Blog

StopWords is an important concept in the field of Text Analysis or Text Mining .

When we are doing Text Mining , one of our objective is to analyze all the different words occurring in the text along with the frequency of their occurrence . ( with the primary presumption that more the occurrence of a specific word in a text , more is the importance given to that word in the text ).

Often there are words in a text that are occurring frequently but provide very little information. These are called stop words, and we may want to remove them from your analysis. 
Some common English stop words include "I", "she'll", "the", etc.

Handling Stopwords is easy in R using "tm" package.

Package "tm' provides a list of stopwords for various languages including English language ..

We can get the full list of Stopwords by calling / invoking the following function -

stopwords("english")

or 

stopwords("SMART")

The second option provides more comprehensive list of Stopwords .

Now , when we are doing text analysis , we might come across some other words which are occurring very frequently in our text & has no significance in Text Analysis but not present in the list of provided Stopwords ...

In that case we have 2 choices - 

Option 1 : Add new Words to the stopword list only for the current instance of R Script 

New_Stopwords = c( "word1" , "word2" ...........)

                My_Stopwords = c( New_Stopwords , stopwords("english") )

We can now use the variable My_Stopwords  in place of stopwords("english")  .

Option 2 : If you want to update the stopword list permanently to have the updated list for all R Scripts 

It is possible to add your own stopwords to the default list of stopwords that came along with tm install. The "tm" package comes with many data files including stopwords, and note that stopwords files come for many languages. You can add, delete, or update the english.dat file under stopwords directory.

Find the Folder where to find the 'english.dat' file on you system 

step 1 : -  Run the following R command to the folder path where Packages are saved in your system 
> .libPaths() [1] 
"D:/Users/XXXXXX/Documents/R/R-3.5.1/library"

step 2 :- Go to the folder in your system and find subfolder "tm\stopwords"

step 3 :- Find the file called "english.dat" in the folder 

step 4 :- Open this file "english.dat" with R-Studio

step 5 :- add the word(s) you want to be included in the stopword list at the end of the content of the "english.dat" file. 
                             Note - add each new word at the end in each new line ( press enter after entering 1 new word )


Now , whenever you use "stopwords("english")" function from "tm" package you will have the updated list of StopWords .

Happy Learning 
Priyaranjan Mohanty
@AUTHOR : Admin

Tags:Eco, Water, Air, Environment

Comments (0)

    No Comments Found
Leave a Comment