Tuesday, April 18, 2017

Text Mining(TM) with an example of WordCloud on RStudio

It is estimated that major part of useable business information is unstructured, often in the form of text data. Text mining provides a collection of methods that help us to derive actionable insights from these data. 

The main package to perform text mining tasks in R is tm .The structure for managing documents in tm is  Corpus, representing a collection of text documents. Or "A corpus is a large body of natural language text used for accumulating statistics on natural language text. The plural is corpora. A lexicon is a collection of information about the words of a language about the lexical categories to which they belong. A lexicon is usually structured as a collection of lexical entries like same word used for verb, Noun and adjectives.

Once we have a corpus we typically want to modify the documents in it, e.g., stemming, stopword removal…etc.  In tm, all this functionality is subsumed into the concept of a transformation. Transformations are done via the tm_map() function which applies (maps) a function to all elements of the corpus. Basically, all transformations work on single text documents and tm_map() just applies them to all documents in a corpus.

Eliminating Extra Whitespace
> sample <- tm_map(sample, stripWhitespace)

Convert to Lower Case
> sample <- tm_map(sample, content_transformer(tolower))

Remove Stopwords
> sample <- tm_map(sample, removeWords, stopwords("english"))

Stemming is done by:
> sample <- tm_map(sample, stemDocument)
Wordcloud _example_1: 

Step 1 : Install package "tm"

Step 2:  Install package "RColorBrewer"

Step 3 : Install package wordCloud 

Step 4 :  Load Libraries 

Step 5 : Execute the  R script :
my_data_file = readLines("/home/spb/data/input.txt")

myCorpus = Corpus(VectorSource(my_data_file))

myCorpus = tm_map(myCorpus, tolower)
myCorpus = tm_map(myCorpus, removePunctuation)
myCorpus = tm_map(myCorpus, removeNumbers)
myCorpus = tm_map(myCorpus, removeWords, stopwords("english"))

myTDM = TermDocumentMatrix(myCorpus, control = list(minWordLength = 1))

m = as.matrix(myTDM)

v = sort(rowSums(m), decreasing = TRUE)

wordcloud(names(v), v, min.freq = 50) 
 Step 6 :  wordcloud visualization :

Wordcloud _example_2:
wordcloud(names(v), v, min.freq = 50, colors=brewer.pal(7, "Dark2"), random.order = TRUE) 

Wordcloud _example_3: 
wordcloud(names(v), v, min.freq = 50, colors=brewer.pal(7, "Dark2"), random.order = FALSE)