WordCloud Using R Language
Why R is considered as the most prominent language for Data Science ? What is the special recipe making this langauge to work with data so efficiently ? So in this blog we gonna tell you the key things of R and also give a trail on how to work with R by generating a wordcloud from a article .
Introduction to R programming :
R is a programming language developed by Ross Ihaka and Robert Gentleman in 1993. R possesses an extensive catalog of statistical and graphical methods. It includes machine learning algorithm, linear regression, time series, statistical inference to name a few. Most of the R libraries are written in R.R is a programming language and free software environment for statistical computing and graphics that is supported by the R Foundation for Statistical Computing. The R language is widely used among statisticians and data miners for developing statistical software and data analysis.R is not only entrusted by academic, but many large companies also use R programming language, including Uber, Google, Airbnb, Facebook and so on.
Why we use R nowadays?
Roughly half of all data scientists use R for data mining and statistical analysis — it is the programming language of choice within the rather nebulous “Big data” industry you keep hearing about. R includes built-in functions and variables designed to make statistical analysis easier, and it also provides graphic-generation tools that produce publication-quality data visualizations.R is highly extensible, and many packages exist to address specific data analysis tasks and problems. It owes a part of it’s popularity to its open-source status, which means that anyone can use R and have access to world-quality statistical analysis tools.R is designed to work on virtually any platform and can be run on systems with a Unix, Linux, Windows, or Mac OS operating system.
R – Installation :
For installing R ,follow the link below
https://cran.r-project.org/
R Studio Installation :
For installing R Studio ,follow the link below
https://www.rstudio.com/products/rstudio/download/#download
R Packages :
Packages are collections of **R** functions, data, and compiled code in a well-defined format. The directory where packages are stored is called the library. **R** comes with a standard set of packages. Others are available for download and installation. Once installed, they have to be loaded into the session to be used.
Install : Package Description
install.packages("tm") # for text mining install.packages("wordcloud") # word-cloud generator
The package “tm” is used to text mining and the package “wordcloud” is used to generate wordcloud
Load : Library
library("tm") library("wordcloud")
The function library is used to load the installed packages
Choosing The File
Choose the dataset file like your article for which you have to generate wordcloud.
Load the data as a corpus-collections of documents containing (natural language) text
words <- Corpus(VectorSource(text))
inspect(words) # View
Load our corpus and extract the words from it
Cleaning the Text
# Convert the text to lower case words <- tm_map(words, content_transformer(tolower)) # Remove numbers words <- tm_map(words, removeNumbers) # Remove english common stopwords words <- tm_map(words, removeWords, stopwords("english")) # specify your stopwords as a character vector words <- tm_map(words, removeWords, c("the", "is")) # Remove punctuations words <- tm_map(words, removePunctuation) # Eliminate extra white spaces words <- tm_map(words, stripWhitespace)
Before processing the corpus we need to clean it . For example removinf stop words , punctuations , whitespaces etc
Build a term-document matrix
textdocument<- TermDocumentMatrix(words) matrix<- as.matrix(textdocument) sum <- sort(rowSums(m),decreasing=TRUE) dataframe <- data.frame(word = names(v),freq=v) head(d, 10)
Counting the frequency of words in a document.
Generate the Word cloud
set.seed(1) wordcloud(words = d$word, freq = d$freq, min.freq = 1,max.words=200, random.order=FALSE, rot.per=0.15,colors=brewer.pal(8, "Dark2"))
WORDCLOUD FOR THE DATASET
Dataset : https://drive.google.com/open?id=1XEw73_0DmYYM48C1sasqfVLdTtUgu9r4
Conclusion:
R is free and open-source, making it possible for anyone to have access to world-class statistical analysis tools. It is used widely in academia and the private sector and is the most popular statistical analysis programming language today. Learning R isn’t easy — if it was, data scientists wouldn’t be in such high demand. However, there is no shortage of quality resources you can use to learn R if you’re willing to put in the time and effort.