Part II: Text Data Processing (30 minutes)
Manipulating and analyzing text data using regular expressions
# --- Part II: Text Data Processing (30 minutes) ---
# Load necessary libraries
library(stringr)
## Manipulating and analyzing text data using regular expressions
# Example: Extracting email addresses from a string
text <- "Contact us at support@example.com or feedback@example.net"
emails <- str_extract_all(text, "[[:alnum:]_.]+@[[:alnum:]]+\\.[[:alpha:]]{2,}")
print(emails)
#> [[1]]
#> [1] "support@example.com" "feedback@example.net"
Exercise:
- Write a regular expression to find all the words starting with ‘b’ in a given text.
Text mining basics
# Load the 'tm' package for text mining
library(tm)
#> Loading required package: NLP
# Example: Basic text mining with a simple corpus
docs <- Corpus(VectorSource(c("Text mining is awesome", "R is a versatile tool for text analysis")))
dtm <- DocumentTermMatrix(docs)
inspect(dtm)
#> <<DocumentTermMatrix (documents: 2, terms: 7)>>
#> Non-/sparse entries: 8/6
#> Sparsity : 43%
#> Maximal term length: 9
#> Weighting : term frequency (tf)
#> Sample :
#> Terms
#> Docs analysis awesome for mining text tool versatile
#> 1 0 1 0 1 1 0 0
#> 2 1 0 1 0 1 1 1
Exercise:
- Create a corpus from your own text data and compute its term frequency-inverse document frequency (tf-idf) matrix.