tidytext extract url r
- Load the necessary packages:
library(tidyverse) - This loads the tidyverse package, which is a collection of R packages designed for data manipulation and visualization.
library(tidytext) - This loads the tidytext package, which provides functions for text mining and analysis.
- Read the data:
data <- read_csv("data.csv") - This reads the data from a CSV file called "data.csv" and stores it in the variable "data".
- Tokenize the text:
data_tokenized <- data %>% unnest_tokens(word, text) - This tokenizes the text column in the data by separating it into individual words. The result is stored in a new variable called "data_tokenized".
- Remove stop words:
data_filtered <- data_tokenized %>% anti_join(stop_words) - This removes common stop words (e.g., "the", "and", "is") from the tokenized data. The result is stored in a new variable called "data_filtered".
- Count word frequencies:
word_counts <- data_filtered %>% count(word, sort = TRUE) - This counts the frequency of each word in the filtered data and sorts the result in descending order. The result is stored in a new variable called "word_counts".
- Extract URLs:
url <- word_counts %>% filter(str_detect(word, "^http")) - This filters the word_counts data to extract only the words that start with "http", indicating URLs. The result is stored in a new variable called "url".
- View the extracted URLs:
url - This displays the extracted URLs on the console.
Note: Please ensure that you have the "tidyverse" and "tidytext" packages installed before running this code.