使用install.packages('tidytext')
安装
我们首先使用
unnest_tokens
函数将文本划分为单个单词,并删除一些常见的“停用词”。这个函数将获取一个字符串向量并提取标记,以便每个标记在新表中都有一行。下面我们一个简单的例子:
library(tidytext)
text <- c('I am JoJo,','he is kimi','life is short')
example <- tibble(line = c(1, 2, 3),
text = text)
example %>% unnest_tokens(word,text)
A tibble: 9 × 2
line | word |
---|---|
1 | i |
1 | am |
1 | jojo |
2 | he |
2 | is |
2 | kimi |
3 | life |
3 | is |
3 | short |
现在我们看一下推文的第一条数据
campaign_tweets[1,]
A tibble: 1 × 8
source | id_str | text | created_at | retweet_count | in_reply_to_user_id_str | favorite_count | is_retweet |
---|---|---|---|---|---|---|---|
Android | 612063082186174464 | Why did @DanaPerino beg me for a tweet (endorsement) when her book was launched? | 2015-06-19 20:03:05 | 166 | NA | 348 | FALSE |
campaign_tweets[1,] %>%
unnest_tokens(word, text) %>%
pull(word)
- ‘why’
- ‘did’
- ‘danaperino’
- ‘beg’
- ‘me’
- ‘for’
- ‘a’
- ‘tweet’
- ‘endorsement’
- ‘when’
- ‘her’
- ‘book’
- ‘was’
- ‘launched’
默认的unnest_tokens会去除特殊符号,我们可以通过指定token,来保留这些符号
campaign_tweets[1,] %>%
unnest_tokens(word, text, token = "tweets") %>%
pull(word)
- ‘why’
- ‘did’
- ‘@danaperino’
- ‘beg’
- ‘me’
- ‘for’
- ‘a’
- ‘tweet’
- ‘endorsement’
- ‘when’
- ‘her’
- ‘book’
- ‘was’
- ‘launched’
接下来我们要做的另一个小调整是删除图片链接,得到提取的
word
links <- "https://t.co/[A-Za-z\\d]+|&"
tweet_words <- campaign_tweets %>%
mutate(text = str_replace_all(text, links, "")) %>%
unnest_tokens(word, text, token = "tweets")
接下来我们来看一下哪些单词出现的次数最多
tweet_words %>%
count(word) %>%
top_n(10, n) %>%
mutate(word = reorder(word, n)) %>%
arrange(desc(n))
A