【R语言文本挖掘】：文本挖掘（以特朗普推文数据为例）_r语言文本分析案例(1)

最新推荐文章于 2024-06-06 22:42:09 发布

2401_84166965

最新推荐文章于 2024-06-06 22:42:09 发布

阅读量585

点赞数 12

分类专栏：程序员文章标签：大数据面试学习

本文链接：https://blog.csdn.net/2401_84166965/article/details/138686895

版权

程序员专栏收录该内容

180 篇文章 0 订阅

订阅专栏

既有适合小白学习的零基础资料，也有适合3年以上经验的小伙伴深入学习提升的进阶课程，涵盖了95%以上大数据知识点，真正体系化！

由于文件比较多，这里只是将部分目录截图出来，全套包含大厂面经、学习笔记、源码讲义、实战项目、大纲路线、讲解视频，并且后续会持续更新

需要这份系统化资料的朋友，可以戳这里获取

| 4 | Twitter Web Client | 5775731054 | Donald Trump Partners with TV1 on New reality Series Entitled, Omarosa’s Ultimate Merger: http://tinyurl.com/yk5m3lc | 2009-11-16 16:06:10 | 5 | NA | 3 | FALSE |
| 5 | Twitter Web Client | 5364614040 | --Work has begun, ahead of schedule, to build the greatest golf course in history: Trump International – Scotland. | 2009-11-02 09:57:56 | 7 | NA | 6 | FALSE |
| 6 | Twitter Web Client | 5203117820 | --From Donald Trump: “Ivanka and Jared’s wedding was spectacular, and they make a beautiful couple. I’m a very proud father.” | 2009-10-27 10:31:48 | 4 | NA | 5 | FALSE |

names(trump_tweets)

‘source’
‘id_str’
‘text’
‘created_at’
‘retweet_count’
‘in_reply_to_user_id_str’
‘favorite_count’
‘is_retweet’

1.2 数据基本情况

使用?trup_tweets可以了解各个变量的具体信息，如下

source. 用于撰写推文的设备或服务。
id_str.推文 ID。
text.推文.
created_at. 发表的时间
retweet_count.被转发多少次。
in_reply_to_user_id_str.如果有评论，则返回回复的人的用户 ID
favorite_count. 点赞数
is_retweet. 是否是转载的

下面我们通过source来看推文的来源数量

trump_tweets %>% count(source) %>% arrange(desc(n)) %>% head(5)

A data.frame: 5 × 2

	source	n

1	Twitter Web Client	10718
2	Twitter for Android	4652
3	Twitter for iPhone	3962
4	TweetDeck	468
5	TwitLonger Beta	288

我们对竞选期间发生的事情感兴趣，因此在本次分析中，我们将重点关注特朗普宣布竞选当天和选举日之间发布的推文。我们定义了下表，其中仅包含该时间段的推文。请注意，我们使用extract来删除部分含有Twitter for (.*),并且过滤掉转载的文章

campaign_tweets <- trump_tweets %>% 
  extract(source, "source", "Twitter for (.\*)") %>%
  filter(source %in% c("Android", "iPhone") &
           created_at >= ymd("2015-06-17") & 
           created_at < ymd("2016-11-08")) %>%
  filter(!is_retweet) %>%
  arrange(created_at) %>% 
  as_tibble()

我们现在来探索两个不同这些设备发推文的可能性。对于每条推文，我们将提取时间，东海岸时间（EST）,然后计算每个设备每小时推文的比例：

campaign_tweets %>%
  mutate(hour = hour(with_tz(created_at, "EST"))) %>%
  count(source, hour) %>%
  group_by(source) %>%
  mutate(percent = n / sum(n)) %>%
  ungroup %>%
  ggplot(aes(hour, percent, color = source)) +
  geom_line() +
  geom_point() +
  scale_y_continuous(labels = percent_format()) +
  labs(x = "Hour of day (EST)", y = "% of tweets", color = "")

png

特朗普在 Android 平台上早上发布的推文更多，而 iPhone 上的竞选推文在下午和傍晚发布更多信息。

在其他地方，我们可以看到不同之处在于在推文中共享链接或图片。

tweet_picture_counts <- campaign_tweets %>%
  filter(!str_detect(text, '^"')) %>%
  count(source,
        picture = ifelse(str_detect(text, "t.co"),
                         "Picture/link", "No picture/link"))

ggplot(tweet_picture_counts, aes(source, n, fill = picture)) +
  geom_bar(stat = "identity", position = "dodge") +
  labs(x = "", y = "Number of tweets", fill = "")

png

事实证明，来自 iPhone 的推文包含图片或链接的可能性是其 38 倍。这对我们的叙述也很有意义：iPhone（可能由该活动运行）倾向于撰写有关事件的“公告”推文。接下来，我们将研究将 Android 与 iPhone 进行比较时推文有何不同。为此，我们引入了 tidytext 包。

2.文本数据分析

tidytext 包帮助我们将自由格式的文本转换为整洁的表格。拥有这种格式的数据极大地促进了数据可视化和统计技术的使用。

使用install.packages('tidytext')安装

我们首先使用 unnest_tokens 函数将文本划分为单个单词，并删除一些常见的“停用词”。这个函数将获取一个字符串向量并提取标记，以便每个标记在新表中都有一行。下面我们一个简单的例子：

library(tidytext)
text <- c('I am JoJo,','he is kimi','life is short')
example <- tibble(line = c(1, 2, 3),
                      text = text)

example %>% unnest_tokens(word,text)

A tibble: 9 × 2

line	word

1	i
1	am
1	jojo
2	he
2	is
2	kimi
3	life
3	is
3	short

现在我们看一下推文的第一条数据

campaign_tweets[1,]

A tibble: 1 × 8

source	id_str	text	created_at	retweet_count	in_reply_to_user_id_str	favorite_count	is_retweet

Android	612063082186174464	Why did @DanaPerino beg me for a tweet (endorsement) when her book was launched?	2015-06-19 20:03:05	166	NA	348	FALSE

campaign_tweets[1,] %>% 
  unnest_tokens(word, text) %>%
  pull(word)

‘why’
‘did’
‘danaperino’
‘beg’
‘me’
‘for’
‘a’
‘tweet’
‘endorsement’
‘when’
‘her’
‘book’
‘was’
‘launched’

默认的unnest_tokens会去除特殊符号，我们可以通过指定token,来保留这些符号

campaign_tweets[1,] %>% 
  unnest_tokens(word, text, token = "tweets") %>%
  pull(word)

‘why’
‘did’
‘@danaperino’
‘beg’
‘me’
‘for’
‘a’
‘tweet’
‘endorsement’
‘when’
‘her’
‘book’
‘was’
‘launched’

接下来我们要做的另一个小调整是删除图片链接，得到提取的word

links <- "https://t.co/[A-Za-z\\d]+|&amp;"
tweet_words <- campaign_tweets %>% 
  mutate(text = str_replace_all(text, links, ""))  %>%
  unnest_tokens(word, text, token = "tweets")

接下来我们来看一下哪些单词出现的次数最多

tweet_words %>% 
  count(word) %>%
  top_n(10, n) %>%
  mutate(word = reorder(word, n)) %>%
  arrange(desc(n))

A tibble: 10 × 2

word	n

the	2329
to	1410
and	1239
in	1185
i	1143
a	1112
you	999
of	982
is	942
on	874

不难理解这些词出现的次数最多。但是这些词没有提供信息。 tidytext包有这些常用词的数据库，称为停用词，在文本挖掘中：stop_words可以查看常见的停用词

head(stop_words)

A tibble: 6 × 2

word	lexicon

a	SMArT
a’s	SMArT
able	SMArT
about	SMArT
above	SMArT
according	SMArT

接下来我们删选掉属于停用词的文本

tweet_words <- campaign_tweets %>% 
  mutate(text = str_replace_all(text, links, ""))  %>%
  unnest_tokens(word, text, token = "tweets") %>%
  filter(!word %in% stop_words$word )

tweet_words %>% 
  count(word) %>%
  top_n(10, n) %>%
  mutate(word = reorder(word, n)) %>%
  arrange(desc(n))

A tibble: 10 × 2

word	n

#trump2016	414
hillary	405
people	303
#makeamericagreatagain	294
america	254
clinton	237
poll	217
crooked	205
trump	195
cruz	159

可以看出这个时候出现最多的次数的word能给我们一些信息，但是我们还要进行一个正则删选掉一些特殊的标记

tweet_words <- campaign_tweets %>% 
  mutate(text = str_replace_all(text, links, ""))  %>%
  unnest_tokens(word, text, token = "tweets") %>%
  filter(!word %in% stop_words$word &
           !str_detect(word, "^\\d+$")) %>%
  mutate(word = str_replace(word, "^'", ""))

Using `to_lower = TrUE` with `token = 'tweets'` may not preserve UrLs.

现在我们已经将所有单词放在了一个表格中，以及有关用于撰写它们来自的推文的设备的信息，我们可以开始探索在将 Android 与 iPhone 进行比较时哪些单词更常见。

对于每个单词，我们想知道它更有可能来自 Android 推文还是 iPhone 推文。在这里我们使用优势比率指标来进行衡量。对于每个设备和一个给定的单词，我们称它为 y，我们计算 y 和非 y 单词比例之间的几率或比率，并计算这些几率的比率。这里我们将有许多比例为 0，因此我们使用0.5 校正，具体做法如下：

0.5

\frac{#in Android+0.5/totalAndroid+0.5}{ #in iPhone+0.5/totaliPhone+0.5}

#iniPhone+0.5/totaliPhone+0.5#inAndroid+0.5/totalAndroid+0.5

android_iphone_or <- tweet_words %>%
  count(word, source) %>%
  pivot_wider(names_from = "source", values_from = "n", values_fill = 0) %>%#将长数据转换为宽数据
  mutate(or = (Android + 0.5) / (sum(Android) + 0.5) / 
           ( (iPhone + 0.5) / (sum(iPhone) + 0.5)))

因此，or值越大，证明出现在android中的频率越大

以下是 Android 的最大优势比率

android_iphone_or %>% arrange(desc(or))%>%head()

A tibble: 6 × 4

word	iPhone	Android	or

poor	0	13	23.08133
poorly	0	12	21.37160
turnberry	0	11	19.66188
@cbsnews	0	10	17.95215
angry	0	10	17.95215
bosses	0	10	17.95215

以下是iPhone的优势比最大的几个word

android_iphone_or %>% arrange(or) %>% head()

A tibble: 6 × 4

word	iPhone	Android	or

#makeamericagreatagain	294	0	0.001451382
#americafirst	71	0	0.005978071
#draintheswamp	63	0	0.006731214
#trump2016	411	3	0.007271020
#votetrump	56	0	0.007565170
join	157	1	0.008141564

鉴于其中几个词是整体低频词，我们可以根据总频率施加一个过滤器，如下所示：

android_iphone_or %>% filter(Android+iPhone > 70) %>%
  arrange(desc(or))%>%head()

A tibble: 6 × 4

word	iPhone	Android	or

@cnn	17	90	4.420869
republican	12	63	4.342710
bernie	13	59	3.767735
bad	26	104	3.371068
wow	23	74	2.710101

既有适合小白学习的零基础资料，也有适合3年以上经验的小伙伴深入学习提升的进阶课程，涵盖了95%以上大数据知识点，真正体系化！

由于文件比较多，这里只是将部分目录截图出来，全套包含大厂面经、学习笔记、源码讲义、实战项目、大纲路线、讲解视频，并且后续会持续更新

需要这份系统化资料的朋友，可以戳这里获取

android_iphone_or %>% filter(Android+iPhone > 70) %>%
  arrange(desc(or))%>%head()

A tibble: 6 × 4

word	iPhone	Android	or

@cnn	17	90	4.420869
republican	12	63	4.342710
bernie	13	59	3.767735
bad	26	104	3.371068
wow	23	74	2.710101

[外链图片转存中…(img-IP2LhbEc-1715356767684)]
[外链图片转存中…(img-RGSLy3EN-1715356767685)]
[外链图片转存中…(img-nzWBOQW0-1715356767685)]

既有适合小白学习的零基础资料，也有适合3年以上经验的小伙伴深入学习提升的进阶课程，涵盖了95%以上大数据知识点，真正体系化！

由于文件比较多，这里只是将部分目录截图出来，全套包含大厂面经、学习笔记、源码讲义、实战项目、大纲路线、讲解视频，并且后续会持续更新

需要这份系统化资料的朋友，可以戳这里获取

2401_84166965

关注

12
点赞
踩
26

收藏

觉得还不错? 一键收藏
1
评论
【R语言文本挖掘】：文本挖掘（以特朗普推文数据为例）_r语言文本分析案例(1)

对于每个设备和一个给定的单词，我们称它为 y，我们计算 y 和非 y 单词比例之间的几率或比率，并计算这些几率的比率。这对我们的叙述也很有意义：iPhone（可能由该活动运行）倾向于撰写有关事件的“公告”推文。现在我们已经将所有单词放在了一个表格中，以及有关用于撰写它们来自的推文的设备的信息，我们可以开始探索在将 Android 与 iPhone 进行比较时哪些单词更常见。我们对竞选期间发生的事情感兴趣，因此在本次分析中，我们将重点关注特朗普宣布竞选当天和选举日之间发布的推文。但是这些词没有提供信息。
复制链接

扫一扫