twitteR 实例

最新推荐文章于 2022-10-18 14:40:34 发布

weixin_33989058

最新推荐文章于 2022-10-18 14:40:34 发布

阅读量336

点赞数

文章标签： r语言大数据 php

=============================================

利用R来分析tweets从而预测网民对各个航线的满意度

R by example: mining Twitter for consumerattitudes towards airlines

=============================================

参考：http://www.inside-r.org/howto/mining-twitter-airline-consumer-sentiment

======================

1. 工业上关于各个行业的满意度调查：

======================

Airlines

	Base- line	95	96	97	98	99	00	01	02	03	04	05	06	07	08	09	10	11	Previous Year % Change	First Year % Change
Southwest	78	76	76	76	74	72	70	70	74	75	73	74	74	76	79	81	79	81	2.5	3.8
All Others	NM	70	74	70	62	67	63	64	72	74	73	74	74	75	75	77	75	76	1.3	8.6
Airlines	72	69	69	67	65	63	63	61	66	67	66	66	65	63	62	64	66	65	-1.5	-9.7
Continental	67	64	66	64	66	64	62	67	68	68	67	70	67	69	62	68	71	64	-9.9	-4.5
American	70	71	71	62	67	64	63	62	63	67	66	64	62	60	62	60	63	63	0.0	-10.0
United	71	67	70	68	65	62	62	59	64	63	64	61	63	56	56	56	60	61	1.7	-14.1
US Airways	72	67	66	68	65	61	62	60	63	64	62	57	62	61	54	59	62	61	-1.6	-15.3
Delta	77	72	67	69	65	68	66	61	66	67	67	65	64	59	60	64	62	56	-9.7	-27.3
Northwest Airlines	69	71	67	64	63	53	62	56	65	64	64	64	61	61	57	57	61	#	N/A	N/A

======================

2. Tweets上都发生了什么：

======================

RT @dave_mcgregor: Publicly pledging to never fly @delta again. The worst airline ever. U have lost my patronage forever due to ur incompetence

Completely unimpressed with @continental or @united. Poor communication, goofy reservations systems and all to turn my trip into a mess.

@United Weather delays may not be your fault, but you are in the customer service business. It's atrocious how people are getting treated!

We were just told we are delayed 1.5 hrs & next announcement on @JetBlue - “We're selling headsets.” Way to capitalize on our misfortune.

...

复制代码

======================

3. Game Plan

======================

4. 抓取和各个航线有关的twitter

======================

这里用的是Jeff Gentry的 twitteR package (对于更加一般性的网页抓取，R有其他的packages： XML and RCurl packages )

> # load the package
> library(twitteR)
利用 @信息 和 时间控制数量 来实现抓取
> # get the 1,500 most recent tweets mentioning ‘@delta’:
> delta.tweets = searchTwitter('@delta', n=1500)
#这里tweets是个list, 里面的objection是 object of type “status” from the “twitteR” package. 
#A “list” in R is a collection of objects and its elements may be named or just numbered.
#“[[ ]]” is used to access elements.

> length(delta.tweets)
[1] 1500
> class(delta.tweets)
[1] "list"

–> tweet = delta.tweets[[1]]
–> class(tweet)
–[1] "status"
–attr(,"package")
–[1] "twitteR
 

•The help page (“?status”) describes some accessor methods like getScreenName() and getText() which do what you would expect:

–> tweet$getScreenName()
–[1] "Alaqawari"
–> tweet$getText()
–[1] "I am ready to head home. Inshallah will try to get on the earlier flight to Fresno. @Delta @DeltaAssist"

复制代码

======================

5.获得抓取到的tweet对应的text

======================

> delta.text = laply(delta.tweets, function(t) t$getText() )
 这里实现的是对一个list中的每个objection迭代实现一个相同的函数操作
> length(delta.text)[1] 1500
> head(delta.text, 5)
[1] "I am ready to head home. Inshallah will try to get on the earlier flight to Fresno. @Delta @DeltaAssist"
[2] "@Delta Releases 2010 Corporate Responsibility Report - @PRNewswire (press release) : http://tinyurl.com/64mz3oh"
[3] "Another week, another upgrade! Thanks @Delta!"
[4] "I'm not able to check in or select a seat for flight DL223/KL6023 to Seattle tomorrow. Help? @KLM @delta"
[5] "In my boredom of waiting realized @deltaairlines is now @delta seriously..... Stil waiting and your not even unloading status yet"

复制代码

======================

6. 导入情感词

======================

1. Download Hu & Liu’s opinion lexicon:

•

• http://www.cs.uic.edu/~liub/FBS/sentiment-analysis.html

（search ‘opinion lexicon’）

2. Loading data is one of R’s strengths. These are simple text files, though they use “;” as a comment character at the beginning:

•

> hu.liu.pos = scan('../data/opinion-lexicon-English/positive-words.txt', what='character', comment.char=';')
> hu.liu.neg = scan('../data/opinion-lexicon-English/negative-words.txt', what='character', comment.char=';')

复制代码

3. Add a few industry-specific and/or especially emphatic terms:

> pos.words = c(hu.liu.pos, 'upgrade')
> neg.words = c(hu.liu.neg, 'wtf', 'wait',
'waiting', 'epicfail', 'mechanical')

复制代码

======================

7.定义一个利用情感词来得到情感分数的一个简单方法：

======================

score.sentiment = function(sentences, pos.words, neg.words, .progress='none')
{
    require(plyr)
    require(stringr)

# we got a vector of sentences. plyr will handle a list or a vector as an "l" for us
    # we want a simple array of scores back, so we use "l" + "a" + "ply" = laply:
    scores = laply(sentences, function(sentence, pos.words, neg.words) {

# clean up sentences with R's regex-driven global substitute, gsub():
        sentence = gsub('[[:punct:]]', '', sentence)
        sentence = gsub('[[:cntrl:]]', '', sentence)
        sentence = gsub('\\d+', '', sentence)
# and convert to lower case:
        sentence = tolower(sentence)

# split into words. str_split is in the stringr package
        word.list = str_split(sentence, '\\s+')
# sometimes a list() is one level of hierarchy too much
        words = unlist(word.list)

# compare our words to the dictionaries of positive & negative terms
        pos.matches = match(words, pos.words)
        neg.matches = match(words, neg.words)

# match() returns the position of the matched term or NA
        # we just want a TRUE/FALSE:
        pos.matches = !is.na(pos.matches)
        neg.matches = !is.na(neg.matches)

# and conveniently enough, TRUE/FALSE will be treated as 1/0 by sum():
        score = sum(pos.matches) - sum(neg.matches)

return(score)
    }, pos.words, neg.words, .progress=.progress )

    scores.df = data.frame(score=scores, text=sentences)
return(scores.df)
}

复制代码

======================

8. 利用上面定义的方法来得到text对应的情感分数：

======================

Example 1：rand text

> sample = c("You're awesome and I love you",
"I hate and hate and hate. So angry. Die!",
"Impressed and amazed: you are peerless in your achievement of unparalleled mediocrity.")
> result = score.sentiment(sample, pos.words, neg.words)
> class(result)
[1] "data.frame"
> result$score
[1]  2 -5  4

#data.frames hold tabular data so they consist of columns & rows which can be accessed by name or number.Here, “score” is the name of a column.

Example 2： several real tweets

–> score.sentiment(c("@Delta I'm going to need you to get it together. Delay on tarmac, delayed connection, crazy gate changes... #annoyed",

–"Surprised and happy that @Delta helped me avoid the 3.5 hr layover I was scheduled for. Patient and helpful agents. #remarkable"), pos.words, neg.words)$score

–[1] -4 5

–> result

• score text

–1 2 You're awesome and I love you

–2 -5 I hate and hate and hate. So angry. Die!

–3 4 Impressed and amazed: you are peerless in your achievement of unparalleled mediocrity.

–> result[1,1]

–[1] 2

–> result[1,'score']

–[1] 2

–> result[1:2, 'score']

–[1] 2 -5

–> result[c(1,3), 'score']

–[1] 2 4

–> result[,'score']

–[1] 2 -5 4

Example 3： Delta tweets：just feed their text into score.sentiment():

–> delta.scores = score.sentiment(delta.text, pos.words, neg.words, .progress='text')#Progress bar provided by plyr

–|==================================================| 100%

#Let’s add two new columns to identify the airline for when we combine all the scores later:

–> delta.scores$airline = 'Delta'

–> delta.scores$code = 'DL’

复制代码

======================

9. 利用直方图看结果：

======================

# R’s built-in hist() function will create and plot histograms of your data:
-> hist(delta.scores$score)

#ggplot2 is an alternative graphics package which generates more refined graphics:

–> qplot(delta.scores$score)

# 结合所有的航班数据来看比较结果.

# combine all the results into a single “all.scores” data.frame:

-> all.scores = rbind( american.scores, continental.scores, delta.scores, jetblue.scores, southwest.scores, united.scores, us.scores ) # rbind() combines rows from data.frames, arrays, and matrices

#ggplot2 implements “grammar of graphics”, building plots in layers:

–> ggplot(data=all.scores) + # ggplot works on data.frames, always

– geom_bar(mapping=aes(x=score, fill=airline), binwidth=1) +

– facet_grid(airline~.) + # make a separate plot for each airline

– theme_bw() + scale_fill_brewer() # plain display, nicer colors

#ggplot2’s faceting capability makes it easy to generate the same graph for different values of a variable, #in this case “airline”.

复制代码

======================

10. 提高数据纯度： Ignore the middle

======================

#Let’s focus on very negative (<-2) and positive (>2) tweets:
> all.scores$very.pos = as.numeric( all.scores$score >= 2 )
> all.scores$very.neg = as.numeric( all.scores$score <= -2 )

#For each airline ( airline + code ), let’s use the ratio of very positive to very negative tweets as   #the overall sentiment score for each airline:
> twitter.df = ddply(all.scores, c('airline', 'code'), summarise, pos.count = sum( very.pos ), neg.count = sum( very.neg ) )
> twitter.df$all.count = twitter.df$pos.count + twitter.df$neg.count
> twitter.df$score = round( 100 * twitter.df$pos.count /
twitter.df$all.count )
#Sort with orderBy() from the doBy package:
> orderBy(~-score, twitter.df)

复制代码

======================

11. 与真实值的比较

======================

第一步：利用XML包来从网页提取表格数据

# XML package provides amazing readHTMLtable() function:
> library(XML)
> acsi.url = 'http://www.theacsi.org/index.php?option=com_content&view=article&id=147&catid=&Itemid=212&i=Airlines'
> acsi.df = readHTMLTable(acsi.url, header=T, which=1, stringsAsFactors=F)
> # only keep column #1 (name) and #18 (2010 score)
> acsi.df = acsi.df[,c(1,18)]
> head(acsi.df,1)
                     10
1 Southwest Airlines 79

# Well, typing metadata is OK, I guess... clean up column names, etc:

> colnames(acsi.df) = c('airline', 'score')
> acsi.df$code = c('WN', NA, 'CO', NA, 'AA', 'DL', 
'US', 'NW', 'UA')
> acsi.df$score = as.numeric(acsi.df$score)

复制代码

第二步：进行数据上的比较

#merge() joins two data.frames by the specified “by=” fields. You can specify ‘suffixes’ to rename confl#icting column names:

> compare.df = merge(twitter.df, acsi.df, by='code', 
suffixes=c('.twitter', '.acsi'))

#Unless you specify “all=T”, non-matching rows are dropped (like a SQL INNER JOIN), and that’s what happ#ened to top scoring JetBlue.

#With a very low score, and low traffic to boot, soon-to-disappear Continental looks like an outlier. Le#t’s exclude:

> compare.df = subset(compare.df, all.count > 100)

#用图形看

#ggplot will even run lm() linear (and other) regressions for you with its geom_smooth() layer:

-> ggplot( compare.df ) +

geom_point(aes(x=score.twitter, y=score.acsi, color=airline.twitter), size=5) +

geom_smooth(aes(x=score.twitter, y=score.acsi, group=1), se=F, method="lm") +

theme_bw() +

opts(legend.position=c(0.2, 0.85))

复制代码

====================

用twitteR包对中文推特用户的抽样分析

=======================

社会化媒体的信息量巨大，一向是数据分析的热门领域。Twitter作为最为火爆的微博网站吸引了大量用户，也蕴含了大量信息。Twitter的数据信息其实包括两个部分，一个是发推内容，一个是发推用户。本文希望对中文twitter用户的情况进行一些粗浅的分析，算是抛砖引玉。

较早的时候就有人作过Twitter中文用户调查，它是以问卷的形式收集了500人的数据，得出了一些有趣的结论。本例则是希望利用twitteR包获取更多中文Twitter用户样本加以分析。首先遇到的第一个问题就是如何分辨出中文用户，笔者是从这篇博文获得启发，思路就是先从一位中文推特界的大佬入手，那么其粉丝应该基本上就是中文用户了。本例中选择的对象是“连岳”（lianyue），他的粉丝数达到了八万人。可能会问为何不选有十二万粉丝的“艾未未”？因为他比较国际化一些，其粉丝有不少是外国友人。

我们首先加载twitteR包，然后获取“用户”对象，从lianyue的follower中抓取5000名粉丝。这一步要注意，如果你的位置正好在和谐社会内，那需要VPNFQ才能抓到数据，而且得花点时间。之后将获得的数据转为数据框格式方便处理。

library(twitteR)
lianyue <- getUser('lianyue')
follow.lian <- lianyue$getFollowers(n=5000)
df.lian <- do.call('rbind',lapply(follow.lian,as.data.frame))
在上面这个数据中有四个可用于研究的变量，分别是：

statusesCount 发推数
followersCount 粉丝数
friendsCount 朋友数
created 开户时间

如果将上面的数据简单绘图观察就会发现存在异常点（例如个别话唠和大佬），当然你也可以对这个异常点做进一步研究。但本例只关注“乌合之众”，在去除异常点后仍有4989个样本，存入子集变量df.sub中。再将其中的开户时间转为两种方便处理的格式。

df.sub <- subset(df.lian,friendsCount<2300 & followersCount<3000 & statusesCount<10000)
df.sub$time <- as.Date(df.sub$created)
df.sub$ntime <- as.numeric(df.sub$time)

对于手头的数据，我们跳过描述性统计，首先观察这些用户的开户时间是如何分布的。加载ggplot2包绘制条形图观察。可以看到在2011年下半年新开用户出现了飙升，是FQ手段有突破？还是国家队大规模入驻？有兴趣的同学可以进一步挖掘其原因。

library(ggplot2)
p <- ggplot(df.sub,aes(x=time))
p + geom_bar(fill='red',colour='black',binwidth=30)

第二个我们想绘制这些变量的散点图观察之间的关系。除了X和Y轴表示两个数值变量之外，点的大小表示了发推的数量，颜色的不同表示了开户时间，散点越大表示发推越多，颜色偏兰表示开户早，偏红表示开户晚。

p <- ggplot(data=df.sub,aes(x=friendsCount,y=followersCount))
p + geom_point(aes(size=statusesCount,colour=ntime),alpha=0.8)

从上图中可以观察到大部分用户聚集在左下角，有较多粉丝的用户似乎多半是发推较多，或是开户较早。为了验证这一点我们使用加性模型试一下。关于加性模型可以参照笔者之前的一篇博文。

library(mgcv)
model <- gam(followersCount~s(friendsCount)+s(statusesCount)+s(ntime),data=df.sub)
par(mfrow=c(1,3))
plot(model,se=T)

我们来看最后的结果，第一个图显示你fo的人越多，那么被fo的可能性也越大，不过也有例外。第二个图显示你发推越多，被fo的可能性也越大，但没人喜欢话唠，话太多就会被unfo掉，关键还是要看发推质量。第三个图大致显示开户较早的被fo的较多。估计第一批用twitter的都是黄埔一期的精英吧。

本文只抽取了数千个数据的样本来分析中文twitter用户，抽样方法不一定科学，所以其样本不一定具有代表性。有兴趣的同学对于这些样本还可以划分细类做进一步分析，或是利用聚类分析等其它方法得出更有趣的结论。

=====================

如何分析twitter中包含的投资者情绪

=========================

投资者情绪(investor sentiment)是反映投资者心理的重要因素，它是一种反映投资者的投资意愿或预期的市场人气指标，对证券市场的运行和发展有很大的影响。以往对投资者情绪指数的衡量主要通过两种方式，一个是通过问卷调查等方式直接测量，另一个是通过分析有关交易数据来间接测量。

随着社会化媒体的盛行，Twitter上的推文也可以被看作是一种情绪指标。英国伦敦基金公司Derwent Capital Markets就利用Twitter上发表的推文，统计大众情绪来预测股市走势，在2011年全球市场低迷之中，还能维持1.85%报酬率，领先S&P500指数。

本例完全借助 Mining Twitter for Airing Consumer Sentiment这篇文章（该文获得了最近的R商业应用大赛第二名）的代码和思想来衡量投资者对S&P500的情绪。本例分析简单粗陋，并未考虑推文的时间因素，以及和股市的交互涟漪效应，Just for fun。

#首先载入twitterR包，再以sp500为关键词搜索1500条推文
library(twitteR)
sp500 <- searchTwitter('sp500',n=1500)
#再加载plyr包，利用其中的向量化操作函数将所有的推文文本提取出来。
library(plyr)
sp.text <- laply(sp500,function(t) t$getText())
#从这个地址将包含正面和负面情绪词汇的文本包下载到本地 http://www.cs.uic.edu/~liub/FBS/opinion-lexicon-English.rar
pos <- scan('d:/positive-words.txt',what='character',comment.char=';')
neg <- scan('d:/negative-words.txt',what='character',comment.char=';')
pos.words <- c(pos, 'up','bull')
neg.words <- c(neg, 'down','bear')
#最后利用score.sentiment函数将推文与情绪文本进行比对，结果存于result变量中，其中score保存着各推文出现代表不同情绪的词频数。score为正表示正面情绪，为负表示负面情绪。
result <- score.sentiment(sp.text,pos.words,neg.words)
data=result$score
table(factor(data))

-5 -3 -2 -1 0 1 2 3 4 5
4 13 38 113 813 120 34 2 3 1
从上面的结果可以看到，基本上对于sp500的情绪持中性，没有太大的正面或负面情绪。当然Derwent系统则更为精密强悍，它随机追踪Twitter上10%的推文，也就是每天统计1亿则推文，然后采取两套方法整理数据：1. 比较正面评价和负面评价，2. 利用Google程序确定六种情绪，冷静、警觉、相信、活跃、友善和高兴。系统开发者Johan Bollen教授能准确预测道琼指数走势，准确率高达87.6%。

想要利用“情绪指数”预测股市走势并不只有Derwent，彭博社和市场预测公司Wise Window也开始研究社群网站上的对话和股价的关系，不管是Twitter、Facebook、或是Blog全部都纳入研究，以“产业”为搜集信息的主题，例如搜集大家谈论“汽车”的观点，藉此预期福特或通用汽车的股市行情。WiseWindow执行长Sid Mohasseb表示，这项研究就是以“人”为主，当你能够抓住人们说：我想要这个、我不要那个的时候，就可以预期这些选择将会带来哪些公司的营收，我们做的就是加快分析。

=================

在Twitter上你应该关注谁？

====================

社交媒体中的 数据挖掘是当今比较热门的领域之一。各平台公司都想从用户的数据中寻找其偏好特征，提供个性化的服务。其中一种服务就是在微博平台中向用户推荐值得关注的对象名单。如果用户已经使用了一段时间的服务并关注了一些对象，那么有一种简单思路可以为其提供更多的关注对象。

俗语说：人以群分，物以类聚。朋友之间基本上是臭味相投的，再进一步延伸， 朋友的朋友也应该能成为新的朋友。那么进行推荐服务的方法可以这样来实现：先找出用户已经关注朋友的名单，再进一步找出每个朋友所关注的对象，再根据频数作为推荐的权重。下面我们用Twitter为例用R语言来实现这种思路。

先从笔者的推号入手（@xccds），提取了100名关注对象的信息。然后对每个关注对象再提取他的100名关注对象信息，合并后制成频数表，剔除笔者已经关注过的对象。排序后保留频数最高的前五名，绘制条形图以显示最后的推荐结果。查了一下这几位的资料，感觉还比较靠谱，值得follow。各位经常上推的朋友也不妨一试。

注：由于Twitter API有限制，所以并未完全获得所有一万条信息，上图是根据大约两千条信息绘制的。

R代码如下：

rm(list=ls())
library(twitteR) #加载包
myid <- getUser('xccds') #获取用户信息
#取得100名关注对象的用户名
myfo <- twListToDF(myid$getFriends(n=100))$screenName  
ffo <-list()
record <- character()
for (i in 1:100){
  user <- getUser((myfo[i]))  #获取关注对象的信息
  #取得关注对象的关注对象
  ffo <- twListToDF(user$getFriends(n=100))$screenName
  record <-  c(record ,as.character(ffo))
}
# 生成频数表
table.record <- table(record)
# 从表格转化为数据框
data = as.data.frame(table.record,stringsAsFactors=F)
# 将已经关注的对象从中删除
data <- data[data$record%in%setdiff(record,myfo),]
# 选择频数最高的五人
data <- data[order(data$Freq,decreasing=T)[1:5],]
# 加载包并绘制条形图
library(ggplot2) 
p <- ggplot(data,aes(record,Freq))
p + geom_bar(aes(fill=Freq))+coord_flip()

====================

关于Jeremy Lin的twitter词云

========================

标签云或词云 (word cloud)是关键词的视觉化描述，用于汇总用户生成的标签或一个网站的文字内容。word cloud实际上是将文档包含的词汇频率表进行了可视化。这些词汇的重要程度主要通过改变字体大小或颜色来表现。这项技术常用于具体化、形象化一些热门话题或文本内容。好吧，闲话少说，趁着Jeremy Lin的热度尚在，我们来看看在推特上人们谈到他的时候会提到哪些关键的词汇。

看看上面的词云图吧。恩...所属的knicks队名出现频率高是应该的，网络上流传的外号linsanity真得是很热啊，还有不少将他和kobe bryant相比较。剩下的各位慢慢研究吧。

R代码：

#载入用到的包
library(twitteR)
library(tm)
library(wordcloud)
 
#设定搜索标签，提取1000个推文并转化为数据框
searchTerm = '#Jeremy Lin'
raw.data <- searchTwitter(searchTerm,n=1000)
tw.df <- twListToDF(raw.data)
 
#为了回避一些推文中的网址，用文本函数gsub加以去除
Remove <- function(tweet) {
  gsub("http+", "", tweet)
}
tweets <- as.vector(sapply(tw.df$text, Remove))
 
#用tm包中的Corpus读取文本并生成语料库对象，再对其进行预处理
tw.corpus <- Corpus(VectorSource(tweets))
tw.corpus <- tm_map(tw.corpus, stripWhitespace)
tw.corpus <- tm_map(tw.corpus, removePunctuation)
tw.corpus <- tm_map(tw.corpus,tolower)
tw.corpus <- tm_map(tw.corpus,removeWords,stopwords('english'))
 
#生成词频矩阵
doc.matrix <- TermDocumentMatrix(tw.corpus,control = list(minWordLength = 1))
dm <- as.matrix(doc.matrix)
v <- sort(rowSums(dm),decreasing=T)
d <- data.frame(word=names(v),freq=v)
 
#去除Jeremy和Lin这两个词后，生成最后的词云
data <- d[c(-1,-2),]
mycolors <- colorRampPalette(c("white","red"))(200)
wc <-wordcloud(data$word,data$freq,min.freq=15,colors=mycolors[100:200])

注：由于Twitter API有限制，所以并未完全获得所有一万条信息，上图是根据大约两千条信息绘制的。

R代码如下：

rm(list=ls())
library(twitteR) #加载包
myid <- getUser('xccds') #获取用户信息
#取得100名关注对象的用户名
myfo <- twListToDF(myid$getFriends(n=100))$screenName  
ffo <-list()
record <- character()
for (i in 1:100){
  user <- getUser((myfo[i]))  #获取关注对象的信息
  #取得关注对象的关注对象
  ffo <- twListToDF(user$getFriends(n=100))$screenName
  record <-  c(record ,as.character(ffo))
}
# 生成频数表
table.record <- table(record)
# 从表格转化为数据框
data = as.data.frame(table.record,stringsAsFactors=F)
# 将已经关注的对象从中删除
data <- data[data$record%in%setdiff(record,myfo),]
# 选择频数最高的五人
data <- data[order(data$Freq,decreasing=T)[1:5],]
# 加载包并绘制条形图
library(ggplot2) 
p <- ggplot(data,aes(record,Freq))
p + geom_bar(aes(fill=Freq))+coord_flip()

========================

从Twitter招聘信息看R语言的重要性

============================

随着大数据时代的到来，对于数据科学和R语言的人才需求日渐旺盛。前段时间美国总统Obama在招募竞选团队时就提供了数据分析的职位，并要求具有R语言的技能。著名的网络公司Twitter最近发布了一则招聘信息，寻找数据科学家或分析师，以帮助他们从大数据中获取信息。从招聘要求中，我们可以看到业界对于数据科学和R语言的态度。如果你想在将来从事数据科学事业，这些信息也能帮助你了解自己需要哪些方面的技能。Twitter的招聘信息如下：

关于工作的描述：
我们正在寻找干劲十足的人来帮助我们从Twitter的大规模数据中提取内容。作为分析团队中的数据科学家，你将使用统计分析和数据挖掘技术来帮助我们更好地理解用户，确定是否应该推出新功能，并衡量整个组织是否成功。你应该热衷于寻找数据中的知识，并使用定量分析方法来回答复杂的问题。

工作责任：