php 抓取微博内容,新浪微博数据抓取，顺手做了分词和词云

最新推荐文章于 2024-01-25 10:15:57 发布

栗迦南

最新推荐文章于 2024-01-25 10:15:57 发布

阅读量225

点赞数

文章标签： php 抓取微博内容

# 然后是抓取数据的函数。目前只写了feeds部分的抓取，其他是类似的，而且会更简单一点，不需要刷新页面。

f_weibo_get

# 参数N是想要获取的微博条数。参数hisnick是对方的ID

library(rjson)

memory.limit(4000)

# 先看一下有多少页

pg=1

the1url

the1get

write(the1get, "temp.txt")

the1get

idi

oid

idi

uid

# 微博信息

infoi

a1 STK && STK.pageletM && STK.pageletM.view\\(','',the1get[infoi])

a1 ','',a1)

write(a1, 'a1.txt')

numberi ', a1))

number ')[[1]][2]

number ')[[1]][1]

pages

weibo_data

# 循环读取页面

for (pg in 1:pages){

# 第一屏

the1url

the1get

write(the1get, "temp.txt")

the1get

# 看别人的时候是hisFeed，看自己的时候是myFeed(后面的url也略有差异，主要是刷新的时候需要用到uid)

if(uid == oid){

myfeedi

}

if(uid != oid){

myfeedi

}

a1 STK && STK.pageletM && STK.pageletM.view\\(','',the1get[myfeedi])

a1 ','',a1)

write(a1, 'a1.txt')

# 最后一条微博的ID

lastmidi

lastmid

# 于是第二屏

the2url

'&count=15&max_id=', lastmid, '&pre_page=', pg, '&end_id=&pagebar=0&uid=', oid, sep='')

the2get

write(the2get, "temp.txt")

the2get

write(a2, 'a2.txt')

# 最后一条微博的ID

lastmidi

lastmid

# 于是第三屏

the3url

'&count=15&max_id=', lastmid, '&pre_page=', pg, '&end_id=&pagebar=1&uid=', oid, sep='')

the3get

write(the3get, "temp.txt")

the3get

write(a3, 'a3.txt')

# 筛选微博正文内容，连接起来

a123

index

a11

b [^<>]*

getcontent

paste(substring(string, greg+1, greg+attr(greg,'match.length')-2), collapse=' ')

}

a111

names(a111)

weibo_data

gc()

}

# 去掉英文和数字，去掉@对象

weibo_data

return(weibo_data[1:min(as.numeric(number), N)])

}

# 登录

ch0

ch1

# 获取微博数据(这里只做了我自己的版本，10000是个足够大的数字)

weibo_10000_0

weibo_10000_1

# 这两个结果有一点点点差异，目前看来，貌似是显示给自己的微博比较全。

all(weibo_10000_0 %in% weibo_10000_1)

# FALSE

all(weibo_10000_1 %in% weibo_10000_0)

# TRUE

栗迦南

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
php 抓取微博内容,新浪微博数据抓取，顺手做了分词和词云

# 然后是抓取数据的函数。目前只写了feeds部分的抓取，其他是类似的，而且会更简单一点，不需要刷新页面。f_weibo_get # 参数N是想要获取的微博条数。参数hisnick是对方的IDlibrary(rjson)memory.limit(4000)# 先看一下有多少页pg=1the1url the1get write(the1get, "temp.txt")the1get idi oid ...
复制链接

扫一扫