微博热搜网站:https://s.weibo.com/top/summary/
就是这个样子:
pyquery提取:
保险起见headers里加个UA…
from pyquery import PyQuery as pq
html = pq("https://s.weibo.com/top/summary/",
{
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.92 Safari/537.36'
}
)
这里打印html可以看到有结果了
for item in html("#pl_top_realtimehot > table > tbody > tr").items():
if len(list(item.text().split())) >= 3:
print({
'排名': int(item('td.td-01.ranktop').text()),
'名称': item('td.td-02 > a').text(),
'热度': int(item('td.td-02 > span').text())
})
然后用这段代码就可以获得所有结果:
for item in html("#pl_top_realtimehot > table > tbody > tr").items():
if len(list(item.text().split())) >= 3:
with open ("weibo.csv","a",encoding="utf-8") as f:
f.write(item('td.td-01.ranktop').text()+","+item('td.td-02 > a').text()+","+item('td.td-02 > span').text()+"\n")
这一段代码就可以把数据存储为csv文件
但是因为我们想要每天定时爬取,所以还要再加点东西:
import time
with open ("weibo.csv","a",encoding="utf-8") as f:
f.write("日期,时间,排名,名称,热度\n")
for item in html("#pl_top_realtimehot > table > tbody > tr").items():
if len(list(item.text().split())) >= 3:
with open ("weibo.csv","a",encoding="utf-8") as f:
f.write(time.strftime("%Y-%m-%d")+","+time.strftime("%H:%M:%S")+","+item('td.td-01.ranktop').text()+","+item('td.td-02 > a').text()+","+item('td.td-02 > span').text()+"\n")
这就可以了:
还有保存到excel:
try:
r_xls = open_workbook("weibo.xls")
row = r_xls.sheets()[0].nrows
excel = copy(r_xls)
table = excel.get_sheet("sheet1")
for item in html(