任务目标:获取腾讯新闻首页
(https://news.qq.com/)热点精选部分至少50条新闻的id,标题和url.
1. 采用Selenium抓取数据
使用Selenium模拟鼠标的动作, 完成页面剩余部分的加载. 这里通过对页面打开时的请求进行分析, 找到了请求新闻数据的接口
# 使页面向下滑动,以便获取足够多的数据
time.sleep(5)
# 向下滚动1000像素
js = "window.scrollBy(0, 1000)"
# 滚动到当前页面底部,这两个试过,没有用
# js = "window.scrollTo(0,document.body.scrollHeight)"
# js = "document.documentElement.scrollTop=500000"
for i in range(10):
driver.execute_script(js)
time.sleep(3)
html = driver.page_source
tree = etree.HTML(html)
# 试过用//li[@class="item cf"]/a[@class="picture"]/@href以及
# //li[@class="item cf"]/a[@class="picture"]/img/@alt
# 效果不好,获取到的数据长度不一致
lis = tree.xpath('//li[@class="item cf"]/div[@class="detail"]/h3/a')
link = tree.xpath('//li[@class="item cf"]/div[@class="detail"]/h3/a/@href')
title = tree.xpath('//li[@class="item cf"]/div[@class="detail"]/h3/a/text()')
print(len(lis))
print(len(link))
print(len(title))
# 这三个长度一致的时候再进行下一步
# 输出结果
for i in range(len(lis)):
print(title[i], link[i])
# 写入文件
import csv
# windows下如果不加newline=''会有空行
# 如果不加encoding用vscode打开会乱码,如果是encoding='utf-8'用excel打开会乱码,只有用encoding='utf-8-sig'两者打开都不会乱码
with open('test.csv','w',newline='',encoding='utf-8-sig')as f:
f_csv = csv.writer(f)
f_csv.writerow(['index', 'title', 'url'])
# 或者用f_csv.writerows([[数据1], [数据2], ...])一次性写入
for i in range(len(lis)):
f_csv.writerow([i+1, title[i], link[i]])
2. 获取动态加载地址并发起请求
-
Ajax动态加载
-
找到动态加载页面url
-
page参数需放入到url中,否则只能获取到第一页数据
-
结果保存到DataFrame中,并写入到csv文件
-
列名包括:‘标题’,‘父类别’, ‘子类别’,‘标签’, ‘文章来源’, ‘显示类型’, ‘更新时间’,‘链接’
import requests
import time
import pandas as pd
import numpy as np
pre_url = 'https://pacaio.match.qq.com/irs/rcd?page='
headers = {
'user-agent':
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.122 Safari/537.36'
}
info_list = []
for i in range(10):
try:
url = pre_url + str(i)
page = str(i)
params = {
'cid': '137',
'token': 'd0f13d594edfc180f5bf6b845456f3ea',
'id': '',
'ext': 'top',
'expIds': '',
# 'callback': '__jp1'
}
res = requests.get(url=url, headers=headers, params=params).json()
for info in res['data']:
result = {}
result['标题'] = info['title']
result['链接'] = info['vurl']
result['父类别'] = info['category1_chn']
result['子类别'] = info['category2_chn']
# result['关键字'] = info['keywords']
result['显示类型'] = info['showtype']
result['文章来源'] = info['source']
result['标签'] = info['tags']
result['更新时间'] = info['update_time']
info_list.append(result)
time.sleep(2)
except:
print(i)
df = pd.DataFrame(info_list,columns=['标题','父类别', '子类别','标签', '文章来源', '显示类型', '更新时间','链接'])
df.index = np.arange(1,len(df)+1)
df.to_csv('QQ_news.csv')
print('数据采集完成')
3. 知乎数据爬取
还没做完哦
链接如下
https://www.zhihu.com/search?q=Datawhale&utm_content=search_history&type=content
用requests库实现,不能用selenium网页自动化
提示:
该链接需要登录,可通过github等,搜索知乎登录的代码实现,并理解其中的逻辑,此任务允许复制粘贴代码
与上面ajax加载类似,这次的ajax加载需要用requests完成爬取,最终存储样式随意,但是通过Chrome的开发者工具,分析出ajax的流程需要写出来