一起进步吧!感谢大家的支持和关注
遇到的问题
-
动态加载页面和静态页面的区别
- 为什么选json()反序列化
- 持续化存储
源代码
import requests
# url 分析
# https://movie.douban.com/j/chart/top_list?type=19&interval_id=100%3A90&action=&start=0&limit=20
# https://movie.douban.com/j/chart/top_list?type=19&interval_id=100%3A90&action=&start=20&limit=20
# https://movie.douban.com/j/chart/top_list?type=19&interval_id=100%3A90&action=&start=40&limit=20
url = "https://movie.douban.com/j/chart/top_list"
headers = {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4692.71 Safari/537.36',
}
for num in range(0, 401, 20):
params = {
"type": 19,
"interval_id": "100:90",
"action": "",
"start": num, # 动态的值
"limit": 20,
}
print(num)
res = requests.get(url=url, params=params, headers=headers).json()
# print(res)
fp = open('./douban.txt', 'a', encoding="utf-8")
for dic in res:
title = dic['title']
score = dic['score']
fp.write(title + ':' + score + '\n')
print(title, '爬虫保存成功!')
print("爬取成功")
反思
- 爬取速度太慢
- 怎么去改善