豆瓣评论爬取
下面展示部分代码
配置代理池
proxies_pool=[
{'http':'114.233.71.160:9000'},
{'http':'111.225.153.132:8089'},
{'http':'180.105.117.139:8089'},
{'http':'111.225.152.68:8089'},
{'http':'36.138.56.214:3128'},
{'http':'175.100.72.95:57938'},
{'http':'198.199.74.99:59166'},
{'http':'176.100.216.154:8087'},
{'http':'4.16.68.158:443'},
{'http':'161.35.70.249:8080'},
]
导入需要的Python库
from lxml import etree
import urllib.request
import urllib.parse
import random
构造请求头
def creat_request(page):
base_url='https://movie.douban.com/subject/35267208/comments?'
headers = {
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/110.0.0.0 Safari/537.36 Edg/110.0.1587.56'
}
定义方法,获取豆瓣评论数据
def get_content(request):
handller = urllib.request.ProxyHandler(proxies=proxies)
opener = urllib.request.build_opener(handller)
response = opener.open(request)
tree=etree.HTML(response.read().decode('utf-8'))
result=tree.xpath('//span[@class="short"]/text()')
return result
定义方法,下载爬取的数据,保存为文本文件
def down_load(page,result):
with open('./doubancomment'+str(page)+'txt',mode='w',encoding='utf-8') as file:
file.write(str(result))
以上就是部分核心代码,爬取的效果还是不错的,简单快捷,还可以继续增加部分信息(如评论人网名和星级)实现更好的爬取效果
![规规矩矩的爬取](https://img-blog.csdnimg.cn/direct/021542391d0f481a85b0ce6b1f48ad95.png#pic_center)
另一篇文章中已经将程序打包好为执行文件,可以看看