2023年11月20日有效,能帮助到UU们的话请一键三连哦~
本文对比了三种不同的爬虫写法,分别是单线程串行请求、双线程串行请求、双线程异步请求,在爬取前20页评论的情况下单纯对比了一下运行时间。
本文爬虫的输入是京东的商品ID,如图100057562590就是京东的商品ID,是京东用来区分每一个商品的标识,虽然实际上是按SKU级别区分的,但多个SKU对应同一个评论区。
也就是京东的链接、SKU、SKU_num和评论区的关系是这样的:
一个链接里有多个SKU,一个SKU对应一个SKU_num;一个链接对应一个评论区。
所以用一个SKU_num就可以指向一个链接里的评论区了。
单线程串行请求
import requests
import time
#单线程串行处理,耗时9s
if __name__ == "__main__":
print(f"start at {time.strftime('%X')}")
Brandname = "Pelliot"
#sku_nums = get_SKU_num(Brandname)#SKU_NUM多就写在txt文件里,少就直接写列表
sku_nums = [10077210061258,100055582110]
for productId in sku_nums:
for page_num in range(20,):
url= "https://api.m.jd.com/?appid=item-v3&functionId=pc_club_productPageComments&client=pc&clientVersion=1.0.0&t=1700360288121&loginType=3&uuid=122270672.1987596784.1700214663.1700357553.1700358318.4&productId={}&score=0&sortType=5&page=".format(productId)+"{}".format(page_num)+"&pageSize=10"
headers = {
"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/119.0.0.0 Safari/537.36"
}
response = requests.get(url = url,headers=headers)
print(f"end at {time.strftime('%X')}")
双线程串行请求
import requests
import time
import pandas as pd
from multiprocessing.dummy import Pool
#多线程串行处理,耗时5s
def get_comments(productId):
for page_num in range(20):
url= "https://api.m.jd.com/?appid=item-v3&functionId=pc_club_productPageComments&client=pc&clientVersion=1.0.0&t=1700360288121&loginType=3&uuid=122270672.1987596784.1700214663.1700357553.1700358318.4&productId={}&score=0&sortType=5&page=".format(productId)+"{}".format(page_num)+"&pageSize=10"
headers = {
"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/119.0.0.0 Safari/537.36"
}
requests.get(url = url,headers=headers)
if __name__ == "__main__":
print(f"start at {time.strftime('%X')}")
Brandname = "Pelliot"
#sku_nums = get_SKU_num(Brandname)#商品号多就写在txt文件里,少就直接写列表
sku_nums = [10077210061258,100055582110]
pool = Pool(2)
pool.map(get_comments,sku_nums)
print(f"end at {time.strftime('%X')}")
双线程异步请求
#双线程异步协程处理,耗时1s
import aiohttp
import time
from multiprocessing.dummy import Pool
import asyncio
def get_comments(productId):
async def request(url):
headers = {
"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/119.0.0.0 Safari/537.36"
}
async with aiohttp.ClientSession() as session:
async with session.get(url= url,headers= headers) as response:
return await response.json()
async def main():
task_list = []
for page_num in range(20):
url= "https://api.m.jd.com/?appid=item-v3&functionId=pc_club_productPageComments&client=pc&clientVersion=1.0.0&t=1700360288121&loginType=3&uuid=122270672.1987596784.1700214663.1700357553.1700358318.4&productId={}&score=0&sortType=5&page=".format(productId)+"{}".format(page_num)+"&pageSize=10"
locals()['task_'+str(page_num)] = asyncio.create_task(request(url))
task_list.append(locals()['task_'+str(page_num)]) #以列表形式存储任务名,方便gather使用
await asyncio.gather(*task_list)
asyncio.set_event_loop_policy(asyncio.WindowsSelectorEventLoopPolicy())
new_loop = asyncio.new_event_loop()
asyncio.set_event_loop(new_loop)
loop = asyncio.get_event_loop()
task = loop.create_task(main())
loop.run_until_complete(task)
if __name__ == "__main__":
print(f"start at {time.strftime('%X')}")
Brandname = "Pelliot"
#sku_nums = get_SKU_num(Brandname)#商品号多就写在txt文件里,少就直接写列表
sku_nums = [10077210061258,100055582110]
pool = Pool(2)
pool.map(get_comments,sku_nums)
print(f"end at {time.strftime('%X')}")
虽然异步编程确实有效提速了,但是短时间内对京东服务器发起大量请求可能会面临封IP的问题,作者没钱买有效代理IP所以请求的数量少,以此避免被封。UU们使用的时候可以考虑用time.sleep适当休眠一会,保证短时间内发送请求数量不太多;要么注意加入代理IP的使用。