京东评论爬虫
(1)请求天气接口爬虫
# 后端负责接受请求,响应纯数据,叫做API接口。前端js负责发请求。
#接口也分为合法的和非法的(别人提供数据的后台程序)
import requests
import json
from lxml import etree
url = 'http://t.weather.itboy.net/api/weather/city/101030100'
resp = requests.get(url)
weather_json_str = resp.text
#print(resp.status_code)
wether_obj = json.loads(weather_json_str)
#print(wether_obj)
wether_data = wether_obj['data']
day_wether_list = wether_data['forecast']
#print(day_wether_list)
for day in day_wether_list:
#print(day)
date = day["date"]
high = day["high"]
low = day["low"]
type = day["type"]
print(f'今天{date}号,天气{type},{high},{low}')
(2)京东评论接口请求爬虫
import json
import requests
base_url = 'https://club.jd.com/comment/productPageComments.action'
#本次请求头只要伪造请求头User_Agent,但前端时间测试需要cookie字段
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.111 Safari/537.36'
}
#Referer: https://item.jd.com/,反爬虫可能用到,从哪里来的
params = {
'productId': 100009077475,#商品id
'score': 0,
'sortType': 5,
'page': 1,#第n页
'pageSize': 10,
'isShadowSku': 0,
'rid': 0,
'fold': 1
}
#for i in range(1,21):
#params['page'] = i
resp = requests.get(base_url,headers=headers,params=params)
comments_json = resp.text
print(comments_json)
#京东评论接口返回jsonp 涉及跨域问题,需要将jsonp转化为json
#方法一:python字符串方式删除 2、3、本例中发现删掉第一个阐述,就可以返回完美的json
comments_obj = json.loads(comments_json)
print()
comments = comments_obj['comments']
for c in comments:
cid = c['id']
comment= c['content']
creation_time = c['creationTime']
images = c['images']
product_color = c['productColor']
product_size = c['productSize']
print(cid,comment)
遇到的困难:
今天学习了京东的评论爬虫,在爬虫的时候发现京东网页属于动态网页,如果直接使用网页网址,会发现匹配不到想要的内容,也就是说网址返回的内容中没有你想要的,此时,需要寻找对应的接口,在开发者工具network中,先清除所有请求,在页面上请求要爬取的数据,刷新页面,可以捕捉网页响应的请求,在请求中的headers的request url中即可获取网址。