python爬虫实训第四天

最新推荐文章于 2022-06-18 03:08:21 发布

啊水水水啊

最新推荐文章于 2022-06-18 03:08:21 发布

阅读量230

点赞数

分类专栏：笔记

本文链接：https://blog.csdn.net/qq_46020525/article/details/112794864

版权

笔记专栏收录该内容

37 篇文章 1 订阅

订阅专栏

京东评论爬虫

（1）请求天气接口爬虫

# 后端负责接受请求，响应纯数据，叫做API接口。前端js负责发请求。
#接口也分为合法的和非法的（别人提供数据的后台程序）
import requests
import json
from lxml import etree
url = 'http://t.weather.itboy.net/api/weather/city/101030100'
resp = requests.get(url)
weather_json_str = resp.text
#print(resp.status_code)
wether_obj = json.loads(weather_json_str)
#print(wether_obj)
wether_data = wether_obj['data']
day_wether_list = wether_data['forecast']
#print(day_wether_list)
for day in day_wether_list:
    #print(day)
    date = day["date"]
    high = day["high"]
    low = day["low"]
    type = day["type"]
    print(f'今天{date}号，天气{type},{high},{low}')

在这里插入图片描述

（2）京东评论接口请求爬虫

import  json
import requests
base_url = 'https://club.jd.com/comment/productPageComments.action'
#本次请求头只要伪造请求头User_Agent,但前端时间测试需要cookie字段
headers = {

'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.111 Safari/537.36'

}
#Referer: https://item.jd.com/，反爬虫可能用到，从哪里来的
params = {

'productId': 100009077475,#商品id
'score': 0,
'sortType': 5,
'page': 1,#第n页
'pageSize': 10,
'isShadowSku': 0,
'rid': 0,
'fold': 1
}
#for i in range(1,21):
   #params['page'] = i
resp = requests.get(base_url,headers=headers,params=params)
comments_json = resp.text
print(comments_json)
#京东评论接口返回jsonp 涉及跨域问题，需要将jsonp转化为json
#方法一：python字符串方式删除  2、3、本例中发现删掉第一个阐述，就可以返回完美的json
comments_obj = json.loads(comments_json)
print()
comments = comments_obj['comments']
for c in comments:
    cid = c['id']
    comment= c['content']
    creation_time = c['creationTime']
    images = c['images']
    product_color = c['productColor']
    product_size = c['productSize']
    print(cid,comment)

在这里插入图片描述
遇到的困难：
今天学习了京东的评论爬虫，在爬虫的时候发现京东网页属于动态网页，如果直接使用网页网址，会发现匹配不到想要的内容，也就是说网址返回的内容中没有你想要的，此时，需要寻找对应的接口，在开发者工具network中，先清除所有请求，在页面上请求要爬取的数据，刷新页面，可以捕捉网页响应的请求，在请求中的headers的request url中即可获取网址。