P4 商品评论抓取

"""
JSON数据主要有两种形式:‌【对象形式】和【数组形式】。‌

①对象形式用于表示键值对,‌其中键是字符串,‌而值可以是字符串、‌数字、‌布尔值、‌数组、‌另一个对象,‌甚至是null。‌
    例如,‌一个JSON对象可以表示为{"key": "value"}的形式,‌其中"key"是键,‌而"value"可以是上述提到的任何数据类型。‌
②数组形式则是一个值的有序列表,‌可以包含多个对象或其他类型的值,‌
    例如,表示为[value1, value2, ...]的形式。

***此外,‌JSON数据还可以包含嵌套结构,‌即一个JSON对象内部可以包含另一个JSON对象或JSON数组,‌这种嵌套结构使得JSON能够灵活地表示复杂的数据结构。
    ‌例如,‌一个嵌套的JSON对象可以表示为{"key": {"subkey": "value"}}的形式。‌

个人评价:类似于字典dict格式
"""

import requests
import parsel
import csv

with open('C:/Users/86189/Desktop/京东评论.csv',mode='w',encoding='utf-8-sig',newline="") as f:
    csv.writer(f).writerow(["content","referenceTime","location","productColor","productSize"]) # 引号快捷键:【使用Shift + 引号键】

### 如何找到“商品评论对应的url链接”
# ①打开“检查”,点击Network,CTRL+R 刷新界面,在搜索框搜索与商业评论有关的内容(如输入”很好用,拍照好“此类字段),双击搜索结果,复制打开网络界面后对应的链接;
# ②后期若想爬取对应属性的内容,可以点击与“Header”同栏的“Preview”

headers = {
    'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36',
    'Cookie':'__jdv=122270672%7Cdirect%7C-%7Cnone%7C-%7C1723344360225; mba_muid=1723344360223889980773; __jdu=1723344360223889980773; TrackID=1pbo0AydHEXqZegmwoFWyywmjd4a2izt8ogqS4ZvcYXOWap6t9c_lxIuBd3NzHtWz-r2EZoMzhYZvxI_ym9L9SbUwXfiUii33Wex7DbfPptY; thor=7AA0A3083246CBD2944A28D8BE895268E8BF56F3367134CFACE59C015C53BAC6A16E44F39DD8D193A15C1AFB682106A289C10B05376B9743E68B14D18EC2C1A602A6286A9233DD1610DD4983934F5DCF3925B4FD63C1AD7A63732F1A654E9F73870D3AA8AC5FEB7CA33C672B53FD3A2CC3D271266A709120A9704EF7B08D18E66FC3BD5E9A30A8A1134279F50EC48E73402E255796DA3E71D7AD0724969F185B; pinId=96knPOjqUXCM7BezoGw-nQ; pin=jd_owZNkKmyKwCf; unick=jd_03iqkp6agdzw23; ceshi3.com=000; _tp=qJRXnwzwrWtxb9VoxCkPrg%3D%3D; _pst=jd_owZNkKmyKwCf; shshshfpa=7262d375-eb13-1324-42eb-086d4cbb31e6-1723344429; shshshfpx=7262d375-eb13-1324-42eb-086d4cbb31e6-1723344429; areaId=1; jsavif=1; 3AB9D23F7A4B3C9B=DDJ35H5KJH5BJE4JUAXMLFRVPFKV3ICM2WC2UBXBVIMSPKBPREDMRX5E56R4DDVPDCBHZAYF7KY4ONN2UDI5TUZGYQ; token=8e63d1176095d527a6efb93feadaae51,3,957460; __jda=181111935.1723344360223889980773.1723344360.1723349471.1723427266.3; __jdc=181111935; ipLoc-djd=1-2802-54745-0; shshshfpb=BApXS76xpR_RAlH1lGVT9kPAiYRr3fYA8BmF1QwZp9xJ1MpMdpYC2; flash=3_v8yE5vrClwqkXLfJ-vyicKaVR3ymmV0BFm9COudDVHDkQzkjp_Mos2BL4YQtLVGFwQEIu-elecuUcnF2OaIW3xr-itky6Iy1eTR-al1AvzvwqF6tSZigFlKrHUP2jevGGEEWoMMRGw2bsQU_XTUO5I8T0xSMvVl7HPJps0FVt0MsQ5oyvH_L; 3AB9D23F7A4B3CSS=jdd03DDJ35H5KJH5BJE4JUAXMLFRVPFKV3ICM2WC2UBXBVIMSPKBPREDMRX5E56R4DDVPDCBHZAYF7KY4ONN2UDI5TUZGYQAAAAMRIRSJSNYAAAAAC4WUH5T757P7GIX; _gia_d=1; __jdb=181111935.5.1723344360223889980773|3.1723427266',
    'Referer':'https://item.jd.com/',
    'Origin':'https://item.jd.com'
}
for page in range(0,10):
    url = f'https://api.m.jd.com/?appid=item-v3&functionId=pc_club_productPageComments&client=pc&clientVersion=1.0.0&t=1723429329034&body=%7B%22productId%22%3A100071390082%2C%22score%22%3A{page}%2C%22sortType%22%3A5%2C%22page%22%3A0%2C%22pageSize%22%3A10%2C%22isShadowSku%22%3A0%2C%22fold%22%3A1%2C%22bbtf%22%3A%22%22%2C%22shield%22%3A%22%22%7D&h5st=20240812102209038%3B9imzn5ytiittg9i8%3Bfb5df%3Btk03wbff01d0a18nHS5zEjW9TdrbIdu6G7z3rlvGSsddiMXjzHay0GNhHeWDoS3YH3uTlW2j9w_WT0YpQsmYec0sOqjl%3B9c3a62fd014257801696cb34a980bdf9%3B4.7%3B1723429329038%3BVGsAYhfApuvEblr2td1-PAsrIQ9d1MLYDJPwIgRaQXxdimTMt3XM6tUA1D9kdd_NMKiBmI4TRgI8gMjsYOWeWj88T8ytOE0Spp1uJ0XcDvo1v1e374wqdCiDbO0_-R-luAGjg6qQEuVPLMjqDsA9USEH4td_oFwm__HZQaJPDj3hkvyHRYDKgb2m40DNdSGKNfXs3XuQxXEeNKgtW4uBCQg035ZxChaAr0RZttw5uaexMvcNfvbS9arMC_KU3J_kcweu6kclxRzI4dVvFFXLFy1KaSxTYY7I18hA00hkmsNn2qLEN_HzSuYNh_3O7Tk8t50u06OmbEoXwoWn2XulZEOPos0AlZm_16qkyDBlDooUbYiJvaL5CNWvBE7Z6U1XdMualcf2uyp02_p8XajjR55LKrTgqagr6_LpjwRWFQIua_m1O_LiXl4mu79D6qCAY4XmNs8Jz2o1PClz3N0-cB_Nn6dLfuswNjdXhOZyz4X7qzjN24KYQNTsmz5SmElT0N-zV85XOrZVb8ieCh4lG1dNjEDuzPOSE07tekiH_1SLwDUeAmwLE1bL4ru0_ea3-GOV3sYE2MYdHAcFkZVjDWM8fY_p5xIpUsrVxOLCu7nZggE7nDk8PeheJO0dl8zjLad9Prk3hGJ0DQIeqffFGvzEemLTD52YgeDqWQHLXbk3%3B089bfe45dc157f4443031cb51332d186&x-api-eid-token=jdd03DDJ35H5KJH5BJE4JUAXMLFRVPFKV3ICM2WC2UBXBVIMSPKBPREDMRX5E56R4DDVPDCBHZAYF7KY4ONN2UDI5TUZGYQAAAAMRIRSJSNYAAAAAC4WUH5T757P7GIX&loginType=3&uuid=181111935.1723344360223889980773.1723344360.1723349471.1723427266.3'
    response = requests.get(url=url, headers=headers)
    json_data = response.json()
    # 使用JSON数据的提取格式(详情见开头)
    comments = json_data['comments']
    '''
    【可以看出 comments 是储存所有内容的最大列表(的名字)】
    网页源代码数据:
    comments: [{id: 21061847367, guid: "013e474939507a7347bc6a9dbb3d9a31",…},…]
    0: {id: 21061847367, guid: "013e474939507a7347bc6a9dbb3d9a31",…}
    1: {id: 21081430845, guid: "e01361fba7506aad2c8331f541eb8f04",…}
    2: {id: 21151929424, guid: "d7b2c2821036a3154e6aff458f6c48e8",…}
    3: {id: 21075005942, guid: "5defd49a500f8ff39954d791897c7bed",…}
    '''

    for comment in comments:
        content = comment['content']
        productColor = comment['productColor']
        productSize = comment['productSize']
        referenceTime = comment['referenceTime']
        location = comment['location']
        print(content,referenceTime,location,productColor,productSize)
        with open('C:/Users/86189/Desktop/京东评论.csv', mode='a', encoding='utf-8-sig', newline="") as f:
            csv.writer(f).writerow([content,referenceTime,location,productColor,productSize])  # 引号快捷键:【使用Shift + 引号键】

# 问题:爬虫时写入csv文件,内容出现乱码
# 解决:utf-8改成utf-8-sig

'''
思考方式:
① 在原网页中打开检查,查找含meta中的charset内容(一般在<head>中)
    <meta http-equiv="Content-Type" content="text/html; charset=utf-8">
② 发现charset对应的是utf-8,若utf-8不行,则尝试utf-8-sig
'''



评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值