python爬虫实训第五天

最新推荐文章于 2022-07-04 10:38:02 发布

啊水水水啊

最新推荐文章于 2022-07-04 10:38:02 发布

阅读量115

点赞数

分类专栏：笔记

本文链接：https://blog.csdn.net/qq_46020525/article/details/112850155

版权

笔记专栏收录该内容

37 篇文章 1 订阅

订阅专栏

京东评论爬虫数据库版

（1）京东评论爬虫数据库版

import  json
import sqlite3
import requests
import time
def get_one(pid,pageno):
    #取一个商品的一页评论

    base_url = 'https://club.jd.com/comment/productPageComments.action'
#本次请求头只要伪造请求头User_Agent,但前端时间测试需要cookie字段
    headers = {

'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.111 Safari/537.36'

}
#Referer: https://item.jd.com/，反爬虫可能用到，从哪里来的
    params = {

    'productId': pid,#商品id
    'score': 0,
    'sortType': 5,
    'page': pageno,#第n页
    'pageSize': 10,
    'isShadowSku': 0,
    'rid': 0,
    'fold': 1
    }
    resp = requests.get(base_url,headers=headers,params=params)
    comments_json = resp.text
    #(comments_json)
#京东评论接口返回jsonp 涉及跨域问题，需要将jsonp转化为json
#方法一：python字符串方式删除  2、3、本例中发现删掉第一个阐述，就可以返回完美的json
    comments_obj = json.loads(comments_json)
    comments = comments_obj['comments']
    #print(comments)
    return comments
def write_comments_to_db(c,cursor):
        cid = c['id']
        content = c['content']
        creation_time = c['creationTime']
        product_color = c['productColor']
        #product_size = c['productSize']
        cursor.execute("""
        insert into contents(cid,content,product_color,creation_time) values(?,?,?,?);
        """, [cid,content,product_color,creation_time])
if __name__=='__main__':
    connect = sqlite3.connect('./jd.db')
    cursor = connect.cursor()
    cursor.execute("""
    create table if not exists contents(
    id integer primary key,
    cid integer,
    content Text,
    product_color TEXT,
    creation_time DATETIME);
    """)
    product_id = '100009077475'
    for pageno in range(1,100):
        one_page_comments = get_one(product_id,pageno)
        for c in one_page_comments:
            write_comments_to_db(c,cursor)
        connect.commit()
        print(f'第{pageno}页数据插入完成')
        time.sleep(1)
    connect.close()

主要困难及问题：
今天在昨天的基础上学习了如何通过datagrip将在京东评论区获取的数据存入数据库中，使用sqlite3包创建和数据库的连接，然后使用cursor的execute方法来实现数据库的插入操作，最后记得用commit提交。此时获取的数据需要进行分析才能分析出产品的好坏，在使用jieba分词之后，需要去除数据中的无意义的停止词，用python的in来处理。