项目之爬取当当网的评论

最新推荐文章于 2024-04-26 19:04:43 发布

无毒有偶

最新推荐文章于 2024-04-26 19:04:43 发布

阅读量1.2k

点赞数

分类专栏：爬虫

本文链接：https://blog.csdn.net/weixin_43157068/article/details/104332822

版权

爬虫专栏收录该内容

15 篇文章 0 订阅

订阅专栏

本文只是用于检测正则表达式书写

还有就是淘宝太难了，爬不了

import urllib.request
import urllib.parse
import json
import jsonpath
import re
from bs4 import BeautifulSoup
import warnings
warnings.filterwarnings('ignore')

url_front = 'http://product.dangdang.com/index.php?r=comment%2Flist&productId=1204926048&categoryPath=01.41.26.21.00.00&mainProductId=1204926048&mediumId=0&pageIndex='
url_back = '&sortType=1&filterType=1&isSystem=1&tagId=0&tagFilterCount=0&template=publish&long_or_short=short'
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:72.0) Gecko/20100101 Firefox/72.0',

}

def handle_request(url):
    request = urllib.request.Request(url=url, headers=headers)
    return request

def get_response(request):
    response = urllib.request.urlopen(request)
    return response

def parse_json(json_text):
    obj = json.loads(json_text)
    ret = jsonpath.jsonpath(obj, '$.data.list.html')
    return ret[0]



def main():
    start_page = int(input('请输入起始页码：'))
    end_page = int(input('请输入结束页码：'))
    url = url_front + str(start_page) + url_back
    json_text = get_response(handle_request(url=url)).read().decode('gbk')
    # print(response.read().decode('gbk'))
    # 因为是json对象，所以如果不转成json格式，就显示不了中文
    ret = parse_json(json_text)
    soup = BeautifulSoup(ret)
    commet_list = soup.select('.item_wrap > div')
    # print(commet_list[0])
    # 去掉首尾中括号
    commet_list = str(commet_list)[1:-1]
    pattern = re.compile(r'''<div class="comment_items clearfix">
.*?
<em>(.*?)</em>
.*?
<span><a href="(.*?)" target="_blank">(.*?)</a></span>
.*?
<span>(.*?)</span>
.*?
<div class="support" data-comment-id="(.*?)">
.*?
<a class="pic" href="javascript:"><img alt="(.*?)" src="(.*?)"/></a>
.*?''', re.S)
    commet_list = pattern.findall(commet_list)
    for commet in commet_list:
        print(commet)
    # print(len(commet_list))
    # print(commet_list)
    # with open('dangdang.html', 'w', encoding='gbk')as fp:
    #     fp.write(commet_list)

if __name__ == '__main__':
    main()

无毒有偶

关注

0
点赞
踩
2

收藏

觉得还不错? 一键收藏
0
评论
项目之爬取当当网的评论

本文只是用于检测正则表达式书写还有就是淘宝太难了，爬不了import urllib.requestimport urllib.parseimport jsonimport jsonpathimport refrom bs4 import BeautifulSoupimport warningswarnings.filterwarnings('ignore')url_front...
复制链接

扫一扫