糗事百科爬虫 2017 10/1版本的糗事百科 python3.x

最新推荐文章于 2019-10-29 17:46:35 发布

mu7zp

最新推荐文章于 2019-10-29 17:46:35 发布

阅读量237

点赞数

本文链接：https://blog.csdn.net/mu7zp/article/details/78471227

版权

从 http://cuiqingcai.com/990.html处学习并改进

1. 首先下载网页基本信息

a.基本的网页下载模式，出现如下错误

http.client.RemoteDisconnected:Remote end closed connection without response

可能因为么有模拟header

b.需要得到：浏览器的User Agent，则可以在浏览器上输出地址栏上看一下about:version

2. 网页分析器

a.这里利用正则表达式，需要注意的是如果么有，则截取前后，然后判断

b.空格太多，可以用a.strip()消除前后空格和换行符

c.出现只能显示部分的情况，应该找到源页面，然后摘取文档，注意此时有图片也不展示

3.基本代码：

# _*_coding:utf-8 -*-

import urllib
import  urllib.request
import urllib.parse
import  re
import urllib.error
import http.cookiejar

__author__ = "muzp"

page = 2
url = 'https://www.qiushibaike.com/hot/page/' + str(page)
user_agent = 'User-Agent:Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36'
headers ={'User-Agent': user_agent}


try:

    request = urllib.request.Request(url, headers=headers)
    response = urllib.request.urlopen(request)
    content = response.read().decode("utf-8")

    pattern = re.compile('''<div class="author clearfix">.*?<h2>(.*?)</h2>'''+
                         '''.*?<a href="(.*?)"''' +
                         '''.*?<span>(.*?)</span>'''+
                         '''(.*?)</div>'''+
                         '''.*?<!-- 图片或gif -->(.*?)<div class="stats">''' +
                         '''.*?<i.*?number">(.*?)</i>''', re.S)

    items = re.findall(pattern, content)

    for item in items:
        haveImg = re.search("img", item[4])
        havere = re.search("查看全文",item[3])
        temp =""
        if havere:
            url1 ="https://www.qiushibaike.com"+item[1]
            print(url1)
            request1 = urllib.request.Request(url1, headers=headers)
            response1 = urllib.request.urlopen(request1)
            content1 = response1.read().decode("utf-8")
            pattern1 = re.compile('<div class="content">(.*?)</div>(.*?)</div>', re.S)
            items1 = re.findall(pattern1, content1)
            for item1 in items1:
                haveImg1 = re.search("img", item1[1])
                if not haveImg1:
                    haveImg = None
                    temp = item1[0]
                else:
                    haveImg = True




        if not haveImg:
            print("作者："+item[0].strip())
            if not havere:
                print("内容："+item[2].strip())
            else:
                print("内容：" + temp.strip())
            print("点赞数："+item[5].strip()+"\n")
except urllib.request.URLError as e:
    if(hasattr(e,"code")):
        print(e.code)
    if(hasattr(e,'reason')):
        print(e.reason)

mu7zp

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
糗事百科爬虫 2017 10/1版本的糗事百科 python3.x

从 http://cuiqingcai.com/990.html处学习并改进1. 首先下载网页基本信息a.基本的网页下载模式，出现如下错误 http.client.RemoteDisconnected:Remote end closed connection without response可能因为么有模拟headerb.需要得到：浏览器的User Agen
复制链接

扫一扫