从 http://cuiqingcai.com/990.html处学习并改进
1. 首先下载网页基本信息
a.基本的网页下载模式,出现如下错误
http.client.RemoteDisconnected:Remote end closed connection without response
可能因为么有模拟header
b.需要得到:浏览器的User Agent,则可以在浏览器上输出地址栏上看一下about:version
2. 网页分析器
a.这里利用正则表达式,需要注意的是如果么有,则截取前后,然后判断
b.空格太多,可以用a.strip()消除前后空格和换行符
c.出现只能显示部分的情况,应该找到源页面,然后摘取文档,注意此时有图片也不展示
3.基本代码:
# _*_coding:utf-8 -*- import urllib import urllib.request import urllib.parse import re import urllib.error import http.cookiejar __author__ = "muzp" page = 2 url = 'https://www.qiushibaike.com/hot/page/' + str(page) user_agent = 'User-Agent:Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36' headers ={'User-Agent': user_agent} try: request = urllib.request.Request(url, headers=headers) response = urllib.request.urlopen(request) content = response.read().decode("utf-8") pattern = re.compile('''<div class="author clearfix">.*?<h2>(.*?)</h2>'''+ '''.*?<a href="(.*?)"''' + '''.*?<span>(.*?)</span>'''+ '''(.*?)</div>'''+ '''.*?<!-- 图片或gif -->(.*?)<div class="stats">''' + '''.*?<i.*?number">(.*?)</i>''', re.S) items = re.findall(pattern, content) for item in items: haveImg = re.search("img", item[4]) havere = re.search("查看全文",item[3]) temp ="" if havere: url1 ="https://www.qiushibaike.com"+item[1] print(url1) request1 = urllib.request.Request(url1, headers=headers) response1 = urllib.request.urlopen(request1) content1 = response1.read().decode("utf-8") pattern1 = re.compile('<div class="content">(.*?)</div>(.*?)</div>', re.S) items1 = re.findall(pattern1, content1) for item1 in items1: haveImg1 = re.search("img", item1[1]) if not haveImg1: haveImg = None temp = item1[0] else: haveImg = True if not haveImg: print("作者:"+item[0].strip()) if not havere: print("内容:"+item[2].strip()) else: print("内容:" + temp.strip()) print("点赞数:"+item[5].strip()+"\n") except urllib.request.URLError as e: if(hasattr(e,"code")): print(e.code) if(hasattr(e,'reason')): print(e.reason)