爬取糗事百科

～张贵轩

已于 2024-03-13 10:27:34 修改

阅读量123

点赞数

分类专栏： python任务文章标签： python

于 2018-10-21 11:37:32 首次发布

本文链接：https://blog.csdn.net/weixin_43422232/article/details/83240211

版权

python任务专栏收录该内容

8 篇文章 0 订阅

订阅专栏

爬取糗事百科

#coding=utf-8
import urllib.request
import re

f=open('D:/python文件/张贵轩-任务2.txt','a',encoding='utf-8')
try:
    for page in range(1,11):
        url = 'https://www.qiushibaike.com/8hr/page/'+str(page)+'/'
        user_agent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'
        headers = {'User-Agent': user_agent}
        request = urllib.request.Request(url,headers = headers)
        response = urllib.request.urlopen(request)
        content = response.read().decode('utf-8')
        pattern = re.compile(r'<h2>(.*?)</h2>',re.S)
        pattern1 = re.compile(r'<span>(.*?)</span>',re.S)
        pattern2 = re.compile(r'<i class=.*?>(.*?)</i>',re.S)
        items = re.findall(pattern, content)
        items1 = re.findall(pattern1, content)
        items2 = re.findall(pattern2, content)
        list = []
        list1 = []
        list2 = []
        i = 0
        j = 0
        k = 0
        l = 0
        for b in items:
            list.append(items[j])
            j+=1
        for c in items1:
            list1.append(items1[j])
            k+=1
        for d in items2:
            list2.append(items2[l])
            l+=1
        print(items1)
        for a in range(25):
            list[i]=list[i].strip()
            list1[i] = list1[i].strip()
            list[i] = re.sub('<br/>','',list[i])
            list1[i] =re.sub('<br/>','',list1[i])
            f.write('发布人：')
            f.writelines(list[i]+'\n')
            f.write('发布内容：')
            f.writelines(list1[i].string+'\n')
            f.write('点赞数：')
            f.writelines(list2[i]+'\n\n')
            f.write('==================================================\n')
            i+=1

except urllib.request.URLError as e:
    if hasattr(e,"code"):
        print (e.code)
    if hasattr(e,"reason"):
        print (e.reason)

f.close()

学习总结：
（1）掌握了re库，urllib.request库内的方法，利用这些方法爬取网页内容；
（2）正则表达式的使用；
（3）利用user_agent = ‘Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)’，headers = {‘User-Agent’: user_agent}将python伪装成浏览器访问有限制的网站；
（4）利用了文件的打开，打开类型，保存等操作；
（5）利用了try，except这个错误处理方法；
（6）其他的还有删除文本指定内容操作等等。