Python爬虫入门+糗事百科示例

最新推荐文章于 2020-10-28 20:45:24 发布

xieyu_pei

最新推荐文章于 2020-10-28 20:45:24 发布

阅读量239

点赞数

本文链接：https://blog.csdn.net/xieyu_pei/article/details/79777183

版权

本文假设你已经有编程基础，知道面向对象

写于学习爬虫两周后

之所以多用Python写爬虫是因为Python中有很多爬虫相关的库，比如urllib，urllib2等，下边直接开始：

下边是第一个爬取百度首页实例，这是爬虫的基本构成，简单的爬虫基本基于此完成

import urllib.request
#导入库

url = 'http://www.baidu.com'
#想要爬取的地址
request = urllib.request.Request(url)
#说明请求
response = urllib.request.urlopen(request)
#向服务器发出请求，并用response接受服务器的响应结果
data = response.read()
data = data.decode('utf-8')
#转码为utf-8格式
print(data)
#打印结果

打印的结果是一大堆代码

然后此时我们想爬取糗事百科发现出现了问题

import urllib.request

url = 'http://www.qiushibaike.com'
try:
    request = urllib.request.Request(url)
    response = urllib.request.urlopen(request)
    data = response.read()
    data = data.decode('utf-8')
except:
    print('访问失败')

print(data)

结果如下：

这是因为我们的请求缺少了请求头

将代码调整如下

import urllib.request

url = 'http://www.qiushibaike.com'
user_agent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'
headers = {'User-Agent': user_agent}
request = urllib.request.Request(url,headers=headers)
response = urllib.request.urlopen(request)

data = response.read()
data = data.decode('utf-8')
print(data)

就打印出了类似上例中的结果

观察上述代码我们发现全都是字母，看着就头疼，要是能只显示我们想要的该多好，于是，我们需要用到正则表达式：

观察上述代码爬取的数据（或者在浏览器上按F12）我们想要的糗事百科的段子都是这样的

于是我们在代码中加上

re.findall(r'<span>(.*?)</span>', str(content),re.S)

为了得带更多的段子，我们加上循环，并将结果写入本地文件结果如下：

import urllib.request
import re

targetpath = '/home/xieyupei/1.txt'
def writ(text):
    try:
        with open(targetpath,'a') as f:
            f.write(text+"\n");
            f.close()
    except:
        print("存储失败")


def req(page):
    url = 'http://www.qiushibaike.com/hot/page/' + str(page)
    user_agent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'
    headers = {'User-Agent': user_agent}
    try:
        request = urllib.request.Request(url, headers=headers)
        response = urllib.request.urlopen(request)
        content = response.read().decode('utf-8')
        for joke in set(re.findall(r'<span>(.*?)</span>', str(content),re.S)):#此处的re.S为跨行匹配
            print(joke)
            writ(joke)
    except:
        print('请求未响应')

for page in range(1,10):
    page = str(page)
    req(page)

在结果比较粗糙，还是有不少我们不想要的东西，需要加以优化，目前先进行到这里。

2018年4月19日

糗事百科进行了改版，请求地址统一进行了修改，我们的代码改为下图所示：

url = 'http://www.qiushibaike.com/hot/page/' + str(page)+'/'

只有这一句进行了微调，其它不用改动；

xieyu_pei

关注

0
点赞
踩
3

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫