爬虫其实很简单,只要用心,很快就就能掌握这门技术,下面通过实现抓取糗事百科段子,来分析一下为什么爬虫事实上是个非常简单的东西。
本文目标
抓取糗事百科热门段子
实现每按一次回车显示一个段子的发布时间,发布人,段子内容,点赞数。
获取网页源码
通过Requests框架抓取源码。
import requests
import re
head = {'User-Agent':'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.1.6) Gecko/20091201 Firefox/3.5.6'}
TimeOut = 30
def requestpageText(url):
try:
Page = requests.session().get(url,headers=head,timeout=TimeOut)
Page.encoding = "gb2312"
return Page.text
except BaseException as e:
print("联网失败了...",e)
site = "http://www.qiushibaike.com/8hr/page/1"
text = requestpageText(site)#抓取网页源码
print(text)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
importrequests
importre
head={'User-Agent':'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.1.6) Gecko/20091201 Firefox/3.5.6'}
TimeOut=30
defrequestpageText(url):
try:
Page=requests.session().get(url,headers=head,timeout=TimeOut)
Page.encoding="gb2312"
returnPage.text
exceptBaseExceptionase:
print("联网失败了...",e)
site="http://www.qiushibaike.com/8hr/page/1"
text=requestpageText(site)#抓取网页源码
print(text)
获取段子并打印
通过正则匹配段子数据
patterns = re.compile(r'
items = re.findall(patterns,text)
1
2
patterns=re.compile(r'
items=re.findall(patterns,text)
按回车获取下一条
把抓取的数据放在本地列表里,每次按回车,则去下一条数据,如果数据没有了,则执行翻页操作
index = 0
while index < len(items):
try:
x = items[index]
print("作者:{0} 好笑:{1} 评论{2}".format(x[0],x[2],x[3]))
print(x[1])
text = input("按回车键进入下一项")
print()
print()
except Exception as e:
print(e)
index+=1
1
2
3
4
5
6
7
8
9
10
11
12
13
index=0
whileindex
try:
x=items[index]
print("作者:{0} 好笑:{1} 评论{2}".format(x[0],x[2],x[3]))
print(x[1])
text=input("按回车键进入下一项")
print()
print()
exceptExceptionase:
print(e)
index+=1
整合代码
打开糗事百科热门段子,网址为http://www.qiushibaike.com/8hr/page/1,多次翻页发现后面的1位页数。
右键单击网页查看源码,分析源码,可以找到我们需要的数据
import requests
import re
class qiushibaike:
def __init__(self):
self.head = {'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/45.0.2454.101 Safari/537.36'}
self.TimeOut = 30
self.url = "http://www.qiushibaike.com/8hr/page/%d"
self.page = 1
def requestpageText(self,url):
try:
print("开始获取数据:",url)
Page = requests.session().get(url,headers=self.head,timeout=self.TimeOut)
Page.encoding = "utf-8"
return Page.text
except BaseException as e:
print("联网失败了...",e)
def downurl(self,page):
url = self.url%(page)
text = self.requestpageText(url)
patterns = re.compile(r'
items = re.findall(patterns,text)
index = 0
while index < len(items):
try:
x = items[index]
print("作者:{0} 好笑:{1} 评论{2}".format(x[0],x[2],x[3]))
print(x[1])
text = input("按回车键进入下一项")
print()
print()
except Exception as e:
print(e)
index+=1
self.page +=1
self.downurl(self.page)
def start(self):
self.downurl(self.page)
q = qiushibaike()
text = q.start()
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
importrequests
importre
classqiushibaike:
def__init__(self):
self.head={'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/45.0.2454.101 Safari/537.36'}
self.TimeOut=30
self.url="http://www.qiushibaike.com/8hr/page/%d"
self.page=1
defrequestpageText(self,url):
try:
print("开始获取数据:",url)
Page=requests.session().get(url,headers=self.head,timeout=self.TimeOut)
Page.encoding="utf-8"
returnPage.text
exceptBaseExceptionase:
print("联网失败了...",e)
defdownurl(self,page):
url=self.url%(page)
text=self.requestpageText(url)
patterns=re.compile(r'
items=re.findall(patterns,text)
index=0
whileindex
try:
x=items[index]
print("作者:{0} 好笑:{1} 评论{2}".format(x[0],x[2],x[3]))
print(x[1])
text=input("按回车键进入下一项")
print()
print()
exceptExceptionase:
print(e)
index+=1
self.page+=1
self.downurl(self.page)
defstart(self):
self.downurl(self.page)
q=qiushibaike()
text=q.start()
效果图
C:\Users\Administrator>E:\python\learn\qiushibaike\qiushibaike.py
开始获取数据: http://www.qiushibaike.com/8hr/page/1
作者:快乐二霸~武寒 好笑:3118 评论115
致我们终将逝去的青春
按回车键进入下一项
1
2
3
4
5
6
7
8
9
C:\Users\Administrator>E:\python\learn\qiushibaike\qiushibaike.py
开始获取数据:http://www.qiushibaike.com/8hr/page/1
作者:快乐二霸~武寒好笑:3118评论115
致我们终将逝去的青春
按回车键进入下一项