Python使用BeautifulSoup爬虫，和pyspider框架的使用

最新推荐文章于 2024-03-07 15:45:53 发布

chengguixian0057

最新推荐文章于 2024-03-07 15:45:53 发布

阅读量296

点赞数

文章标签： python 爬虫

原文链接：https://my.oschina.net/u/3294842/blog/856831

版权

2017.3.11 爬虫在很早就接触过，BeautifulSoup也是用过，但是，每写次爬虫就有一种新的。

BeautifulSoup的CSS选择器，可以选择标签，class，id等等。现在使用bs4来写爬虫，都会先使用CSS选择器找到希望得到的数据的块，然后再根据情况使用find_all方法和select方法来完成爬虫。不多说了上代码：

#coding:utf8
# http://www.qiushibaike.com/8hr/page/2/
#div id = content-left 总盒子
#div class = article block untagged mb15 子盒子
#刚开始使用的urllib发现不能访问网页，故使用了urllib2的Request发送一个伪造报头

from bs4 import BeautifulSoup
import urllib
import urllib2
import re

'''
url = "http://www.qiushibaike.com"
获取到BeautifulSoup的一个实例
headers = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.89 Safari/537.36'}
req = urllib2.Request(url,headers = headers )
html = BeautifulSoup(urllib2.urlopen(req).read())
通过id找到网页中的div盒子
divlist = html.select('#content-left')
#print len(divlist) #通过打印知道此列表只有一个元素
得到divlist中的元素，并且只去想要元素的div盒子
tags = divlist[0].select("div[class='article block untagged mb15']")
#便利tags得到每个tag
for tag in tags:
print tag.select("div[class='author clearfix']")[0].select("h2")[0].text.encode("utf8")#获得段子发表者
print tag.select(".contentHerf")[0].select("span")[0].text.encode("utf8")#获取段子内容
#print tag
#通过得到下一页的span标签，得到它的父节点标签，父节点标签中含有下一页的链接。
newurl = divlist[0].select(".pagination")[0].select("span.next")[0].parent["href"]
'''
'''以上分析结束 '''
#开始写方法
count = 1
url = "http://www.qiushibaike.com"
def getDuanzi(url,count):
#打开一个新的文本
file = open("e:\\out\\%d.txt" %(count),"w")
headers = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.89 Safari/537.36'}
req = urllib2.Request(url,headers = headers )
html = BeautifulSoup(urllib2.urlopen(req).read())
divlist = html.select('#content-left')
tags = divlist[0].select("div[class='article block untagged mb15']")
try:
for tag in tags:
user = tag.select("div[class='author clearfix']")[0].select("h2")[0].text.encode("utf8")#获得段子发表者
print user
content = tag.select(".contentHerf")[0].select("span")[0].text.encode("utf8")#获取段子内容
file.write("%s\r\n%s\r\n\r\n" %(user,content))
file.close()
count+=1
newurl = divlist[0].select(".pagination")[0].select("span.next")[0].parent["href"]
url = "http://www.qiushibaike.com"+newurl
getDuanzi(url,count)
except Exception:
pass

getDuanzi(url,count)

以上就是代码，然后学习了pyspider的使用，总体来说它并没有减少我们写代码的总量，但是它却更容易被我们维护和调试。所以，还是很有用的，并且它的参数也很强大，可以读取到js，也就是可以等待网页渲染完之后再的读取网页源码，这样我们就可以得到通过js里的数据啦，是不是很好！网上有很多教程，看看都能理解。

转载于:https://my.oschina.net/u/3294842/blog/856831

chengguixian0057

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Python使用BeautifulSoup爬虫，和pyspider框架的使用

2017.3.11 爬虫在很早就接触过，BeautifulSoup也是用过，但是，每写次爬虫就有一种新的。 BeautifulSoup的CSS选择器，可以选择标签，class，id等等。现在使用bs4来写爬虫，都会先使用CSS选择器找到希望得到的数据的块，然后再根据情况使用find_all方...
复制链接

扫一扫