简单的python爬虫

最新推荐文章于 2023-02-16 10:35:56 发布

Zhou_xinke

最新推荐文章于 2023-02-16 10:35:56 发布

阅读量422

点赞数

文章标签： python url 爬虫

本文链接：https://blog.csdn.net/Zhou_xinke/article/details/62037260

版权

参考别人的教程写的python爬虫，先贴上地址：http://python.jobbole.com/77825/

再上代码：

#缺乏异常处理，只能爬取静态页面
import re
import urllib.request
import urllib
from collections import deque

queue = deque()
visited = set()#已经爬取过的页面的集合

url = 'http://news.dbanotes.net'#爬取的网站的入口，可以自行更改

queue.append(url)
cnt = 0

while queue:
    url = queue.popleft()#让url出队
    visited |= {url}     #URL标记已经爬取过 visited集合和{url}集合进行或运算

    print('已经抓取' +str(cnt) + '      正在抓取 <----'+url)
    cnt = cnt+1
    urlop = urllib.request.urlopen(url)
    #urlop是一个HTTPResponse对象,下列网址有该对象的属性和方法
    #https://docs.python.org/3/library/http.client.html#httpresponse-objects

    #判断是否是一个正常的网址，不是.jpg这样的地址
    if 'html' not in urlop.getheader('Content-Type'):  
        continue

    try:
        data = urlop.read().decode('utf-8')   #用utf-8的方式解码
    except:
        continue

    linkre = re.compile('href=\"(.+?)"')       #这个正则匹配了所有的连接.jpg这种也会包含
    for x in linkre.findall(data):             #在解码的文件中进行查找url
        if 'http' in x and x not in visited:   
            queue.append(x)
            print('加入队列 ---->' + x)

主要就是对代码的进行讲解把。
这个爬虫用到了python里面的队列还有集合，队列不用list是因为它的效率很低，所以要引入deque这个包
这个程序就是让你给一个开始的网址，然后爬虫根据你给的网址，在这个网页上找到所有可以用的url，（判断是否已经访问过该网址，访问过了就不加入队列，没有访问过就加入队列）并加入队列，这很显然属于广度优先搜索了。实际上这个爬虫工作了一会儿就跑出异常了。我看他的改进之后再来给大家分享！