python学习（一）

最新推荐文章于 2022-04-11 22:18:09 发布

愤怒的北方酱

最新推荐文章于 2022-04-11 22:18:09 发布

阅读量425

点赞数

本文链接：https://blog.csdn.net/h123120/article/details/52862863

版权

目标：编写个简单的单线程爬虫，从百度首页开始爬取百度（baidu.com）这个域名下的一些域名

一、算法与数据结构

  1.BFS（宽度优先搜索）：用于遍历搜索整个网站。 

2.set（集合）：遍历时维护已经爬过的网站存储整个网站。
二、代码
基本思路：利用BFS遍历页面，利用正则表达式匹配出链接，集合标记访问过的链接
细节注释在代码里

<pre name="code" class="python">#encoding:UTF-8
#python3.3
import re
import urllib.request
import urllib 
import urllib.parse 
from collections import deque

visited=set()#空集合
queue=deque()#空双向队列
url="http://www.baidu.com"
cnt=0
queue.append(url)#向队列中插入爬虫起初始

#BFS爬取页面
while queue:
    url=queue.popleft()
    visited|={url}#标记访问过的页面
    
    try:#队列里的url有些会残缺不全，或者有其他问题，所以用try except处理下
        response=urllib.request.urlopen(url)
    except:
        continue
    
    print('第'+str(cnt)+'个  正在抓取：'+url)
    cnt+=1
    
    #过滤掉其他非html类型的页面（比如gif,css...)
    if 'html' not in response.getheader('content-Type'): 
        continue
    
    #有些页面转码会出错，用try except处理下
    try:
        Page=response.read().decode('utf-8')
    except:          
        continue      
    #正则表达式的模式串
    LinkPattern=re.compile('href=\"(.+?)\"')
    Links=LinkPattern.findall(Page)
    #遍历所有正则表达式匹配出来的链接
    for Link in Links:
        if  Link not in visited and 'baidu.com' in Link and 'http' in Link:#链接不在集合里且里边没有“baidu.com”这个域名
            queue.append(Link)#链接插入队列

三、总结
这只是个简单的单线程爬虫，效率极低，没有对内容作出多的处理，超时也没跳过，还有待改善！
效果图：
图传不上来

愤怒的北方酱

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
python学习（一）

目标：编写个简单的单线程爬虫，从百度首页开始爬取百度（baidu.com）这个域名下的一些域名一、算法与数据结构1.BFS（宽度优先搜索）：用于遍历搜索整个网站。2.set（集合）：遍历时维护已经爬过的网站存储整个网站。二、代码基本思路：利用BFS遍历页面，利用正则表达式匹配出链接，集合标记访问过的链接细节注释在代码里#encoding:UTF-8#python
复制链接

扫一扫