Python实现简单爬虫

最新推荐文章于 2024-05-20 20:26:39 发布

木独猪_xss

最新推荐文章于 2024-05-20 20:26:39 发布

阅读量309

点赞数

分类专栏：前端文章标签： python 爬虫

本文链接：https://blog.csdn.net/u012442987/article/details/69774088

版权

前端专栏收录该内容

2 篇文章 0 订阅

订阅专栏

简单爬虫构架
这里写图片描述

时序图
这里写图片描述

Url管理器

管理待抓取url集合和已抓取Url集合
通过两个列表（已抓取url列表，未抓取url的列表）防止重复抓取、防止循环抓取
这里写图片描述

网页下载器
将互联网上Url对应的网页下载到本地的工具
通过的Python urllib2模块来实现
一个网页下载器的示例

#coding=utf-8
import urllib2
import cookielib
url = "http://www.baidu.com"
request = urllib2.Request(url)
request.add_header("user-agent","Mozilla/5.0") #伪装成火狐浏览器
cj = cookielib.CookieJar()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
urllib2.install_opener(opener) #使urllib2增加cookie的处理
response = urllib2.urlopen(request)
print response.getcode()
print cj
fout = open("baidu.txt","w")
fout.write(response.read())
fout.close()
 
 1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
 
 1
2
3
4
5
6
7
8
9
10
11
12
13
14
15

网页解析器（BeautifulSoup）
BeautifulSoup一个强大的网页信息解析的python第三方插件,可以选择使用html.parser或lxml来作为解析器
网页解析器的作用是解析下载的网页内容，提取价值数据和新的url,调度器不断将新的url添加到url管理器
一个小的示例

from bs4 import BeautifulSoup

html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""
soup = BeautifulSoup(html_doc, 'html.parser',from_encoding='utf-8')
for node in soup.find_all('a'):
    print node.name,node['href'],node.get_text()
 
 1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
 
 1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18

爬取之前首先要对目标网页进行分析，如下:
这里写图片描述

最后贴一下调度器的代码：

import url_manager, html_downloader, html_parser, html_outputer 
test.sayHello('chm')

class SpiderMain:
    def __init__(self):
        self.urls = url_manager.UrlManager()
        self.downloader = html_downloader.HtmlDownloader()
        self.parser = html_parser.HtmlParser()
        self.outputer = html_outputer.HtmlOutputer()
    def craw(self, root_url):
        count = 1
        self.urls.add_new_url(root_url)
        while self.urls.has_new_url():
            try:
                new_url = self.urls.get_new_url()
                print 'craw %d: %s' % (count,new_url)
                html_cont = self.downloader.download(new_url)
                new_urls, new_data = self.parser.parse(new_url, html_cont)
                self.urls.add_new_urls(new_urls)
                self.outputer.collect_data(new_data)
                if count == 100:
                    break
                count = count + 1
            except IOError, e:
                print e
                print 'craw failed'
        self.outputer.output_html()

if __name__ == '__main__':
    root_url = 'http://baike.baidu.com/view/21087.htm'
    spider = SpiderMain()
    spider.craw(root_url)
 
 1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
 
 1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32

全部代码去这里下载：http://download.csdn.net/detail/zxc123e/9506792

爬取的部分内容：
这里写图片描述

注意：因为网页结果是在不断升级和变化的，如果执行过程中发生异常，请重新分析目标页面后修改程序，才能正确爬取。

木独猪_xss

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Python实现简单爬虫

简单爬虫构架时序图 Url管理器管理待抓取url集合和已抓取Url集合通过两个列表（已抓取url列表，未抓取url的列表）防止重复抓取、防止循环抓取网页下载器将互联网上Url对应的网页下载到本地的工具通过的Python urllib2模块来实现一个网页下载器的示例#coding=utf-8import urllib2
复制链接

扫一扫