Python开发简单爬虫（一）

最新推荐文章于 2024-04-01 13:30:49 发布

walxiaosage

最新推荐文章于 2024-04-01 13:30:49 发布

阅读量1.9k

点赞数

分类专栏： Python入门学习

本文链接：https://blog.csdn.net/walxiaosage/article/details/50437328

版权

Python入门学习专栏收录该内容

38 篇文章 0 订阅

订阅专栏

Python开发简单爬虫

1.介绍

2.爬虫简介以及爬虫的技术价值

1.爬虫是什么

爬虫：一段自动抓取互联网信息的程序

2.爬虫技术的价值

价值：互联网数据，为我所用

3.简单爬虫架构

1.简单爬虫架构

1.调度端：启动停止爬虫或者监事爬虫的执行过程

2.url管理器：对将要爬取的URL和已经爬取过的URL进行管理，可以取出待爬取的url传送给网页下载器

3.网页下载器：将带爬取的url下载下来，并作为字符串存取，将字符串传给网页解析器

4.网页解析器：解析字符串，一方面解析出有价值的数据；另一方面解析出许多指向其他网页的url,这些url被解析出来后可以补充进url管理器

三个模块形成循环，只要有相关的内容就会一直运行

2.简单爬虫架构的动态运行流程

4.URL管理器和实现方法

1.URL管理者

用来管理待抓取的URL集合和已抓取URL集合

防止重复抓取、防止循环抓取（相互指向的问题）

URL管理器所支持的功能：

2.URL管理器的实现方法

1）将待爬取和已爬取的URL存入内存：set，内存受限制

2）存储在关系数据库中：

3）缓存数据库：set,这是目前的主流

5.网页下载器和urllib2模块

1.网页下载器简介

是将互联网上URL对应的网页下载到本地的工具

python支持的网页下载器：

1）urllib2:官方基础模块

2）requests：第三方包更强大

2.urllib2下载器网页的三种方法

1）最简单方法：urllib2.urlopen(url)

import urllib2#抓取网页

#直接请求

response=urllib2.urlopen('http://www.baidu.com')#获取网页

#获取状态码，如果是200表示获取成功

print response.getcode()

#读取内容

cont=response.read()

2）添加data 、http header

request类生成request对象

import urllib2#抓取网页

#创建request对象

request=urllib2.Request('http://www.baidu.com')#获取网页

#添加数据

request.add_data('a','100')

#添加http的header

request.add_header('User-Agent','Mozilla/5.0')

#发送请求获取结果

response=urllib2.urlopen(request)

3）添加特殊情景的处理器

import urllib2,cookielib #抓取网页

#创建cookie容器

cj=cookielib.CookieJar()

#创建1个opener

opener=urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))

#给urllib2安装opener

urllib2.install_opener(opener)

#使用带有cookie的urllib2访问网页

response=urllib2.urlopen("http://www.baidu.com/")

3.urllib2示例代码演示

import urllib2,cookielib #抓取网页

url="http://www.baidu.com"

print "第一种方法："

response1=urllib2.urlopen(url)

print response1.getcode()

print len(response1.read())

print "第二种方法："

request=urllib2.Request(url)

request.add_header('User-Agent','Mozilla/5.0')

response2=urllib2.urlopen(request)

print response2.getcode()

print len(response2.read())

print "第三种方法："

cj=cookielib.CookieJar()

opener=urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))

urllib2.install_opener(opener)

response3=urllib2.urlopen(url)

print response3.getcode()

print cj

print len(response3.read())

6.网页解析器和BeautifulSoup第三方模块

1.网页解析器简介

从网页中提取出有价值数据的工具

输入html网页字符串，通过解析获取有价值的数据和新的url列表

python的网页解析器：

1）正则表达式：字符串的模糊匹配

2）html.parser模块

3）BeautifulSoup 插件：可以使用html.parsor和lxml作为自己的解析器

4）lxml插件解析xml或者html

后三种是一种结构化的解析：

DOM(Document Object Model)树文档对象模型

2.BeautifulSoup模块介绍和安装

BeautifulSoup

—python第三方库，用于从HTML或XML中提取数据

—官网：http://www.crummy.com/software/BeautifulSoup/

安装并测试：

—安装： pip installbeautifulsoup4

或者easy_install beautifulsoup4、apt-get install Python-bs4

—测试：import bs4

3.BeautifulSoup的语法

根据html网页，创建一个bs对象，创建的同时就生成了DOM树，然后根据该树可以进行搜索

比如，对如下名称、属性、文字：

步骤：

from bs4importBeautifulSoup

import re

########首先########

#根据HTML网页字符串创建BeautifulSoup对象

soup=BeautifulSoup(

html_doc,#HTML文档字符串

'html.parser',#HTML解析器

from_encoding='utf-8' #HTML文档的编码

)

#如果网页的编码和代码的编码不一致，解析过程会出现乱码

#########其次，搜索节点（find_all,find）find_all(name,attrs,string)

#查找所有标签为a的节点

soup.find_all('a')

#查找所有标签为a，链接符合/view/123.html形式的节点

soup.find_all('a',href='/view/123.html')

soup.find_all('a',href=re.compile(r'/view/\d+.html'))

#查找所有标签为div,class为abc,文字为Python的节点

soup.find_all('div',class='abc',string='Python')

########最后，访问节点的信息

#例如得到节点为<ahref='1.html'>Python</a>

#获取查找到的节点的标签名称

node.name

#获取查找到的节点的href属性

node['href']

#获取查找到的节点的链接文字

node.get_text()

4.BeautifulSoup实例测试

from bs4importBeautifulSoup

import re

html_doc = """

<html><head><title>The Dormouse'sstory</title></head>

<body>

TheDormouse's story

Once upon a timethere were three little sisters; and their names were

<a href="http://example.com/elsie"class="sister" id="link1">Elsie</a>,

<a href="http://example.com/lacie"class="sister" id="link2">Lacie</a> and

<a href="http://example.com/tillie"class="sister" id="link3">Tillie</a>;

and they lived at the bottom of a well.

...

"""

#根据HTML网页字符串创建BeautifulSoup对象

soup=BeautifulSoup(html_doc,'html.parser', from_encoding='utf-8')

print "获取所有的链接"

links=soup.find_all('a')

for link inlinks:

printlink.name,link['href'],link.get_text()

print "获取Lacie的链接"

link_node=soup.find('a',href='http://example.com/lacie')

print link_node.name,link_node['href'],link_node.get_text()

print "正则匹配"

link_node=soup.find('a',href=re.compile(r'ill'))

print link_node.name,link_node['href'],link_node.get_text()

print "获取p指定class获取内容"

p_node=soup.find('p',class_="title")#因为class是python的关键字，所以要加下划线

print p_node.name,':',p_node.get_text()

walxiaosage

关注

0
点赞
踩
13

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

专栏目录