python简单爬虫程序分析_[Python专题学习]-python开发简单爬虫

最新推荐文章于 2023-05-03 14:51:55 发布

weixin_39999730

最新推荐文章于 2023-05-03 14:51:55 发布

阅读量135

点赞数

文章标签： python简单爬虫程序分析

本文链接：https://blog.csdn.net/weixin_39999730/article/details/111458958

版权

掌握开发轻量级爬虫，这里的案例是不需要登录的静态网页抓取。涉及爬虫简介、简单爬虫架构、URL管理器、网页下载器(urllib2)、网页解析器(BeautifulSoup)

一.爬虫简介以及爬虫的技术价值

1.爬虫简介

爬虫：一段自动抓取互联网信息的程序。

爬虫是自动访问互联网，并且提取数据的程序。

2.爬虫价值

互联网数据，为我所用！

二.简单爬虫架构

运行流程：

三.URL管理器和实现方法

1.URL管理器

URL管理器：管理待抓取URL集合和已抓取URL集合，防止重复抓取、防止循环抓取

2.实现方式

四.网页下载器和urllib2模块

1.网页下载器

将互联网上URL对应的网页下载到本地的工具。

Python有哪几种网页下载器？

2.urllib2下载器网页的三种方法

a.urllib2下载网页方法1：最简法方法

b.urllib2下载网页方法2：添加data、http header

c.urllib2下载网页方法3：添加特殊情景的处理器

3.urllib2实例代码演示

由于我这里用的是python3.x，引用的不是urllib，而是urllib.request。

importurllib.requestimporthttp.cookiejar

url= "http://www.baidu.com"

print('第一种方法')

response1=urllib.request.urlopen(url)print(response1.getcode())print(len(response1.read()))print('第二种方法')

request=urllib.request.Request(url)

request.add_header("user-agent", "Mozilla/5.0")

response2=urllib.request.urlopen(request)print(response2.getcode())print(len(response2.read()))print('第三种方法')

cj=http.cookiejar.CookieJar()

opener=urllib.request.build_opener(urllib.request.HTTPCookieProcessor(cj))

urllib.request.install_opener(opener)

response3=urllib.request.urlopen(url)print(response3.getcode())print(cj)print(response3.read())

运行结果：

五.网页解析器和BeautifulSoup第三方模块

1.网页解析器简介

从网页中提取有价值数据的工具。

Python有哪几中网页解析器？

结构化解析-DOM(Document Object Model)树

2.BeautifulSoup模块介绍和安装

安装并测试BeautifulSoup4，安装：pip install beautifulsoup4

但这样安装成功后，在PyCharm中还是不能引入，于是再通过从官网上下载安装包解压，再安装，竟然还是不可以，依然报No module named 'bs4'。

没办法，最后在PyCharm中通过如下方式安装后才可以。

进入如下窗口。

点击“Install Package”进行安装，出现如下提示表明安装成功。

安装成功后，再次进入可以看到安装的版本等信息，如下所示。

3.BeautifulSoup的语法

4.BeautifulSoup实例测试

from bs4 importBeautifulSoupimportre

html_doc= """

The Dormouse's story

The Dormouse's story

Once upon a time there were three little sisters; and their names were

Elsie,

Lacie and

Tillie;

and they lived at the bottom of a well.

...

"""

#soup = BeautifulSoup(html_doc, 'html.parser', from_encoding='utf-8') python3 缺省的编码是unicode, 再在from_encoding设置为utf8, 会被忽视掉，去掉【from_encoding="utf-8"】

soup = BeautifulSoup(html_doc, 'html.parser')print("获取所有的链接")

links= soup.find_all('a')for link inlinks:print(link.name, link['href'], link.get_text())print("获取Lacie的链接")

link_node= soup.find('a', href="http://example.com/lacie")print(link_node.name, link_node['href'], link_node.get_text())print("正则匹配")

link_node= soup.find('a', href=re.compile(r"ill"))print(link_node.name, link_node['href'], link_node.get_text())print("获取p段落文字")

p_node= soup.find('p', class_="title")print(p_node.name, p_node.get_text())

运行结果：

六.实战演练：爬取百度百科1000个页面的数据

1.分析目标

目标：百度百科Python词条相关词条网页-标题和简介

URL格式：词条页面URL：/item/计算机程序设计语言/7073760

数据格式：

标题：

***

简介：

***

weixin_39999730

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫