Python数据分析：爬虫基本知识

最新推荐文章于 2024-05-30 00:04:56 发布

Sweeney Chen

最新推荐文章于 2024-05-30 00:04:56 发布

阅读量416

点赞数

分类专栏： Python数据分析文章标签： Python 数据分析爬虫

本文链接：https://blog.csdn.net/weixin_41792682/article/details/89503924

版权

Python数据分析专栏收录该内容

32 篇文章 7 订阅

订阅专栏

Python数据分析：爬虫基本知识

爬虫:

自动抓取互联网信息的程序
利用互联网数据进行分析、开发产品

爬虫基本架构：

URL管理模块
- 对计划爬取的或已经爬取的URL进行管理
网页下载模块
- 将URL管理模块中指定的URL进行访问下载
网页解析模块
- 解析网页下载模块中的URL，处理或保存数据
- 如果解析到要继续爬取的URL，返回URL管理模块继续循环

URL管理模块：

防止重复爬取或循环指向
实现方式：
- 用Python的set数据结构存储，因为set能够保证不会有重复
- 数据库中的数据表
- 缓存数据库Redis，适用于大型互联网公司

URL下载模块：

将URL对应的网页下载到本地或读入内存（字符串）
实现方式
- URL
- requests
通过URL直接下载

response = urllib.request.urlopen(url)

response.getcode()

response.read()
通过requests访问下载

request = urllib.request.Request(url)

request.add_head()

request.add_data()

response = urllib.urlopen(request)
通过cookie访问

使用http.cookiejar模块

cookie_jar = http.cookiejar.CookieJar()

opener = urllib.request.build_opener()

urllib.request.install_opener(opener)

response = urllib.request.urlopen(url)

import urllib.request

test_url = "https://www.baidu.com/"

通过URL直接下载

# 通过Request访问
request = urllib.request.Request(test_url)
request.add_header("user-agent", "Mozilla/5.0")

response = urllib.request.urlopen(request)
print(response.getcode()) # 200 表示访问成功
print(response.read())

运行结果：
在这里插入图片描述

通过request访问下载

# 通过Request访问
request = urllib.request.Request(test_url)
request.add_header("user-agent", "Mozilla/5.0")

response = urllib.request.urlopen(request)
print(response.getcode()) # 200 表示访问成功
print(response.read())

运行结果：
在这里插入图片描述

通过cookie访问

# 通过cookie访问
import http.cookiejar

cookie_jar = http.cookiejar.CookieJar()
opener = urllib.request.build_opener(urllib.request.HTTPCookieProcessor(cookie_jar))
urllib.request.install_opener(opener)

response = urllib.request.urlopen(test_url)
print(response.getcode()) # 200 表示访问成功
print(response.read())
print(cookie_jar)

运行结果：
在这里插入图片描述

网页解析模块：

从已下载的网页中爬取数据
实现方式：
- 正则表达式，字符串的模糊匹配
- html.parser
- BeautifulSoup,结构化的网页解析
- lxml
结构化解析
Document Object Model，树形结构

Sweeney Chen

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
打赏
0
评论
Python数据分析：爬虫基本知识

Python数据分析：爬虫基本知识爬虫:自动抓取互联网信息的程序利用互联网数据进行分析、开发产品爬虫基本架构：URL管理模块对计划爬取的或已经爬取的URL进行管理网页下载模块将URL管理模块中指定的URL进行访问下载网页解析模块解析网页下载模块中的URL，处理或保存数据如果解析到要继续爬取的URL，返回URL管理模块继续循环URL管理模块：...
复制链接

扫一扫