python网络爬虫常用技术

最新推荐文章于 2024-02-17 08:15:00 发布

朝阳区靓仔_James

最新推荐文章于 2024-02-17 08:15:00 发布

阅读量474

点赞数

文章标签： python 爬虫 pandas 开发语言人工智能

本文链接：https://blog.csdn.net/weixin_58753619/article/details/130587163

版权

urllib模块

urllib库是python中自带的模块，也是一个最基本的网络请求库，该模块提供了一个urlopen()方法，通过该方法指定URL发送网络请求来获取数据。

urllib 是一个收集了多个涉及 URL 的模块的包

urllib.request 打开和读取 URL

三行代码即可爬取百度首页源代码：

import urllib.request
# 打开指定需要爬取的网页
response=urllib.request.urlopen('http://www.baidu.com')
# 或者是 
# from urllib import request
# response = request.urlopen('http://www.baidu.com')


# 打印网页源代码
print(response.read().decode())

加入decode()是为了避免出现下图中十六进制内容

在这里插入图片描述
加入decode()进行解码后

在这里插入图片描述
下面三种本篇将不做详述

urllib.error 包含 urllib.request 抛出的异常
urllib.parse 用于解析 URL
urllib.robotparser 用于解析 robots.txt 文件

requests模块

requests模块是python中实现HTTP请求的一种方式，是第三方模块，该模块在实现HTTP请求时要比urllib模块简化很多，操作更加人性化。
以GET请求为例：

import requests
response = requests.get('http://www.baidu.com/')
print('状态码：', response.status_code)
print('请求地址：', response.url)
print('头部信息：', response.headers)
print('cookie信息：', response.cookies)
# print('文本源码：', response.text)
# print('字节流源码：', response.content)

输出结果如下：

状态码：200
请求地址：http://www.baidu.com/
头部信息：{'Cache-Control': 'private, no-cache, no-store, proxy-revalidate, no-transform', 'Connection': 'keep-alive', 'Content-Encoding': 'gzip', 'Content-Type': 'text/html', 'Date': 'Sun, 10 May 2020 02:43:33 GMT', 'Last-Modified': 'Mon, 23 Jan 2017 13:28:23 GMT', 'Pragma': 'no-cache', 'Server': 'bfe/1.0.8.18', 'Set-Cookie': 'BDORZ=27315; max-age=86400; domain=.baidu.com; path=/', 'Transfer-Encoding': 'chunked'}
cookie信息：<RequestsCookieJar[<Cookie BDORZ=27315 for .baidu.com/>]>

这里讲解一下response.text和 response.content的区别

response.content是直接从网络上面抓取的数据,没有经过任何解码,所以是一个 bytes类型
response.text是将response.content进行解码的字符串,解码需要指定一个编码方式, requests会根据自己的猜测来判断编码的方式,所以有时候可能会猜测错误,就会导致解码产生乱码,这时候就应该使用 response.content.decode(‘utf-8’)
进行手动解码

以POST请求为例

import requests
data={'word':'hello'}
response = requests.post('http://www.baidu.com',data=data)
print(response.content)

请求headers处理

当爬取页面由于该网页为防止恶意采集信息而使用反爬虫设置，从而拒绝用户访问，我们可以通过模拟浏览器的头部信息来进行访问，这样就能解决反爬虫设置的问题。

通过浏览器进入指定网页，右击鼠标，选中“检查”，选择“Network”，刷新页面后选择第一条信息，右侧消息头面板将显示下图中请求头部信息

在这里插入图片描述
例如：

import requests
url = 'https://www.bilibili.com/'
headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.129 Safari/537.36'}
response = requests.get(url, headers=headers)
print(response.content.decode())

网络超时

在访问一个页面，如果该页面长时间未响应，系统就会判断该网页超时，所以无法打开网页。
例如：

import requests


url = 'http://www.baidu.com'
# 循环发送请求50次
for a in range(0, 50):
    try:
       # timeout数值可根据用户当前网速，自行设置
        response = requests.get(url, timeout=0.03) # 设置超时为0.03
        print(response.status_code)
    except Exception as e:
        print('异常'+str(e)) # 打印异常信息

部分输出结果如下：

在这里插入图片描述

代理服务

设置代理IP可以解决不久前可以爬取的网页现在无法爬取了，然后报错——由于连接方在一段时间后没有正确答复或连接的主机没有反应，连接尝试失败的问题。

例如：

import requests


# 设置代理IP
proxy = {'http': '117.45.139.139:9006',
         'https': '121.36.210.88:8080'
         }
# 发送请求
url = 'https://www.baidu.com'
response = requests.get(url, proxies=proxy)
# 也就是说如果想取文本数据可以通过response.text
# 如果想取图片，文件，则可以通过 response.content
# 以字节流的形式打印网页源代码,bytes类型
print(response.content.decode())
# 以文本的形式打印网页源代码，为str类型
print(response.text) # 默认”iso-8859-1”编码，服务器不指定的话是根据网页的响应来猜测编码。

Beautiful Soup模块

Beautiful Soup模块是一个用于HTML和XML文件中提取数据的python库。Beautiful Soup模块自动将输入的文档转换为Unicode编码，输出文档转换为UTF-8编码，你不需要考虑编码方式，除非文档没有指定一个编码方式，这时，Beautiful Soup就不能自动识别编码方式了，然后，仅仅需要说明一下原始编码方式就可以了。

例如：

from bs4 import BeautifulSoup


html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>


<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>


<p class="story">...</p>
"""
# 创建对象
soup = BeautifulSoup(html_doc, features='lxml')
# 或者创建对象打开需要解析的html文件
# soup = BeautifulSoup(open('index.html'), features='lxml')
print('源代码为：', soup)# 打印解析的HTML代码

运行结果如下：

<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
</body></html>

用Beautiful Soup爬取百度首页标题

from bs4 import BeautifulSoup
import requests


response = requests.get('http://news.baidu.com')
soup = BeautifulSoup(response.text, features='lxml')
print(soup.find('title').text)

运行结果如下：

百度新闻——海量中文资讯平台

关于Python技术储备

学好 Python 不论是就业还是做副业赚钱都不错，但要学会 Python 还是要有一个学习规划。最后大家分享一份全套的 Python 学习资料，给那些想学习 Python 的小伙伴们一点帮助！

对于0基础小白入门：

如果你是零基础小白，想快速入门Python是可以考虑的。

一方面是学习时间相对较短，学习内容更全面更集中。
二方面是可以找到适合自己的学习方案

包括：Python激活码+安装包、Python web开发，Python爬虫，Python数据分析，人工智能、机器学习等习教程。带你从零基础系统性的学好Python！

零基础Python学习资源介绍

👉Python学习路线汇总👈

Python所有方向的技术点做的整理，形成各个领域的知识点汇总，它的用处就在于，你可以按照上面的知识点去找对应的学习资源，保证自己学得较为全面。（全套教程文末领取哈）

👉Python必备开发工具👈

温馨提示：篇幅有限，已打包文件夹，获取方式在：文末

👉Python学习视频600合集👈

观看零基础学习视频，看视频学习是最快捷也是最有效果的方式，跟着视频中老师的思路，从基础到深入，还是很容易入门的。

👉实战案例👈

光学理论是没用的，要学会跟着一起敲，要动手实操，才能将自己的所学运用到实际当中去，这时候可以搞点实战案例来学习。
在这里插入图片描述

👉100道Python练习题👈

检查学习结果。

👉面试刷题👈