爬虫入门之request工具

招财财猫

已于 2022-11-04 20:50:46 修改

阅读量1k

点赞数

分类专栏： Python应用：自动化文章标签：爬虫 python 开发语言

于 2022-10-29 10:29:28 首次发布

本文链接：https://blog.csdn.net/m0_70592782/article/details/127578789

版权

本文介绍了Python爬虫入门，重点讲解了如何使用requests库、urllib.request和urllib3.request获取HTML文档数据。通过BeautifulSoup库结合lxml或XPath解析数据，并探讨了如何将提取的数据保存到文件中。还提到了处理JSON文件响应和实例，包括多线程抓取策略以避免被目标网站封锁。

摘要由CSDN通过智能技术生成

一、HTML文档响应

第一步：获取数据：使用request拿到数据，用requests库、urllib.request、urllib3.request

第一种 requests库获取数据

import requests

f=requests.get('https://www.baidu.com/')
f.encoding='utf-8'
print(f.text)

# 如果响应的信息不是很全，可以加入headers

import requests

headers = {

    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/107.0.0.0 Safari/537.36'
}

f=requests.get('https://www.baidu.com/',headers=headers)
f.encoding='utf-8'
print(f.text)

第二种urllib.request 库获取数据

import urllib.request
f=urllib.request.urlopen('https://www.baidu.com/')
info=f.read().decode('utf-8')
print(info)

# 获取的信息不全，加头


import urllib.request

headers = {

    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/107.0.0.0 Safari/537.36'
}

f=urllib.request.Request('https://www.baidu.com/',headers=headers)

re=urllib.request.urlopen(f)

info=re.read().decode('utf-8')
print(info)

第三种urllib3.request

import urllib3.request

f=urllib3.PoolManager().request('GET',url='https://www.baidu.com/')
print(f.data.decode('utf-8'))

例子：

# 百度搜索‘学习’的响应，不加headers的话会出现百度安全验证，需要加上

import urllib3

http = urllib3.PoolManager()
url = 'https://www.baidu.com/s'
headers = {

    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) 。。。。。。。,
'Cookie':'BIDUPSID=39EB78AC324289C741888E2487E09AEA; PSTM=1666156248。。。。。。。。
}
response = http.request('GET', url,fields={'wd':'学习'},headers=headers)
result = response.data.decode('UTF-8')
print(result)