1.1 、数据请求—urlib库

最新推荐文章于 2024-07-19 14:43:37 发布

TUJC

最新推荐文章于 2024-07-19 14:43:37 发布

阅读量732

点赞数

分类专栏： python爬虫

本文链接：https://blog.csdn.net/TU_JCN/article/details/98501813

版权

python爬虫专栏收录该内容

11 篇文章 0 订阅

订阅专栏

Python内置的urllib库提供了HTTP请求、异常处理、URL解析等功能。request模块包括urlopen和Request，支持普通读取、设置header、处理汉字及URL拼接。error模块用于异常处理，parse模块提供URL处理工具，而robotparser模块则用于识别robots.txt文件，遵循爬虫协议。

摘要由CSDN通过智能技术生成

urlib库
Python 内置的 HTTP 请求库，不需要额外安装

1、request模块，基本的 HTTP 请求模块，模拟发送请求，
2、error 模块，异常处理模块，捕获异常，进行重试或其他操作保证程序不会意外终止。
3、parse 模块，工具模块，提供了许多 URL 处理方法，比如拆分、解析、合并等等的方法。
4、robotparser模块，识别网站的 robots.txt 文件，判断哪些网站可以爬，哪些网站不可以爬的，其实用的比较少。

一、request模块

1、urlopen

（1）普通读取 str_data = urllib.request.urlopen(url).read().decode("utf-8")

（2）汉字转译 encode_new_url = urllib.parse.quote(final_url,safe=string.printable)

（3）url拼接 str_params = urllib.parse.urlencode(params)

2、Request

（1）普通读取 urllib.request.urlopen(urllib.request.Request(url))

（2）header请求头 request.add_header(动态添加head数据)

（3）遍历多个user_agent random.choice(user_agent_list)

3、hander工具

（1）密码验证 , HTTPBasicAuthHandler

（2）IP代理 , ProxyHandler

一、request模块

urlpoen Request hander

1、urlopen

urllib.request.urlopen(url, data=None, [timeout=,] ,*, 
                      cafile=None, capath=None, 
                      cadefault=False, context=None)

data参数： bytes 类型， 若有data 参数，请求方式 GET —> POST。

timeout 参数：设置超时时间，单位为秒，如果不指定，就会使用全局默认时间。它支持 HTTP、HTTPS、FTP 请求。

context 参数：必须是 ssl.SSLContext 类型，用来指定 SSL 设置。

cafile 和 capath 两个参数：指定 CA 证书和它的路径，这个在请求 HTTPS 链接时会有用

cadefault 参数：现在已经弃用了，默认为 False。

https://docs.python.org/3/library/urllib.request.html

import urllib.request
    
    response = urllib.request.urlopen('https://www.python.org')
    print(response.read().decode('utf-8'))

（1）普通读取 str_data = urllib.request.urlopen(url).read().decode("utf-8")

1、urlopen 2、read 3、decode

import urllib.request

def load_data():
    url = "http://www.baidu.com/"

# 1、get的请求，response:http相应的对象
    response = urllib.request.urlopen(url)
  
# 2、读取内容 bytes类型
    data = response.read()
 
# 3、将文件获取的内容转换成字符串
    str_data = data.decode("utf-8")

# 4、将数据写入文件
    with open("baidu.html","w",encoding="utf-8")as f:
        f.write(str_data)

    #将字符串类型转换成bytes
    str_name = "baidu"
    bytes_name =str_name.encode("utf-8")
    print(bytes_name)
load_data()

#python爬取的类型:str bytes
#如果爬取回来的是bytes类型:但是你写入的时候需要字符串 decode("utf-8")
#如果爬取过来的是str类型:但你要写入的是bytes类型 encode(""utf-8")

（2）汉字转译 encode_new_url = urllib.parse.quote(final_url,safe=string.printable)

import urllib.request
import urllib.parse
import string

def get_method_params():

    url = "http://www.baidu.com/s?wd="
    #拼接字符串(汉字)
    #python可以接受的数据
    #https://www.baidu.com/s?wd=%E7%BE%8E%E5%A5%B3

    name = "美女"
    final_url = url+name
    print(final_url)

    #代码发送了请求
    #网址里面包含了汉字;ascii是没有汉字的;url转译
    #将包含汉字的网址进行转译
    encode_new_url = urllib.parse.quote(final_url,safe=string.printable)
    print(encode_new_url)

    # 使用代码发送网络请求
    response = urllib.request.urlopen(encode_new_url)
    print(response)

    #读取内容
    data = response.read().decode()
    print(data)

    #保存到本地
    with open("02-encode.html","w",encoding="utf-8")as f:
        f.write(data)
        
    #UnicodeEncodeError: 'ascii' codec can't encode
    # characters in position 10-11: ordinal not in range(128)
    #python:是解释性语言;解析器只支持 ascii 0 - 127
    #不支持中文

get_method_params()

（3）url拼接 str_params = urllib.parse.urlencode(params)

wd=%E4%B8%AD%E6%96%87&key=zhang&value=san

params = {
"wd":"中文",
"key":"zhang",
"value":"san"
}
str_params = urllib.parse.urlencode(params)

import urllib.request
import urllib.parse
import string

def get_params():
    url = "http://www.baidu.com/s?"

    params = {
        "wd":"中文",
        "key":"zhang",
        "value":"san"
    }
    str_params = urllib.parse.urlencode(params)
    print(str_params)
    final_url = url + str_params

    #将带有中文的url 转译成计算机可以识别的url
    end_url = urllib.parse.quote(final_url,safe=string.printable)

    response = urllib.request.urlopen(end_url)

    data = response.read().decode("utf-8")
    print(data)

get_params()

2、urllib.request.Request

urllib.request.Request(url, data=None, headers={}, 
                       origin_req_host=None, unverifiable=False, method=None)

data 参数：必须传 bytes类型（字节流），如果是一个字典，可以先用 urllib.parse 模块里的 urlencode() 编码。

headers 参数：是一个字典，即 Request Headers ，可以在构造 Request 时通过 headers 参数直接构造，也可以通过调用 Request 实例的 add_header() 方法来添加, Request Headers 最常用的用法就是通过修改 User-Agent 来伪装浏览器，默认的 User-Agent 是 Python-urllib，我们可以通过修改它来伪装浏览器。

origin_req_host 参数：请求方的 host 名称或者 IP 地址。

unverifiable 参数：这个请求是否是无法验证的，默认是False。

method 参数：字符串，指示请求使用的方法，比如GET，POST，PUT等

（1）普通读取 urllib.request.urlopen(urllib.request.Request(url))

1、request对象 Request(url) 2、response对象 urlopen(request)

import urllib.request
    
    request = urllib.request.Request('https://python.org')
    response = urllib.request.urlopen(request)
    print(response.read().decode('utf-8'))

（2）header请求头 request.add_header(动态添加head数据)

0、编写header 1、request对象 2、request.add_header(header) 3、urlopen(request)


import urllib.request

def load_baidu():
    url= "https://www.baidu.com"

    # 0、编写header
    header = {
        #浏览器的版本
        "User-Agent":"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36",
        # "haha":"hehe"
    }

    #1、创建请求对象
    request = urllib.request.Request(url)

    #2、动态的去添加head的信息
    request.add_header("User-Agent","Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36")
    
    #3、请求网络数据(不在此处增加请求头信息因为此方法系统没有提供参数)
    response = urllib.request.urlopen(request)
    print(response)
    data = response.read().decode("utf-8")

# 保存
    with open("02header.html","w")as f:
        f.write(data)


#获取到完整的url
    final_url = request.get_full_url()
    print(final_url)

 
#查看响应头
    print(response.headers)

#获取请求头的信息，所有的头的信息
    #（1）方法一：
    request_headers = request.headers 
    print(request_headers)

    #（2）方法二：  
    request_headers = request.get_header("User-agent")
    print(request_headers)
    #注意点:首字母需要大写,其他字母都小写




load_baidu()

User-Agent:
(1)模拟真实的浏览器发送请求:1)百度批量搜索；2)检查元素(百度搜索useragent⼤全)
(2)request.add_header(动态添加head数据)
(3)响应头 response.header
(4)创建request:urlib.request.Request(url)

（3）遍历多个user_agent random.choice(user_agent_list)

0、创建user_agent列表 1、随机选取 random.choice(user_agent_list)

2、request对象 3、request.add_header(header) 4、urlopen(request)

一个浏览器的版本，容易被认出是爬虫，遍历多个user_agent（百度搜索userahent大全）

import urllib.request
import random

def load_baidu():

    url = "http://www.baidu.com"
# 0、
    user_agent_list = [
        "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.1 (KHTML, like Gecko) Chrome/14.0.835.163 Safari/535.1",
        "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:6.0) Gecko/20100101 Firefox/6.0",
        "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50",
        "Opera/9.80 (Windows NT 6.1; U; zh-cn) Presto/2.9.168 Version/11.50"

    ]
# 1、
    #每次请求的浏览器都是不一样的
    random_user_agent = random.choice(user_agent_list)

# 2、
    request = urllib.request.Request(url)

# 3、
    #增加对应的请求头信息(user_agent)
    request.add_header("User-Agent",random_user_agent)

# 4 、
    #请求数据
    response = urllib.request.urlopen(request)
    #请求头的信息
    print(request.get_header("User-agent"))

load_baidu()

3、hander工具

系统的urlopen并没有添加代理的功能，所以需要我们自定义这个功能，自己的oppener请求数据

1、创建自己的 handler处理器 handler = urllib.request.HTTPHandler()

2、创建自己的oppener opener=urllib.request.build_opener(handler)

3、用自己创建的opener调用open方法请求数据 response = opener.open(url)

import urllib.request

def handler_openner():

    url = "https://blog.csdn.net/m0_37499059/article/details/79003731"

# 1、创建自己的处理器
    handler = urllib.request.HTTPHandler()

# 2、创建自己的oppener
    opener=urllib.request.build_opener(handler)

# 3、用自己创建的opener调用open方法请求数据
    response = opener.open(url)
    data = response.read().decode("utf-8")


    with open("02header.html", "w")as f:
        f.write(data)

handler_openner()

（1）密码验证 , HTTPBasicAuthHandler

创建验证处理器
hander = urllib.request.HTTPPasswordMgrWithDefaultRealm()
hander.add_password(None,url,username,password)
auth_handler = urllib.request.HTTPBasicAuthHandler(p)

import urllib
from urllib.error import URLError

username = 'username'
password = 'password'
url = 'http://localhost:500/'

# 1、创建带验证的hander
hander = urllib.request.HTTPPasswordMgrWithDefaultRealm()
hander.add_password(None,url,username,password)
auth_handler = urllib.request.HTTPBasicAuthHandler(p)

# 2、创建opener
opener = urllib.request.build_opener(auth_handler)


try:
# 3、opener.open(url)
    result =opener.open(url)
    html = result.read().decode('utf-8')
    print(html)
except URLError as e:
    pint(e.reason)

（2）IP代理 , ProxyHandler

创建代理处理器

proxy={...}

proxy_handler = urllib.request.ProxyHandler(proxy)

IP代理 :
(1)免费的IP:时效性差,错误率⾼
(2)付费的IP:贵花钱,也有失效不能⽤的

IP分类:
透明:对⽅知道我们真实的ip
匿名:对⽅不知道我们真实的ip,知道了你使⽤了代理
⾼匿:对⽅不知道我们真是的IP.也不知道我们使⽤了代理

import urllib.request
def create_proxy_handler():
    url = "https://blog.csdn.net/m0_37499059/article/details/79003731"

# 1、添加代理
    proxy = {
        #免费的写法  "http":""
        # "http":"120.77.249.46:8080"

        #付费的代理:
        # "http":"xiaoming":123@115.
    }

# 2、代理处理器
    proxy_handler = urllib.request.ProxyHandler(proxy)

# 3、创建自己opener
    opener = urllib.request.build_opener(proxy_handler)

# 4、拿着代理ip去发送请求
    response = opener.open(url)
    data = response.read().decode("utf-8")

# 5、保存
    with open("03header.html", "w")as f:
        f.write(data)

create_proxy_handler()

付费代理 (l两种方法)

方法一：

直接创建代理处理器 ProxyHandler

方法二：

先创建密码管理器，HTTPPasswordMgrWithDefaultRealm()

在创建代理ip验证处理器ProxyBasicAuthHandler

import urllib.request

def money_proxy_use():

'''方法一：
    # 0、代理ip
    money_proxy ={"http":"username:pwd@192.168.12.11:8080"}

    # 1、创建代理处理器
    proxy_handler=urllib.request.ProxyHandler(money_proxy)
    
    # 2、通过处理器创建opener
    opener = urllib.request.build_opener(proxy_handler)

    # 3、open发送请求
    opener.open("http://www.baidu.com")'''

'''方法二：'''
    # 0、代理ip
    use_name = "abcname"
    pwd = "123456"
    proxy_money = "123.158.63.130:8888"

    # 1、创建密码管理器,添加用户名和密码
    password_manager = urllib.request.HTTPPasswordMgrWithDefaultRealm()
    password_manager.add_password(None,proxy_money,use_name,pwd)

    # 2、创建可以验证代理ip的处理器
    handle_auth_proxy = urllib.request.ProxyBasicAuthHandler(password_manager)

    # 3、根据处理器创建opener
    opener_auth = urllib.request.build_opener(handle_auth_proxy)

    # 4、发送请求
    response = opener_auth.open("http://www.baidu.com")
    print(response.read())

    #爬取自己公司的数据,做数据分析
    #admin

money_proxy_use()

遍历 ip代理处理器

import urllib.request

def proxy_user():

    proxy_list = [
        {"https":""},
        # {"https":"106.75.226.36:808"},
        # {"https":"61.135.217.7:80"},
        # {"https":"125.70.13.77:8080"},
        # {"https":"118.190.95.35:9001"}
    ]
    for proxy in proxy_list:
        print(proxy)
        #利用遍历出来的ip创建处理器
        proxy_handler = urllib.request.ProxyHandler(proxy)
        #创建opener
        opener = urllib.request.build_opener(proxy_handler)

        try:
            data = opener.open("http://www.baidu.com",timeout=1)

            haha = data.read()
            print(haha)
        except Exception as e:
            print(e)


proxy_user()

(3) Cookies

实现cookie 登录

方法一：手动提取cookie

登录网址，直接获取个人中心的页面，手动复制粘贴 PC 抓包的 cookies，放在 request对象的请求头里面

import urllib.request

# 1.数据url
url = 'https://www.yaozh.com/member/'

# 2.添加请求头
headers = {
    'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.67 Safari/537.36'
    ,
    'Cookie': '_ga=GA1.2.1820447474.1535025127; MEIQIA_EXTRA_TRACK_ID=199Tty9OyANCXtHaSobJs67FU7J; UtzD_f52b_ulastactivity=1511944816%7C0; WAF_SESSION_ID=7d88ae0fc48bffa022729657cf09807d; PHPSESSID=7jsc60esmb6krgthnj99dfq7r3; _gid=GA1.2.358950482.1540209934; _gat=1; MEIQIA_VISIT_ID=1BviNX3zYEKVS7bQVpTRHOTFV8M; yaozh_logintime=1540209949; yaozh_user=381740%09xiaomaoera12; yaozh_userId=381740; db_w_auth=368675%09xiaomaoera12; UtzD_f52b_saltkey=CfYyYFY2; UtzD_f52b_lastvisit=1540206351; UtzD_f52b_lastact=1540209951%09uc.php%09; UtzD_f52b_auth=2e13RFf%2F3R%2BNjohcx%2BuoLcVRx%2FhF0NvwUbslgSZX%2FOUMkCRRcgh5Ayg6RGnklcG3d2DkUFAXJxjhlIS8fPvr9rrwa%2FY; yaozh_uidhas=1; yaozh_mylogin=1540209953; MEIQIA_EXTRA_TRACK_ID=199Tty9OyANCXtHaSobJs67FU7J; WAF_SESSION_ID=7d88ae0fc48bffa022729657cf09807d; Hm_lvt_65968db3ac154c3089d7f9a4cbb98c94=1535025126%2C1535283389%2C1535283401%2C1539351081%2C1539512967%2C1540209934; MEIQIA_VISIT_ID=1BviNX3zYEKVS7bQVpTRHOTFV8M; Hm_lpvt_65968db3ac154c3089d7f9a4cbb98c94=1540209958'
}

# 3.构建请求对象
request = urllib.request.Request(url, headers=headers)

# 4.发送请求对象
response = urllib.request.urlopen(request)

# 5.读取数据
data = response.read()
print(type(data))

# 保存到文件中 验证数据
with open('01cook.html', 'wb') as f:
    f.write(data)

方法二：自动获取cookie（两大步骤）

自动带着cookie 去请求个人中心，利用cookiejar 自动保存cookie，再登录

1、使用代码登录，如果登录成功, cookjar自动保存cookie，创建 cookiehander，生成opener

2、代码带着cooke，利用上面的opener ，去访问个人中心

import urllib.request
from http import cookiejar
from urllib import parse


'''1、代码登录'''
# 如果登录成功, cookjar自动保存cookie

# 添加请求头
headers = {
    'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.67 Safari/537.36'
}
# 登录的网址
login_url = 'https://www.yaozh.com/login'
# 登录的参数
login_form_data = {
    "username": "xiaomaoera12",
    "pwd": "lina081012",
    "formhash": "CE3ADF28C5",
    "backurl": "https%3A%2F%2Fwww.yaozh.com%2F"
}


# 1.1、发送登录请求POST
cook_jar = cookiejar.CookieJar()

# 1.2、创建 cook处理器
cook_hanlder = urllib.request.HTTPCookieProcessor(cook_jar)

# 1.3、根据处理器 生成opener
opener = urllib.request.build_opener(cook_hanlder)

# 1.4、带着参数 发送post请求  Request
                  #（1）参数 将来 需要转译 转码; 
                  #（2）post请求的 data要求是bytes
login_str = parse.urlencode(login_form_data).encode('utf-8')
login_request = urllib.request.Request(login_url, headers=headers, data=login_str)

opener.open(login_request)



# 2. 代码带着cooke去访问 个人中心
center_url = 'https://www.yaozh.com/member/'

center_request = urllib.request.Request(center_url, headers=headers)

response = opener.open(center_url)

data = response.read().decode()  # bytes -->str

with open('02cook.html', 'w') as f:
    f.write(data)


# 一个用户 在不同的地点(IP(福建,上海, 杭州, 河南)) 不同浏览器 上面 不停的登录  非人为操作
# 封你的账号
# N 个 账号

二、error模块

1、URLError


from urllib import request , error

try:
    response = request.urlopen('https://abc.com')
except URLError as e:
    pint(e.reason)

2、HTTPError


from urllib import request , error

try:
    response = request.urlopen('https://abc.com')
except HTTPError as e:
    pint(e.reason,e.code,e.headers,sep='\n')


# urlib.request  提示错误 HTTPError UrlError
"""
     raise URLError(err)
urllib.error.URLError: <urlopen error [Errno 8] nodename nor servname provided, or not known>
    
    raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 404: Not Found

"""

import urllib.request


url = 'https://blog.csdn.net/zjsxxzh/article/details/110'

url = 'https://affdsfsfsdfd.cn'

try:
    response = urllib.request.urlopen(url)

except urllib.request.HTTPError as error:
    print(error.code)


except urllib.request.URLError as error:
    print(error)

二、parse模块

1、urlparse() : 实现url的识别和分段，拆为6个部分

2、urlunpase()：拼接，长度必须为6

3、urlsplit()：分解，返回5个结果

4、urlunsplit(): 合成，长度必须为5

5、urljoin()：合并

6、urlencode()：键值转换，key:value转化为key=value

7、quote()：中文转化为url ，（中文编码）

8、unquote()：url转化为中文，（中文解码）

四、robotparser模块

1、Robots协议

爬虫协议、机器人协议，告诉哪些网站可以爬，哪些网站不可以爬

robots.txt样例：

User-agent: *

Disallow: /

Allow: /public/

2、爬虫名称

BaiduSpider

Googlebot

360Spider

YodaoBot

ia_archiver

Scooter

3、robotsparser

TUJC

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
1.1 、数据请求—urlib库

urlib库Python 内置的 HTTP 请求库，不需要额外安装1、request模块，基本的 HTTP 请求模块，模拟发送请求，2、error 模块，异常处理模块，捕获异常，进行重试或其他操作保证程序不会意外终止。3、parse 模块，工具模块，提供了许多 URL 处理方法，比如拆分、解析、合并等等的方法。4、robotparser模块，识别网站的 robots.txt 文件，...
复制链接

扫一扫

专栏目录