urllib基础

最新推荐文章于 2023-05-19 20:13:41 发布

YangJZ_ByteMaster

最新推荐文章于 2023-05-19 20:13:41 发布

阅读量542

点赞数

分类专栏： # python爬虫

本文链接：https://blog.csdn.net/qq_44537267/article/details/106975163

版权

python爬虫专栏收录该内容

24 篇文章 1 订阅

订阅专栏

urllretrieve（网址，本地文件存储地址）直接下载到本地
info() 查看网页简介信息 getcode() 返回网页爬取的状态码
geturl（）获取当前访问的网页

Urllib是python内置的HTTP请求库包括以下模块

urllib.request 请求模块
urllib.error 异常处理模块
urllib.parse url解析模块
urllib.robotparser robots.txt解析模块

urlopen

关于urllib.request.urlopen参数的介绍：

urllib.request.urlopen(url, data=None, [timeout, ]*, cafile=None, capath=None, cadefault=False, context=None)

会返回一个二进制的对象，对这个对象进行read（）操作可以得到一个包含网页的二进制字符串，然后用decode()解码成一段html代码。

timeout参数的使用

在某些网络情况不好或者服务器端异常的情况会出现请求慢的情况，或者请求异常，所以这个时候我们需要给
请求设置一个超时时间，而不是让程序一直在等待结果。例子如下：
必要时进行异常处理

import urllib.request
for i in range(0,100):
	fileread = urllib.request.urlopen('http://xxxxxx', timeout=1)
	try:
		print(len(fileread.read().decode("utf-8")
	except Exception as err:
		print("出现异常")

get请求–实现百度信息自动搜索

版本可能有些老了，内容不可能太一样，提取不了信息

import urllib.request,re
keywd = "python"
keywd = urllib.request.quote(keywd) #如果搜索的是中文需要对其进行转码
for i in range(1,11):  #对一到十页进行爬取
    url = "http://www.baidu.com/s?wd=" + keywd + "&pn=" + str((i-1)*10)  #pn表示页数，百度搜索下一页pn默认增加10
    d = urllib.request.urlopen(url).read().decode("utf-8")
    pat = "title:'(.*?)'"
    pat2 = 'title:"(.*?)"'    #有的标题用的单引号有的双引号
    rst1 = re.compile(pat).findall(d)
    rst2 = re.compile(pat2).findall(d)
    for j in range(0,len(rst1)):
        print(rst1[j])
    for z in range(0,len(rst2)):
        print(rst1[z])

post请求实例

import urllib.request
import urllib.parse
posturl = "https://www.iqianyue.com/mypost"
postdata = urllib.parse.urlencode({
    "name" : "djfasdklf",  #字符一定要加引号，否则报错：xxx is not defined
    "pass" : "jdskfajd",
}).encode("utf-8")
#进行post，就需要使用urllib.request下面的Request（真是post地址，post数据）
#测试地址为 https://www.iqianyue.com/mypost
req = urllib.request.Request(posturl,postdata)
rst = urllib.request.urlopen(req).read().decode("utf-8")
print(rst)

YangJZ_ByteMaster

关注

0
点赞
踩
2

收藏

觉得还不错? 一键收藏
0
评论
urllib基础

urllretrieve（网址，本地文件存储地址）直接下载到本地info() 查看网页简介信息 getcode() 返回网页爬取的状态码geturl（）获取当前访问的网页Urllib是python内置的HTTP请求库包括以下模块urllib.request 请求模块urllib.error 异常处理模块urllib.parse url解析模块urllib.robotparser robots.txt解析模块urlopen关于urllib.request.urlopen参数的介绍：.
复制链接

扫一扫