Python学习笔记之八（Urllib）

最新推荐文章于 2024-08-13 18:29:56 发布

xuanjat

最新推荐文章于 2024-08-13 18:29:56 发布

阅读量306

点赞数

分类专栏： Python学习笔记文章标签： python 学习

本文链接：https://blog.csdn.net/xuanjat/article/details/96445487

版权

Python学习笔记专栏收录该内容

11 篇文章 0 订阅

订阅专栏

Python学习笔记之八（Urllib）

2019-07-18 08:50:59 星期四

1.爬虫入门

本课概要

课前说明
作业讲解
关于学习方法
urllib基础
超时设置
自动模拟HTTP请求

出版社爬虫

#出版社爬虫
import urllib.request
data=urllib.request.urlopen("http://read.douban.com/provider/all").read()
data=data.decode("utf-8")
import re
pat='<div class="name">(.*?)</div>'
mydata=re.compile(pat).findall(data)
mydata
fh=open("F:/urllib.txt","w")
for i in range(0,len(mydata)):
    fh.write(mydata[i]+"\n")

urlib基础

要系统学习urllib模块，我们从urlib基础开始。这个知识点中，我们会为大家实战讲解urlretrieve()、urlcleanup()、info()、getcode()当前网页状态码、geturl()当前网页网址等。

#网页截取
import urllib.request
urllib.request.urlretrieve("http://www.hellobi.com",filename="F:/urllib2.html")

(‘F:/urllib2.html’, <http.client.HTTPMessage at 0x2c9bc883080>)

getcode()状态码

200 正常网页
403 禁止访问*

#清缓存
urllib.request.urlcleanup()
file=urllib.request.urlopen("http://www.hellobi.com")
file.info()

<http.client.HTTPMessage at 0x2c9bc883710>

超时设置

由于网络速度或对方服务器的问题，我们爬取一个网页的时候，都需时间。我们访问一个网页，如果该网页长时间未响应，那么我们的系就会判断该网页超时了，即无法打开该网页。
有的时候，我们需要根据自己的需要，来设置超时的时间值，比如，有些网站反应快，我们希望2秒钟没有反应，则判断为超时，那么此时，timeout的值就是2，再比如，有些网站服务器反应慢，那么此时，我们希望100秒没有反应，才判断为超时，那么此时timeout的值就是100。接下来为大家实战讲解爬取时的超时设置。

file=urllib.request.urlopen("http://www.baidu.com",timeout=10)
for i in range (0,100):
    try:       file=urllib.request.urlopen("http://yum.iqianyue.com",timeout=1)
        data=file.read()
        print(len(data))
    except Exception as e:
        print ("出现异常"+str(e))

结果图示

自动模拟HTTP请求

客户端如果要与服务器端进行通信，需要通过http请求进行，http请求有很多种，我们在此会讲post与get两种请求方式。比如登陆、搜*索某些信息的时候会用到。

模拟get请求

#模拟get请求
import urllib.request
keywd="Python"
req=url="http://www.baidu.com/s?wd="+keywd
urllib.request.Request   #变成一个请求
data=urllib.request.urlopen(req).read()
fh=open("F:/PL/Pythonsearch.html","wb")
fh.write(data)
fh.close()
len(data)

456008

#模拟get请求 中文搜索方法
import urllib.request
keywd="南大鳥"
keywd=urllib.request.quote(keywd)#中文转码
req=url="http://www.baidu.com/s?wd="+keywd+"&ie=utf-8"#把网址封装为一个请求
#urllib.request.Request   #变成一个请求
data=urllib.request.urlopen(req).read()
fh=open("F:/PL/PythonsearchChinese.html","wb")
fh.write(data)
fh.close()
len(data)

294958

post


<html>
<head>
<title>Post Test Page</title>
</head>

<body>
<form action="" method="post">
name:<input name="name" type="text" /><br>
passwd:<input name="pass" type="text" /><br>
<input name="" type="submit" value="submit" />
<br />
</body>
</html>

import urllib.request
import urllib.parse
url="http://www.iqianyue.com/mypost/"
mydata=urllib.parse.urlencode({"name":"ceo@iqianyue.com",
                               "pass":"1235jkds"}).encode("utf-8")
req=urllib.request.Request(url,mydata)#设置一个请求
data=urllib.request.urlopen(req).read()
fh=open("F:/PL/postname.html","wb")
fh.write(data)
fh.close()
print(len(data))

2.爬虫异常处理

2019-07-18 15:40:38 星期四

本课概要

异常处理概述
常见状态码及含义。
URLError与HTTPError
异常处理实战

异常处理概述

爬虫在运行的过程中，很多时候都会遇到这样或那样的异常。如果没有异常处理，爬虫遇到异常时就会直接崩溃停止运行，下次再次运行时，又会重头开始，所以，要开发一个具有顽强生命力的爬虫，必须要进行异常处理。

常见状态码及含义

301Moved Permanently：重定向到新的URL，永久性
302Found：重定向到临时的URL，非永久性
304Not Modified：请求的资源未更新
400 Bad Request：非法请求
401 Unauthorized：请求未经授权
403 Forbidden：禁止访问
404Not Found：没有找到对应页面
500Internal Server Error：服务器内部出现错误
501 Not Implemented：服务器不支持实现请求所需要的功能

URLError与HTTPError

两者都是异常处理的类，HTTPError是URLError的子类，HTTPError有异常状态码与异常原因，URLError没有异常状态码，所以，在处理的时候，不能使用URLError直接代替HTTPError。如果要代替，必须要判断是否有状态码属性。

例子

#爬虫异常处理
#URLerror
#1.连接不上服务器
#2.远程URL不存在
#3.本地没有网络
#4.触发HTTPE...子类
import urllib.error
import urllib.request
try:
    urllib.request.urlopen("http://www.bilibili.com/")
    print("copy success")
except urllib.error.URLError as e:
    if hasattr(e,"code"):
        print(e.code)
    if hasattr(e,"reason"):
        print(e.reason)

403
Forbidden

爬虫的浏览器伪装技术

本课概要

浏览器伪装技术原理
浏览器伪装技术实战

浏览器伪装技术原理

我们可以试试爬取csdn博客，我们发现会返回403，因为对方服务器会对爬虫进行屏蔽。此时，我们需要伪装成浏览器才能爬取。
浏览器伪装我们一般通过报头进行

#爬虫模拟浏览器
import urllib.request
url="https://blog.csdn.net/"
headers=("user-Agent","Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.142 Safari/537.36")
opener=urllib.request.build_opener()
opener.addheaders=[headers]
data=opener.open(url).read()
fh=open("F:/PL/browser1.html","wb")
fh.write(data)
fh.close()
print("copy finished")

新闻爬虫

浏览器伪装技术实战

由于urlopen()对于一些HTTP的高级功能不支持，所以，我们如果要修改报头，可以使用urllib.request.build_opener()进行，当然，也可以使用urllib.request.Request()下的add_header()实现浏览器的模拟。

#爬虫新浪新闻整个网页，把新闻下载到本地
import urllib.request
import re
data=urllib.request.urlopen("http://www.taptap.com/").read()
data2=data.decode("utf-8","ignore")
pat='herf="(http://www.taptap.com/.*?)">'
allurl=re.compile(pat).findall(data2)
print("every thing is ok")
for i in range(0,len(allurl)):
    try:
        print("第"+str(i)+"次爬取")
        thisurl=allurl[i]
        file="F:/PL/sinanews/"+str(i)+".html"
        urllib.request.urlretrieve(thisurl,file)
        print("-----(❤ ω ❤)爬取成功-------")
    except urllib.error.URLError as e:
        if hasattr(e,"code"):
            print(e.code)
        if hasattr(e,"reason"):
            print(e.reason)
print("news copy is ok")

xuanjat

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Python学习笔记之八（Urllib）

Python学习笔记之八（Urllib）2019-07-18 08:50:59 星期四1.爬虫入门本课概要课前说明作业讲解关于学习方法urllib基础超时设置自动模拟HTTP请求出版社爬虫#出版社爬虫import urllib.requestdata=urllib.request.urlopen("http://read.douban.com/provider/all...
复制链接

扫一扫

专栏目录