python基础和爬虫（3）-CSDN博客

1.urllib基础，要系统学习urllib模块，我们从urllib基础开始，首先讲解urlretrieve()、urlcleanup()、info()、getcode()、geturl()等
urlretrieve()直接把网页抓到本地

import urllib.request
data1=urllib.request.urlretrieve("http://www.baidu.com",filename="F:/python_video/1.html")
#urlcleanup--清除urlretrieve产生的缓存
urllib.request.urlcleanup()
#info()---将基本的环境信息展现出来
file=urllib.request.urlopen("http://www.baidu.com")
file.info()
#geturl--知道现在抓取的网页是什么
#getcode--知道状态码
file.getcode()
file.geturl()
#超时设置
file=urllib.request.urlopen("http://www.baidu.com"，timeout=1)#由于服务器反应比较慢。超时设置为了排除无效网址，根据网址反应的时间来判断。
for i in range(0,100):
      try:
            file=urllib.request.urlopen("http://www.baidu.com",timeout=1)
            data=file.read()
            print(len(data))
      except Exception as e:
            print("出现异常："+str(e))
#timeout的设置要根据服务器的运行快慢来定。
#自动模拟HTTP（post and get）
#url中？后面部分就为get,=之前为字段名，之后为字段值
#比如www.baidu.com/s?wd=python&dsdfsd,使用www.baidu.com/s?wd=()，在（加入搜索的关键字）就可以进行搜索，后面部分不需要考虑。
#使用get()直接得到,构造网址的格式
import urllib.request
keywd="python"
keywd=urllib.request.quote(keywd)#遇到中文进行重新编码，如果不处理，会出错。
url="http://www.baidu.com/"+keywd
req=urllib.request.Request(url)#请求访问
data=urllib.request.urlopen(req).read()
fh=open("F:/python_video/2.html","w")
fh.write(data)
fh.close()
#在爬取时，使用http,不能用https,因为https为更安全的方式。

#使用post，使用post自动发送用户名和密码。
#首先需要分析网页，得到网页登录名和密码的位置，找到对应的name
import urllib.request
import urllib.parse
url="http：//www.iqianyue.com/mypost"
mydata=urllib.parse.urlencode({"name":"邮箱或者用户名","pass":"密码"})encode("utf-8")
req=urllib.request.Request(url,mydata)#设置一个请求
data=urllib.request.urlopen(req).read()
fh=open("F:/python_video/3.html","wb")
fh.write(data)
fh.close()

2.爬虫的异常处理
爬虫过程中会遇到各种异常，遇到异常之后，就会崩掉，下次要重新开始爬。因此必须要进行异常处理。
常见状态码及含义
301 moved permanently: 重定向到新的url,永久性
302 Found: 重定向到临时的URL，非永久性
304 Not modified: 请求的资源未更新
400 bad request :非法请求
401 Unauthorized:请求未经授权
403 Forbidden:禁止访问
404 not found:没有找到对应的页面
500 internal server error:服务器内部出现错误
501 not implemented:服务器不支持实现请求所需要的功能

URLError与HTTPError
两者都是异常处理的类，HTTPError是URLError的子类，HTTPError有异常状态码与异常的原因，URLError没有异常状态码和异常原因。
URLError：
1.连不上服务器
2.远程的服务器不存在
3.没有网络
4.触发了httperror子类

import urllib.error
import urllib.request
try:
   urllib.request.urlopen("http://blog.csdn.net")
except urllib.error.URLError as e:
   if hasattr(e,"code"):#有的时候有状态码，有的时候没有状态码，因此判断一下。
      print(e,"code")
   if hasattr(e,"reason"):
      print(e,"reason")
# 403 forbidden#没有模拟成浏览器，无法进行爬取

3.浏览器伪装技术
我们可以爬取csdn博客，我们发现会返回403，因为对方服务器会对爬虫进行屏蔽。此时，我们需要伪装成浏览器才能爬取。
浏览器伪装我们csdn网页信息的爬取。
打开“http://blog.csdn.net/column.html”,用F12,选择网络，打开一个网页，看headers即为报头（标头），关键字段为User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36 Edge/16.16299。

import urllib.request
url="http://blog.csdn.net/BULpreZHt1ImlN4N/article/details/81009409"
headers=("User-Agent","Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.186 Safari/537.36")
opener=urllib.request.build_opener()
opener.addheaders=[headers]
data=opener.open(url).read()
fh=open("F:/python_video/4.html","wb")
fh.write(data)
fh.close()

4、新闻爬虫需求及思路
需求：将新浪新闻首页（http://news.sina.com.cn/）所有新闻都爬到本地。
思路：先爬首页，通过正则获取新闻链接，然后依次爬取各新闻，并存储到本地。

import urllib.request
import re
data=urllib.request.urlopen("http://news.sina.com.cn").read()
data2=data.decode("utf-8","ignore")
pat='<a target="_blank" href="(http://news.sina.com.cn/.*?)">'
allurl=re.compile(pat).findall(data2)
for i in range(0,len(allurl)):
   try:
       print("第"+str(i)+"次抓取")
       thisurl=allurl[i]
       fill="F://python_video/"+str(i)+".shtml"
       urllib.request.urlretrieve(thisurl,fill)
       print("----success----")
    except urllib.error.URLError as e:
       if hasattr(e,"code"):
          print(e.code)
       if hasattr(e,"reason"):
          print(e.reason)
#添加为全局
import urllib.request
import re 
url=""
headers=("User-Agent","")
opener=urllib.request.build_opener()
opener.addheaders=[headers]
#将opener对象添加为全局,urlopen将伪装成浏览器，通过报头打开，如果不添加为全局，则还会出现403的错误。
urllib.request.install_opener()
data=urllib.request.urlopen(url).decode("utf-8","ignore")
pat=""
result=re.comile(pat).findall(data)
for i in range(0,len(result)):
    file="F:/python_video/"+sri(i)+".html"
    urllib.result.urlretrieve(result[i],filename=file)
    print("第"+str(i+1)+"次爬取成功")

5、代理服务器
所谓代理服务器，是一个处于我们与互联网中间的服务器，如果使用代理服务器，我们浏览信息的时候，先向代理服务器发出请求，然后由代理服务器向互联网获取信息，再返回给我们。防止爬虫时服务器（IP）被封。使用拨号（ADS）它的ip部分是一样的，可能导致这个字段整个被封。
使用代理服务器爬取信息，防止IP被封。

import urllib.request
def use_proxy(url,proxy_addr):
   proxy=urllib.request.ProxyHandler({"http"：proxy_addr})
   opener=urllib.request.build_opener(proxy,urllib.request.HTTPHandler)
   urllib.request.install_opener(opener)
   urllib.request.urlopen(url).reader().decode("utf-8","ignore")
   return data
proxy_addr="119.183.220.224：8888"#8888为端口
#使用列表的方式建立IP池进行存储，然后逐个轮询调用。
#proxy_addr=["119.183.220.224：8888",""]
url="http://www.baidu.com"
data=use_proxy(url,proxy_addr)
print(len(data))