Python数据分析___爬虫入门—2
1 爬虫入门:
1.1 练习urllib库:
第一,这里是基于Python3,而不是基于Python2的,两者有区别;
第二,在Python中,很多库可以用来抓取网页,urllib是其中之一。
第三,练习使用urllib(库).request(模块)的urlopen(函数)
from urllib.request import urlopen
response=urlopen("https://www.hao123.com")
print(response.read())
注:
建议使用Fiddler抓包工具,来展现:请求、响应的过程;
如果启动了Fiddler抓包工具,地址要写:https://www.hao123.com;
如果没有启动,地址写:http://www.hao123.com;
上述python代码的执行结果为:
b'<!DOCTYPE html><html><head><noscript><meta http-equiv="refresh" content="0; URL=\'/?
...
省略很多很多行
...
1.2 修改User-Agent:
第一,在Filddler中,双击选中左侧:www.hao123.com,在右侧部分,请求部分的Raw,显示:User-Agent: Python-urllib/3.6
第二,由上集知道:User-Agent (浏览器名称),标识客户端身份的名称,通常页面会根据不同的客户端身份,自动做出适配,甚至返回不同的响应内容。
第三,因此,需要修改User-Agent,以免Web服务器认为我们不是浏览器,进行拦截
from urllib.request import urlopen,Request
a=Request("http://www.hao123.com") #生成Request类的实例对象,初始化需要传入url
a.add_header("User-agent","Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.67 Safari/537.36")
#这一步修改了请求报头,将我们的Pycharm访问,伪装成Google浏览器访问
response=urlopen(a)
with response: #上下文管理器,主要是为了打开文件内容后,自动关闭
print("response.read())
注:函数urlopen()可以传入Request类;
Request类的初始化需要传入url,Request类的实例具备add_header方法,添加请求报头信息;
1.3 查看Response:
查看一下返回的response的属性:状态码、url等等
from urllib.request import urlopen,Request
from http.client import HTTPResponse
url="http://www.bing.com"
a=Request(url)
a.add_header("User-agent","Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.67 Safari/537.36")
response=urlopen(a)
print(response.closed) #打印出False
with response:
print(type(response)) #打印出: <class 'http.client.HTTPResponse'>
print(response._method) #GET
print(response.status) #打印出:200
print(response.reason) #打印出:OK
print(response.geturl())#打印出:http://cn.bing.com/
print(response.info()) #打印出info: Cache-Control: private, max-age=0,等等
print(response.read()) #b'<!DOCTYPE html><html lang="zh"><script ,等等
print(response.closed) #打印出True
1.4 多个User-agent:
上述发给服务器的User-agent(伪装成的浏览器)就一个(Google浏览器),是固定的;
当然可以随机伪装一个,可以伪装成IE浏览器,360浏览器,等等,随机选一个,发给服务器
import random
from urllib.request import urlopen,Request
from http.client import HTTPResponse
url="http://www.bing.com"
a=Request(url)
list_1=["Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.67 Safari/537.36",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1",
"Mozilla/5.0 (X11; CrOS i686 2268.111.0) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.57 Safari/536.11",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1092.0 Safari/536.6",
"Mozilla/5.0 (Windows NT 6.1) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1090.0 Safari/536.6"
]
a.add_header("User-agent",random.choice(list_1)) #随机选一个
response=urlopen(a)
print(response.closed)
with response:
print(type(response))
print(response._method)
print(response.status)
print(response.reason)
print(response.geturl())
print(response.info())
print(response.read())
print(a.get_header("User-agent"))
print(response.closed)
1.5 练习parse模块:
from urllib import parse
dict_1={"a":1,"b":2,"c":3}
a=parse.urlencode(dict_1)
print(a) #打印出:a=1&b=2&c=3
dict_2={"a":1,"b":2,"url":"http://www.hao123.com"}
b=parse.urlencode(dict_1)
print(b) #打印出:a=1&b=2&url=http%3A%2F%2Fwww.hao123.com
#:被转化为了%3A,/被转化为了%2F
练习使用parse的2个函数:urlencode、unquote
from urllib import parse
dict_1={"name":"科比",
"age":18,
"url":"http://www.hao123.com"}
a=parse.urlencode(dict_1)
print("编码后:",a)
b=parse.unquote(a)
print("解码后:",b)
#编码后: name=%E7%A7%91%E6%AF%94&age=18&url=http%3A%2F%2Fwww.hao123.com
#解码后: name=科比&age=18&url=http://www.hao123.com
1.6 拼接Url:
from urllib import parse
dict_1={"wd":"中"}
a=parse.urlencode(dict_1)
b="http://www.baidu.com/s?{}".format(a)
print("编码后是:",b)
c=parse.unquote(b)
print("解码后是:",c)
1.7 爬取Bing网站:
#https://cn.bing.com/search?q=%E5%95%A6%E5%95%A6%E5%95%A6
import random
from urllib.request import urlopen,Request
from urllib import parse
base_dir="https://cn.bing.com/search"
query_1={"q":"科比666"}
query_2=parse.urlencode(query_1)
address_1="{}?{}".format(base_dir,query_2)
list_1=["Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.67 Safari/537.36",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1",
"Mozilla/5.0 (X11; CrOS i686 2268.111.0) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.57 Safari/536.11",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1092.0 Safari/536.6",
"Mozilla/5.0 (Windows NT 6.1) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1090.0 Safari/536.6"
]
user_agent=random.choice(list_1)
request_1=Request(address_1)
request_1.add_header("User-agent",user_agent)
response=urlopen(request_1)
with response:
with open("e://biying.html","wb") as f:
content=response.read()
f.write(content)
f.flush()
#flush()** 方法是用来刷新缓冲区的,即将缓冲区中的数据立刻写入文件,同时清空缓冲区,不需要是被动的等待输出缓冲区写入。一般情况下,文件关闭后会自动刷新缓冲区,但有时你需要在关闭前刷新它,这时就可以使用 flush() 方法。
1.8 练习POST方法
# 需要打开著名的测试网站:http://httpbin.org
#https://cn.bing.com/search?q=%E5%95%A6%E5%95%A6%E5%95%A6
import random
from urllib.request import urlopen,Request
from urllib import parse
import simplejson
base_dir="http://httpbin.org/post"
data_1={"ok1":1,"ok2":2,"ok3":3}
data_2=parse.urlencode(data_1)
request_1=Request(base_dir)
list_1=["Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.67 Safari/537.36",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1",
"Mozilla/5.0 (X11; CrOS i686 2268.111.0) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.57 Safari/536.11",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1092.0 Safari/536.6",
"Mozilla/5.0 (Windows NT 6.1) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1090.0 Safari/536.6"
]
request_1.add_header("User-Agent",random.choice(list_1))
response=urlopen(request_1,data=data_2.encode())
with response:
response_1=response.read()
print(response_1.decode())