01 三个库 urllib uillib2 requests
urllib.urlopen()
urllib2.urlopen()
urllib.urlopen()
综合一下:目前而言我们只能用python3的urllib
实例:
#方法一: import urllib.request request_url = 'http://www.baidu.com' # 需要请求的URL地址 response = urllib.request.urlopen(request_url) # 发起请求 print(response.read().decode('utf-8')) # 打印响应的文本,并进行UTF-8解码
#方法二: //利用Python 抓取指定页面 import urllib.request url = "http://www.baidu.com" data = urllib.request.urlopen(url).read() data = data.decode('UTF-8') print(data)
#下载图片 import urllib.request urllib.request.urlretrieve('https://timgsa.baidu.com/timg?image&quality=80&size=b9999_10000&sec=1529243320258&di=0bedf00ad8d1f35a1b7538548a3679a2&imgtype=0&src=http%3A%2F%2Fh.hiphotos.baidu.com%2Fzhidao%2Fpic%2Fitem%2F95eef01f3a292df59fc681e5bc315c6034a8733c.jpg',filename='H:\Qt\1.png')
request模块 这个一旦使用不再愿意使用urllib
1.安装: pip install requests
2.发送网络请求 >>r=requests.get(url)
还可以是post/put/delete/head/options
3.为url传递参数 >>payload={'key1':'value','key2':'value2'}
>>requests.get(url,params=payload)
>>print(r.url)
4.相应内容 >> r=requests.get(url)
>>r.text
>>r.encoding 'utf-8'
>>r.encoding='ISO -8859-1'
5.二进制相应内容
>>payload={'key1':'value','key2':'value2'}
>>requests.get(url,params=payload)
6.定制请求头
>>url=''
>>head={'content-type':'application/ison'}
>>r=requests.get(url,head=headers)
7.复杂的post请求
>>payload={'key1':'value','key2':'value2'}
>>requests.get(url,params=payload)
>>r=requests.get(url)
>>r.status_code
200
9.响应头
>>r.headers
10.Cookies
>>r.cookies
>>r.cookies['example cooke name']
11.超时
>>requests.get(url,timeout=0.001)
12.错误和异常
遇到网络问题(如:DNS查询失败,拒绝连接等时,requests会抛出一个ConnectionError异常
HTTP无效响应时, HTTPError
请求超时, Timeout异常
02爬虫的介绍
网页获取信息/目录扫描/扫描存在缺陷/
#coding=utf-8 import requests import json url ="https://www.ichunqiu.com/courses/qyaqll" headers={ } r = requests.get(url=url,headers=headers) print(r.text) data =json.load(r.text) print(data)
03利用python开发一个爬虫
可利用for循环
- #coding:utf8
- import urllib2
- #json解析库,对应到lxml
- import json
- #json的解析语法,对应到xpath
- import jsonpath
- url="http://www.lagou.com/lbs/getAllCitySearchLabels.json"
- header={"User-Agent":"Mozilla/5.0 (Windows NT 6.1; WOW64; rv:34.0) Gecko/20100101 Firefox/34.0"}
- request=urllib2.Request(url,headers=header)
- response=urllib2.urlopen(request)
- #取出json文件里的内容,返回的格式是字符串
- html=response.read()
- #把json形式的字符串转换成python形式的Unicode字符串
- unicodestr=json.loads(html)
- #python形式的列表
- city_list=jsonpath.jsonpath(unicodestr,"$..name")
- #打印每个城市
- for i in city_list:
- print i
- #dumps()默认中文伟ascii编码格式,ensure_ascii默认为Ture
- #禁用ascii编码格式,返回Unicode字符串
- array=json.dumps(city_list,ensure_ascii=False)
- #把结果写入到lagouCity.json文件中
- with open("lagouCity.json","w") as f:
- f.write(array.encode("utf-8"))