HTTP协议
-
HTTP:超文本传输协议,发布和接收HTTP页面的方法。端口80.
HTTPS协议:HTTP加密版本,加入了SSL。端口443
请求过程:
输入url回车,发生请求。
服务器response
浏览器分析response,再次发送request,获取images,css,js等
下载成功后显示
-
url详解:统一资源定位符
-
组成:
scheme://host:port/path/?query-string=xx#anchor
scheme:协议,http,https,ftp
host:主机名(映射)=IP
port:端口号(浏览器自动加端口号)
path:某个页面
query-string=xx:查询字符串,多个参数&
anchor:锚点(用于前端)
-
常用请求方法:Get(下载), Post(登录上传)8种方法
反爬虫:有时候是post
-
请求头常用参数:
发送请求:数据分为3部分:URL,body(post),head
请求:
User-Agent:伪装成浏览器。
Referer:哪个URL过来的,反爬虫,不指定不做响应
Cookie:HTTP无状态,靠cookie做标识
-
常见状态码:
200:正常
301:永久重定向
302:临时重定向
400:请求的URL在服务器找不到
403:权限不够
500:内部错误
-
Chorm抓包工具
右键—检查—
Elements:网页代码,Console控制台,Source:文件组成,Network请求,Performance
-
urllib库
模拟浏览器行为发送请求,保存数据。
urlopen
,在urllib.request
模块下
from urllib import request
resp = request.urlopen('https://www.baidu.com')
print(resp.read())
参数:URL,data=None(否则请求post),timeout
返回值:http.client.HTTPResponse对象,
方法:read(size), readline, readlines, getcode(状态码)
-
urlretrieve
函数:保存到本地,扩展名一致(也用来下载图片)request.urlretrieve('https://www.baidu.com', 'baidu.html')
-
urlencode
:字典数据转化为URL编码parse_qs
:解码from urllib import parse, request url = 'http://www.baidu.com/s' qarams = {'wd': '爬虫数据'} url = url + '?' + qs resp = request.urlopen(url) print(resp.read())
-
urlparse
和urlsplit
:分隔from urllib import request.parse url = 'http://www.baidu.com/s' result = parse.urlsplit(url) #split没有params参数 # result = parse.urlparse(url) print(result) print(result.scheme) print(result.netloc) print(result.path) print(result.query)
-
request.Request
类:增加请求头headers, data(要编码), method#编码encode,解码decode('utf-8') from urllib import request, parse url = 'ttps://www.lagou.com/jobs/positionAjax.json?city=%E4%B8%8A%E6%B5%B7&needAddtionalResult=false'#jason含json的url headers = { 'User-Agent': 'Mozilla/5.0 (compatible; U; ABrowse 0.6; Syllable) AppleWebKit/420+ (KHTML, like Gecko) ', 'Referer': 'https://www.lagou.com/jobs/list_%E6%95%B0%E6%8D%AE%E5%88%86%E6%9E%90?labelWords=&fromSearch=true&suginput=' } data = { 'first':'true', 'pn':1 'kd':python } req = request.Request(url, headers=headers, data=parse.urlencode(data).encode('utf-8', method='POST') resp = request.urlopen(req)
ProxyHeadler处理器(代理设置)
西刺免费代理IP:xicidaili.com
快代理:kuaidaili.com
代理云:dailiyun.com
from urllib import request handler = request.ProxyHandler({'http': '218.66.161.88:31769'}) opener = request.build_opener(handler) url = 'http://httpbin.org/ip' resp = opener.open(url) print(reqp.read())
httpbin.org :查看HTTP请求的参数
cookie
-
第一次请求后返回的信息,以后请求自动加载。存储的数据有限<4kb
-
格式
Set-Cookie: NAME=VALUE: Expires/Max-age=DATE: Path=PATH: Dpmain=DOMAIN_NAME: SECURE
Expores:过期时间
Domain:域名
Decure:HTTPS
- Cookielib和HTTPCookieProcessor模拟登陆
from urllib import request
url = ''
headers = {
'User-Agent':'...',
'Cookie': '...'
}
req = request.Request(url, headers=headers)
resp = request.urlopen(req)
with open ('renren.html', 'w', encoding='utf-8') as fp:
fp.write(resp.read().decode('utf-8'))
-
http.cookiejar模块:
CookieJar:管理,存储,请求添加cookie的类,存储在内存
FileCookieJar:filename存储到文件,delayload=true,支持延迟访问文件
MozillaCookieJar:与Mozilla浏览器兼容,filecookie派生
LWPCookieJar:Set-Cookie3兼容的实例,filecookie派生
# 利用http.cookie和request.HTTPCookieProcessor
from urllib import request.parse, request
from http.cookiejar import CookieJar
headers = {
'User-Agent':'Mozilla/5.0 (compatible; U; ABrowse 0.6; Syllable) AppleWebKit/420+ (KHTML, like Gecko)'
}
def get_opener():
cookiejar = CookieJar()
handler = request.HTTPCookieProcessor(cookiejar)
opener = request.build_opener(handler)
return opener
def login_renren(opener):
data = {'email':'', 'passwd':''}
data parse.urlencode(data).encode('utf-8')
login_url = ''
req = requser.Request(login_url, headers=headers, data=data)
opener.open(req)
def visit_profile(opener):
url = ''
req = request.Request(url, headers=headers)
resp = opener.open(req)
with open('renren.html') as fp:
fp.write(resp.read().decode('utf-8'))
if __name__ = '__main__':
opener = get_opener()
login_renren(opener)
visit_profile(opener)
保存cookie到本地:
from urllib import request
from http.cookiejar import MozillaCookieJar
cookiejar = MozillaCookieJar('cookie.txt')
handler = request.HTTPCookieProcessor(cookiejar)
opener = request.build_opener(handler)
handers = {
'User-Agent':'Mozilla/5.0 (compatible; U; ABrowse 0.6; Syllable) AppleWebKit/420+ (KHTML, like Gecko)'
}
req = request.Request('', headers=headers)
resp = opener.open(req)
print(resp.read())
#参数:ignore_discard=True, ignore_expires=True
cookiejar.save()
#加载cookie,注意参数:cookiejar.load()
requests库(补充)
- 实例:利用post爬取拉勾网
import requests
data = {
'first': '',
'pn': '',
'kd': '' #看网页里有什么(含position的请求)
}
headers = {
'Referer': '',
'User-Agent': ''
}
#看网页形式是POST还是GET
response = requests.post(url, data=data, headers=headers)
print(response.json())
- 使用代理
import requests
url =
headers = {'User-Agent': ''}
proxy = {'http/https':'ip':'端口'}
resp = requests.get(url, headers=headers, proxies=proxy)
with open('xx.html','w',encoding='utf-8') as fp:
fp.write(resp.text)
-
处理cookie
可用
response.cookies
获取
response.cookies.get_dict()
字典形式 session:(urllib中的opener)发送多个请求
import requests url = headers = data = #登录信息 #登录 session = requests.session() session.post(url, data=data, headers=headers) # 访问别的 resp = session.get('') print(resp.text)
-
处理不信任的SSL证书
resp = request.get(url, verify=False)
XPath和lxml模块
开发工具:Chorm插件XPath Helper,Firefox:Try XPath
1.选取节点:
nodename:此节点的所有子节点
/:最前表示根,其它表示子节点
//:子孙节点,随便在哪个位置
@:选择节点属性 //book[@id]
2.谓语:
[1]第一个,
[last()]最后
[position<3]前两个
[@price=10]
模糊匹配://div[contains(@class, ‘f1’)]
3.通配符:*任意,@*任何属性
4.多个路径:|或
5.运算符:
|, +, -, *, div, =, !=, mod, and, or...
解析HTML代码
from lxml import etree
text = '''
...
'''
#对HTML进行补充
#解析HTML字符串
html_Ele = etree.HTML(text) # Element
print(etree.tostring(html_Ele, encoding='utf-8').decode('utf-8'))
#解析HTML文件
#不补充
parser = etree.HTMLParser(encoding='utf-8') #不规范HTML代码时,加参数
html_Ele = etree.parse('tencent.html', parser=parser)
print(etree.tpstring(html_Ele, encoding='utf-8').decode('utf-8'))
-
在lxml中使用XPath语法:
Element.xpath()
1.获取所有tr
2.获取第2个tr标签
3.所有clas=even的标签
4.所有a标签的href属性
5.所有的职位信息
from lxml import etree
parser = etree.HTMLParser(encoding='utf-8')
html = etree.parse('tencent.html', parser=parser)
#1.获取所有tr
trs = html.xpath('//tr') #列表
for tr in trs:
print(etree.tostrint(tr, encoding='utf-8').decode='utf-8')
#2.获取第2个tr标签
tr_2 = html.xpath('//tr[2]')[0]
#3.所有clas=even的标签
trs_c = html.xpath('//tr[@class='even']')
#4.所有a标签的href属性
a_href = html.xpath('//a/@href')
#5.所有的职位信息
trs_all = html.xpath('//tr[position()>1]')
#6.当前标签下的a标签,应是.//
for tr in trs_all:
href = tr.xpath('.//a/@href')[0]
fullurl = 'http://ht.tencent.com/'+href
title = tr.xpath('./td[1]//text()')[0] #text在子标签之下
category = tr.xpath('./td[2]/text()')[0]
print(href)
break
实例:豆瓣电影
import requests
from lxml import etree
headers = {
'User-Agent':'Mozilla/5.0 (compatible; U; ABrowse 0.6; Syllable) AppleWebKit/420+ (KHTML, like Gecko)',
'Referer': 'https://movie.douban.com/explore'
}
url = 'https://movie.douban.com/top250'
resp = requests.get(url, headers=headers)
#print(resp.status_code)
text = resp.text
#2.提取
douban_html = etree.HTML(text)
#含所有目标的标签
div_list = douban_html.xpath("//div[@class='item']")
# print(etree.tostring(div_list, encoding='utf-8').decode('utf-8'))
movies = []
for li in div_list:
title = li.xpath(".//a/span/text()")[0]
author = li.xpath(".//p[@class]/text()")[0].strip().replace('\xa0', '')
types = li.xpath(".//p[@class]/text()")[1].strip().replace('\xa0', '')
rating_num = li.xpath(".//div[@class='star']/span[@class='rating_num']/text()")
movie = {'title': title, 'author': author, 'types': types, 'rating_num': rating_num}
movies.append(movie)
print(movies)
#
if __name__=='__mian__':
movie_list = []
spider()
#lists =spider()
for url in lists:
get_details(url)
实例:电影天堂
from lxml import etree
import requests
# 获取列表页面
BASE_DOMAIN = 'https://www.dytt8.net/'
HEADERS = {
'User-Agent': 'Mozilla/5.0 (compatible; U; ABrowse 0.6; Syllable) AppleWebKit/420+ (KHTML, like Gecko)',
'Referer': 'https://www.dytt8.net/css/db.css'
}
movie_list = []
def get_movie_list(url):
resp = requests.get(url, headers=HEADERS)
resp.encoding = 'gbk'
# 查看源代码的解码方式
mov_text = resp.text
mov_html = etree.HTML(mov_text)
table_t = mov_html.xpath('.//table[@style="margin-top:6px"]//a/@href')
lists = list(map(lambda x : BASE_DOMAIN + x, table_t))
return lists
# 详情页
def spider():
base_url = 'https://www.dytt8.net/html/gndy/dyzz/list_23_{}.html'
for i in range(1, 6):
url=base_url.format(i)
detail_urls = get_movie_list(url)
for d_url in detail_urls:
movie = get_details(d_url)
movie_list.append(movie)
print(movie_list)
def get_details(d_url):
resp_i = requests.get(d_url, headers=HEADERS)
resp_i.encoding = 'gbk'
html_i = etree.HTML(resp_i.text)
img = html_i.xpath('//div[@id="Zoom"]//img/@src')
info_list = html_i.xpath('//div[@id="Zoom"]//p/text()')
download_url = html_i.xpath('//div[@id="Zoom"]//a/@href')
info = list(map(lambda x: x.replace('\u3000', ''), info_list))
title = info[1].replace('◎译名', '')
movie = {'title':title, 'img':img, 'info':info[2:], 'download_url':download_url}
return movie
if __name__ == '__mian__':
spider()