urllib, XPath和lxml

HTTP协议

  1. HTTP:超文本传输协议,发布和接收HTTP页面的方法。端口80.

    HTTPS协议:HTTP加密版本,加入了SSL。端口443

    ​ 请求过程:

    ​ 输入url回车,发生请求。

    ​ 服务器response

    ​ 浏览器分析response,再次发送request,获取images,css,js等

    ​ 下载成功后显示

  2. url详解:统一资源定位符

    1. 组成:scheme://host:port/path/?query-string=xx#anchor

      ​ scheme:协议,http,https,ftp

      ​ host:主机名(映射)=IP

      ​ port:端口号(浏览器自动加端口号)

      ​ path:某个页面

      ​ query-string=xx:查询字符串,多个参数&

      ​ anchor:锚点(用于前端)

    2. 常用请求方法:Get(下载), Post(登录上传)8种方法

      ​ 反爬虫:有时候是post

    3. 请求头常用参数:

      ​ 发送请求:数据分为3部分:URL,body(post),head

      ​ 请求:

      ​ User-Agent:伪装成浏览器。

      ​ Referer:哪个URL过来的,反爬虫,不指定不做响应

      ​ Cookie:HTTP无状态,靠cookie做标识

    4. 常见状态码:

      ​ 200:正常

      ​ 301:永久重定向

      ​ 302:临时重定向

      ​ 400:请求的URL在服务器找不到

      ​ 403:权限不够

      ​ 500:内部错误

    5. Chorm抓包工具

      ​ 右键—检查—

      Elements:网页代码,Console控制台,Source:文件组成,Network请求,Performance

urllib库

模拟浏览器行为发送请求,保存数据。

  1. urlopen,在urllib.request模块下
from urllib import request
resp = request.urlopen('https://www.baidu.com')
print(resp.read())

参数:URL,data=None(否则请求post),timeout

返回值:http.client.HTTPResponse对象,

​ 方法:read(size), readline, readlines, getcode(状态码)

  1. urlretrieve函数:保存到本地,扩展名一致(也用来下载图片)

    request.urlretrieve('https://www.baidu.com', 'baidu.html')

  2. urlencode:字典数据转化为URL编码

    parse_qs:解码

    from urllib import parse, request
    
    url = 'http://www.baidu.com/s'
    qarams = {'wd': '爬虫数据'}
    url = url + '?' + qs
    resp = request.urlopen(url)
    print(resp.read())
    
  3. urlparseurlsplit:分隔

    from urllib import request.parse
    url = 'http://www.baidu.com/s'
    result = parse.urlsplit(url)
    #split没有params参数
    # result = parse.urlparse(url)
    print(result)
    print(result.scheme)
    print(result.netloc)
    print(result.path)
    print(result.query)
    
  4. request.Request类:增加请求头headers, data(要编码), method

    #编码encode,解码decode('utf-8')
    from urllib import request, parse
    url = 'ttps://www.lagou.com/jobs/positionAjax.json?city=%E4%B8%8A%E6%B5%B7&needAddtionalResult=false'#jason含json的url
    
    headers = {
        'User-Agent': 'Mozilla/5.0 (compatible; U; ABrowse 0.6; Syllable) AppleWebKit/420+ (KHTML, like Gecko) ',
        'Referer': 'https://www.lagou.com/jobs/list_%E6%95%B0%E6%8D%AE%E5%88%86%E6%9E%90?labelWords=&fromSearch=true&suginput='
    }
    data = {
        'first':'true',
        'pn':1
        'kd':python
    }
    
    req = request.Request(url, headers=headers, data=parse.urlencode(data).encode('utf-8', method='POST')
    resp = request.urlopen(req)
    

    ProxyHeadler处理器(代理设置)

    西刺免费代理IP:xicidaili.com

    快代理:kuaidaili.com

    代理云:dailiyun.com

    from urllib import request
    
    handler = request.ProxyHandler({'http': '218.66.161.88:31769'})
    opener = request.build_opener(handler)
    url = 'http://httpbin.org/ip'
    resp = opener.open(url)
    print(reqp.read())
    

    httpbin.org :查看HTTP请求的参数

cookie

  1. 第一次请求后返回的信息,以后请求自动加载。存储的数据有限<4kb

  2. 格式

Set-Cookie: NAME=VALUE: Expires/Max-age=DATE: Path=PATH: Dpmain=DOMAIN_NAME: SECURE

			Expores:过期时间

			Domain:域名

			Decure:HTTPS
  1. Cookielib和HTTPCookieProcessor模拟登陆
   from urllib import request
   
   url = ''
   headers = {
       'User-Agent':'...',
       'Cookie': '...'
   }
   req = request.Request(url, headers=headers)
   resp = request.urlopen(req)
   with open ('renren.html', 'w', encoding='utf-8') as fp:
       fp.write(resp.read().decode('utf-8'))
  1. http.cookiejar模块:

    ​ CookieJar:管理,存储,请求添加cookie的类,存储在内存

    ​ FileCookieJar:filename存储到文件,delayload=true,支持延迟访问文件

    ​ MozillaCookieJar:与Mozilla浏览器兼容,filecookie派生

    ​ LWPCookieJar:Set-Cookie3兼容的实例,filecookie派生

# 利用http.cookie和request.HTTPCookieProcessor
from urllib import request.parse, request
from http.cookiejar import CookieJar

headers = {
    'User-Agent':'Mozilla/5.0 (compatible; U; ABrowse 0.6; Syllable) AppleWebKit/420+ (KHTML, like Gecko)'
}

def get_opener():
    cookiejar = CookieJar()
    handler = request.HTTPCookieProcessor(cookiejar)
    opener = request.build_opener(handler)
    return opener

def login_renren(opener):
    data = {'email':'', 'passwd':''}
    data parse.urlencode(data).encode('utf-8')
    login_url = ''
    req = requser.Request(login_url, headers=headers, data=data)
    opener.open(req)
   
def visit_profile(opener):
    url = ''
    req = request.Request(url, headers=headers)
    resp = opener.open(req)
    with open('renren.html') as fp:
        fp.write(resp.read().decode('utf-8'))

if __name__ = '__main__':
    opener = get_opener()
    login_renren(opener)
    visit_profile(opener)

保存cookie到本地:

from urllib import request
from http.cookiejar import MozillaCookieJar

cookiejar = MozillaCookieJar('cookie.txt')
handler = request.HTTPCookieProcessor(cookiejar)
opener = request.build_opener(handler)

handers = {
    'User-Agent':'Mozilla/5.0 (compatible; U; ABrowse 0.6; Syllable) AppleWebKit/420+ (KHTML, like Gecko)'
}
req = request.Request('', headers=headers)
resp = opener.open(req)
print(resp.read())
#参数:ignore_discard=True, ignore_expires=True
cookiejar.save()
#加载cookie,注意参数:cookiejar.load()

requests库(补充)

  1. 实例:利用post爬取拉勾网
import requests

data = {
    'first': '',
    'pn': '',
    'kd': ''	#看网页里有什么(含position的请求)
}
headers = {
    'Referer': '',
    'User-Agent': ''
}
#看网页形式是POST还是GET
response = requests.post(url, data=data, headers=headers)
print(response.json())
  1. 使用代理
import requests

url = 
headers = {'User-Agent': ''}
proxy = {'http/https':'ip':'端口'}
resp = requests.get(url, headers=headers, proxies=proxy)
with open('xx.html','w',encoding='utf-8') as fp:
    fp.write(resp.text)
  1. 处理cookie

    ​ 可用response.cookies获取

    response.cookies.get_dict()字典形式

    ​ session:(urllib中的opener)发送多个请求

    import requests
    url = 
    headers = 
    data = #登录信息
    #登录
    session = requests.session()
    session.post(url, data=data, headers=headers)
    # 访问别的
    resp = session.get('')
    print(resp.text)
    
  2. 处理不信任的SSL证书

resp = request.get(url, verify=False)

XPath和lxml模块

​ 开发工具:Chorm插件XPath Helper,Firefox:Try XPath

  1. 语法:

​ 1.选取节点:

​ nodename:此节点的所有子节点

​ /:最前表示根,其它表示子节点

​ //:子孙节点,随便在哪个位置

​ @:选择节点属性 //book[@id]

​ 2.谓语:

​ [1]第一个,

​ [last()]最后

​ [position<3]前两个

​ [@price=10]

​ 模糊匹配://div[contains(@class, ‘f1’)]

​ 3.通配符:*任意,@*任何属性

​ 4.多个路径:|或

​ 5.运算符:

|, +, -, *, div, =, !=, mod, and, or...

  1. lxml库

解析HTML代码

from lxml import etree

text = ''' 
	...
'''

#对HTML进行补充
#解析HTML字符串
html_Ele = etree.HTML(text) # Element
print(etree.tostring(html_Ele, encoding='utf-8').decode('utf-8'))

#解析HTML文件
#不补充
parser = etree.HTMLParser(encoding='utf-8') #不规范HTML代码时,加参数
html_Ele = etree.parse('tencent.html', parser=parser)
print(etree.tpstring(html_Ele, encoding='utf-8').decode('utf-8'))

  1. 在lxml中使用XPath语法:Element.xpath()

    ​ 1.获取所有tr

    ​ 2.获取第2个tr标签

    ​ 3.所有clas=even的标签

    ​ 4.所有a标签的href属性

    ​ 5.所有的职位信息

from lxml import etree

parser = etree.HTMLParser(encoding='utf-8')
html = etree.parse('tencent.html', parser=parser)

#1.获取所有tr
trs = html.xpath('//tr')	#列表
for tr in trs:
    print(etree.tostrint(tr, encoding='utf-8').decode='utf-8')

#2.获取第2个tr标签
tr_2 = html.xpath('//tr[2]')[0]

#3.所有clas=even的标签
trs_c = html.xpath('//tr[@class='even']')

#4.所有a标签的href属性
a_href = html.xpath('//a/@href')

#5.所有的职位信息
trs_all = html.xpath('//tr[position()>1]')

	#6.当前标签下的a标签,应是.//
for tr in trs_all:
    href = tr.xpath('.//a/@href')[0]
    fullurl = 'http://ht.tencent.com/'+href
    title = tr.xpath('./td[1]//text()')[0] 	#text在子标签之下
    category = tr.xpath('./td[2]/text()')[0]
    
    print(href)
    break

实例:豆瓣电影

import requests
from lxml import etree

headers = {
    'User-Agent':'Mozilla/5.0 (compatible; U; ABrowse 0.6; Syllable) AppleWebKit/420+ (KHTML, like Gecko)',
    'Referer': 'https://movie.douban.com/explore'
}
url = 'https://movie.douban.com/top250'
resp = requests.get(url, headers=headers)
#print(resp.status_code)
text = resp.text

#2.提取
douban_html = etree.HTML(text)
#含所有目标的标签
div_list = douban_html.xpath("//div[@class='item']")
# print(etree.tostring(div_list, encoding='utf-8').decode('utf-8'))
movies = []
for li in div_list:
    title = li.xpath(".//a/span/text()")[0]
    author = li.xpath(".//p[@class]/text()")[0].strip().replace('\xa0', '')
    types = li.xpath(".//p[@class]/text()")[1].strip().replace('\xa0', '')
    rating_num = li.xpath(".//div[@class='star']/span[@class='rating_num']/text()")
    movie = {'title': title, 'author': author, 'types': types, 'rating_num': rating_num}
    movies.append(movie)
print(movies)


#
if __name__=='__mian__':
    movie_list = []
    spider()
    #lists =spider()
    for url in lists:
        get_details(url)

实例:电影天堂

from lxml import etree
import requests

# 获取列表页面
BASE_DOMAIN = 'https://www.dytt8.net/'
HEADERS = {
    'User-Agent': 'Mozilla/5.0 (compatible; U; ABrowse 0.6; Syllable) AppleWebKit/420+ (KHTML, like Gecko)',
    'Referer': 'https://www.dytt8.net/css/db.css'
}
movie_list = []


def get_movie_list(url):
    resp = requests.get(url, headers=HEADERS)
    resp.encoding = 'gbk'
    # 查看源代码的解码方式
    mov_text = resp.text
    mov_html = etree.HTML(mov_text)
    table_t = mov_html.xpath('.//table[@style="margin-top:6px"]//a/@href')
    lists = list(map(lambda x : BASE_DOMAIN + x, table_t))
    return lists


# 详情页
def spider():
    base_url = 'https://www.dytt8.net/html/gndy/dyzz/list_23_{}.html'
    for i in range(1, 6):
        url=base_url.format(i)
        detail_urls = get_movie_list(url)
        for d_url in detail_urls:
            movie = get_details(d_url)
            movie_list.append(movie)
    print(movie_list)


def get_details(d_url):
    resp_i = requests.get(d_url, headers=HEADERS)
    resp_i.encoding = 'gbk'
    html_i = etree.HTML(resp_i.text)
    img = html_i.xpath('//div[@id="Zoom"]//img/@src')
    info_list = html_i.xpath('//div[@id="Zoom"]//p/text()')
    download_url = html_i.xpath('//div[@id="Zoom"]//a/@href')
    info = list(map(lambda x: x.replace('\u3000', ''), info_list))
    title = info[1].replace('◎译名', '')
    movie = {'title':title, 'img':img, 'info':info[2:], 'download_url':download_url}
    return movie

if __name__ == '__mian__':
    spider()

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 1
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值