Python3网络爬虫-基本库使用

Python3网络爬虫-基本库使用

1、 HTTP基本原理

1、URL&URN

  • URL Universa Resource Locator ,即统 资源定位符 。例如:https://github .com/favicon.ico

  • URN 它的全称为 Universal Resource Name ,即统一资源名称 ,例如:um:isbn:0451450523 指定了一本书的 ISBN

  • URI 全称为 ifo rm Resource Identifier ,即 统一资源标志符

    URI=URL+URN , 目前URN使用较少,几乎所有的URL都是URI

2、 HTTP&HTTPS

URL 的开头会有 http https,也许还会看到坤、 smb 开头的URL

  • HTTP Hyper Text Transfer Protocol ,超文本传输协议
  • HTTPS HTTP+SSL/STL ,安全的超文本传输协议

3、请求

  • 请求方法

    • GET 请求中的参数包含在 URL 里面,数据可以在 URL 中看到,而 POST 请求的 URL不 会包
      含这些数据,数据都是通过表单形式传输的,会包含在请求体中
    • GET 请求提交的数据最多只有 1024 字节,而 POST 方式没有限制
    • 表单、敏感信息、文件使用POST提交
  • 请求网址

    即统 资惊定位符 URL ,它可以唯一确定我们想请求的资源

  • 请求头

    服务器重要附加信息

    • Accept 请求报头域,用于指定客户端可接受哪些类型的信息
    • Accept-Language :指定客户端可接受的语言类型
    • Accept-Encoding :指定客户端可接受的内容编码
    • Host :用于指定请求资源的主机 IP 和端口号,其内容为请求 URL 的原始服务器或网关的位置。HTTP 1. 版本开始,请求必须包含此内容
    • Cookie :也常用复数形式 Cookies ,这是网站为了辨别用户进行会话跟踪而存储在用户本地
      的数据 它的主要功能是维持当前访问会话
    • Referer : 此内容用来标识这个请求是从哪个页面发过来的,服务器可以拿到这 信息并做相
      应的处理,如做来源统计、防盗链处理等
    • User-Agent:简称 UA ,它是一个特殊的字符串头,可以使服务器识别客户使用的操作系统
      及版本 浏览器及版本等信息 在做爬虫时加上此信息,可以伪装为浏览器;如果不加,很
      可能会被识别州为爬虫
    • Content-Type :也叫互联网媒体类型( Internet Media Type )或者 MIME 类型,在 HTT 协议
      消息头中,它用来表示具体请求中的媒体类型信息 例如, text/html 代表 HTML 格式,image/gif 代表 GIF 图片, app lication/json 代表JSON 类型,更多对应关系可以查看此对照表http://tool.oschina.neνcommons
  • 请求体

    请求体 般承载的内容是 POST 请求中的表单数据,而对于 GET 请求,请求体则为空

    Content-Type提交数据的方式
    application/x-www-forrn-urlencoded表单数据
    multi part/form-data表单文件上传
    application/json序列化 JSON 数据
    text/xmlXML 数据

4、 响应

响应,由服务端返回给客户端,可以分为 部分:响应状态码( Response Status Code )、响应头( Response Headers )和响应体( Response Body )

  • 项目码

    常见的错误代码及错误原因

    状态码说明详情
    100继续请求者应当继续提出请求 服务器已收到请求的一部分,正在等待其余部分
    101切换协议请求者已要求服务器切换协议,服务器已确认并准备切换
    200成功服务然已成功处理了请求
    201已创建请求成功并且服务器创建了新的资源
    202已接收服务器已经接受请求,但尚未处理
    203非授权信息服务器已成功处理了请求,但返回的信息可能来自另 个源
    204无内容服务器成功处理了请求 但没有返回任何内容
    205重置内容服务器成功处理了请求,内容被重宜
    206部分内容服务器成功处理了部分请求
    300多种选择针对请求,服务器可执行多种操作
    301永久移动请求的网页已永久移动到新位置,即永久重定向
    302l临时移动请求的网页暂时跳转到其他页面,即暂时重定向
    303查看其他位置如果原来的请求是POST , 重定向目标文档应该通过GET 提取
    304未修改此次请求返回的网页未修改, 继续使用上次的资源
    305使用代理请求者应该使用代理访问该网页
    307临时重定向请求的资源临时从其他位置l响应
    400错误请求服务器无法解析该请求
    401没授权请求没有进行身份验证或验证未通过
    403禁止访问服务将拒绝此请求
  • 响应头

    响应头包含了服务器对请求的应答信息

    • Date : 标识响应产生的时间。
    • Last-Modified : 指定资源的最后修改时间。
    • Content-Encoding : 指定响应内容的编码。
    • Server : 包含服务器的信息,比如名称、版本号等。
    • Content-Type : 文档类型,指定返回的数据类型是什么,如tex t/htm l 代表返回HTML 文档,app li cation/x-javascript !J!U 代表返回JavaScript 文件, image/jpeg 则代表返回图片。
    • Set-Cookie : 设置Cookies 。响应头中的Set- Cooki e 告诉浏览器需要将此内容放在Cookies中, 下次请求携带Cookies 请求
    • Expires : 指定响应的过期时间, 可以使代理服务器或浏览器将加载的内容更新到缓存中。如
      果再次访问时,就可以直接从缓存中加载, 降低服务器负载,缩短加载时间。

2、基本库使用

1、urllib

判断超时情况

import urllib.request
import urllib.error
import socket
try:
	reponse=urllib.request.urlopen( 'http://httpbin.org/get' , timeout=0.1)
except urllib.error.URLError as e:
    if isinstance( e.reason , socket.timeout):
        print('TIME OUT')
    

构造HTTP请求头

from urllib import request , parse
url='http://httpbin.org/post'
headers={
    'User-Agent':'Mozilla/4.0 (compatible; MSIE S. S; Windows NT )',
    'Host':'httpbin.org'
}
dict={
    'name':'Germey'
}
data=bytes( parse.urlencode(dict), encoding='utf-8')
req= request.Request(url=url,headers=headers,data=data,method='POST')
response= request.urlopen(req)
print(response.read().decode('utf-8'))

3、requests

简单爬取知乎问答内容

import requests
import re
import sys
#设置正确的浏览器信息否则返回400,GOOGLE F12 现抓个User-Agent最好
headers={
    'User-Agent':'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36' 
}
r=requests.get('https://www.zhihu.com/explore',headers=headers)
#如果有不确定的地方可以先发到http://httpbin.org/get 看看请求是否正确
#r=requests.get('http://httpbin.org/get',headers=headers)
if r.status_code != 200 :
    print( "return status_code : % " %r.status_code );
    sys.exit()
pattern=re.compile('explore-feed.*?question_link.*?>(.*?)</a>', re.S)
titles=re.findall(pattern,r.text)
print(titles)
    

抓取图片、视频、音频

import requests
headers={
    'User-Agent':'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36' 
}
r=requests.get('http://github.com/favicon.ico' , headers=headers )
#print( t.text ) --打印文本
#print( t.content )--打印bytes
with open( 'favicon.ico' , 'wb' ) as f:
    f.write( r.content )

  

状态码查询

import requests
import sys
r = requests.get('http://www.jianshu.com')
sys.exit if r.status_code != requests.codes.ok else print('Request Successfully')
#class 'requests.structures.CaseInsensitiveDict'
print( type(r.headers) , r.headers)
#class 'requests.cookies.RequestsCookieJar'
print( type(r.cookies) , r.cookies)
print( type(r.url) , r.url )

requests.codes

import requests
dict=requests.codes.__dict__  
#键值对颠倒
si=[(status_code,info) for info,status_code in dict.items()]
dist_si={}
#字典去重合并
for code_dict in si:
	code_dict_key=code_dict[0]
	code_dict_val=code_dict[1]
	print( code_dict_key , code_dict_val )
	if  dist_si.get(code_dict_key):
		dist_si[code_dict_key].append(code_dict_val)
	else:
		dist_si[code_dict_key]=[code_dict_val]
for (status_code,info) in dist_si.items():
	print( status_code,info ) 
=====+++++++++++++++++++++++++++++=输出==================================
status_codes ['name']
#信息状态码
100 ['continue', 'CONTINUE']
101 ['switching_protocols', 'SWITCHING_PROTOCOLS']
102 ['processing', 'PROCESSING']
103 ['checkpoint', 'CHECKPOINT']
122 ['uri_too_long', 'URI_TOO_LONG', 'request_uri_too_long', 'REQUEST_URI_TOO_LONG']
#成功状态码
200 ['ok', 'OK', 'okay', 'OKAY', 'all_ok', 'ALL_OK', 'all_okay', 'ALL_OKAY', 'all_good', 'ALL_GOOD', '\\o/', '✓']
201 ['created', 'CREATED']
202 ['accepted', 'ACCEPTED']
203 ['non_authoritative_info', 'NON_AUTHORITATIVE_INFO', 'non_authoritative_information', 'NON_AUTHORITATIVE_INFORMATION']
204 ['no_content', 'NO_CONTENT']
205 ['reset_content', 'RESET_CONTENT', 'reset', 'RESET']
206 ['partial_content', 'PARTIAL_CONTENT', 'partial', 'PARTIAL']
207 ['multi_status', 'MULTI_STATUS', 'multiple_status', 'MULTIPLE_STATUS', 'multi_stati', 'MULTI_STATI', 'multiple_stati', 'MULTIPLE_STATI']
208 ['already_reported', 'ALREADY_REPORTED']
226 ['im_used', 'IM_USED']
#重定向状态码
300 ['multiple_choices', 'MULTIPLE_CHOICES']
301 ['moved_permanently', 'MOVED_PERMANENTLY', 'moved', 'MOVED', '\\o-']
302 ['found', 'FOUND']
303 ['see_other', 'SEE_OTHER', 'other', 'OTHER']
304 ['not_modified', 'NOT_MODIFIED']
305 ['use_proxy', 'USE_PROXY']
306 ['switch_proxy', 'SWITCH_PROXY']
307 ['temporary_redirect', 'TEMPORARY_REDIRECT', 'temporary_moved', 'TEMPORARY_MOVED', 'temporary', 'TEMPORARY']
308 ['permanent_redirect', 'PERMANENT_REDIRECT', 'resume_incomplete', 'RESUME_INCOMPLETE', 'resume', 'RESUME']
#客户端错误状态码
400 ['bad_request', 'BAD_REQUEST', 'bad', 'BAD']
401 ['unauthorized', 'UNAUTHORIZED']
402 ['payment_required', 'PAYMENT_REQUIRED', 'payment', 'PAYMENT']
403 ['forbidden', 'FORBIDDEN']
404 ['not_found', 'NOT_FOUND', '-o-', '-O-']
405 ['method_not_allowed', 'METHOD_NOT_ALLOWED', 'not_allowed', 'NOT_ALLOWED']
406 ['not_acceptable', 'NOT_ACCEPTABLE']
407 ['proxy_authentication_required', 'PROXY_AUTHENTICATION_REQUIRED', 'proxy_auth', 'PROXY_AUTH', 'proxy_authentication', 'PROXY_AUTHENTICATION']
408 ['request_timeout', 'REQUEST_TIMEOUT', 'timeout', 'TIMEOUT']
409 ['conflict', 'CONFLICT']
410 ['gone', 'GONE']
411 ['length_required', 'LENGTH_REQUIRED']
412 ['precondition_failed', 'PRECONDITION_FAILED']
428 ['precondition', 'PRECONDITION', 'precondition_required', 'PRECONDITION_REQUIRED']
413 ['request_entity_too_large', 'REQUEST_ENTITY_TOO_LARGE']
414 ['request_uri_too_large', 'REQUEST_URI_TOO_LARGE']
415 ['unsupported_media_type', 'UNSUPPORTED_MEDIA_TYPE', 'unsupported_media', 'UNSUPPORTED_MEDIA', 'media_type', 'MEDIA_TYPE']
416 ['requested_range_not_satisfiable', 'REQUESTED_RANGE_NOT_SATISFIABLE', 'requested_range', 'REQUESTED_RANGE', 'range_not_satisfiable', 'RANGE_NOT_SATISFIABLE']
417 ['expectation_failed', 'EXPECTATION_FAILED']
418 ['im_a_teapot', 'IM_A_TEAPOT', 'teapot', 'TEAPOT', 'i_am_a_teapot', 'I_AM_A_TEAPOT']
421 ['misdirected_request', 'MISDIRECTED_REQUEST']
422 ['unprocessable_entity', 'UNPROCESSABLE_ENTITY', 'unprocessable', 'UNPROCESSABLE']
423 ['locked', 'LOCKED']
424 ['failed_dependency', 'FAILED_DEPENDENCY', 'dependency', 'DEPENDENCY']
425 ['unordered_collection', 'UNORDERED_COLLECTION', 'unordered', 'UNORDERED']
426 ['upgrade_required', 'UPGRADE_REQUIRED', 'upgrade', 'UPGRADE']
429 ['too_many_requests', 'TOO_MANY_REQUESTS', 'too_many', 'TOO_MANY']
431 ['header_fields_too_large', 'HEADER_FIELDS_TOO_LARGE', 'fields_too_large', 'FIELDS_TOO_LARGE']
444 ['no_response', 'NO_RESPONSE', 'none', 'NONE']
449 ['retry_with', 'RETRY_WITH', 'retry', 'RETRY']
450 ['blocked_by_windows_parental_controls', 'BLOCKED_BY_WINDOWS_PARENTAL_CONTROLS', 'parental_controls', 'PARENTAL_CONTROLS']
451 ['unavailable_for_legal_reasons', 'UNAVAILABLE_FOR_LEGAL_REASONS', 'legal_reasons', 'LEGAL_REASONS']
499 ['client_closed_request', 'CLIENT_CLOSED_REQUEST']
#服务端错误状态码
500 ['internal_server_error', 'INTERNAL_SERVER_ERROR', 'server_error', 'SERVER_ERROR', '/o\\', '✗']
501 ['not_implemented', 'NOT_IMPLEMENTED']
502 ['bad_gateway', 'BAD_GATEWAY']
503 ['service_unavailable', 'SERVICE_UNAVAILABLE', 'unavailable', 'UNAVAILABLE']
504 ['gateway_timeout', 'GATEWAY_TIMEOUT']
505 ['http_version_not_supported', 'HTTP_VERSION_NOT_SUPPORTED', 'http_version', 'HTTP_VERSION']
506 ['variant_also_negotiates', 'VARIANT_ALSO_NEGOTIATES']
507 ['insufficient_storage', 'INSUFFICIENT_STORAGE']
509 ['bandwidth_limit_exceeded', 'BANDWIDTH_LIMIT_EXCEEDED', 'bandwidth', 'BANDWIDTH']
510 ['not_extended', 'NOT_EXTENDED']
511 ['network_authentication_required', 'NETWORK_AUTHENTICATION_REQUIRED', 'network_auth', 'NETWORK_AUTH', 'network_authentication', 'NETWORK_AUTHENTICATION']

文件上传(#“Content-Type”: “multipart/form-data”)

import requests
files={
    'file':open('1.pem','rb')
}
r=requests.post( 'http://httpbin.org/post' , files=files )
print(r.text)

Cookies

  • #获取Cookies
    import requests
    r=requests.get('https://baidu.com')
    for key , val in r.cookies.items() :
        print( "%s=%s" % (key , val) )
    
  • #手工设置Cookies ---Google浏览器F12 获取
    #####################方法1#####################################
    import requests
    headers={
        'Cookie':'_zap=cc672834-3e63-4a4e-9246-93b54dc74a76; __DAYU_PP=yuUeiiVeaVZEayUab2rFffffffffd3f1f0f5bc9c; d_c0="AMCkrWxHuw2PTh4QnK1aQBQcA2l7rd2aSjY=|1528686380"; l_n_c=1; q_c1=35d4a692ec7d4c3c88351f8b8959668b|1553738732000|1516775913000; _xsrf=d632891773e10dc462a07feb2f829368; n_c=1; _xsrf=aDKGdn6TfOkYfk43vsekRV75FfebYNba; SL_GWPT_Show_Hide_tmp=1; SL_wptGlobTipTmp=1; __utmc=51854390; __utmz=51854390.1553738668.1.1.utmcsr=(direct)|utmccn=(direct)|utmcmd=(none); BL_D_PROV=; BL_T_PROV=; tgw_l7_route=66cb16bc7f45da64562a077714739c11; l_cap_id="YzJjYzEyY2ExZGMxNGJkMmFjNmNkNTM3MDg1ZWRiM2E=|1553762062|9d1547776eebfb3b42ca92369b2d3a9df4245339"; r_cap_id="Yjg3NTg0YjRhNmZjNDEyMDk2MmFkMjI4NzgyODgzYzU=|1553762062|efff30851f845765634ec9bae5bde07dce11315e"; cap_id="M2M0MjNjMzUyNzdlNGQxMThlNTRhOGVhOTY5ZDkwMjM=|1553762062|48aac3689381c89f5ecccbdc02c001de923e6fe2"; __utma=51854390.1821104099.1553738668.1553738668.1553761992.2; __utmb=51854390.0.10.1553761992; capsion_ticket="2|1:0|10:1553762071|14:capsion_ticket|44:ODBmZjRiMWMzN2MxNDM1OTlkMDUzNTA5NTNjM2ZlMDI=|6a6ccc9cf7d944da04671d627a7be433a0911b39d8918dc4ae65184d1d7fff89"; z_c0="2|1:0|10:1553762113|4:z_c0|92:Mi4xVHg3NkRnQUFBQUFBd0tTdGJFZTdEU1lBQUFCZ0FsVk5RZFdKWFFBU2RTWmpnTUIwSXF3ODZ1TEFNTlJraFJsbjh3|fb442f693e4ef8cc9837064a6e4e1bdd766d26db24f0bb4b0b765f36e7672ac8"; tst=r; __utmv=51854390.100--|2=registration_date=20190328=1^3=entry_date=20180124=1' ,
        'User-Agent':'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36'
    }
    r=requests.get('https://www.zhihu.com/collections' ,headers=headers)
    print(r.text)
    #####################方法2#####################################
    import requests
    headers={
        'User-Agent':'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36'
    }
    cookies='_zap=cc672834-3e63-4a4e-9246-93b54dc74a76; __DAYU_PP=yuUeiiVeaVZEayUab2rFffffffffd3f1f0f5bc9c; d_c0="AMCkrWxHuw2PTh4QnK1aQBQcA2l7rd2aSjY=|1528686380"; l_n_c=1; q_c1=35d4a692ec7d4c3c88351f8b8959668b|1553738732000|1516775913000; _xsrf=d632891773e10dc462a07feb2f829368; n_c=1; _xsrf=aDKGdn6TfOkYfk43vsekRV75FfebYNba; SL_GWPT_Show_Hide_tmp=1; SL_wptGlobTipTmp=1; __utmc=51854390; __utmz=51854390.1553738668.1.1.utmcsr=(direct)|utmccn=(direct)|utmcmd=(none); BL_D_PROV=; BL_T_PROV=; tgw_l7_route=66cb16bc7f45da64562a077714739c11; l_cap_id="YzJjYzEyY2ExZGMxNGJkMmFjNmNkNTM3MDg1ZWRiM2E=|1553762062|9d1547776eebfb3b42ca92369b2d3a9df4245339"; r_cap_id="Yjg3NTg0YjRhNmZjNDEyMDk2MmFkMjI4NzgyODgzYzU=|1553762062|efff30851f845765634ec9bae5bde07dce11315e"; cap_id="M2M0MjNjMzUyNzdlNGQxMThlNTRhOGVhOTY5ZDkwMjM=|1553762062|48aac3689381c89f5ecccbdc02c001de923e6fe2"; __utma=51854390.1821104099.1553738668.1553738668.1553761992.2; __utmb=51854390.0.10.1553761992; capsion_ticket="2|1:0|10:1553762071|14:capsion_ticket|44:ODBmZjRiMWMzN2MxNDM1OTlkMDUzNTA5NTNjM2ZlMDI=|6a6ccc9cf7d944da04671d627a7be433a0911b39d8918dc4ae65184d1d7fff89"; z_c0="2|1:0|10:1553762113|4:z_c0|92:Mi4xVHg3NkRnQUFBQUFBd0tTdGJFZTdEU1lBQUFCZ0FsVk5RZFdKWFFBU2RTWmpnTUIwSXF3ODZ1TEFNTlJraFJsbjh3|fb442f693e4ef8cc9837064a6e4e1bdd766d26db24f0bb4b0b765f36e7672ac8"; tst=r; __utmv=51854390.100--|2=registration_date=20190328=1^3=entry_date=20180124=1'
    jar=requests.cookies.RequestsCookieJar()
    for cookie in cookies.split(';'):
        key,val = cookie.split('=',1)
        jar.set(key,val)
    r=requests.get('https://www.zhihu.com/collections' ,cookies=jar,headers=headers)
    print(r.text)
    
    
  • #维持回话
    import requests
    s=requests.Session()
    r=s.get('http://httpbin.org/cookies/set/number/123456789')
    print(r.text)
    r=s.get('http://httpbin.org/cookies')
    print(r.text)
    

SSL证书验证

相关资料

[理解服务器证书 CA&SSL][https://www.v2ex.com/t/436240]

[SSL/TLS原理详解][https://segmentfault.com/a/1190000002554673]

python有自己的CA列表(不是跟IE,GOOGLE一样用操作系统的) ,由certifi模块提供。我测试环境的CA文件:

(site_test) wujun@wujun-VirtualBox:~$ sudo find ./ -name cacert.pem 
./env_site_test/lib/python3.6/site-packages/pip/_vendor/certifi/cacert.pem
(site_test) wujun@wujun-VirtualBox:~$ python
Python 3.6.7 (default, Oct 22 2018, 11:32:17) 
[GCC 8.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import certifi
>>> 
#忽略警告信息
import logging
import requests
#捕获警告到日志
logging.captureWarnings(True)
#我实验的时候12306已经不是自签名证书了,没法实验
response=requests.get('https://www.12306.cn')
#双向认证的时候需要制定客户端证书和私钥。私钥用于证书签名。requests.get要求这个私钥不能是明文
#response=requests.get('https://www.12306.cn',cert=('/path/ser.crt','/path/key'))
print(response.status_code)

proxy代理

  • 如果需要SOCK协议 需要单独安装 pip install ‘requests[socks]’
import requests
proxies={
    'http':'http://211.149.172.228:9999',
    'https':'https://182.150.35.173:80',
     #HTTP Basic Auth
    'https':'sock5://user:password@10.10.110:3128/'
}
#响应时间timeout最好大于3秒(因为TCP数据包重传窗口的默认大小是3),timeout可以细化:例如timeout=(connect,read,total) 。默认timeout=None阻塞等待...
requests.get('http://httpbin.org/get' , proxies = proxies ,timeout=(4,5,10)

tcpdump抓包可以看到tcp/ip协议头的目的地址已经变成代理的地址(211.149.172.228)

[免费代理&购买代理点击][http://www.qydaili.com/free/]

1553850086241

身份认证

  • basic auth

    import requests
    from requests.auth import HTTPBasicAuth 
    #测试用户名称test_name 密码:123456 , URL中的basic-auth标识是基本认证
    r=requests.get( 'http://httpbin.org/basic-auth/test_name/123456' ,auth = HTTPBasicAuth('test_name','123456'))
    r.text
    '''
    输出测试1,输入正确的密码(200-ok): 
    >>> r=requests.get( 'http://httpbin.org/basic-auth/test_name/123456' ,auth = HTTPBasicAuth('test_name','123456'))
    >>> print(r.headers)
    {'Access-Control-Allow-Credentials': 'true', 'Access-Control-Allow-Origin': '*', 'Content-Encoding': 'gzip', 'Content-Type': 'application/json', 'Date': 'Sun, 31 Mar 2019 12:27:00 GMT', 'Server': 'nginx', 'Content-Length': '68', 'Connection': 'keep-alive'}
    >>> print(r.status_code)
    200
    
    输出测试2,输入错误的密码(401-unauthorized):
    >>> r=requests.get( 'http://httpbin.org/basic-auth/test_name/123456' ,auth = HTTPBasicAuth('test_name','1234567'))
    >>> print(r.status_code)
    401
    >>> print(r.headers)    
    {'Access-Control-Allow-Credentials': 'true', 'Access-Control-Allow-Origin': '*', 'Date': 'Sun, 31 Mar 2019 12:30:17 GMT', 'Server': 'nginx', 'WWW-Authenticate': 'Basic realm="Fake Realm"', 'Content-Length': '0', 'Connection': 'keep-alive'}
    >>> 
    
    请求测试,BASIC AUTH请求报文什么样子
    >>> r=requests.get( 'http://httpbin.org/get' ,auth = HTTPBasicAuth('test_name','1234567')) 
    >>> print(r.text)
    {
      "args": {}, 
      "headers": {
        "Accept": "*/*", 
        "Accept-Encoding": "gzip, deflate", 
        "Authorization": "Basic dGVzdF9uYW1lOjEyMzQ1Njc=", 
        "Host": "httpbin.org", 
        "User-Agent": "python-requests/2.18.4"
      }, 
      "origin": "218.88.16.199, 218.88.16.199", 
      "url": "https://httpbin.org/get"
    }
    '''
    

    1、从上面可以看出服务器需要BASIC AUTH是响应401 ,报文头WWW-Authenticate 提示Fake Realm域需要验证

    2、客户端对添加"Authorization": “Basic dGVzdF9uYW1lOjEyMzQ1Njc=”, user:password 用base64转换够放在Basic后发给服务器

    3、如果用户名和密码不匹配在重新响应401提示需要基本验证

  • 摘要认证

    import requests
    from requests.auth import HTTPDigestAuth
    url = 'http://httpbin.org/digest-auth/auth/user/pass'
    r=requests.get(url, auth=HTTPDigestAuth('user', 'pass'))
    r.status_code
    print(r.headers)
    '''
    #输出测试1
    >>> r.status_code
    200
    >>> print(r.headers)
    {'Access-Control-Allow-Credentials': 'true', 'Access-Control-Allow-Origin': '*', 'Content-Encoding': 'gzip', 'Content-Type': 'application/json', 'Date': 'Mon, 01 Apr 2019 03:14:28 GMT', 'Server': 'nginx', 'Set-Cookie': 'fake=fake_value; Path=/, stale_after=never; Path=/', 'Content-Length': '59', 'Connection': 'keep-alive'}
    #输出测试2,服务器返回401
    import requests
    from requests.auth import HTTPDigestAuth
    text=requests.get('http://httpbin.org/digest-auth/auth/user/pass1', auth=HTTPDigestAuth('user', 'pass')).headers
    for head,response_msg in text.items():
    	print(head,response_msg) 
        
    Access-Control-Allow-Credentials true
    Access-Control-Allow-Origin *
    Content-Type text/html; charset=utf-8
    Date Mon, 01 Apr 2019 04:26:46 GMT
    Server nginx
    Set-Cookie stale_after=never; Path=/, last_nonce=d0d5882d37dcf4b76dee54e9c0d2bb5a; Path=/, fake=fake_value; Path=/
    WWW-Authenticate Digest realm="me@kennethreitz.com", nonce="3969731c4f2ce3545a8266fe7d41a67c", qop="auth", opaque="3f15a8256cb961c0e0add04854f1f15d", algorithm=MD5, stale=FALSE
    Content-Length 0
    Connection keep-alive
    >>> 
    输入测试1,请求报文是什么样子
    看下图
    '''
    
    
    

    1554093891935

    1、重TCPDUMP截图可以看出 request进行 了两次请求, 第一次请求为了获取服务随机数、摘要算法等信息。第二次请求才带上用户名和密码

    2、第二次请求中Authorization 中的response是计算结果。 [OAuth 2.0: Bearer Token Usage][https://www.cnblogs.com/XiongMaoMengNan/p/6785155.html]

prepared request

  • 为方便进程调度方便引入

    from requests import Request, Session
    url='http://httpbin.org/post'
    data={
        'name':'wujun'
    }
    headers={
        'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36'
    }
    s=Session()
    req=Request('POST',url,data=data,headers=headers)
    prepped=s.prepare_request(req)
    r=s.send(prepped)
    print(r.text)
    

4、正则表达式

在线测试网站http://tool.oschina.net/regex#

模式描述
\w匹配字母、数字及下划线
\W匹配非字母、数字及下划线
\s匹配任意空白字符,等价于[\t\n\r\f]
\S匹配任意非空字符
\d匹配任意数字,等价于[0- 9]
\D匹配任意非数字的字符
\A匹配字符串开头
\Z匹配字符串结尾,如果存在换行,只匹配到换行前的结束字符串
\z匹配字符串结尾,如果存在换行,同时还会匹配换行符
\G匹配最后匹配完成的位宣
\n匹配 个换行符
\t匹配一个制表衔’
^匹配 行字符串的开头
$匹配 行字符串的结尾
.匹配任意字符,除了换行符,当 re.DOTALL 标记被指定时,则可以匹配包括换行符的任意字符
[…]用来表示 组字符,单独列出
[^…]不在[]中的字符
*匹配 0个或多个表达式
+匹配 1个或多个表达式
?匹配 0个或1 个前面的正则表达式定义的片段,非贪婪方式
{n}精确匹配 n个前面的表达式
{n,m}匹配 n到m次由前面正则表达式定义的片段,贪婪方式
a|b匹配a或b
( )匹配括号内的表达式,也表示 1个组
  • match()

    需要匹配的内容用()扩起来,用group按顺序输出

    import re
    content = 'Hello 1234567 World_tHIS is Regex Demo'
    result= re.match('^Hello\s(\d+)\s',content)
    print(result)
    print(result.group(1))
    print(result.span())
    #非贪婪模式1 输出 1234567
    result=re.match('^Hello.*?(\d+).*Demo$',content)
    >>> print(result.group(1))
    1234567
    #非贪婪模式2 输出 '' 意料之外(因为最少匹配字符)
    result=re.match('^Hello.*Regex (.*?)',content)
    >>> print(result.group(1))                        
    
    #贪婪模式  输出 7
    result=re.match('^Hello.*(\d+).*Demo$',content)
    >>> print(result.group(1))                         
    7
    #换行 需要增加修饰符re.S 这个修饰符的作用是使.匹配包括换行符在内的所有字符
    content = '''Hello 1234567 World_tHIS 
    is Regex Demo'''
    result= re.match('^Hello\s(\d+)\s',content,re.S)
    >>> print(result.group(1))  
    12345
    #转义 使用"\"
    
    
  • search()

    它在匹配时会扫描整个字符串,然后返回第一个成功匹配的结结果。

    import re
    content = 'extra Hello 1234567 World_tHIS is Regex Demo'
    result= re.search('Hello\s(\d+)\s',content)
    >>> print(result.group(1))
    1234567
    
  • findall()

    提取多个内容,注意贪婪和非贪婪模式

    import re
    html='''
    <li data-view="5"><a href="/4.mp3" singer="beyond">尤辉岁月</a></li>
    <li data-view="5"><a href="/4.mp3" 
    singer="beyond">尤辉岁月</a></li>
    '''
    result= re.findall('<li.*?href="(.*?)".*?singer="(.*?)">(.*?)</a></li>',html,re.S)
    for r in result:
        print(r[0],r[1],r[2])
    
  • sub()

    字符替换

    import re
    content='123wujun456'
    result=re.sub('\d+' , '' , content)
    >>> print(result)
    wujun
    
  • compile()

    这个方法可以将正则字符串编译成正则表达式对象,以便在后面的匹配中复用

    import re
    content1 = '2019-12-15 12:00'
    content2 = '2019-12-16 12:00'
    content3 = '2019-12-17 12:00'
    pattern = re.compile('\d{2}:\d{2}',re.S)
    result1=re.sub(pattern ,'' , content1 )
    result2=re.sub(pattern ,'' , content2 )
    result3=re.sub(pattern ,'' , content3 )
    >>> print( result1 , result2 , result3)
    2019-12-15  2019-12-16  2019-12-17 
    
    
  • 抓取猫眼电影TOP100

    import requests
    import re
    import json
    def write_to_file(content):
    	with open('result.txt' , 'a' , encoding='utf-8') as f :
    		f.write( json.dumps(content , ensure_ascii = False) + '\n' ) 
    		
    def get_one_page(url):
    	headers={'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36'}
    	response = requests.get( url , headers = headers )
    	if response.status_code != 200 :
    		print(esponse.status_code )
    		return None
    	return response.text
    def parse_one_page(html):
    	pattern=re.compile('<dd>.*?board-index.*?>(.*?)</i>.*?data-src="(.*?)".*?name.*?<a.*?>(.*?)</a>.*?star.*?>(.*?)</p>.*?releasetime.*?>(.*?)</p>.*?score.*?integer.*?>(.*?)</i>.*?fraction.*?>(.*?)</i>' , re.S)
    	items = re.findall( pattern , html )
    	'''
    	print( "items=",items)
    	for item in items:
    		print( 
    		item[0] , 
    		item[1] , 
    		item[2].strip() , 
    		item[3].strip()[3:]   if len(item[3].strip()) > 3 else ''  , 
    		item[4][5:]   if len(item[4]) > 5 else ''  , 
    		item[5]+item[6] ) 
    		print( "="*50 )
    	'''
    	for item in items:
    		yield {
    			'index' : item[0],
    			'image' : item[1],
    			'title' : item[2].strip(),
    			'actor' : item[3].strip()[3:]   if len(item[3].strip()) > 3 else '',
    			'time'  : item[4][5:]   if len(item[4]) > 5 else '',
    			'score' : item[5]+item[6]
    		}
    if __name__ == "__main__":
    	for pages in range( 10 ):
    		url='https://maoyan.com/board/4?offset=' + str(pages*10)
    		html=get_one_page(url)
    		for content in parse_one_page(html) :
    			print(content)
    			write_to_file(content)
    	
    
    

5、XPath

  • 第一个XPath程序
from lxml import etree
text='''
<div>
<ul>
<li class ="item-0"><a href="link1.html">first item</a></li>
<li class ="item-1"><a href="link2.html">second item</a></li>
<li class ="item-inactive"><a href="link3.html">third item</a></li>
<li class ="item-1"><a href="link4.html">fourth item</a></li>
<li class ="item-0"><a href="link5.html">程序</a>
<li class ="item-3 item-4" name = "name" ><a href="link5.html">程序</a>
</ul>
</div>
'''
html=etree.HTML(text)
#自动修正HTML报文
result=etree.tostring(html)
#bytes转换成str
print(result.decode('utf-8'))

###或者直接解析程序
html=etree.parse(text , etree.HTMLParse())
result=etree.tostring(html)
print(result.decode('utf-8'))
###属性匹配

  • 按顺选择

    from lxml import etree
    text='''
    <div>
    <ul>
    <li class ="item-0"><a href="link1.html">first item</a></li>
    <li class ="item-1"><a href="link2.html">second item</a></li>
    <li class ="item-inactive"><a href="link3.html">third item</a></li>
    <li class ="item-1"><a href="link4.html">fourth item</a></li>
    <li class ="item-0"><a href="link5.html">程序</a>
    <li class ="item-3 item-4" name = "name" ><a href="link5.html">程序</a>
    </ul>
    </div>
    '''
    html=etree.HTML(text)
    #第一个li节点
    html.xpath('//li[1]')
    #最好一个
    html.xpath('//li[last()]')
    #位置小于3的节点
    html.xpath('//li[position()<3]')
    #倒数第二个
    html.xpath('//li[last()-2]')
    
    
  • 节点轴选择

    from lxml import etree
    text='''
    <div>
    <ul>
    <li class ="item-0"><a href="link1.html">first item</a></li>
    <li class ="item-1"><a href="link2.html">second item</a></li>
    <li class ="item-inactive"><a href="link3.html">third item</a></li>
    <li class ="item-1"><a href="link4.html">fourth item</a></li>
    <li class ="item-0"><a href="link5.html">程序</a>
    <li class ="item-3 item-4" name = "name" ><a href="link5.html">程序</a>
    </ul>
    <ul>
    <li class ="a-item-0"><a href="link1.html">first item</a></li>
    </ul>
    <ul>
    <li class ="b-item-0"><a href="link1.html">first item</a></li>
    </ul>
    </div>
    '''
    html=etree.HTML(text)
    #所有的父节点
    html.xpath('//li[1]/ancestor::*')
    #父body节点
    html.xpath('//li[1]/ancestor::body')
    #选中节点的所有属性
    html.xpath('//li[1]/attribute::*')
    #获取直接子节点
    html.xpath('//li[1]/child::a[contains(@href , "link1.html")]')
    #获取子孙节点
    html.xpath('//li[1]/descendant::*')
    #获取所有同级节点
    html.xpath('//li[1]/following-sibling::*')
    

6、Beautiful Soup

  • 基本用法

    text='''
    <html><head><title>The Dormouse's story </title></head>
    <body>	
    <p class = "title 1 2 3" name = "dromouse"> <b>The Dormouse's story</b></p>
    <div>
    <ul>
    <li class ="item-0"><a href="link1.html">first item</a></li>
    <li class ="item-1"><a href="link2.html">second item</a></li>
    <li class ="item-inactive"><a href="link3.html">third item</a></li>
    <li class ="item-1"><a href="link4.html">fourth item</a></li>
    <li class ="item-0"><a href="link5.html">程序</a>
    <li class ="item-3 item-4" name = "name" ><a href="link5.html">程序</a>
    </ul>
    <ul>
    <li class ="a-item-0"><a href="link1.html">first item</a></li>
    <ul>
    <li class ="b-item-0"><a href="link1.html">first item</a></li>
    </div>
    '''
    from bs4 import BeautifulSoup
    #使用lxml解释器
    soup= BeautifulSoup(text,'lxml')
    #输出美化后的HTML报文
    print(soup.prettify())
    #Tag类型 ,string是属性
    print(type(soup.title))
    <class 'bs4.element.Tag'>
    #输出li标签的文本(仅仅选取第一个)
    print(soup.li.string)
    #不带属性,选取一大段
    print(soup.head)
    
  • 提取信息

    #节点名 name
    >>> print(soup.head.name)
    head
    #属性1 attrs
    >>> print(soup.p.attrs['name'])
    dromouse
    >>> print(soup.p['name'])      
    dromouse
    >>> print(soup.p['class'])
    ['title', '1', '2', '3']
    >>> 
    #获取内容
    >>> print(soup.title.string)
    The Dormouse's story
    #嵌套选择
    >>> print(soup.p.b.string)  
    The Dormouse's story
    >>> 
    #获取子节点
    >>> soup.div.contents
    ['\n', <ul>
    <li class="item-0"><a href="link1.html">first item</a></li>
    <li class="item-1"><a href="link2.html">second item</a></li>
    <li class="item-inactive"><a href="link3.html">third item</a></li>
    <li class="item-1"><a href="link4.html">fourth item</a></li>
    <li class="item-0"><a href="link5.html">程序</a>
    </li><li class="item-3 item-4" name="name"><a href="link5.html">程序</a>
    </li></ul>, '\n', <ul>
    <li class="a-item-0"><a href="link1.html">first item</a></li>
    </ul>, '\n', <ul>
    <li class="b-item-0"><a href="link1.html">first item</a></li>
    </ul>, '\n']
    >>> soup.div.children
    <list_iterator object at 0x7f1fbcea9908>
    
    >>> for i , child  in enumerate(soup.div.children): 
    ...     print(i, child)
    ... 
    0 
    
    1 <ul>
    <li class="item-0"><a href="link1.html">first item</a></li>
    <li class="item-1"><a href="link2.html">second item</a></li>
    <li class="item-inactive"><a href="link3.html">third item</a></li>
    <li class="item-1"><a href="link4.html">fourth item</a></li>
    <li class="item-0"><a href="link5.html">程序</a>
    </li><li class="item-3 item-4" name="name"><a href="link5.html">程序</a>
    </li></ul>
    2 
    
    3 <ul>
    <li class="a-item-0"><a href="link1.html">first item</a></li>
    </ul>
    4 
    
    5 <ul>
    <li class="b-item-0"><a href="link1.html">first item</a></li>
    </ul>
    6 
    
    #获取子节点
    >>> for i , child  in enumerate(soup.div.descendants):
    ...     print(i,child)
    
    #获取父节点,第一个li的父亲
    >>> soup.li.parent
    <ul>
    <li class="item-0"><a href="link1.html">first item</a></li>
    <li class="item-1"><a href="link2.html">second item</a></li>
    <li class="item-inactive"><a href="link3.html">third item</a></li>
    <li class="item-1"><a href="link4.html">fourth item</a></li>
    <li class="item-0"><a href="link5.html">程序</a>
    </li><li class="item-3 item-4" name="name"><a href="link5.html">程序</a>
    </li></ul>
    #获取所有祖先节点
    >>> list(enumerate(soup.div.parents))
    #获取兄弟节点
    text='''
    <p>a<a>a</a>c<a></a>d</p>
    '''
    soup= BeautifulSoup(text,'lxml')
    soup.a.previous_sibling
    soup.a.next_sibling
    list(enumerate(soup.a.previous_siblings))
    list(enumerate(soup.a.next_siblings))
    'a'
    >>> soup.a.next_sibling
    'c'
    >>> list(enumerate(soup.a.previous_siblings))
    [(0, 'a')]
    >>> list(enumerate(soup.a.next_siblings))
    [(0, 'c'), (1, <a></a>), (2, 'd')]
    >>> 
    #提取信息
    text='''
    <p class="1234">a
    <a>a1</a>
    <a>a2</a>
    d</p>
    '''
    soup= BeautifulSoup(text,'lxml')
    soup.a.previous_sibling
    soup.a.next_sibling.string
    list(soup.a.parents)
    list(soup.a.parents)[0]
    list(soup.a.parents)[0].attrs['class']
    
    
  • find_all()

    #按节点查询
    text='''
    <p class="1234">a
    <a>a1</a>
    <a>a2</a>
    d</p>
    '''
    soup= BeautifulSoup(text,'lxml')
    print(soup.find_all(name='a'))
    print(type(soup.find_all(name='a')[0]))
    for a in soup.find_all(name='a'):
        print(a.string)
    #按属性查询
    text='''
    <p class="1234">a
    <a>a1</a>
    <a>a2</a>
    d</p>
    <p id = "1" class="12345">a
    <a>a1</a>
    <a>a2</a>
    d</p>
    '''
    soup= BeautifulSoup(text,'lxml')
    #print(soup.find_all(attrs={'class':'12345'})) 或者 print( soup.find_all(class_="12345") )
    print( soup.find_all(id="1") )
    print(type(soup.find_all(attrs={'class':'1234'})[0]))
    for a in soup.find_all(attrs={'class':'1234'}):
        print(a.string)
    #text 正则匹配节点的!文本!
    text='''
    <p>
    Hello,this is link
    </p>
    <p>
    Hello,this is link,too
    </p>
    '''
    soup= BeautifulSoup(text,'lxml')
    print(soup.find_all(text=re.compile('link')))
    
  • find()

    与find_all比较,它返回单个Tag

  • 其他函数

    函数功能
    find_parents返回所有父节点
    find_parent直接父节点
    find_next_siblings返回后面所有兄弟节点
    find_next_sibling返回后面第一个兄弟节点
    find_previous_siblings返回前面所有兄弟节点
    find_previous_sibling返回前面第一个兄弟节点
    find_all_next返回后面所有复合条件的节点
    find_next返回后面第一个复合条件的节点
    find_all_previous返回前面所有复合条件的节点
    find_previous返回前面第一个复合条件的节点
  • css

[w3c-css选择器][http://www.w3school.com.cn/cssref/css_selectors.asp]

#按节点查询
text='''
<div class ='panle'>
<div class = 'panle-heading' >
<p class="1234">a
<a>a1</a>
<a>a2</a>
d</p>
</div>
<div>
<ul class='ul-1'>
<li id = "item-1">test1</li>
<li id = "item-3">test2</li>
</ul>
<ul class='ul-2'>
<li id = "item-1">test1</li>
<li id = "item-3">test2</li>
</ul>
</div>
'''
from bs4 import BeautifulSoup
soup= BeautifulSoup(text,'lxml')
print(soup.select('.panle .panle-heading'))
print(soup.select('ul li'))
print(soup.select('.ul-1 #item-1'))
print(type(soup.select('ul')[0]))
print(soup.select('ul')[0])
>>> for ul in soup.select('ul'):
...     print( ul.select('li'))
... 
[<li id="item-1">test1</li>, <li id="item-3">test2</li>]
[<li id="item-1">test1</li>, <li id="item-3">test2</li>]
>>> print(soup.select('ul li')[0].get_text())
test1
>>> print(soup.select('ul li')[0].string)
test1
>>> 

7、 pyquery

  • 字符串初始化

    text='''
    <div class ='panle'>
    <div class = 'panle-heading' >
    <p class="1234">a
    <a>a1</a>
    <a>a2</a>
    d</p>
    </div>
    <div>
    <ul class='ul-1'>
    <li id = "item-1">test1</li>
    <li id = "item-3">test2</li>
    </ul>
    <ul class='ul-2'>
    <li id = "item-1">test1</li>
    <li id = "item-3">test2</li>
    </ul>
    </div>
    '''
    from pyquery import PyQuery as pq
    doc=pq(text)
    >>>print(doc('li'))
    <li id="item-1">test1</li>
    <li id="item-3">test2</li>
    <li id="item-1">test1</li>
    <li id="item-3">test2</li>
    
  • URL初始化

    from pyquery import PyQuery as pq
    >>> html=pq(url='http://www.sina.com.cn',encoding='utf-8')     
    >>> print(html('title'))                                   
    <title>新浪首页</title>
    
  • 文件初始化

    from pyquery import PyQuery as pq
    html=pq(filename='demo.html',encoding='utf-8') 
    
  • CSS

    text='''
    <div id='AAA' class ='panle'>
    <div class = 'panle-heading' >
    <p class="1234">a
    <a>a1</a>
    <a>a2</a>
    d</p>
    </div>
    <div>
    <ul class='ul-1'>
    <li id = "item-1">test1</li>
    <li id = "item-3">test2</li>
    </ul>
    <ul class='ul-2'>
    <li id = "item-1">test1</li>
    <li id = "item-3">test2</li>
    </ul>
    </div>
    >/div>
    '''
    from pyquery import PyQuery as pq
    doc=pq(text)
    >>> print(doc('.panle .panle-heading a')) 
    <a>a1</a>
    <a>a2</a>
    d
    >>> print(type(doc('.panle .panle-heading a')) )
    <class 'pyquery.pyquery.PyQuery'>
    
    
  • 查找节点

    1. 子节点 find-子孙节点 children-子节点

      #使用上面的HTML文本
      from pyquery import PyQuery as pq
      doc=pq(text)
      items=doc('.ul-1')
      >>> print(type(items))
      <class 'pyquery.pyquery.PyQuery'>
      >>> print(items)
      <ul class="ul-1">
      <li id="item-1">test1</li>
      <li id="item-3">test2</li>
      </ul>
      
      >>> lis=items.find('li')
      >>> print(type(lis))
      <class 'pyquery.pyquery.PyQuery'>
      >>> print(lis)
      <li id="item-1">test1</li>
      <li id="item-3">test2</li>
      
      >>> lis=items.children()
      >>> print(lis)
      <li id="item-1">test1</li>
      <li id="item-3">test2</li>
      #按id筛选
      >>> lis=items.children('#item-1')
      >>> print(lis)                   
      <li id="item-1">test1</li>
      
      
    2. 父节点 parent -直接父节点 parents-祖先

      #使用上面的HTML文本
      from pyquery import PyQuery as pq
      doc=pq(text)
      items=doc('.ul-1')
      container=items.parent()
      print(type(container))
      print(container)
      >>> items=doc('.ul-1')
      >>> container=items.parent()
      >>> print(type(container))
      <class 'pyquery.pyquery.PyQuery'>
      >>> print(container)
      <div>
      <ul class="ul-1">
      <li id="item-1">test1</li>
      <li id="item-3">test2</li>
      </ul>
      <ul class="ul-2">
      <li id="item-1">test1</li>
      <li id="item-3">test2</li>
      </ul>
      </div>
      
      >>> container=items.parents('.panle')        
      >>> print(container)                 
      <div id="AAA" class="panle">
      <div class="panle-heading">
      <p class="1234">a
      <a>a1</a>
      <a>a2</a>
      d</p>
      </div>
      <div>
      <ul class="ul-1">
      <li id="item-1">test1</li>
      <li id="item-3">test2</li>
      </ul>
      <ul class="ul-2">
      <li id="item-1">test1</li>
      <li id="item-3">test2</li>
      </ul>
      </div>
      </div>
      
      
    3. 兄弟节点

      from pyquery import PyQuery as pq
      doc=pq(text)
      >>> li=doc( '#item-1')     
      >>> print(li)
      <li id="item-1">test1</li>
      <li id="item-1">test1</li>
      
      >>> print(li.siblings())
      <li id="item-3">test2</li>
      <li id="item-3">test2</li>
      
      
    4. 遍历

      text='''
      <div class= "div0 div1">
      <li id="1" >li-1</li>
      <li>li-2</li>
      <li>li-3</li>
      <li>li-3</li>
      </div>
      '''
      from pyquery import PyQuery as pq
      doc=pq(text)
      >>> lis=doc('li').items()
      >>> print(type(lis))     
      <class 'generator'>
      >>> for li in lis:
      ...     print(li,type(li))
      ... 
      <li id="1">li-1</li>
       <class 'pyquery.pyquery.PyQuery'>
      <li>li-2</li>
       <class 'pyquery.pyquery.PyQuery'>
      <li>li-3</li>
       <class 'pyquery.pyquery.PyQuery'>
      <li>li-3</li>
       <class 'pyquery.pyquery.PyQuery'>
      
      
    5. 获取信息

      text='''
      <div class= "div0 div1">
      <li id="1" ><span class='bold1'>li-1</span></li>
      <li id="2" ><span class='bold2'>li-2</span></li>
      <li id="3" ><span class='bold3'>li-3</span></li>
      <li id="4" ><span class='bold4'>li-4</span></li>
      </div>
      '''
      from pyquery import PyQuery as pq
      doc=pq(text)
      >>> a=doc('li')
      >>> print(a , type(a))
      <li id="1"><span class="bold1">li-1</span></li>
      <li id="2"><span class="bold2">li-2</span></li>
      <li id="3"><span class="bold3">li-3</span></li>
      <li id="4"><span class="bold4">li-4</span></li>
       <class 'pyquery.pyquery.PyQuery'>
      >>> print(a.attr('id'))
      1
      >>> print(a.attr.id)
      1
      #遍历
      >>> a=doc('li').items()
      >>> for li in a:
      ...     print(li.attr.id)
      ... 
      1
      2
      3
      4
      
      
    6. 取 文本

      text='''
      <div class= "div0 div1">
      <li id="1" ><span class='bold1'>li-1</span></li>
      <li id="2" ><span class='bold2'>li-2</span></li>
      <li id="3" ><span class='bold3'>li-3</span></li>
      <li id="4" ><span class='bold4'>li-4</span></li>
      </div>
      '''
      from pyquery import PyQuery as pq
      doc=pq(text)
      li_text=doc('li')
      >>> print(a,li_text.text())
      <li id="1"><span class="bold1">li-1</span></li>
      <li id="2"><span class="bold2">li-2</span></li>
      <li id="3"><span class="bold3">li-3</span></li>
      <li id="4"><span class="bold4">li-4</span></li>
       li-1 li-2 li-3 li-4
      >>> print(a,li_text.html())
      <li id="1"><span class="bold1">li-1</span></li>
      <li id="2"><span class="bold2">li-2</span></li>
      <li id="3"><span class="bold3">li-3</span></li>
      <li id="4"><span class="bold4">li-4</span></li>    
       <span class="bold1">li-1</span>
      >>> text=li_text.items()       
      >>> for html in text:
      ...     print(html.html())
      ... 
      <span class="bold1">li-1</span>
      <span class="bold2">li-2</span>
      <span class="bold3">li-3</span>
      <span class="bold4">li-4</span>
          
      
      
    7. 节点操作

      text='''
      <div class= "div0 div1">
      <li id="1" ><span class='bold1'>li-1</span></li>
      <li id="2" ><span class='bold2'>li-2</span></li>
      <li id="3" ><span class='bold3'>li-3</span></li>
      <li id="4" ><span class='bold4'>li-4</span></li>
      </div>
      '''
      from pyquery import PyQuery as pq
      doc=pq(text)
      >>> li_text=doc('div')      
      >>> print(li_text)          
      <div class="div0 div1">
      <li id="1"><span class="bold1">li-1</span></li>
      <li id="2"><span class="bold2">li-2</span></li>
      <li id="3"><span class="bold3">li-3</span></li>
      <li id="4"><span class="bold4">li-4</span></li>
      </div>
      >>> li_text.removeClass('div0')
      [<div.div1>]
      >>> print(li_text)             
      <div class="div1">
      <li id="1"><span class="bold1">li-1</span></li>
      <li id="2"><span class="bold2">li-2</span></li>
      <li id="3"><span class="bold3">li-3</span></li>
      <li id="4"><span class="bold4">li-4</span></li>
      </div>
      >>> li_text.addClass('div2')   
      [<div.div1.div2>]
      >>> print(li_text)          
      <div class="div1 div2">
      <li id="1"><span class="bold1">li-1</span></li>
      <li id="2"><span class="bold2">li-2</span></li>
      <li id="3"><span class="bold3">li-3</span></li>
      <li id="4"><span class="bold4">li-4</span></li>
      </div>
      
      >>> li_text=doc('#1')    
      >>> print(li_text)
      <li id="1"><span class="bold1">li-1</span></li>
      
      >>> print(li_text.attr('name','modify'))
      <li id="1" name="modify"><span class="bold1">li-1</span></li>
      
      >>> print(li_text.text('test modify'))  
      <li id="1" name="modify">test modify</li>
      
      >>> print(li_text.html('<b>AAA</b>'))     
      <li id="1" name="modify"><b>AAA</b></li>
      >>>
      
      
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值