一、requests的登录问题
requests自动登录步骤
第一步:人工对需要自动登录网页进行登录
第二步:获取这个网站登录后的cookie信息
第三步:发送请求的时候在请求头中添加cookie值
import requests
headers = {
'cookie': '_zap=dab5d00e-ae9e-4fd3-9406-9aa38b5dd510; _xsrf=Qv8SuVAEk3B8ArpedFiXQJzA3aZnTRJD; d_c0=AICTv_7HjhaPTjP8MFdxDckcwFSB5KxHCKo=|1680319008; Hm_lvt_98beee57fd2ef70ccdd5ca52b9740c49=1680319009; gdxidpyhxdE=hoK%2FgbGDdaH0%2BTzIJbUKGQZ3uBaLDT2ZgswE77PdJw5PGNK7ye%2FKElr0uVURTp7B0DRugY63UHXLZaTatyhkgYrKXuzVvAxPNUs6ev3TRAQx8hbV6oSGNxTaGQks%5CavDWf4qs8ldClpN32nV2wPU8xEp%2BUiqAHuu7koPfUmmjBeij%5Czr%3A1680319912464; YD00517437729195%3AWM_NI=er%2BgCwpZ8hSu8ARy0fCpxvJxwYu1BHHnhzmMEgkmn0GGsAE%2F%2Fi26TlJGvkr829Zj0ikEkNJcgT4jT2NGIp2%2BEGHBvIDGOyCtVT%2BxaJTNVJhqVkycc727eiuF0TNcrSqMNGo%3D; YD00517437729195%3AWM_NIKE=9ca17ae2e6ffcda170e2e6eeb2cd67a5ba9989ce6e878a8ba7c44f839e9f83d46eacea82dab470b49baab4b42af0fea7c3b92aa7aee588c27db486f889d27ffc9f9e9ab36e81b0b897c974b587bbd2c1448cb2a89bf04787b9a58eed3c8599bda2c825edb7fed7d7688a99ba8cf134f2e7968bb363a99ba5afc969a58cfc8bd73a8e998791b121879988a3f5438899819ac280968ce599e44298929aabeb7b96b5b784e23df2e9ab8af84ef48c8284f159b39382a8ee37e2a3; YD00517437729195%3AWM_TID=Rsi%2B93AkUs5BVEEFUFaBfhrY8hM5RLD0; captcha_session_v2=2|1:0|10:1680319050|18:captcha_session_v2|88:Nk1OcWcybVo1QVFTZUNQUW44OGp4RmxTTS9rbTQvRmEvcmp4SWR5OGVUUVNCRFBBalQ4aXNqMnFub1NwNmtIRA==|a0f16b0a383b63158c1347a141f35dfbc8a82d06be26550677440863bd1ed3f7; __snaker__id=21lWaUO5iZBaHVVE; captcha_ticket_v2=2|1:0|10:1680319108|17:captcha_ticket_v2|704:eyJ2YWxpZGF0ZSI6IkNOMzFfQy5qenpScGdkTDJlUHVsazlGd3prUEZjMEpCcEt2aktJMHpCLXY5WHJjYm51LXJMTjB6NEFVRk93c2RqZnUxdkN4SVZHMUF5ekR5T0tUbm52anRjNjE4OG8tNGEuLWhKbHZxZ2FubVhLbW1zYjA3a0xLdi10V2FKclV2Z0lzZFB4XzB1UlJVSkI0R282bXh2dUxIcm81ZGxzN1pWRl9hRXJIeFFnNmpIakY0MmF1bU5pb1hMYlk3SzBnOWtSUDhfeGFXV21LdEdubUlqUS1tRVUyajRaMWxSQVhiQlR6WFJZV0tLblRfU1NfMW5XRjRtdGZyTmx4RF95eGktQU9YWWk1emhRdmlMTXQxbl9GT3JINDZvZEpPOHZEMjJWLnV5SUtxUFdRc1dyeERyNEhyTXJjZVR1di01NG41QlJqTDlZbTlCUWtPOUFVMi5fRi1tLTQ2UngxYmw3UjlTQVVzR3U1OGNEYTQ4UjU0S3psOHQtX012RXZBd3NlR0xUazF4TUguRmxLVm90RTlLRlRDWkRSdC1TeV9fdFRWTkxyNkZsRnRmSS15UDg0ZVhuTU9LWFpLYzlqd1hHbEtqRWpCak9JNXp3QllEUlhnQ0NDTDFUZXlOVUtxYmZKcy5FRTY2OXJNUU9rU3kyMG5KVk0tNTVYUDRCRjJ6QmxBMyJ9|15fdffa5a39af8450cb65442d937c6fa158204305025f73e5ac450c339749580; z_c0=2|1:0|10:1680319120|4:z_c0|92:Mi4xeFJDb0NnQUFBQUFBZ0pPX19zZU9GaVlBQUFCZ0FsVk5rUEFVWlFESTZPaERycWlyQ2QxR21nQTE0aVo1ZldRUG1B|0dfdca1d5eaf9cfc3c5637d198620224951f0fc8633680ec057320fab3873b29; q_c1=22823c6626ba4523aab02aa5ea3a1b86|1680319121000|1680319121000; KLBRSID=b33d76655747159914ef8c32323d16fd|1680319149|1680319008; Hm_lpvt_98beee57fd2ef70ccdd5ca52b9740c49=1680319150',
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/111.0.0.0 Safari/537.36'
}
response = requests.get('https://www/zhihu.com/', headers=headers)
# https://www.zhihu.com/
print(response.text)
二、selenium获取cookie
from selenium.webdriver import Chrome
1.创建浏览器打开需要自动登录的网页
b = Chrome()
b.get('https://www.taobao.com')
2.留足够长的时间,人工完成登录(必须得保证b指向的窗口中的网页能够看到登录以后的信息)
input('是否已经完成登录:')
3.获取登录成功后的cookie信息,保存到本地文件
result = b.get_cookies()
print(result)
三、selenium使用cookie代码
from selenium.webdriver import Chrome
1.创建浏览器打开需要自动登录网页
b = Chrome()
b.get('https://www.taobao.com')
2.获取本地保存的cookie
with open('files/taobao.txt', encoding='utf-8') as f:
result = eval(f.read())
3.添加cookie
for x in result:
b.add_cookie(x)
4.重新打开网页
b.get('https://www.taobao.com')
四、selenium代理
from selenium.webdriver import Chrome, ChromeOptions
options = ChromeOptions()
# 设置代理
options.add_argument('--proxy-server=http://59.56.84.244:4531')
b = Chrome(options=options)
b.get('https://movie.douban.com/top250?start=0&filter=')
五、requests代理
import requests
headers = {
'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/102.0.0.0 Safari/537.36'
}
# 代理ip
proxies = {
'https': '59.56.84.244:4531'
}
res = requests.get('https://movie.douban.com/top250?start=0&filter=', headers=headers, proxies=proxies)
print(res.text)
六、Xpath
import json
import lxml
from lxml import etree
Xpath用来解析网页数据或xml数据的一种解析方法,它是通过路径来获取标签(元素)
"""
python数据:{'name':'xiaoming', 'age':18, 'is_ad':True, 'car_no':None}
Json数据:{"name":"xiaoming", "age":18, "is_ad":true, "car_no":null}
xml数据:
<allStudent>
<student class='优秀学员'>
<name>xiaoming</name>
<age>18</age>
<is_ad>是</is_ad>
<car_no></car_no>
</student>
</allStudent>
"""
1.常见的几个概念
1)树:整个网页结构和xml结构就是一个树结构
2)元素(节点):html树结构的每一个标签
3)根节点:树结构中的第一个节点
4)内容:标签内容
5)属性:标签属性
2.Xpath语法
1.获取标签
1)绝对路径:以‘/’开头,然后从根节点开始层层往下写路径
2)相对路径:写路径的时候用‘.’或者‘…’开头,其中‘.’表示当前节点;‘…’表示当前节点的父节点。
注意:如果路径以‘./’开头,‘./’可以省略
3)全路径:以’//‘开头的路径
2.获取标签内容:在获取标签的路径最后加’/text()’
3.获取标签属性:在获取标签的路径最后加’/@属性名’
3.应用
1.创建树结构,获取根节点
html = open('data.html', encoding='utf-8').read()
root = etree.HTML(html)
2.通过路径获取标签
节点对象.xpath(路径)—— 根据路径获取所有标签,返回值是列表,列表中的元素是节点对象
1)绝对路径
result = root.xpath('/html/body/div/a')
print(result)
# 获取标签内容
result = root.xpath('/html/body/div/a/text()')
print(result)
# 获取标签属性
result = root.xpath('/html/body/div/a/@href')
print(result)
11)绝对路径的写法跟xpath前面用谁去点的无关
div = root.xpath('/html/body/div')[0]
print(div)
result = div.xpath('/html/body/div/a/text()')
print(result)
2)相对路径
result = root.xpath('./body/div/a/text()')
print(result) # ['我是超链接2', '我是超链接4']
result = div.xpath('./a/text()')
print(result) # ['我是超链接2', '我是超链接4']
result = div.xpath('a/text()')
print(result) # ['我是超链接2', '我是超链接4']
3)全路径
# 找所有的a标签
result = root.xpath('//a/text()')
print(result) # ['我是超连接11', '我是超连接22', '我是超连接33', '我是超连接1', '我是超链接2', '我是超链接4', '我超链接3']
result = div.xpath('//a/text()')
print(result) # ['我是超连接11', '我是超连接22', '我是超连接33', '我是超连接1', '我是超链接2', '我是超链接4', '我超链接3']
result = root.xpath('//div/a/text()')
print(result) # ['我是超链接2', '我是超链接4', '我超链接3']
4.加谓语(加条件) —— 路径中的节点
1)位置相关谓语:
[N] —— 第N个指定标签
[last()] —— 最后一个指定标签
[last()-N] ——
[position()>N]、[position()>=N]、[position()<N]、[position()<=N]
result = root.xpath('//span/p[2]/text()')
print(result) # ['我是段落22']
result = root.xpath('//span/p[last()]/text()')
print(result) # ['我是段落55']
result = root.xpath('//span/p[position()<=2]/text()')
print(result) # ['我是段落11', '我是段落22']
result = root.xpath('//span/p[position()>2]/text()')
print(result) # ['我是段落33', '我是段落44', '我是段落55']
2)属性相关谓语
[@属性名=属性值]
result = root.xpath('//span/p[@id="p1"]/text()')
print(result) # ['我是段落33']
result = root.xpath('//span/p[@class="c1"]/text()')
print(result) # ['我是段落11', '我是段落33', '我是段落44']
result = root.xpath('//span/p[@data="5"]/text()')
print(result) # ['我是段落55']
5.通配符
在xpath中可以通过*来表示任意标签或者任意属性
result = root.xpath('//span/*/text()')
print(result) # ['我是span11', '我是超连接11', '我是段落11', '我是段落22', '我是段落33', '我是段落44', '我是段落55', '我是超连接22', '我是超连接33']
result = root.xpath('//span/p[@class="c1"]/text()')
print(result) # ['我是段落11', '我是段落33', '我是段落44']
result = root.xpath('//span/*[@class="c1"]/text()')
print(result) # ['我是超连接11', '我是段落11', '我是段落33', '我是段落44', '我是超连接33']
result = root.xpath('//span/span/@*')
print(result) # ['hello', 'world']
result = root.xpath('//*[@class="c1"]/text()')
print(result) # ['我是超连接11', '我是段落11', '我是段落33', '我是段落44', '我是超连接33', '我是段落3']
七、Xpath操作豆瓣
import requests
from lxml import etree
1.获取网页数据
headers = {
'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/102.0.0.0 Safari/537.36'
}
# 代理ip
proxies = {
'https': '59.56.84.244:4531'
}
res = requests.get('https://movie.douban.com/top250?start=0&filter=', headers=headers, proxies=proxies)
# print(res.text)
2.解析数据
root = etree.HTML(res.text)