1. xpath选择器
1.1 xpath介绍
xpath : 是一门在xml / html文档中查找信息的语句 .
安装 :
pip install lxml
导入 :
from lxml import etree
生成对象 :
html = etree . HTML ( 'html文档字符串' )
html = etree . parse ( '.html文件路径' , etree . HTMLParser ( ) )
1.2 选取节点表达式
查询节点 :
/ : 从根节点选取 ( 值是一个对象 )
/ / : 不管任何位置 , 直接查找 ( 值是一个对象 )
. : 从当前节点
. . : 从父节点
/ @ 属性名 : 获取属性值
/ text ( ) : 获取标签内容
1. 查找所有节点
* : 通配符表示所有
from lxml import etree
doc = '''
<html>
<head>
<base href='http://example.com/' />
<title>Example website</title>
</head>
<body>
<div id='images'>
<a href='image1.html' aa='bb'>Name: My image 1 <br /><img src='image1_thumb.jpg' /></a>
<a href='image2.html'>Name: My image 2 <br /><img src='image2_thumb.jpg' /></a>
<a href='image3.html'>Name: My image 3 <br /><img src='image3_thumb.jpg' /></a>
<a href='image4.html'>Name: My image 4 <br /><img src='image4_thumb.jpg' /></a>
<a href='image5.html' class='li li-item' name='items'>Name: My image 5 <br /><img src='image5_thumb.jpg' /></a>
<a href='image6.html' name='items'><span><h5>test</h5></span>Name: My image 6 <br /><img src='image6_thumb.jpg' /></a>
</div>
</body>
</html>
'''
html = etree. HTML( doc)
node = html. xpath( '//*' )
print ( node)
"""
[<Element html at 0x1b5918a8680>, <Element head at 0x1b5919bfb80>, ...]
"""
2. 指定节点
//标签名
node = html. xpath( '//head' )
print ( node)
"""
[<Element head at 0x1ad61d0fb80>]
"""
3. 子节点
指定子标签 :
1. //父标签名/子标签名
2. 标签名 / child : : 子标签名
node = html. xpath( '//div/a' )
print ( node)
node = html. xpath( '//a[1]/child::img/@src' )
print ( node)
node = html. xpath( '//a[1]/child::*' )
print ( node)
4. 子孙节点
子 / 孙标签 :
1. //祖或父标签名//子或孙标签名
1. 标签名 / child : : *
node = html. xpath( '//body/a' )
print ( node)
node = html. xpath( '//body//a' )
print ( node)
node = html. xpath( '//a[6]/descendant::*' )
print ( node)
"""
[<Element span at 0x2826142fc80>, <Element h5 at 0x2826142fb80>,
<Element br at 0x2826142fcc0>, <Element img at 0x2826142fd00>]
"""
node = html. xpath( '//a[6]/descendant::h5/text()' )
print ( node)
5. 父节点
子节点 / . . 找到父节点
node = html. xpath( '//body//a[@href="image1.html"]/..' )
print ( node)
node = html. xpath( '//body//a[1]/..' )
print ( node)
node = html. xpath( '//body//a[1]/parent::div' )
print ( node)
node = html. xpath( '//body//a[1]/parent::*' )
print ( node)
6. 祖先节点
node = html. xpath( '//a/ancestor::div' )
print ( node)
node = html. xpath( '//a/ancestor::*' )
print ( node)
7. 属性匹配
单属性值匹配 :
标签名 [ @ 属性名 = '属性值' ]
多属性值匹配 :
标签有class属性有多个值 , 直接匹配就不可以了 , 需要用contains
标签名 [ contains @ 属性名 = '属性值' ]
node = html. xpath( '//body//a[@href="image1.html"]' )
print ( node)
node = html. xpath( '//body//a[@class="li"]' )
print ( node)
node = html. xpath( '//body//a[contains(@class,"li")]' )
print ( node)
8. 文本内容获取
标签名 / text ( ) 取当前标签的文本内容
node = html. xpath( '//body//a[@href="image1.html"]/text()' )
print ( node)
node = html. xpath( '//body//a/text()' )
print ( node)
"""
['Name: My image 1 ', 'Name: My image 2 ', 'Name: My image 3 ',
'Name: My image 4 ', 'Name: My image 5 ', 'Name: My image 6 ']
"""
9. 属性值获取
标签名 / @ 属性名 取当前标签的属性
标签名 / attribute : : * 获取所有属性值
node = html. xpath( '//body//a/@href' )
print ( node)
node = html. xpath( '//body//a[1]/@href' )
print ( node)
node = html. xpath( '//a[1]/@aa' )
print ( node)
node = html. xpath( '//a[1]/attribute::*' )
print ( node)
10. 按序选择
正序 :
标签名 [ 序号 ] 序号从 1 开始
倒序 :
标签名 [ last() ] 最后一个
标签名 [ last()-n ] 倒数第n + 1 个
node = html. xpath( '//a[2]/text()' )
print ( node)
node = html. xpath( '//a[2]/@href' )
print ( node)
node = html. xpath( '//a[last()]/@href' )
print ( node)
node = html. xpath( '//a[last()-1]/@href' )
print ( node)
11. 位置条件
标签名 [ position()<序号 ]
node = html. xpath( '//a[position()<3]/@href' )
print ( node)
12. 同级节点查找
/following:当前节点之后所有同级节点(包括同级节点的子孙节点)
following-sibling : 当前节点之后同级节点(只找兄弟)
node = html. xpath( '//a[3]/following-sibling::*' )
print ( node)
node = html. xpath( '//a[3]/following-sibling::a' )
print ( node)
node = html. xpath( '//a[1]/following-sibling::*[2]' )
print ( node)
node = html. xpath( '//a[1]/following-sibling::*[2]/@href' )
print ( node)
1.3 复制xpath路径
xpath路径 : / / * [ @ id = "hotsearch-content-wrapper" ] / li [ 1 ] / a / span [ 2 ]
完整xpath路径 : / html / body / div [ 1 ] / div [ 1 ] / div [ 5 ] / div / div / div [ 3 ] / ul / li [ 1 ] / a / span [ 2 ]
2. Web应用测试工具
selenium : 是一个用于Web应用程序测试的工具 .
使用requests速度快 , 可以开启多线程 , requests无法直接执行JavaScript代码 .
爬虫中使用是为了解决requests无法直接执行JavaScript代码的问题 , 但是速度慢 .
2.1 安装selenium
安装selenium : pip3 install selenium = = 3.141 .0
最新版本好多方法弃用了 . . .
2.2 下载驱动
谷歌浏览器驱动网址 : http : / / npm . taobao . org / mirrors / chromedriver /
* 1. 找到chrome版本信息
* 2. 下载对应版本的驱动 ( 驱动器版本向下兼容 )
* 3. 下载之后解压得到一个可执行文件 ( 不需要安装 )
2.3 等待元素加载
网页加载需要一定的时间 , 通过代码去查找标签速度非常快 , 可能标签还没加载完 , 代码就查找了 ,
如果找不到会报错 .
在执行代码查找标签之前先等待标签加载完毕 .
两种方式 :
1. 显示等待 : 每个标签都要写等待逻辑 .
2. 隐式等待 : 写一个逻辑 , 所有标签遵循这个规则 .
元素对象 . implicitly_wait ( 等待加载时间 ) 超时报错
from selenium import webdriver
bro= webdriver. Chrome( executable_path= './chromedriver.exe' )
bro. get( 'https://www.jd.com/' )
bro. implicitly_wait( 10 )
. . .
2.4 简单使用
生成对象 :
浏览器对象 = webdriver . Chrome ( executable_path = '驱动器路径' )
打开网页 :
浏览器对象 . get ( '网络地址' )
打印文本信息 :
浏览器对象 . page_source
关闭当前页面 :
浏览器对象 . close ( )
退出浏览器 :
浏览器对象 . quit ( )
from selenium import webdriver
bro = webdriver. Chrome( executable_path= './chromedriver.exe' )
bro. get( 'https://www.baidu.com/' )
print ( bro. page_source)
bro. close( )
2.5 查找标签
( 最新版本模块很多方法被弃用 ! )
# = = = = = = = = = = = = = = = find系列方法查找元素 = = = = = = = = = = = = = = = = = = =
不带s :
1. find_element_by_id 通过id查找
2. find_element_by_link_text 通过a标签的文本内容找
3. find_element_by_partial_link_text 通过a标签的文本内容找 , 模糊匹配
4. find_element_by_tag_name 标签名
5. find_element_by_class_name 类名
6. find_element_by_name name属性
7. find_element_by_css_selector 通过css选择器
8. find_element_by_xpath 通过xpaht选择器
带s :
强调:find_elements_by_xxx的形式是查找到多个元素 , 结果为列表
元素对象 . send_keys ( '搜索关键字' ) 往控件中写入搜索关键字
元素对象 . clear ( ) 清空输入的内容
元素对象 . click ( ) 点击按钮
元素对象 . get_attribute ( '属性名' ) 获取元素的属性
元素对象 . text 获取元素的文本信息
css选择器的复制方法 :
# app > div > div > div > div . el-col . el-col- 24 > section > div >
div . scroll_main . el-scrollbar__wrap . el-scrollbar__wrap--hidden-default > div
1. 自动搜索案例
* 1. 找到输入框
* 2. 找到搜索按键
from selenium import webdriver
bro = webdriver. Chrome( executable_path= 'chromedriver.exe' )
bro. get( 'https://www.so.com/' )
search = bro. find_element_by_id( 'input' )
search. send_keys( "美女" )
button = bro. find_element_by_id( 'search-button' )
button. click( )
print ( bro. page_source)
with open ( 'baidu.html' , 'w' , encoding= 'utf-8' ) as f:
f. write( bro. page_source)
bro. close( )
2.自动登入案例
全自定登入越来越难 . 必要的时候验证手动验证 .
* 1. 找到登入按键
* 2. 找打账户登入
* 3. 找到用户名与密码输入框
* 4. 找打登入按键
* 5. 登入代码 ( 验证码手动 )
from selenium import webdriver
import time
bro = webdriver. Chrome( executable_path= './chromedriver.exe' )
bro. get( 'https://www.baidu.com/' )
bro. implicitly_wait( 10 )
user_login = bro. find_element_by_id( 's-top-loginbtn' )
user_login. click( )
account_login = bro. find_element_by_id( 'TANGRAM__PSP_11__changePwdCodeItem' )
account_login. click( )
username = bro. find_element_by_id( 'TANGRAM__PSP_11__userName' )
username. send_keys( '1360012768@qq.com' )
password = bro. find_element_by_id( 'TANGRAM__PSP_11__password' )
password. send_keys( '1314.qqq' )
button = bro. find_element_by_id( 'TANGRAM__PSP_11__submit' )
button. click( )
time. sleep( 10 )
bro. close( )
2.6 无界面浏览器
selenium必须是打开浏览窗口 , 爬虫不需要展示窗口 , 则设置为无界面浏览器 .
from selenium import webdriver
from selenium. webdriver. chrome. options import Options
chrome_options = Options( )
chrome_options. add_argument( 'window-size=1920x3000' )
chrome_options. add_argument( '--disable-gpu' )
chrome_options. add_argument( '--hide-scrollbars' )
chrome_options. add_argument( 'blinfk-settings=imagesEnabled=alse' )
chrome_options. add_argument( '--headless' )
bro = webdriver. Chrome( executable_path= './chromedriver.exe' , options= chrome_options)
bro. get( 'https://www.baidu.com' )
print ( bro. page_source)
2.7 pillow扣图
安装pillow模块 : pip install pillow
元素对象 . save_screenshot ( '保存路径' ) 把整个页面保存成图片
元素对象 . location 元素的左上角坐标 .
元素对象 . size 元素占用的大小
元素对象 . id 元素id ( selenium分配的 )
元素对象 . tag_name ( 元素的名称 )
from selenium import webdriver
from PIL import Image
bro = webdriver. Chrome( executable_path= './chromedriver.exe' )
bro. get( 'https://www.jd.com/' )
img = bro. find_element_by_css_selector( 'a.logo_tit_lk' )
print ( img. location)
print ( img. size)
print ( img. id )
print ( img. tag_name)
location = img. location
size = img. size
bro. save_screenshot( './main.png' )
img_tu = (
int ( location[ 'x' ] ) , int ( location[ 'y' ] ) , int ( location[ 'x' ] + size[ 'width' ] ) , int ( location[ 'y' ] + size[ 'height' ] ) )
img = Image. open ( './main.png' )
code_img = img. crop( img_tu)
code_img. save( './code.png' )
bro. close( )
2.8 执行js
浏览器对象 . execute_scripr ( 'js代码' )
常用操作 :
1. 执行js代码
2. 使用页面的变量和函数
1. alert弹框
from selenium import webdriver
import time
bro = webdriver. Chrome( executable_path= './chromedriver.exe' )
bro. get( 'https://www.csdn.net/' )
bro. execute_script( "alert('hello')" )
time. sleep( 3 )
bro. switch_to. alert. accept( )
bro. close( )
2. 滑动页面
垂直滑动 window . scrollBy ( 起始坐标 , 结束坐标 )
document . body . scrollHeight 获取页面的高度
from selenium import webdriver
import time
bro = webdriver. Chrome( executable_path= './chromedriver.exe' )
bro. get( 'https://www.csdn.net/' )
bro. execute_script( "window.scrollBy(0, 500)" )
time. sleep( 2 )
bro. execute_script( "window.scrollBy(0, document.body.scrollHeight )" )
time. sleep( 2 )
bro. close( )
3. 使用变量
使用页面中定义的表量
from selenium import webdriver
bro = webdriver. Chrome( executable_path= './chromedriver.exe' )
bro. get( 'https://www.baidu.com' )
bro. execute_script( 'console.log(bds)' )
2.9 选项卡操作
选项卡-- > 新开网页
新开选项卡 : window . open ( )
获取虽有选项卡 : 浏览器对象 . window_handles
切换选项卡 : 浏览器对象 . . switch_to . window ( 选项卡 )
from selenium import webdriver
bro = webdriver. Chrome( executable_path= './chromedriver.exe' )
bro. get( 'https://www.baidu.com' )
bro. execute_script( 'window.open()' )
all_window = bro. window_handles
bro. switch_to. window( all_window[ 0 ] )
bro. get( 'https://www.cnblogs.com/' )
bro. switch_to. window( all_window[ 1 ] )
bro. get( 'https://www.csdn.net/' )
bro. close( )
bro. quit( )
2.10 页面前进后退
后退 : 浏览器对象 . back ( )
前进 : 浏览器对象 . forward ( )
from selenium import webdriver
import time
bro = webdriver. Chrome( executable_path= './chromedriver.exe' )
bro. get( 'https://www.baidu.com' )
bro. get( 'https://www.taobao.com' )
bro. get( 'https://www.bilibili.com/' )
time. sleep( 1 )
bro. back( )
time. sleep( 1 )
bro. back( )
time. sleep( 1 )
bro. forward( )
time. sleep( 1 )
bro. forward( )
time. sleep( 1 )
bro. quit( )
2.11 异常处理
from selenium import webdriver
bro = webdriver. Chrome( executable_path= './chromedriver.exe' )
try :
bro. get( 'https://www.baidu.com' )
bro. find_element_by_id( 'xxxx' )
except Exception as e:
print ( f'程序出错: { e} ' )
bro. quit( )
2.12 半自动登入博客园
操作步骤 :
1. 先半自动登入到博客园
2. 将cookice保存到本地
3. 携带cookice访问博客园
* 1. 获取登入标签
* 2. 获取账户密码表单按钮
* 3. 获取账户密码登入标签
* 4. 半自动登入获取cookie保存到本地
from selenium import webdriver
bro = webdriver. Chrome( executable_path= './chromedriver.exe' )
try :
bro. get( 'https://www.cnblogs.com/' )
bro. implicitly_wait( 10 )
login_button = bro. find_element_by_link_text( '登录' )
login_button. click( )
password_button = bro. find_element_by_class_name( 'mat-tab-label-content' )
password_button. click( )
username_input = bro. find_element_by_id( 'mat-input-0' )
username_input. send_keys( '你的账户' )
password_input = bro. find_element_by_id( 'mat-input-1' )
password_input. send_keys( '你的密码' )
button = bro. find_element_by_class_name( 'mat-button-wrapper' )
button. click( )
input ( )
import json
with open ( 'cookie.json' , mode= 'w' ) as wf:
json. dump( bro. get_cookies( ) , wf)
except Exception as e:
print ( f'程序出错: { e} ' )
finally :
bro. quit( )
* 5. cookie信息
[
{
"domain" : "www.cnblogs.com" ,
"httpOnly" : true ,
"name" : ".AspNetCore.Antiforgery.b8-pDmTq1XM" ,
"path" : "/" ,
"secure" : false ,
"value" : "CfDJ8EOBBtWq0dNFoDS-ZHPSe53mEWd-ZGyjWftpCaA67Ju_PAmyKJdgIMJ6TQroItTC3KugfG1kyhlNdZx9twkZXOMpcOw8OMkPl0v3uajxTJTOJKtxX4sy1Az7e2VbFXcrcgff2l2J1QRpKn75hQ0ldtYSAD"
} ,
{
"domain" : ".cnblogs.com" ,
"expiry" : 1720163214 ,
"httpOnly" : false ,
"name" : "_ga" ,
"path" : "/" ,
"secure" : false ,
"value" : "GA1.2.1702123200.1657091158"
} ,
{
"domain" : ".cnblogs.com" ,
"httpOnly" : true ,
"name" : ".CNBlogsCookie" ,
"path" : "/" ,
"secure" : false ,
"value" : "6AE367FDC883C9497C0965F5DCB0773D77C7B6E04AC8D3483B085CC7C8C7FD46E080F1CFF9028730A81B4781393E850814E684ABDFA2FFD7D01C0CAEB96C28EA39E26578AFF0E5355617C5C2A5191DB59937CC937D"
} ,
{
"domain" : ".cnblogs.com" ,
"expiry" : 1690787158 ,
"httpOnly" : false ,
"name" : "__gpi" ,
"path" : "/" ,
"secure" : false ,
"value" : "UID=00000769a61d749c:T=1657091158:RT=1657091158:S=ALNI_MYpovhSSJNIllzFre6jRxKvDbXmXA"
} ,
{
"domain" : ".cnblogs.com" ,
"expiry" : 1690787158 ,
"httpOnly" : false ,
"name" : "__gads" ,
"path" : "/" ,
"secure" : false ,
"value" : "ID=614211f6e18ef14e:T=1657091158:S=ALNI_MabIFcMdHavfJFTtGjdvxUNM6oWJA"
} ,
{
"domain" : ".cnblogs.com" ,
"expiry" : 1657177614 ,
"httpOnly" : false ,
"name" : "_gid" ,
"path" : "/" ,
"secure" : false ,
"value" : "GA1.2.2027133757.1657091158"
} ,
{
"domain" : ".cnblogs.com" ,
"httpOnly" : true ,
"name" : ".Cnblogs.AspNetCore.Cookies" ,
"path" : "/" ,
"secure" : false ,
"value" : "CfDJ8EOBBtWq0dNFoDS-ZHPSe50ngXRAr8WvkjMPVK2CErFjHpfDDCUA5wWx_coJ_pBtFO5I5aDCaZKVAU3ENMhSzukVskoTcTgvCsxz6lBceGIdIGBAjpxkahkqzDHb323TpdV2X3KMcJUTH-Fzz5NDhvMzDBfrcgOuvhUiu67tqzJeweta9Ld_qo2d7zGzHcCQOhVZJAXsZYB6lERqnNx83pRWzwUbmeoxPjvpQiILl6Amab0RkkoGS4wP5K1l0_gn1XBdke5Vp2fXqVIAJoIpV12PC2AjcrV2ABKdYMts_qAZ6UrhK_Rk7cc8wrvyNPP63dvg8pqsceIPl45GS0XuqfPLg1K9nCydFp426a-2UUix2pIwyxKDsq3IpP6qgq4QlkzfZm9CvgF7Tq-14s4327l9uCJEYmrNyeghaBM-4WhHabI_FD6K-xweqaFVx_n5aN5vhXV9yFRiUOFD71kn5FcwOhnImFKDHnmRUaSSy4AyhawQ8hT6UTQcXcigkDStc4wkz-jXpsDdYYxED3fZAp9IwLQv63U9mEG51LlyM7jQ8"
} ,
{
"domain" : ".cnblogs.com" ,
"httpOnly" : false ,
"name" : "Hm_lpvt_866c9be12d4a814454792b1fd0fed295" ,
"path" : "/" ,
"secure" : false ,
"value" : "1657091215"
} ,
{
"domain" : ".cnblogs.com" ,
"expiry" : 1720163214 ,
"httpOnly" : false ,
"name" : "_ga_3Q0DVSGN10" ,
"path" : "/" ,
"secure" : false ,
"value" : "GS1.1.1657091159.1.1.1657091214.5"
} ,
{
"domain" : ".cnblogs.com" ,
"expiry" : 1688627214 ,
"httpOnly" : false ,
"name" : "Hm_lvt_866c9be12d4a814454792b1fd0fed295" ,
"path" : "/" ,
"secure" : false ,
"value" : "1657091158"
}
]
* 6. 携带cookie访问博客园
from selenium import webdriver
bro = webdriver. Chrome( executable_path= './chromedriver.exe' )
try :
bro. get( 'https://www.cnblogs.com/' )
with open ( 'cookie.json' , mode= 'r' , encoding= 'utf8' ) as rf:
import json
cookie = json. load( rf)
for item in cookie:
bro. add_cookie( item)
bro. refresh( )
import time
time. sleep( 3 )
except Exception as e:
print ( f'程序出错: { e} ' )
finally :
bro. quit( )
2.13 抽屉新闻自动点赞
操作步骤 :
1. 使用selenium半自动登入到抽屉新闻网 , 获取到cookie .
2. 使用request携带cookie访问抽屉新闻网 , 批量点赞文章 .
* 1. 获取登入按键
* 2. 获取到手机号码登入
* 3. 获取到手机号输入输入框 , 密码输入框 , 登入按钮
* 4. 自动登入代码
from selenium import webdriver
bro = webdriver. Chrome( executable_path= './chromedriver.exe' )
try :
bro. get( 'https://dig.chouti.com/' )
bro. implicitly_wait( 10 )
login_button = bro. find_element_by_id( 'login_btn' )
bro. execute_script( 'arguments[0].click()' , login_button)
phone_login = bro. find_element_by_link_text( '手机号登录' )
phone_login. click( )
phone_input = bro. find_element_by_name( 'phone' )
phone_input. send_keys( '账户' )
import time
time. sleep( 2 )
password_input = bro. find_element_by_name( 'password' )
password_input. send_keys( '密码' )
import time
time. sleep( 2 )
button_btn = password_input = bro. find_element_by_name( 'password' )
button_btn. click( )
input ( )
with open ( 'chouti_cookie.json' , mode= 'w' ) as wf:
import json
json. dump( bro. get_cookies( ) , wf)
except Exception as e:
print ( f'程序出错: { e} ' )
finally :
bro. quit( )
* 5. 获取到cookie
[
{
"domain" : "dig.chouti.com" ,
"expiry" : 2147483647 ,
"httpOnly" : false ,
"name" : "YD00000980905869%3AWM_NI" ,
"path" : "/" ,
"secure" : false ,
"value" : "hVmgjDuEehm%2F6tUcue5fPsyZBX4g%2BiVrsda5Y2A%2BAlPh5Q9JvDwOUT75TtZvqQSBAJT0GPwQDrndVOoDV6BF%2FM2FysGrBvko6XTGutmHh5yXaXVnRwGhFNF6B0E2IN3UpudUlU%3D"
} ,
{
"domain" : "dig.chouti.com" ,
"expiry" : 1814786532 ,
"httpOnly" : false ,
"name" : "_9755xjdesxxd_" ,
"path" : "/" ,
"secure" : false ,
"value" : "32"
} ,
{
"domain" : "dig.chouti.com" ,
"expiry" : 1814786532 ,
"httpOnly" : false ,
"name" : "gdxidpyhxdE" ,
"path" : "/" ,
"secure" : false ,
"value" : "XUz83Gg7sk4v6wKgX6oScjyLZD7IOVNSrpWzlqERCDA2o1hH1BbZYPc58ewHkCKaUMqoZyHX%2BNtoujYBmJlLnvPj1cg6yK2nlPDJDbWKGZo%2FICGr%5CLmiL2ZHNV9lEvGjnRsa%2B%5CArVE1PLTD7%2FnAD7Jbrm%2BKBV7V0IIg6eR%5CLUeRseNE6x6%3A1657106532758"
} ,
{
"domain" : "dig.chouti.com" ,
"expiry" : 2147483647 ,
"httpOnly" : false ,
"name" : "YD00000980905869%3AWM_TID" ,
"path" : "/" ,
"secure" : false ,
"value" : "wLMI2HmPOO5FVVQAQBfUBrdKG0JzBhyT"
} ,
{
"domain" : "dig.chouti.com" ,
"expiry" : 1688641631 ,
"httpOnly" : false ,
"name" : "__snaker__id" ,
"path" : "/" ,
"secure" : false ,
"value" : "akew1JdgZb6KMz9y"
} ,
{
"domain" : ".chouti.com" ,
"httpOnly" : false ,
"name" : "Hm_lpvt_03b2668f8e8699e91d479d62bc7630f1" ,
"path" : "/" ,
"secure" : false ,
"value" : "1657105631"
} ,
{
"domain" : ".chouti.com" ,
"expiry" : 1688641631 ,
"httpOnly" : false ,
"name" : "Hm_lvt_03b2668f8e8699e91d479d62bc7630f1" ,
"path" : "/" ,
"secure" : false ,
"value" : "1657105631"
} ,
{
"domain" : "dig.chouti.com" ,
"expiry" : 2147483647 ,
"httpOnly" : false ,
"name" : "YD00000980905869%3AWM_NIKE" ,
"path" : "/" ,
"secure" : false ,
"value" : "9ca17ae2e6ffcda170e2e6eed6e868fbf18dd8c569b4a88ba3c44e979b9facd55eedbc81a8b443ae9fa4d4d22af0fea7c3b92aac8aa388e950b79bab89f025f6ac9b84f86ff7baa094fc528288fc88aa6396aefbbab14be99fa4b1eb3f93e78697c77d8d8b9cd8b860b886ba92d4598sa29f86b5b34d94a99fa5f166a88a8190c75ef69e8ad0d03a86a7bda8f14ab6b5a1d7db6085abbc8ecb64f7a79882db5eae8eb9a8eb4e8e9afaa3b34ff29782d4d87cb7bc9b8dd837e2a3"
} ,
{
"domain" : "dig.chouti.com" ,
"expiry" : 1688641630 ,
"httpOnly" : false ,
"name" : "deviceId" ,
"path" : "/" ,
"secure" : false ,
"value" : "web.eyJ0eXAiOiJKV1QiLCJhbGciOiJIUzI1NiJ9.eyJqaWQisaOiI5ZmY2Nzk5Yy04NTdlLTQ3MGYtOGMzYS0yMTY1ZTE3MDBkZGMiLCJleHBpcmUiOiIxNjU5Njk3NjMwNDQzIn0.ZxRk1tBgdJ4EZraM_AnGOxvKNl6Mgv1x7FJqCfklTTg"
}
]
* 注意 ! ! !
* 6. 点赞请求地址
发送请求地址 : https : / / dig . chouti . com / link / vote
* 7. 点赞携带数据 : linkId : 文章id
文章id 在div标签 或 div标签a标签的data-id属性中
* 8. 点赞成功之后返回响应
* 9. request访问抽屉网 获取所有文章div标签节点
import requests
from bs4 import BeautifulSoup
header = {
'User-Agent' : 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/101.0.4951.54 Safari/537.36'
}
res= requests. get( 'https://dig.chouti.com/' , headers= header)
soup= BeautifulSoup( res. text, 'lxml' )
div_list= soup. find_all( class_= 'link-item' )
for div in div_list:
article_id= div. attrs. get( 'data-id' )
print ( article_id)
if article_id:
data = {
'linkId' : article_id
}
cookie= { }
with open ( 'chouti.json' , 'r' ) as f:
import json
res = json. load( f)
for item in res:
cookie[ item[ 'name' ] ] = item[ 'value' ]
res = requests. post( 'https://dig.chouti.com/link/vote' , headers= header, data= data, cookies= cookie)
print ( res. text)
2.14 京东商品信息
* 1. 获取到搜索框
* 2. 获取搜索按键 或者使用 回车按键
* 3. 获取商品信息
* 获取图片就有一直出问题 , 图片链接后缀一会是 . jpg ,
一会是 . jpg . avif 只能拿到前四个商品的图片 , 之后的都是None .
img的src数值值放在了data-lazy-img属性中
from selenium import webdriver
from selenium. webdriver. common. keys import Keys
bro = webdriver. Chrome( executable_path= './chromedriver.exe' )
def get_commodity ( bro) :
li_list = bro. find_elements_by_class_name( 'gl-item' )
for commodity in li_list:
try :
name = commodity. find_element_by_css_selector( '.p-name em' ) . text
price = commodity. find_element_by_css_selector( '.p-price i' ) . text
url = commodity. find_element_by_css_selector( '.p-img a' ) . get_attribute( 'href' )
commit = commodity. find_element_by_css_selector( '.p-commit a' ) . text
img = commodity. find_element_by_css_selector( '.p-img img' ) . get_attribute( 'src' )
if not img:
img = 'https:' + commodity. find_element_by_css_selector( '.p-img img' ) . get_attribute( 'data-lazy-img' )
img = img. strip( '.avif' )
print ( f"""
商品名称: { name}
商品价格: { price}
商品链接: { url}
商品图片: { img}
商品评论数: { commit}
""" )
except Exception:
continue
next_button = bro. find_element_by_class_name( 'pn-next' )
import time
time. sleep( 2 )
next_button. click( )
get_commodity( bro)
try :
bro. get( 'https://www.jd.com/' )
bro. implicitly_wait( 10 )
search_input = bro. find_element_by_id( 'key' )
search_input. send_keys( 'Python' )
search_input. send_keys( Keys. ENTER)
get_commodity( bro)
except Exception as e:
print ( f'出现异常: { e} ' )
finally :
bro. quit( )
结果 :
商品名称 : 零基础学Python(Python3 . 9 全彩版)(编程入门 项目实践 同步视频)
商品价格 : 69.40
商品链接 : https : / / item . jd . com / 12353915. html
商品图片 : https : / / img10 . 360 buyimg . com / n1 / s200x200_jfs / t1 / 192162 / 30 / 9469 / 137831 / 60 cff716E24a6f3a9 / f11a344fb18010fc . jpg
商品评论数 : 20 万 +
. . .
2.15 动作链
from selenium import webdriver
import time
from PIL import Image
from chaojiying import Chaojiying_Client
from selenium. webdriver import ActionChains
bro= webdriver. Chrome( executable_path= './chromedriver.exe' )
bro. implicitly_wait( 10 )
try :
bro. get( 'https://kyfw.12306.cn/otn/resources/login.html' )
bro. maximize_window( )
button_z= bro. find_element_by_css_selector( '.login-hd-account a' )
button_z. click( )
time. sleep( 2 )
bro. save_screenshot( './main.png' )
img_t= bro. find_element_by_id( 'J-loginImg' )
print ( img_t. size)
print ( img_t. location)
size= img_t. size
location= img_t. location
img_tu = ( int ( location[ 'x' ] ) , int ( location[ 'y' ] ) , int ( location[ 'x' ] + size[ 'width' ] ) , int ( location[ 'y' ] + size[ 'height' ] ) )
img = Image. open ( './main.png' )
fram = img. crop( img_tu)
fram. save( 'code.png' )
chaojiying = Chaojiying_Client( '用户名' , '密码' , '903641' )
im = open ( 'code.png' , 'rb' ) . read( )
res= chaojiying. PostPic( im, 9004 )
print ( res)
result= res[ 'pic_str' ]
all_list = [ ]
if '|' in result:
list_1 = result. split( '|' )
count_1 = len ( list_1)
for i in range ( count_1) :
xy_list = [ ]
x = int ( list_1[ i] . split( ',' ) [ 0 ] )
y = int ( list_1[ i] . split( ',' ) [ 1 ] )
xy_list. append( x)
xy_list. append( y)
all_list. append( xy_list)
else :
x = int ( result. split( ',' ) [ 0 ] )
y = int ( result. split( ',' ) [ 1 ] )
xy_list = [ ]
xy_list. append( x)
xy_list. append( y)
all_list. append( xy_list)
print ( all_list)
for a in all_list:
x = a[ 0 ]
y = a[ 1 ]
ActionChains( bro) . move_to_element_with_offset( img_t, x, y) . click( ) . perform( )
time. sleep( 1 )
username= bro. find_element_by_id( 'J-userName' )
username. send_keys( '账户' )
password= bro. find_element_by_id( 'J-password' )
password. send_keys( '密码' )
time. sleep( 3 )
submit_login= bro. find_element_by_id( 'J-login' )
submit_login. click( )
time. sleep( 3 )
print ( bro. get_cookies( ) )
time. sleep( 10 )
bro. get( 'https://www.12306.cn/index/' )
time. sleep( 5 )
except Exception as e:
print ( e)
finally :
bro. close( )