python爬虫学习第二章xpath

小兔子要开心呀

于 2024-08-09 16:11:41 发布

阅读量321

点赞数 5

文章标签： python 爬虫学习

本文链接：https://blog.csdn.net/m0_58905839/article/details/140958355

版权

第二章 xpath的学习

2.1 xpath插件的安装

首先要在浏览器安装xpath插件，打开和关闭的快捷键为：ctrl+shift+x

2.2 xpath的基本使用

xpath的基本语法
1.路径查询：
/ / : 查找所有的子孙节点，不考虑层级关系
/ : 查找直接子节点
2.谓词查询：
/ / div[@id]
/ / div[@id=‘maincontent’]
3.属性查询
/ /@class
4.模糊查询
/ /div[contains(@id,“he”)]
/ /div[starts-with(@id,“he”)] starts-with:以…为开头
5.内容查询
/ / div/h1/text()
6.逻辑运算
/ / div[@id=“head” and @class=“s_down”]
/ / title | / / price
xpath解析
(1) 本地文件 etree.parse()
(2) 服务器响应的数据 response.read().decode(‘utf-8’)**** etree.HTML()

<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8"/>
    <title>Title</title>
</head>
<body>
    <ul>
        <li id="l1" class="c1">北京</li>
        <li id="l2" class="c2">上海</li>
        <li>深圳</li>
        <li>武汉</li>
    </ul>
</body>
</html>

from lxml import etree

# xpath解析
# (1) 本地文件                                              etree.parse()
# (2) 服务器响应的数据 response.read().decode('utf-8')****    etree.HTML()

# xpath解析本地文件
tree = etree.parse('01_解析_xpath的基本使用.html')

# 查找ul下面的li
# text()方法用于获取标签中的内容
# li_list = tree.xpath("//ul/li/text()")

# 查找所有有id属性的li标签
# li_list = tree.xpath("//ul/li[@id]/text()")
# # 判断列表的长度
# print(len(li_list))

# 查找id属性为l1的li标签的class属性值  注意引号问题
# li = tree.xpath("//ul/li[@id='l1']/@class")

# 模糊查询：查找id属性中包含l的l1标签
# li_list = tree.xpath("//ul/li[contains(@id,'l')]/text()")

#starts-with:以...为开头
# 查询id的值以l开头的li标签
# li_list = tree.xpath("//ul/li[starts-with(@id,'l')]/text()")

# 查询id为l1和class为c1的
# li_list = tree.xpath("//ul/li[@id='l1' and @class='c1']/text()")

# 查询id为l1的或id为l2的
li_list = tree.xpath("//ul/li[@id='l1']/text() | //ul/li[@id='l2']/text()")
print(li_list)

2.3 获取百度网站的百度一下

import urllib.request
from lxml import etree

url="https://www.baidu.com"

headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/126.0.0.0 Safari/537.36"
}

request = urllib.request.Request(url=url,headers=headers)

response = urllib.request.urlopen(request)

content = response.read().decode('utf-8')

tree = etree.HTML(content)

li = tree.xpath("//input[@id='su']/@value")

print(li)

2.4 爬取站长素材图片

# (1)请求对象的定制
# (2)获取网页源码
# (3)下载

import urllib.request
from lxml import etree

# https://sc.chinaz.com/tupian/shugantupian.html
# https://sc.chinaz.com/tupian/shugantupian_2.html
# https://sc.chinaz.com/tupian/shugantupian_3.html

def create_request(page):
    if(page==1):
        url="https://sc.chinaz.com/tupian/shugantupian.html"
    else:
        url="https://sc.chinaz.com/tupian/shugantupian_"+str(page)+".html"
    headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/126.0.0.0 Safari/537.36"
    }

    request = urllib.request.Request(url=url, headers=headers)
    return request

def get_content(request):
    response = urllib.request.urlopen(request)
    content = response.read().decode('utf-8')
    return content

def down_load(content):
    #下载图片
    tree = etree.HTML(content)
    name_list = tree.xpath("//div/img/@alt")
    # 一般涉及到图片的网站都会进行懒加载
    src_list = tree.xpath("//div/img/@data-original")
    for i in range(len(name_list)):
        name = name_list[i]
        src = src_list[i]
        url = "https:"+src

        urllib.request.urlretrieve(url=url,filename='./picture/'+name+'.jpg')

if __name__ == '__main__':
    start_page = int(input("请输入起始页码："))
    end_page = int(input("请输入结束页码："))

    for page in range(start_page,end_page+1):
        # (1)请求对象的定制
        request = create_request(page)
        # (2)获取网页源码
        content = get_content(request)
        # (3)下载
        down_load(content)