Python实习第五天代码

最新推荐文章于 2022-02-27 16:29:16 发布

大前端工程师

最新推荐文章于 2022-02-27 16:29:16 发布

阅读量810

点赞数 4

分类专栏： Python 文章标签： python

程麒阁-大连大学-软件工程学院

本文链接：https://blog.csdn.net/chengqige/article/details/107283014

版权

Python 专栏收录该内容

10 篇文章 2 订阅

订阅专栏

Xpath命令

我们昨天学习了一些简单爬虫，学习了一些伪装技术：ip伪装，头伪装，忽略SSL证书等技术。我们学习了爬取整个网页，理解了发出请求，接受回应相关内容，下载了requests包简化了很多爬虫代码
今天我们来学习Xpath命令

那Xpath是什么？它的功能又是什么呢？

我们昨天只是简单地爬取整个网页，可是往往我们在使用爬虫时候不需要爬取整个网页，比如我们只需要爬取到我们想要的图片或者文字，这个时候就没有必要爬取整个网页，只需要进行筛选，筛选出自己想要的部分进行爬取就好。
如何进行筛选html标签？
就是今天我们要讲的Xpath
想要了解Xpath的使用的，可以访问我兄弟的网站: Xpath常见命令

这里面讲讲常用的Xpath符号

xpath	含义
/	从根节点获取下面子节点（只往下一级）
//	匹配该节点下所有子节点（孙子节点也包括）
@	选取当前标签内中的属性
text()	Xpath函数用于获取标签内文字：ex:< div> hello < /div>获取hello
string()	获取改节点下所有文字，返回值->字符串
contains()	匹配多类选择器：class=“list listitem”

我好像没太明白。。
不要紧我们用示例说话，我们首先先给出一段html代码
然后我们用 Xpath命令 进行操作

html（整个网页）

<bookstore>
    <book>
      <title lang="en">Harry Potter</title>
      <author>J K. Rowling</author> 
      <year>2005</year>
      <price>29.99</price>
    </book>

    <book>
      <title lang="zh">哈利波特2</title>
      <author>J K. Rowling</author> 
      <year>2008</year>
      <price>39.99</price>
    </book>

    <book>
      <title lang="zh">哈利波特3</title>
      <author>J K. Rowling</author> 
      <year>2012</year>
      <price>49</price>
    </book>

</bookstore>

<book>
      <title lang="zh">哈利波特</title>
      <author>J K. Rowling</author> 
      <year>2008</year>
      <price>39.99</price>
</book>

Xpath命令及其含义

对应上面HTML代码进行筛选

Xpath命令	命令含义
//bookstore	获取bookstore节点
//bookstore//book	获取bookstore节点下的所有的book节点
//bookstore//book[position()>1]	获取bookstore节点下book且不取第一个book（因为>1)，要记住这里是从1计起和数组不一样，数组从0计起
//bookstore/book/title[@lang=“zh”]	获取title中属性lang为zh的节点
/bookstore/book/title[@lang=“zh”]/text()	获取到节点包着的文字：[‘哈利波特2’, ‘哈利波特3’]
//@lang	获取当前节点下所有lang属性值，返回一个列表

好了我们了解Xpath命令之后把这些用到代码上：
当然python本身没有支持xpath命令的包
需要进行下载安装

打开Terminal

输入命令行然后敲回车安装

pip install lxml

练习这段代码：

from lxml import etree

data = '''
<bookstore>
    <book>
      <title lang="en">Harry Potter</title>
      <author>J K. Rowling</author> 
      <year>2005</year>
      <price>29.99</price>
    </book>

    <book>
      <title lang="zh">哈利波特2</title>
      <author>J K. Rowling</author> 
      <year>2008</year>
      <price>39.99</price>
    </book>

    <book>
      <title lang="zh">哈利波特3</title>
      <author>J K. Rowling</author> 
      <year>2012</year>
      <price>49</price>
    </book>

</bookstore>

<book>
      <title lang="zh">哈利波特</title>
      <author>J K. Rowling</author> 
      <year>2008</year>
      <price>39.99</price>
</book>
'''
html = etree.HTML(data)
e = html.xpath('//bookstore')  # 获取bookstore节点
print(e)

e = html.xpath('//bookstore//book')  # 获取bookstore节点下的所有的book节点
print(e)

e = html.xpath('//bookstore//book[position()>1]')  # 获取bookstore节点下的所有的book节点中的指定元素
print(e)

e = html.xpath('//bookstore/book/title[@lang="zh"]')
print(e)
# text()获取的是当前节点下的文本内容
e = html.xpath('//bookstore/book/title[@lang="zh"]/text()')
print(e)

e = html.xpath('//@lang')  # 查找所有属性是lang的元素并且取出值
print(e)

这里给大家的建议是：
一次运行打印一行，剩余的注释掉。这样有利于理解 Xpath命令

用xpath爬取指定内容（图片，文字）

python爬虫案例1–豆丁网课程爬取

我们课上以爬取豆丁网为例子，我们现在要爬取整个链接下的课程信息，顺便下载课程封面图片

我们打开f12进行分析html代码，点击小箭头选中元素进行分析
写出筛选的xpath命令，然后把文件下载下来

f12分析这个大家可以按照我上面说的做下，用小箭头选择元素定位

现在你要在你的项目文件夹（就是黑体加粗的那个）下建一个目录就是右键后的Directory，命名为images然后敲回车

对于爬虫我们的思路是：

1.导入requests包发出请求抓取网页（当然要进行头伪装，ip伪装）
2.浏览网页按下F12，分析HTML网页结构
3.用xpath命令筛选出我们想要的数据
4.进行爬取返回信息，状态码
5.保存信息到本地

豆丁网python课程爬取代码如下：

# 能爬取课程信息，能下载课程图片，并保存到本地
# 1。获取标签 requests  2。解析 （获取标签的元素）
import requests
from lxml import etree


# 1。获取标签 requests
def get_html(url, params=None, flag='html'):
    headers = {
        'Host': 'www.codingke.com',
        'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.116 Safari/537.36'
    }

    response = requests.get(url=url, params=params, headers=headers)
    # print(response.status_code)
    # print(response.encoding)

    if flag == 'images':
        return response.content  # 返回的是二进制
    elif flag == 'html':
        return response.text  # 返回的是文本


# 2。解析 （获取标签的元素）
def parse_html(content):
    # courses 课程列表
    courses = []
    html = etree.HTML(content)
    # xpath的查找过程
    li_list = html.xpath('//ul[@class="search_list"]/li')
    for li in li_list:
        # 获取图片
        img_src = li.xpath('a[@class="search_img"]/img/@src')
        print('img_src:', img_src)  # ['/files/course/2019/05-14/153718e7b0eb880760.jpg']
        # 获取简单的文本title
        a_tag = li.xpath('string(div[@class="search_info"]/a)')  # <a href=''>xxxxx </a>   ['xxxxx']
        #  ['xxxxx']
        print("a_tag:", a_tag)
        # 一个课程构成一个列表
        # 为甚写[0]？因为我们想要的信息都在列表中第一项
        # 而列表中第一项是0所以我们就写0了
        course = [img_src[0], a_tag[0]]  # ['/files/course....','xxxx']
        # 将列表加到课程列表中
        courses.append(course)  # [['/files/course....','xxxx'],[],[],...]

    return courses


# __name__ 在本页面的名字叫__main__, 其他页面导入的话__name__就是当前文件的模块名
# 为什么要写它，因为你可能在其他py文件下调用该文件下的函数，但是你不想运行
# 所以写这个main翻译为只在该文件运行目录下能运行，调用该文件的其他文件只给函数，并不运行程序。
if __name__ == '__main__':
    url = 'http://www.codingke.com/search/course'
    params = {'keywords': 'python'}
    # 获取html标签
    content = get_html(url, params, 'html')
    # print(content)
    # 解析标签
    courses = parse_html(content)

    print(courses)
    # 下载图片
    for image, title in courses:
        url = 'http://www.codingke.com' + image
        # 借助requests完成图片的获取
        content = get_html(url, flag='images')

        filename = url.rsplit('/')[-1]  # ['http://www.codingke.com/files/course/2019/03-05','170936087d5a394261.jpg']
        print(filename)
        # 保存到本地
        with open('images/' + filename, 'wb') as fw:
            fw.write(content)
        print('成功下载：', filename)

python爬虫案例2–豆瓣电影影评爬取
我们爬取的网站是：豆瓣电影–庆余年影评
打开F12分析网页
参考上面所说，爬虫一般思路：伪装，请求，筛选，获取，保存到本地

爬取豆瓣影评代码如下

# 评论信息爬取
# 豆瓣
import random
import re

import requests
from bs4 import BeautifulSoup
from lxml import etree

user_agents = [
    'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_3) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/12.0.3 Safari/605.1.15',
    'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.116 Safari/537.36'
]


# 1。获取html
def get_resource(url, params=None, flag='html'):
    headers = {
        'Host': 'movie.douban.com',
        'User-Agent': random.choice(user_agents)
    }
    # 使用requests发出请求
    response = requests.get(url=url, params=params, headers=headers)
    # 判断response的状态码
    if response.status_code == 200:
        # 判断flag
        if flag == 'html':
            return response.text
        elif flag == 'media':
            return response.content
    else:
        print('获取资源有误！')


# 2. 解析网页内容
# 方式一：xpath
def parse_html(resource):
    html = etree.HTML(resource)
    divs = html.xpath('//div[@id="comments"]/div[@class="comment-item"]')
    for div in divs:
        # images
        image = div.xpath('div[@class="avatar"]/a/img/@src')
        print('image:', image)
        # 用户名 ['Element']
        comment_info = div.xpath('div[@class="comment"]/h3/span[@class="comment-info"]')[0]
        # print(comment_info)
        username = comment_info.xpath('a/text()')
        print('username:', username)
        # 用户评分  <span class='rating'></span>
        rating = comment_info.xpath('span[contains(@class,"rating")]/@title')
        print('rating:', rating)
        # 评论时间
        comment_time = comment_info.xpath('span[contains(@class,"comment-time")]/@title')
        print('comment_time:', comment_time)
        # 评论
        comment = div.xpath('div[@class="comment"]/p//text()')
        print('comment:',''.join(comment).strip())  # strip()
        print('*' * 50)

if __name__ == '__main__':
    url = 'https://movie.douban.com/subject/25853071/comments'
    params = {'status': 'P'}
    # 调用函数获取资源
    resource = get_resource(url=url, params=params)
    # print(resource)
    # 调用解析
    parse_html(resource)

大前端工程师

关注

4
点赞
踩
8

收藏

觉得还不错? 一键收藏
0
评论
Python实习第五天代码

Xpath命令我们昨天学习了一些简单爬虫，学习了一些伪装技术：ip伪装，头伪装，忽略SSL证书等技术。我们学习了爬取整个网页，理解了发出请求，接受回应相关内容，下载了requests包简化了很多爬虫代码今天我们来学习Xpath命令那Xpath是什么？它的功能又是什么呢？我们昨天只是简单地爬取整个网页，可是往往我们在使用爬虫时候不需要爬取整个网页，比如我们只需要爬取到我们想要的图片或者文字，这个时候就没有必要爬取整个网页，只需要进行筛选，筛选出自己想要的部分进行爬取就好。如何进行筛选html标签？
复制链接

扫一扫