2021-12-14 python之requests三方库2-文件结构

最新推荐文章于 2024-04-22 12:03:40 发布

紫云无堤

最新推荐文章于 2024-04-22 12:03:40 发布

阅读量1.2k

点赞数

分类专栏： Python 文章标签： python html

本文链接：https://blog.csdn.net/Vissence/article/details/121939519

版权

Python 专栏收录该内容

12 篇文章 0 订阅

订阅专栏

文章目录

网页结构
- HTML
- XML
- JSON
BeautifulSoup4解析HTML
select和select_one筛选
- select方法
- select_one方法
测试一下

系统环境：MacOS，python3
参考内容： Requests API；参考书籍下载；

网页结构

简单介绍下HTML，XML，JSON的结构。

HTML和XML都是DOM结构，具体可以参考：
HTML DOM 教程
 XML DOM 教程

HTML

HTML：Hyper Text Markup Language，超文本标记语言

<html>
     <head>
		<title> xxx </title>
		<link href = 'xxx'></link> <script src = 'xxx'></script> <style>
		</style>
    </head>
    <body>
        <div class = 'xxx' id = 'xxx'></div>
    </body>
</html>

标签

标签	描述
html	整个页面
head	头部，包括各种描述信息、引入脚本、样式等
body	页面内容，可见内容
div	块，方便组织页面的标签

属性
标签属性由键值对组成，如name = “values”，常用的有：

key	values
id	元素唯一ID
class	元素类名
href	超链接
src	引用文件地址

XML

XML：EXtensible Markup Language，可扩展标记语言

XML 与 HTML 的相同点:
标记语言
由标签定义内容
遵循W3C 的标准
XML 与 HTML 的不同点:
HTML 不能自定义标签，XML 必须自定义标签;
HTML 用于展示数据，XML 用于结构化、传输和存储数据;
HTML 标签可以闭合，XML 标签必须闭合;
HTML 标签对大小写不敏感，XML 标签对大小写敏感;
HTML 文档没有强制元素，XML 文档必须有根元素。

例子：

<?xml version="1.0" encoding="utf-8"?>
<mail>
	<to>翠花</to>
	<from>二狗</from>
	<heading>提醒</heading>
	<body>明天下午电影院见!</body>
</mail>

JSON

json：JavaScript Object Notation，JavaScript对象标记
与python的字典类似。
例子：

{
	"links": [
		{
			"name": "Google",
			"url": "http://www.google.com"
		}, {
			"name": "Baidu",
			"url": "http://www.baidu.com"
		}
	] 
}

BeautifulSoup4解析HTML

使用BeautifulSoup4这个库来解析，该工具官方教程：Installing Beautiful Soup。时间急的话，可以不看。

安装BeautifulSoup4

Mac系统选一个就可以了，其他系统请参考官方文档

pip install beautifulsoup4
pip3 install beautifulsoup4

python中解析（注意：header中的Host要根据访问的网站进行修改，或者直接注释掉）

import requests
from bs4 import BeautifulSoup

# Request部分
url = "https://www.baidu.com"
header = {
    'Host': 'www.baidu.com', # 注意这里要根据访问的网站进行修改，或者直接注销掉。
    'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.55 Safari/537.36',
    'Connection': 'Keep-Alive',
    'Content-Type': 'text/plain; Charset=UTF-8', 'Accept-Language': 'zh-cn',
    'Cookie': '1111;',
}
r = requests.get(url, headers = header)

# 解析部分
# ---------------------------------------------------------------
# 用自带的 "html.parser"解析，不需要做别的
soup = BeautifulSoup(r.content, "html.parser")
print("----------------------------------显示解析后的内容")
print(soup.prettify()[:100])
print("----------------------------------显示网页标题")
print(soup.title)

在这里插入图片描述

为什么BeautifulSoup可以把HTML的标签当属性一样用？

在网上搜索关于BeautifulSoup这个用法的资料，看了几天都没讲到这个用法的原理，没办法自己研究好了。。
先用help(soup)看看帮助文档（注：这里的soup是变量：soup = BeautifulSoup(…)），跳出来几页说明文档，下面挑几个相关的内容说一说：

解析之后返回了soup，他的类型是bs4.BeautifulSoup，继承关系如下图。
之所以可以把标签当属性一样用的原因是在bs4.element.Tag中自定义了__getattr__函数
__getattr__是一个特殊函数，当调用某个类中不存在的属性时就会调用这个函数，具体说明可以看python官方文档：python官网文档。
BeautifulSoup自定义的getattr源码如下，源码来源：
【/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/bs4/elements.py】

    def __getattr__(self, tag):
        """Calling tag.subtag is the same as calling tag.find(name="subtag")"""
        #print("Getattr %s.%s" % (self.__class__, tag))
        if len(tag) > 3 and tag.endswith('Tag'):
            # BS3: soup.aTag -> "soup.find("a")
            tag_name = tag[:-3]
            warnings.warn(
                '.%(name)sTag is deprecated, use .find("%(name)s") instead. If you really were looking for a tag called %(name)sTag, use .find("%(name)sTag")' % dict(
                    name=tag_name
                )
            )
            return self.find(tag_name)
        # We special case contents to avoid recursion.
        elif not tag.startswith("__") and not tag == "contents":
            return self.find(tag)
        raise AttributeError(
            "'%s' object has no attribute '%s'" % (self.__class__, tag))

所以实际上就等同于调用soup.find(tag)。绝了，以前从没注意过这个函数！测试如下图。
在这里插入图片描述

另一个类似的特殊函数
看帮助文档的过程中，发现另一个类似的函数：__call__函数，其作用是找到所有的标签内容。
call函数和getattr在同一个文件中，其源码如下：

    def __call__(self, *args, **kwargs):
        """Calling a Tag like a function is the same as calling its
        find_all() method. Eg. tag('a') returns a list of all the A tags
        found within this tag."""
        return self.find_all(*args, **kwargs)

测试一下：

打印出meta标签的内容
打印出前10个标签的内容(注：soup()是打印出所有标签的内容)。

提取内容

在这里插入图片描述
完整代码：

import requests
from bs4 import BeautifulSoup

url = "https://www.baidu.com"
header = {
    #'Host': 'www.baidu.com',
    'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.55 Safari/537.36',
    'Connection': 'Keep-Alive',
    'Content-Type': 'text/plain; Charset=UTF-8', 'Accept-Language': 'zh-cn',
    'Cookie': '1111;',
}
r = requests.get(url, headers = header)

# ---------------------------------------------------------------
# 用自带的 "html.parser"解析
soup = BeautifulSoup(r.content, "html.parser")

a = ['title', 'p']
for k in a:
    for i in soup(k):
        print(k.ljust(5, '.'), ' ', i.string)

soup的find_all筛选功能

如前文所述，soup(tag) = soup.find_all(tag)，所以可以直接像下面图中的方式写。
一次意外把第二个参数写成了charset = “utf-8”，也能运行，猜想是程序把list类型的charset = [“utf-8”]自动转换成了字典类型{“charset”:“utf-8”}，经过验证，运行dict(a = “1”)会返回一个{“a”:“1”}的字典，猜想正确。
在这里插入图片描述

select和select_one筛选

select方法

原文内容:https://www.cnblogs.com/yizhenfeng168/p/6979339.html
css默认常识：标签名不加任何修饰，类名前加点，id名前加#，soup.select()利用类似的方法来筛选元素。

soup内容：

html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""

解释：

1）通过标签名查找
print soup.select('title') 
#[<title>The Dormouse's story</title>]
print soup.select('a')
#[<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
print soup.select('b')
#[<b>The Dormouse's story</b>]

（2）通过类名查找 
print soup.select('.sister')
#[<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

（3）通过id名查找
print soup.select('#link1')
#[<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>]

（4）组合查找
# 组合查找即和写 class 文件时，标签名与类名、id名进行的组合原理是一样的，例如查找 p 标签中，id 等于 link1的内容，二者需要用空格分开
print soup.select('p #link1')
#[<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>]
#直接子标签查找
print soup.select("head > title")
#[<title>The Dormouse's story</title>]