python从web抓取信息（爬虫中soup.select()与soup.find_all()对比）

最新推荐文章于 2022-11-02 09:17:23 发布

LFX今天发财了吗

最新推荐文章于 2022-11-02 09:17:23 发布

阅读量3.3k

点赞数 5

分类专栏： python入门学习笔记文章标签： python html css

本文链接：https://blog.csdn.net/qq_45894443/article/details/107890815

版权

python入门学习笔记专栏收录该内容

20 篇文章 1 订阅

订阅专栏

1)利用 webbrowser 模块打开指定的URL

从sys.argv读取命令行参数或从剪切板粘贴内容
用webbrowser.open()函数打开网页

import webbrowser, sys, pyperclip
if len(sys.argv)>1:
    content = sys.argv[1]
else:
    content = pyperclip.paste()
webbrowser.open(content)

打开cmd命令提示符，转换当前工作目录，


C:\Users\Lenovo>cd "F:\python_work"    #直接输入想要跳转的路径
C:\Users\Lenovo>                       #什么也没发生，但是系统已经接受了你的请求，只是还没有转变过来 
C:\Users\Lenovo>F:                     #跳转一下盘！
F:\python_work>test.py https://blog.csdn.net/qq_45894443  #开始输入命令行参数

2)用 requests 模块从 Web 下载网页并检查错误

import requests
res = requests.get("https://editor.csdn.net/md?articleId=107890815")
try:
    res.raise_for_status()
except Exception as exc:
    print("There was a problem: %s"%(exc))

当网页存在时，res.raise_for_status()不执行任何操作，网页不存在时抛出错误，用try-except结构打印错误：

There was a problem: 404 Client Error: Not Found for url:
http://…

3)将下载文件保存到硬盘中

首先，必须用“写二进制”模式打开该文件，即向函数传入字符串’wb’，作为 open()的第二参数。为了将 Web 页面写入到一个文件，可以使用 for 循环和 Response 对象的 iter_content()方法。如果不用需要写入文件，想直接利用这些HTML代码的话可以采用res.text

import requests
res = requests.get("https://www.sigs.tsinghua.edu.cn/zsjz/115163.jhtml")
try:
    res.raise_for_status()
except Exception as exc:
    print("There was a problem: %s"%(exc))
file_object = open("F:\\python_work\\zsjz_page.txt", 'wb')
for chunk in res.iter_content(100000):
    file_object.write(chunk)
file_object.close()

4)用BeautifulSoup模块解析HTML

新建一个HTML文件内容如下，将其命名为example.html：

<!-- This is the example.html example file. --> 

<html><head><title>The Website Title</title></head>
<body>
<p>Download my <strong>Python</strong> book from <a href="http:// inventwithpython.com">my website</a>.</p>
<p>Download my <strong>Python</strong> book from <a href="http:// inventwithpython-the-copied-one.com">my copied website</a>.</p> 
<p class="slogan">Learn Python the easy way!</p> 
<p>By <span id="author">Al Sweigart</span></p> 
</body></html>

下面用BeautifulSoup来解析HTML并查找带有id属性author的元素以及查找相应链接：

import bs4
fileObject = open("F:\\python_work\\CSDN\\example.html", 'rb')
soup = bs4.BeautifulSoup(fileObject, features='html.parser')
linkElem = soup.select('p #author') #select()方法返回一个Tag对象的列表
print(linkElem[0].getText()) #Tag对象.getText()返回符合寻找要求的该Tag对象中的字符串
print(str(linkElem[0])) #str(Tag对象)显示它代表的HTML标
print(linkElem[0].attrs, '\n') #Tag对象.attrs它将所有HTML属性作为一个字典

linkElem1 = soup.select('a[href]') #寻找名为<a>带有href属性的元素，返回一个列表
print(linkElem1, '\n')

linkElem2 = soup.find_all('a') #寻找名为<a>的元素，返回一个列表
for elem in linkElem2:
    print(elem.get('href')) #遍历列表，并将链接提取出来
print('\n')

linkElem3 = soup.find_all('a', text='my website')[0]['href'] #寻找名为<a>，并带有文本'my website'的元素，[0]['href']表示列表的第一项中的链接部分
print(linkElem3)

打印结果：

Al Sweigart
<span id="author">Al Sweigart</span>
{'id': 'author'} 

[<a href="http:// inventwithpython.com">my website</a>, <a href="http:// inventwithpython-the-copied-one.com">my website</a>] 

http:// inventwithpython.com
http:// inventwithpython-the-copied-one.com


http:// inventwithpython.com

CSS选择器的select()例子：

传递给 select()方法的选择器	将匹配…
soup.select(‘div’)	所有名为<div>的元素
soup.select(’#author’)	带有 id 属性为 author 的元素
soup.select(’.notice’)	所有使用 CSS class 属性名为 notice 的元素
soup.select(‘div span’)	所有在<div>元素之内的<span>元素
soup.select(‘div > span’)	所有直接在<div>元素之内的<span>元素，中间没有其他元素
soup.select(‘input[name]’)	所有名为<input>，并有一个 name属性，其值无所谓的元素
soup.select(‘input[type=“button”]’)	所有名为<input>，并有一个 type 属性，其值为 button 的元素

不同的选择器模式可以组合起来，形成复杂的匹配。例如，soup.select(‘p #author’) 将匹配所有 id 属性为 author 的元素，只要它也在一个<p>元素之内。

BeautifulSoup模块返回的soup对象的find_all()函数：

find_all（tag, attributes, recurisive, text, limit, keywords）

tag，即标签名，可以寻找单个标签find_all（‘h1’），也可以寻找一堆标签find_all（[‘h1’,‘h2’,‘h3’]）

attributes，属性，即通过标签具有的属性来查找标签，其属性参数需要用字典封装。用法如 find_all（attr={‘class’:‘red’}）,或者find_all(‘class_’ = ‘red’)。

recursive ,是否支持递归，默认为True，意思为是只查找文档的一级标签（子节点），还是查找文档的所有标签（子孙节点）。默认查找所有标签（子孙节点）。

text，文本。去用标签内的文本内容去匹配标签。find_all（‘a’, text=‘inspirational’）
在这里插入图片描述
如在此查找my website，并提取其链接。即可直接soup.find_all(‘a’,text = ‘my website’)[0][‘href’]非常方便。