Python爬虫第一课，选取标签内容

最新推荐文章于 2024-08-14 12:00:04 发布

k___0___

最新推荐文章于 2024-08-14 12:00:04 发布

阅读量558

点赞数 1

文章标签： python

本文链接：https://blog.csdn.net/k___0___/article/details/104703119

版权

Python爬虫第一课，选取标签内容

获取标题`# from urllib.request import urlopen
`# from urllib.error import HTTPError
from bs4 import BeautifulSoup
def getTitle(url):
try: html = urlopen(url)
except HTTPError as e:
return None
try:
bsObj = BeautifulSoup(html.read())
title = bsObj.body.h1
except AttributeError as e:
return None
return title
title = getTitle(“http://www.pythonscraping.com/pages/page1.html”)
if title == None:
print(“Title could not be found”)
else:
print(title)
获取标签特定问内容，get_text
findAll(tag, attributes, recursive, text, limit, keywords)
find(tag, attributes, recursive, text, keywords)
.findAll({“h1”,“h2”,“h3”,“h4”,“h5”,“h6”})获取所有标签的所有内容
.findAll(“span”, {“class”:{“green”, “red”}})获取指定标签指定内容
recursive 设置为 True ， findAll 就会根据你的要求去查找标签参数的所有子标签，如果 recursive 设置为 False ， findAll 就只查找文档的一级标签，findAll默认是支持递归查找的（ recursive 默认值是 True ）。
范围限制参数 limit ，显然只用于 findAll 方法。 find 其实等价于 findAll 的 limit 等于1 时的情形。
还有一个关键词参数 keyword ，可以让你选择那些具有指定属性的标签。
bsObj.findAll(id=“text”)等价于bsObj.findAll("", {“id”:“text”})
` from urllib.request import urlopen
from bs4 import BeautifulSoup
html = urlopen(“http://www.pythonscraping.com/pages/warandpeace.html”)
bsObj = BeautifulSoup(html)
list = bsObj.findAll(id = “text”)
namelist = bsObj.findAll(“span”,{“class”:“green”})
for name in namelist :
```
print(name.get_text())
```
for a in list :
```
print(a.get_text())`
```
分享就到这里！本文内容提取自Python网络采集！

k___0___

关注

1
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
Python爬虫第一课，选取标签内容

Python爬虫第一课，选取标签内容获取标题`# from urllib.request import urlopen`# from urllib.error import HTTPErrorfrom bs4 import BeautifulSoupdef getTitle(url):try: html = urlopen(url)except HTTPError as e:r...
复制链接

扫一扫