Python爬虫第一课,选取标签内容
- 获取标题`# from urllib.request import urlopen
- `# from urllib.error import HTTPError
- from bs4 import BeautifulSoup
- def getTitle(url):
- try: html = urlopen(url)
- except HTTPError as e:
- return None
- try:
- bsObj = BeautifulSoup(html.read())
- title = bsObj.body.h1
- except AttributeError as e:
- return None
- return title
- title = getTitle(“http://www.pythonscraping.com/pages/page1.html”)
- if title == None:
- print(“Title could not be found”)
- else:
- print(title)
- 获取标签特定问内容,get_text
- findAll(tag, attributes, recursive, text, limit, keywords)
find(tag, attributes, recursive, text, keywords)
.findAll({“h1”,“h2”,“h3”,“h4”,“h5”,“h6”})获取所有标签的所有内容
.findAll(“span”, {“class”:{“green”, “red”}})获取指定标签指定内容 - recursive 设置为 True , findAll 就会根据你的要求去查找标签参数的所有子标签,如果 recursive 设置为 False , findAll 就只查找文档的一级标签,findAll默认是支持递归查找的( recursive 默认值是 True )。
- 范围限制参数 limit ,显然只用于 findAll 方法。 find 其实等价于 findAll 的 limit 等于1 时的情形。
- 还有一个关键词参数 keyword ,可以让你选择那些具有指定属性的标签。
bsObj.findAll(id=“text”)等价于bsObj.findAll("", {“id”:“text”}) - ` from urllib.request import urlopen
- from bs4 import BeautifulSoup
- html = urlopen(“http://www.pythonscraping.com/pages/warandpeace.html”)
- bsObj = BeautifulSoup(html)
- list = bsObj.findAll(id = “text”)
- namelist = bsObj.findAll(“span”,{“class”:“green”})
- for name in namelist :
-
print(name.get_text())
- for a in list :
-
print(a.get_text())`
- 分享就到这里!本文内容提取自Python网络采集!