from urllib.request import urlopen
from bs4 import BeautifulSoup
html = urlopen("http://www.pythonscraping.com/exercises/exercise1.html")
bsObj = BeautifulSoup(html.read());
print(bsObj.h1)
发现本机电脑urllib2
BeautifulSoup4 是一个解析库
import urllib2
response = urllib2.urlopen('http://python.org/')
html = response.read()
bsObj = BeautifulSoup(html.read());
//html 输出一次需要重新获取
print(bsObj.h1)
爬取span标签中 class 为green 的内容
#! /usr/bin/env python
#coding=utf-8
import urllib2
from bs4 import BeautifulSoup
html=urllib2.urlopen("http://www.pythonscraping.com/pages/warandpeace.html")
bsObj=BeautifulSoup(html)
nameList=bsObj.findAll("span",{"class":"green"})
for name in nameList:
print(name.get_text())
find() and findAll() with BeautifulSoup
BeautifulSoup’s find() and findAll() are the two functions you will likely use the most.
With them, you can easily filter HTML pages to find lists of desired tags, or a single tag,
based on their various attributes.
The two functions are extremely similar, as evidenced by their definitions in the
BeautifulSoup documentation:
findAll(tag, attributes, recursive, text, limit, keywords)
find(tag, attributes, recursive, text, keywords)
In all likelihood, 95% of the time you will find yourself only needing to use the first two
arguments: tag and attributes . However, let’s take a look at all of the arguments in
greater detail.
The tag argument is one that we’ve seen before — you can pass a string name of a tag or
even a Python list of string tag names. For example, the following will return a list of all
the header tags in a document: 1
.findAll({"h1","h2","h3","h4","h5","h6"})
.findAll("span", {"class":"green", "class":"red"})
nameList = bsObj.findAll(text="the prince")
print(len(nameList))
allText = bsObj.findAll(id="text")
print(allText[0].get_text())
bsObj.findAll(id="text")
bsObj.findAll("", {"id":"text"})
A Caveat to the keyword Argument
The keyword argument can be very helpful in some situations. However, it is technically redundant as a
BeautifulSoup feature. Keep in mind that anything that can be done with keyword can also be accomplished
using techniques we will discuss later in this chapter (see Regular Expressions and Lambda Expressions).
For instance, the following two lines are identical:
bsObj.findAll(id="text")
bsObj.findAll("", {"id":"text"})
In addition, you might occasionally run into problems using keyword , most notably when searching for
elements by their class attribute, because class is a protected keyword in Python. That is, class is a
reserved word in Python that cannot be used as a variable or argument name (no relation to the
BeautifulSoup.findAll() keyword argument, previously discussed).2 For example, if you try the following
call, you’ll get a syntax error due to the nonstandard use of clasbsObj.findAll(class="green")
tead, you can use BeautifulSoup’s somewhat clumsy solution, which involves adding an underscore:
bsObj.findAll(class_="green")
Alternatively, you can enclose class in quotes:
bsObj.findAll("", {"class":"green"}