3-3 BeautifulSoup简介
https://www.crummy.com/software/BeautifulSoup/#Download
https://www.crummy.com/software/BeautifulSoup/bs4/doc/
3-4 BeautifulSoup使用
pip install beautifulsoup4
C:\Users\Administrator\PycharmProjects\python_data_collection\python_data_collection\beautiful.py
html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
from bs4 import BeautifulSoup as bs
soup = bs(html_doc)
print(soup.prettify())
打印
C:\Users\Administrator\Envs\chongkong_vir\Scripts\python.exe "D:\Program Files\JetBrains\PyCharm 2017.1.3\helpers\pydev\pydevd.py" --multiproc --qt-support --client 127.0.0.1 --port 55094 --file C:/Users/Administrator/PycharmProjects/python_data_collection/python_data_collection/beautiful.py
warning: Debugger speedups using cython not found. Run '"C:\Users\Administrator\Envs\chongkong_vir\Scripts\python.exe" "D:\Program Files\JetBrains\PyCharm 2017.1.3\helpers\pydev\setup_cython.py" build_ext --inplace' to build.
pydev debugger: process 27860 is connecting
Connected to pydev debugger (build 171.4424.42)
<html>
C:/Users/Administrator/PycharmProjects/python_data_collection/python_data_collection/beautiful.py:17: GuessedAtParserWarning: No parser was explicitly specified, so I'm using the best available HTML parser for this system ("html.parser"). This usually isn't a problem, but if you run this code on another system, or in a different virtual environment, it may use a different parser and behave differently.
<head>
<title>
The Dormouse's story
</title>
</head>
<body>
The code that caused this warning is on line 17 of the file C:/Users/Administrator/PycharmProjects/python_data_collection/python_data_collection/beautiful.py. To get rid of this warning, pass the additional argument 'features="html.parser"' to the BeautifulSoup constructor.
<p class="title">
<b>
The Dormouse's story
soup = bs(html_doc)
</b>
</p>
<p class="story">
Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">
Elsie
</a>
,
<a class="sister" href="http://example.com/lacie" id="link2">
Lacie
</a>
and
<a class="sister" href="http://example.com/tillie" id="link3">
Tillie
</a>
;
and they lived at the bottom of a well.
</p>
<p class="story">
...
</p>
</body>
</html>
Process finished with exit code 0
C:\Users\Administrator\PycharmProjects\python_data_collection\python_data_collection\beautiful.py
html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
from bs4 import BeautifulSoup as bs
soup = bs(html_doc, "html.parser")
print(soup.prettify())
打印
C:\Users\Administrator\Envs\chongkong_vir\Scripts\python.exe "D:\Program Files\JetBrains\PyCharm 2017.1.3\helpers\pydev\pydevd.py" --multiproc --qt-support --client 127.0.0.1 --port 55160 --file C:/Users/Administrator/PycharmProjects/python_data_collection/python_data_collection/beautiful.py
warning: Debugger speedups using cython not found. Run '"C:\Users\Administrator\Envs\chongkong_vir\Scripts\python.exe" "D:\Program Files\JetBrains\PyCharm 2017.1.3\helpers\pydev\setup_cython.py" build_ext --inplace' to build.
pydev debugger: process 8520 is connecting
Connected to pydev debugger (build 171.4424.42)
<html>
<head>
<title>
The Dormouse's story
</title>
</head>
<body>
<p class="title">
<b>
The Dormouse's story
</b>
</p>
<p class="story">
Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">
Elsie
</a>
,
<a class="sister" href="http://example.com/lacie" id="link2">
Lacie
</a>
and
<a class="sister" href="http://example.com/tillie" id="link3">
Tillie
</a>
;
and they lived at the bottom of a well.
</p>
<p class="story">
...
</p>
</body>
</html>
Process finished with exit code 0
html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
from bs4 import BeautifulSoup as bs
import re
soup = bs(html_doc, "html.parser")
# print(soup.prettify())
# print(soup.title)
# print(soup.title.string)
# print(soup.a)
# print(soup.find(id="link2"))
# print(soup.find(id="link2").string)
# print(soup.find(id="link2").get_text())
# print(soup.findAll("a"))
# for link in soup.findAll("a"):
# print(link.string)
# print(soup.find("p",{"class":"story"}))
# print(soup.find("p",{"class":"story"}).get_text())
# for tag in soup.find_all(re.compile("^b")):
# print(tag.name)
data = soup.findAll("a", href=re.compile(r"^http://example\.com/"))
print(data)
打印
C:\Users\Administrator\Envs\chongkong_vir\Scripts\python.exe "D:\Program Files\JetBrains\PyCharm 2017.1.3\helpers\pydev\pydevd.py" --multiproc --qt-support --client 127.0.0.1 --port 55639 --file C:/Users/Administrator/PycharmProjects/python_data_collection/python_data_collection/beautiful.py
warning: Debugger speedups using cython not found. Run '"C:\Users\Administrator\Envs\chongkong_vir\Scripts\python.exe" "D:\Program Files\JetBrains\PyCharm 2017.1.3\helpers\pydev\setup_cython.py" build_ext --inplace' to build.
pydev debugger: process 39492 is connecting
Connected to pydev debugger (build 171.4424.42)
[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
Process finished with exit code 0