1.用 Python 登录网页
from urllib.request import urlopen
# if has Chinese, apply decode()
html = urlopen(
"https://mofanpy.com/static/scraping/basic-structure.html"
).read().decode('utf-8')
print(html)
2.匹配网页内容
** 2.1正则表达式**
##正则表达式
import re
res = re.findall(r"<title>(.+?)</title>", html)
print("\nPage title is: ", res[0])
##因为这个段落在 HTML 中还夹杂着 tab, new line, 所以我们给一个 flags=re.DOTALL 来对这些 tab, new line 不敏感.
res = re.findall(r"<p>(.*?)</p>", html, flags=re.DOTALL) # re.DOTALL if multi line
print("\nPage paragraph is: ", res[0])