Python和Web
屏幕抓取
# 简单的屏幕抓取程序
from urllib.request import urlopen
import re
p = re.compile('<a href="(/jobs/\\d+)/">(.*?)</a>')
text = urlopen('http://python.org/jobs').read().decode()
for url, name in p.findall(text):
print('{} ({})'.format(name, url))
#输出类似
Python Developer (/jobs/7209)
Python developer (/jobs/7208)
🛠 Experienced Data Engineer (/jobs/7200)
🤖 Experienced Machine Learning Engineer (/jobs/7199)
Lead / Senior Python Software Engineer (/jobs/7198)
Senior Back-End Developer (/jobs/7197)
IT Specialist (Data Science) (/jobs/7196)
Software Engineer (Mid/Senior) (/jobs/7194)
Python Back-end Developer - Planning and Tasking Team (/jobs/7193)
Senior Python Developer (/jobs/7187)
Health / Intermountain Healthcare (/jobs/7185)
Senior Python Developer (/jobs/7184)
Senior Python Developer (/jobs/7181)
Principal Python Engineer (/jobs/7180)
Senior Backend Software Engineer (f/m/d) (/jobs/7177)
Senior Software Engineer @ Omnipresent (/jobs/7175)
Principal Software Engineer @ Omnipresent (/jobs/7174)
Senior Software Engineer - Python (/jobs/7173)
Sr. Python Developer (/jobs/7171)
Senior Python Software Engineer (/jobs/7170)
Python Engineer @ Aeguana (/jobs/7165)
Algorithms Engineer (/jobs/7164)
Experienced Django Developer (Python) (/jobs/7163)
Python Software Engineer (/jobs/7162)
Senior Python/Django Engineer (/jobs/7129)
缺点
- 正则表达式不容易理解。
- 对付不了独特的html内容,如CDATA部分和字符实体(如&).
Tidy和XHTML解析
- Tidy
Tidy是用于对格式不正确且不严谨的HTML进行修复的工具。
格式错误的HTML代码
<h1>Pet Shop
<h2>Complaints</h3>
<p>There is <b>no <i>way</b> at all</i> we can accept returned
parrots.
<h1><i>Dead Pets</h1>
<p>Our pets may tend to rest at times, but rarely die within the
warranty period.
<i><h2>News</h2></i>
<p>We have just received <b>a really nice parrot.
<p>It's really nice.</b>
<h3><hr>The Norwegian Blue</h3>
<h4>Plumage and <hr>pining behavior</h4>
<a href="#norwegian-blue">More information<a>
<p>Features:
<body>
<li>Beautiful plumage
Tidy修复后的版本
<!DOCTYPE html>
<html>
<head>
<title></title>
</head>
<body>
<h1>Pet Shop</h1>
<h2>Complaints</h2>
<p>There is <b>no <i>way</i></b> <i>at all</i> we can accept
returned parrots.</p>
<h1><i>Dead Pets</i></h1>
<p><i>Our pets may tend to rest at times, but rarely die within the
warranty period.</i></p>
<h2><i>News</i></h2>
<p>We have just received <b>a really nice parrot.</b></p>
<p><b>It's really nice.</b></p>
<hr>
<h3>The Norwegian Blue</h3>
<h4>Plumage and</h4>
<hr>
<h4>pining behavior</h4>
<a href="#norwegian-blue">More information</a>
<p>Features:</p>
<ul>
<li>Beautiful plumage</li>
</ul>
</body>
</html>
当然,Tidy并不能修复HTML文件存在的所有问题,但确实能够确保文件是格式良好的(即所有元素都嵌套正确),这让解析工作容易得多。
- 获取Tidy
$ pip install pytidylib
例如,假设你有一个混乱的HTML文件(messy.html),且在执行路径
中包含命令行版Tidy,下面的程序将对这个文件运行Tidy并将结果打印出来:
from subprocess import Popen, PIPE
text = open('messy.html').read()
tidy = Popen('tidy', stdin=PIPE, stdout=PIPE, stderr=PIPE)
tidy.stdin.write(text.encode())
tidy.stdin.close()
print(tidy.stdout.read().decode())
- 为何使用XHTML
XHTML和旧式HTML的主要区别在于,XHTML非常严格,要求显式地结束所有的元素(至少就我们当前的目标而言如此)。
要对Tidy生成的格式良好的XHTML进行解析,一种非常简单的方式是使用标准库模块html.parser中的HTMLParser类。
- 使用HTMLParser
HTMLParser中的回调方法
回调方法 | 何时被调用 |
---|---|
handle_starttag(tag, attrs) | 遇到开始标签时调用。attrs是一个由形如(name, value)的元组组成的序列 |
handle_startendtag(tag, attrs) | 遇到空标签时调用。默认分别处理开始标签和结束标签 |
handle_endtag(tag) | 遇到结束标签时调用 |
handle_data(data) | 遇到文本数据时调用 |
handle_charref(ref) | 遇到形如&#ref;的字符引用时调用 |
handle_entityref(name) | 遇到形如&name;的实体引用时调用 |
handle_comment(data) | 遇到注释时;只对注释内容调用 |
handle_decl(decl) | 遇到形如<!..>的声明时调用 |
handle_pi(data) | 用于处理指令 |
unknown_decl(data) | 遇到未知声明时调用 |
# 使用模块HTMLParser的屏幕抓取程序
from urllib.request import urlopen
from html.parser import HTMLParser
def isjob(url):
try:
a, b, c, d = url.split('/')
except ValueError:
return False
return a == d == '' and b == 'jobs' and c.isdigit()
class Scraper(HTMLParser):
in_link = False
def handle_starttag(self, tag, attrs):
attrs = dict(attrs)
url = attrs.get('href', '')
if tag == 'a' and isjob(url):
self.url = url
self.in_link = True
self.chunks = []
def handle_data(self, data):
if self.in_link:
self.chunks.append(data)
def handle_endtag(self, tag):
if tag == 'a' and self.in_link:
print('{} ({})'.format(''.join(self.chunks), self.url))
self.in_link = False
text = urlopen('http://python.org/jobs').read().decode()
parser = Scraper()
parser.feed(text)
parser.close()
# 输出类似
Python Developer (/jobs/7209/)
Python developer (/jobs/7208/)
🛠 Experienced Data Engineer (/jobs/7200/)
🤖 Experienced Machine Learning Engineer (/jobs/7199/)
Lead / Senior Python Software Engineer (/jobs/7198/)
Senior Back-End Developer (/jobs/7197/)
IT Specialist (Data Science) (/jobs/7196/)
Software Engineer (Mid/Senior) (/jobs/7194/)
Python Back-end Developer - Planning and Tasking Team (/jobs/7193/)
Senior Python Developer (/jobs/7187/)
Health / Intermountain Healthcare (/jobs/7185/)
Senior Python Developer (/jobs/7184/)
Senior Python Developer (/jobs/7181/)
Principal Python Engineer (/jobs/7180/)
Senior Backend Software Engineer (f/m/d) (/jobs/7177/)
Senior Software Engineer @ Omnipresent (/jobs/7175/)
Principal Software Engineer @ Omnipresent (/jobs/7174/)
Senior Software Engineer - Python (/jobs/7173/)
Sr. Python Developer (/jobs/7171/)
Senior Python Software Engineer (/jobs/7170/)p
Python Engineer @ Aeguana (/jobs/7165/)
Algorithms Engineer (/jobs/7164/)
Experienced Django Developer (Python) (/jobs/7163/)
Python Software Engineer (/jobs/7162/)
Senior Python/Django Engineer (/jobs/7129/)
Beautiful Soup
Beautiful Soup是一个小巧而出色的模块,用于解析你在Web上可能遇到的不严谨且格式糟糕
的HTML。
# 使用BeautifulSoup的屏幕抓取程序
from urllib.request import urlopen
from bs4 import BeautifulSoup
text = urlopen('http://python.org/jobs').read()
soup = BeautifulSoup(text, 'html.parser')
jobs = set()
for job in soup.body.section('h2'):
jobs.add('{} ({})'.format(job.a.string, job.a['href']))
print('\n'.join(sorted(jobs, key=str.lower)))
# 输出类似
Algorithms Engineer (/jobs/7164/)
Experienced Django Developer (Python) (/jobs/7163/)
Health / Intermountain Healthcare (/jobs/7185/)
IT Specialist (Data Science) (/jobs/7196/)
Lead / Senior Python Software Engineer (/jobs/7198/)
Principal Python Engineer (/jobs/7180/)
Principal Software Engineer @ Omnipresent (/jobs/7174/)
Python Back-end Developer - Planning and Tasking Team (/jobs/7193/)
Python developer (/jobs/7208/)
Python Developer (/jobs/7209/)
Python Engineer @ Aeguana (/jobs/7165/)
Python Software Engineer (/jobs/7162/)
Senior Back-End Developer (/jobs/7197/)
Senior Backend Software Engineer (f/m/d) (/jobs/7177/)
Senior Python Developer (/jobs/7181/)
Senior Python Developer (/jobs/7184/)
Senior Python Developer (/jobs/7187/)
Senior Python Software Engineer (/jobs/7170/)
Senior Python/Django Engineer (/jobs/7129/)
Senior Software Engineer - Python (/jobs/7173/)
Senior Software Engineer @ Omnipresent (/jobs/7175/)
Software Engineer (Mid/Senior) (/jobs/7194/)
Sr. Python Developer (/jobs/7171/)
🛠 Experienced Data Engineer (/jobs/7200/)
🤖 Experienced Machine Learning Engineer (/jobs/7199/)
使用CGI创建动态网页
通用网关接口(CGI)。CGI是一种标准机制,Web服务器可通过它将(通常是通过Web表
达提供的)查询交给专用程序(如你编写的Python程序),并以网页的方式显示查询结果。
第一步:准备 Web 服务器
$ python3 -m http.server --cgi
Serving HTTP on 0.0.0.0 port 8000 (http://0.0.0.0:8000/) ...
如果现在将浏览器指向http://127.0.0.1:8000或http://localhost:8000,将看到运行这个服务器所
在目录的内容。另外,你还将看到服务器提供的有关连接的信息。
CGI程序也必须放在可通过Web访问的目录中。另外,必须将其标识为CGI脚本,以免Web
服务器以网页的方式提供其源代码。为此,有两种常见的方式:
- 将脚本放在子目录cgi-bin中;
- 将脚本文件的扩展名指定为.cgi。
PS 开放指定端口
# 查看防火墙状态
systemctl status firewalld
# 查看端口是否已开
firewall-cmd --query-port=8000/tcp
# 添加指定需要开放的端口
firewall-cmd --permanent --add-port=8000/tcp
# 移除指定的端口
firewall-cmd --permanent --remove-port=8000/tcp
# 添加端口后,重启防火墙
systemctl restart firewalld
第二步:添加!#行
脚本开头添加如下(之前没有空行)
#!/usr/bin/python3
第三步:设置文件权限
chmod 755 hello.cgi
简单的cgi脚本
#!/usr/bin/python3
print('Content-type: text/plain')
print()# 打印一个空行,以结束首部
print('Hello, world!')
PS;需要放到cgi-bin目录下
小结
屏幕抓取:指的是自动下载网页并从中提取信息。程序Tidy及其库版本是很有用的工具,可用来修复格式糟糕的HTML,然后使用HTTML解析器进行解析。另一种抓取方式是使用Beautiful Soup,即便面对混乱的输入,它也可以处理。
CGI:通用网关接口是一种创建动态网页的方式,这是通过让Web服务器运行、与客户端程序通信并显示结果而实现的。模块cgi和cgitb可用于编写CGI脚本。CGI脚本通常是在HTML表单中调用的。
Flask:一个简单的Web框架,让你能够将代码作为Web应用发布,同时不用过多操心Web部分。
Web应用框架:要使用Python开发复杂的大型Web应用,Web应用框架必不可少。对简单的项目来说,Flask是不错的选择;但对于较大的项目,你可能应考虑使用Django或TurboGears。
Web服务:Web服务之于程序犹如网页之于用户。你可以认为,Web服务让你能够以更抽象的方式进行网络编程。常用的Web服务标准包括RSS(以及与之类似的RDF和Atom)、XML-RPC和SOAP。