Python学习笔记15：Python和Web

最新推荐文章于 2024-09-18 20:39:31 发布

vbs16

最新推荐文章于 2024-09-18 20:39:31 发布

阅读量374

点赞数

文章标签： python 笔记

本文链接：https://blog.csdn.net/qq_36333986/article/details/130254225

版权

文章介绍了Python进行Web屏幕抓取的基本方法，包括使用正则表达式、Tidy和XHTML解析、BeautifulSoup库。同时，讲解了如何通过CGI创建动态网页，强调了CGI脚本的配置和权限设置。

摘要由CSDN通过智能技术生成

Python和Web

屏幕抓取

# 简单的屏幕抓取程序
from urllib.request import urlopen 
import re

p = re.compile('<a href="(/jobs/\\d+)/">(.*?)</a>') 
text = urlopen('http://python.org/jobs').read().decode() 
for url, name in p.findall(text): 
	print('{} ({})'.format(name, url))

#输出类似
Python Developer (/jobs/7209)
Python developer (/jobs/7208)
🛠 Experienced Data Engineer (/jobs/7200)
🤖 Experienced Machine Learning Engineer (/jobs/7199)
Lead / Senior Python Software Engineer (/jobs/7198)
Senior Back-End Developer (/jobs/7197)
IT Specialist (Data Science) (/jobs/7196)
Software Engineer (Mid/Senior) (/jobs/7194)
Python Back-end Developer - Planning and Tasking Team (/jobs/7193)
Senior Python Developer (/jobs/7187)
Health / Intermountain Healthcare (/jobs/7185)
Senior Python Developer (/jobs/7184)
Senior Python Developer (/jobs/7181)
Principal Python Engineer (/jobs/7180)
Senior Backend Software Engineer (f/m/d) (/jobs/7177)
Senior Software Engineer @ Omnipresent (/jobs/7175)
Principal Software Engineer @ Omnipresent (/jobs/7174)
Senior Software Engineer - Python (/jobs/7173)
Sr. Python Developer (/jobs/7171)
Senior Python Software Engineer (/jobs/7170)
Python Engineer @ Aeguana (/jobs/7165)
Algorithms Engineer (/jobs/7164)
Experienced Django Developer (Python) (/jobs/7163)
Python Software Engineer (/jobs/7162)
Senior Python/Django Engineer (/jobs/7129)

缺点

正则表达式不容易理解。
对付不了独特的html内容，如CDATA部分和字符实体（如&amp）.

Tidy和XHTML解析

Tidy

Tidy是用于对格式不正确且不严谨的HTML进行修复的工具。

格式错误的HTML代码

<h1>Pet Shop 
<h2>Complaints</h3> 
<p>There is <b>no <i>way</b> at all</i> we can accept returned 
parrots. 
<h1><i>Dead Pets</h1>
<p>Our pets may tend to rest at times, but rarely die within the 
warranty period. 
<i><h2>News</h2></i> 
<p>We have just received <b>a really nice parrot. 
<p>It's really nice.</b> 
<h3><hr>The Norwegian Blue</h3> 
<h4>Plumage and <hr>pining behavior</h4> 
<a href="#norwegian-blue">More information<a> 
<p>Features: 
<body> 
<li>Beautiful plumage

Tidy修复后的版本

<!DOCTYPE html> 
<html> 
<head> 
<title></title> 
</head> 
<body> 
<h1>Pet Shop</h1> 
<h2>Complaints</h2> 
<p>There is <b>no <i>way</i></b> <i>at all</i> we can accept 
returned parrots.</p> 
<h1><i>Dead Pets</i></h1> 
<p><i>Our pets may tend to rest at times, but rarely die within the 
warranty period.</i></p> 
<h2><i>News</i></h2> 
<p>We have just received <b>a really nice parrot.</b></p> 
<p><b>It's really nice.</b></p> 
<hr> 
<h3>The Norwegian Blue</h3> 
<h4>Plumage and</h4> 
<hr> 
<h4>pining behavior</h4> 
<a href="#norwegian-blue">More information</a> 
<p>Features:</p> 
<ul> 
<li>Beautiful plumage</li> 
</ul> 
</body> 
</html>

当然，Tidy并不能修复HTML文件存在的所有问题，但确实能够确保文件是格式良好的（即所有元素都嵌套正确），这让解析工作容易得多。

获取Tidy

$ pip install pytidylib

例如，假设你有一个混乱的HTML文件（messy.html），且在执行路径
中包含命令行版Tidy，下面的程序将对这个文件运行Tidy并将结果打印出来：

from subprocess import Popen, PIPE 

text = open('messy.html').read() 
tidy = Popen('tidy', stdin=PIPE, stdout=PIPE, stderr=PIPE) 

tidy.stdin.write(text.encode()) 
tidy.stdin.close() 

print(tidy.stdout.read().decode())

为何使用XHTML

XHTML和旧式HTML的主要区别在于，XHTML非常严格，要求显式地结束所有的元素（至少就我们当前的目标而言如此）。

要对Tidy生成的格式良好的XHTML进行解析，一种非常简单的方式是使用标准库模块html.parser中的HTMLParser类。

使用HTMLParser

HTMLParser中的回调方法

回调方法	何时被调用
handle_starttag(tag, attrs)	遇到开始标签时调用。attrs是一个由形如(name, value)的元组组成的序列
handle_startendtag(tag, attrs)	遇到空标签时调用。默认分别处理开始标签和结束标签
handle_endtag(tag)	遇到结束标签时调用
handle_data(data)	遇到文本数据时调用
handle_charref(ref)	遇到形如&#ref;的字符引用时调用
handle_entityref(name)	遇到形如&name;的实体引用时调用
handle_comment(data)	遇到注释时；只对注释内容调用
handle_decl(decl)	遇到形如<!..>的声明时调用
handle_pi(data)	用于处理指令
unknown_decl(data)	遇到未知声明时调用

# 使用模块HTMLParser的屏幕抓取程序
from urllib.request import urlopen 
from html.parser import HTMLParser 

def isjob(url): 
    try: 
        a, b, c, d = url.split('/') 
    except ValueError: 
        return False 
    return a == d == '' and b == 'jobs' and c.isdigit() 

class Scraper(HTMLParser): 
    in_link = False  
    def handle_starttag(self, tag, attrs): 
        attrs = dict(attrs) 
        url = attrs.get('href', '') 
        if tag == 'a' and isjob(url): 
            self.url = url 
            self.in_link = True 
            self.chunks = [] 
 
    def handle_data(self, data): 
        if self.in_link: 
            self.chunks.append(data)

    def handle_endtag(self, tag): 
        if tag == 'a' and self.in_link: 
            print('{} ({})'.format(''.join(self.chunks), self.url)) 
            self.in_link = False 

text = urlopen('http://python.org/jobs').read().decode() 
parser = Scraper() 
parser.feed(text) 
parser.close()

# 输出类似
Python Developer (/jobs/7209/)
Python developer (/jobs/7208/)
🛠 Experienced Data Engineer (/jobs/7200/)
🤖 Experienced Machine Learning Engineer (/jobs/7199/)
Lead / Senior Python Software Engineer (/jobs/7198/)
Senior Back-End Developer (/jobs/7197/)
IT Specialist (Data Science) (/jobs/7196/)
Software Engineer (Mid/Senior) (/jobs/7194/)
Python Back-end Developer - Planning and Tasking Team (/jobs/7193/)
Senior Python Developer (/jobs/7187/)
Health / Intermountain Healthcare (/jobs/7185/)
Senior Python Developer (/jobs/7184/)
Senior Python Developer (/jobs/7181/)
Principal Python Engineer (/jobs/7180/)
Senior Backend Software Engineer (f/m/d) (/jobs/7177/)
Senior Software Engineer @ Omnipresent (/jobs/7175/)
Principal Software Engineer @ Omnipresent (/jobs/7174/)
Senior Software Engineer - Python (/jobs/7173/)
Sr. Python Developer (/jobs/7171/)
Senior Python Software Engineer (/jobs/7170/)p
Python Engineer @ Aeguana (/jobs/7165/)
Algorithms Engineer (/jobs/7164/)
Experienced Django Developer (Python) (/jobs/7163/)
Python Software Engineer (/jobs/7162/)
Senior Python/Django Engineer (/jobs/7129/)

Beautiful Soup

Beautiful Soup是一个小巧而出色的模块，用于解析你在Web上可能遇到的不严谨且格式糟糕
的HTML。

# 使用BeautifulSoup的屏幕抓取程序
from urllib.request import urlopen 
from bs4 import BeautifulSoup 

text = urlopen('http://python.org/jobs').read() 
soup = BeautifulSoup(text, 'html.parser') 

jobs = set() 
for job in soup.body.section('h2'): 
    jobs.add('{} ({})'.format(job.a.string, job.a['href'])) 

print('\n'.join(sorted(jobs, key=str.lower)))

# 输出类似
Algorithms Engineer (/jobs/7164/)
Experienced Django Developer (Python) (/jobs/7163/)
Health / Intermountain Healthcare (/jobs/7185/)
IT Specialist (Data Science) (/jobs/7196/)
Lead / Senior Python Software Engineer (/jobs/7198/)
Principal Python Engineer (/jobs/7180/)
Principal Software Engineer @ Omnipresent (/jobs/7174/)
Python Back-end Developer - Planning and Tasking Team (/jobs/7193/)
Python developer (/jobs/7208/)
Python Developer (/jobs/7209/)
Python Engineer @ Aeguana (/jobs/7165/)
Python Software Engineer (/jobs/7162/)
Senior Back-End Developer (/jobs/7197/)
Senior Backend Software Engineer (f/m/d) (/jobs/7177/)
Senior Python Developer (/jobs/7181/)
Senior Python Developer (/jobs/7184/)
Senior Python Developer (/jobs/7187/)
Senior Python Software Engineer (/jobs/7170/)
Senior Python/Django Engineer (/jobs/7129/)
Senior Software Engineer - Python (/jobs/7173/)
Senior Software Engineer @ Omnipresent (/jobs/7175/)
Software Engineer (Mid/Senior) (/jobs/7194/)
Sr. Python Developer (/jobs/7171/)
🛠 Experienced Data Engineer (/jobs/7200/)
🤖 Experienced Machine Learning Engineer (/jobs/7199/)

使用CGI创建动态网页

通用网关接口（CGI）。CGI是一种标准机制，Web服务器可通过它将（通常是通过Web表
达提供的）查询交给专用程序（如你编写的Python程序），并以网页的方式显示查询结果。

第一步：准备 Web 服务器

$ python3 -m http.server --cgi
Serving HTTP on 0.0.0.0 port 8000 (http://0.0.0.0:8000/) ...

如果现在将浏览器指向http://127.0.0.1:8000或http://localhost:8000，将看到运行这个服务器所
在目录的内容。另外，你还将看到服务器提供的有关连接的信息。

CGI程序也必须放在可通过Web访问的目录中。另外，必须将其标识为CGI脚本，以免Web
服务器以网页的方式提供其源代码。为此，有两种常见的方式：

将脚本放在子目录cgi-bin中；
将脚本文件的扩展名指定为.cgi。

PS 开放指定端口

# 查看防火墙状态 
systemctl status firewalld

# 查看端口是否已开 
firewall-cmd --query-port=8000/tcp
# 添加指定需要开放的端口
firewall-cmd --permanent --add-port=8000/tcp   
# 移除指定的端口
firewall-cmd --permanent --remove-port=8000/tcp
# 添加端口后，重启防火墙
systemctl restart firewalld

第二步：添加!#行

脚本开头添加如下（之前没有空行）

#!/usr/bin/python3

第三步：设置文件权限

chmod 755 hello.cgi

简单的cgi脚本

#!/usr/bin/python3

print('Content-type: text/plain') 
print()# 打印一个空行，以结束首部

print('Hello, world!')

PS；需要放到cgi-bin目录下

小结

屏幕抓取：指的是自动下载网页并从中提取信息。程序Tidy及其库版本是很有用的工具，可用来修复格式糟糕的HTML，然后使用HTTML解析器进行解析。另一种抓取方式是使用Beautiful Soup，即便面对混乱的输入，它也可以处理。

CGI：通用网关接口是一种创建动态网页的方式，这是通过让Web服务器运行、与客户端程序通信并显示结果而实现的。模块cgi和cgitb可用于编写CGI脚本。CGI脚本通常是在HTML表单中调用的。

Flask：一个简单的Web框架，让你能够将代码作为Web应用发布，同时不用过多操心Web部分。

Web应用框架：要使用Python开发复杂的大型Web应用，Web应用框架必不可少。对简单的项目来说，Flask是不错的选择；但对于较大的项目，你可能应考虑使用Django或TurboGears。

Web服务：Web服务之于程序犹如网页之于用户。你可以认为，Web服务让你能够以更抽象的方式进行网络编程。常用的Web服务标准包括RSS（以及与之类似的RDF和Atom）、XML-RPC和SOAP。

vbs16

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫