Python学习笔记15:Python和Web

文章介绍了Python进行Web屏幕抓取的基本方法,包括使用正则表达式、Tidy和XHTML解析、BeautifulSoup库。同时,讲解了如何通过CGI创建动态网页,强调了CGI脚本的配置和权限设置。
摘要由CSDN通过智能技术生成

Python和Web

屏幕抓取

# 简单的屏幕抓取程序
from urllib.request import urlopen 
import re

p = re.compile('<a href="(/jobs/\\d+)/">(.*?)</a>') 
text = urlopen('http://python.org/jobs').read().decode() 
for url, name in p.findall(text): 
	print('{} ({})'.format(name, url))

#输出类似
Python Developer (/jobs/7209)
Python developer (/jobs/7208)
🛠 Experienced Data Engineer (/jobs/7200)
🤖 Experienced Machine Learning Engineer (/jobs/7199)
Lead / Senior Python Software Engineer (/jobs/7198)
Senior Back-End Developer (/jobs/7197)
IT Specialist (Data Science) (/jobs/7196)
Software Engineer (Mid/Senior) (/jobs/7194)
Python Back-end Developer - Planning and Tasking Team (/jobs/7193)
Senior Python Developer (/jobs/7187)
Health / Intermountain Healthcare (/jobs/7185)
Senior Python Developer (/jobs/7184)
Senior Python Developer (/jobs/7181)
Principal Python Engineer (/jobs/7180)
Senior Backend Software Engineer (f/m/d) (/jobs/7177)
Senior Software Engineer @ Omnipresent (/jobs/7175)
Principal Software Engineer @ Omnipresent (/jobs/7174)
Senior Software Engineer - Python (/jobs/7173)
Sr. Python Developer (/jobs/7171)
Senior Python Software Engineer (/jobs/7170)
Python Engineer @ Aeguana (/jobs/7165)
Algorithms Engineer (/jobs/7164)
Experienced Django Developer (Python) (/jobs/7163)
Python Software Engineer (/jobs/7162)
Senior Python/Django Engineer (/jobs/7129)

缺点

  • 正则表达式不容易理解。
  • 对付不了独特的html内容,如CDATA部分和字符实体(如&amp).

Tidy和XHTML解析

  1. Tidy

Tidy是用于对格式不正确且不严谨的HTML进行修复的工具。

格式错误的HTML代码

<h1>Pet Shop 
<h2>Complaints</h3> 
<p>There is <b>no <i>way</b> at all</i> we can accept returned 
parrots. 
<h1><i>Dead Pets</h1>
<p>Our pets may tend to rest at times, but rarely die within the 
warranty period. 
<i><h2>News</h2></i> 
<p>We have just received <b>a really nice parrot. 
<p>It's really nice.</b> 
<h3><hr>The Norwegian Blue</h3> 
<h4>Plumage and <hr>pining behavior</h4> 
<a href="#norwegian-blue">More information<a> 
<p>Features: 
<body> 
<li>Beautiful plumage

Tidy修复后的版本

<!DOCTYPE html> 
<html> 
<head> 
<title></title> 
</head> 
<body> 
<h1>Pet Shop</h1> 
<h2>Complaints</h2> 
<p>There is <b>no <i>way</i></b> <i>at all</i> we can accept 
returned parrots.</p> 
<h1><i>Dead Pets</i></h1> 
<p><i>Our pets may tend to rest at times, but rarely die within the 
warranty period.</i></p> 
<h2><i>News</i></h2> 
<p>We have just received <b>a really nice parrot.</b></p> 
<p><b>It's really nice.</b></p> 
<hr> 
<h3>The Norwegian Blue</h3> 
<h4>Plumage and</h4> 
<hr> 
<h4>pining behavior</h4> 
<a href="#norwegian-blue">More information</a> 
<p>Features:</p> 
<ul> 
<li>Beautiful plumage</li> 
</ul> 
</body> 
</html>

当然,Tidy并不能修复HTML文件存在的所有问题,但确实能够确保文件是格式良好的(即所有元素都嵌套正确),这让解析工作容易得多。

  1. 获取Tidy
$ pip install pytidylib

例如,假设你有一个混乱的HTML文件(messy.html),且在执行路径
中包含命令行版Tidy,下面的程序将对这个文件运行Tidy并将结果打印出来:

from subprocess import Popen, PIPE 

text = open('messy.html').read() 
tidy = Popen('tidy', stdin=PIPE, stdout=PIPE, stderr=PIPE) 

tidy.stdin.write(text.encode()) 
tidy.stdin.close() 

print(tidy.stdout.read().decode())
  1. 为何使用XHTML

XHTML和旧式HTML的主要区别在于,XHTML非常严格,要求显式地结束所有的元素(至少就我们当前的目标而言如此)。

要对Tidy生成的格式良好的XHTML进行解析,一种非常简单的方式是使用标准库模块html.parser中的HTMLParser类。

  1. 使用HTMLParser

HTMLParser中的回调方法

回调方法何时被调用
handle_starttag(tag, attrs)遇到开始标签时调用。attrs是一个由形如(name, value)的元组组成的序列
handle_startendtag(tag, attrs)遇到空标签时调用。默认分别处理开始标签和结束标签
handle_endtag(tag)遇到结束标签时调用
handle_data(data)遇到文本数据时调用
handle_charref(ref)遇到形如&#ref;的字符引用时调用
handle_entityref(name)遇到形如&name;的实体引用时调用
handle_comment(data)遇到注释时;只对注释内容调用
handle_decl(decl)遇到形如<!..>的声明时调用
handle_pi(data)用于处理指令
unknown_decl(data)遇到未知声明时调用
# 使用模块HTMLParser的屏幕抓取程序
from urllib.request import urlopen 
from html.parser import HTMLParser 

def isjob(url): 
    try: 
        a, b, c, d = url.split('/') 
    except ValueError: 
        return False 
    return a == d == '' and b == 'jobs' and c.isdigit() 

class Scraper(HTMLParser): 
    in_link = False  
    def handle_starttag(self, tag, attrs): 
        attrs = dict(attrs) 
        url = attrs.get('href', '') 
        if tag == 'a' and isjob(url): 
            self.url = url 
            self.in_link = True 
            self.chunks = [] 
 
    def handle_data(self, data): 
        if self.in_link: 
            self.chunks.append(data)

    def handle_endtag(self, tag): 
        if tag == 'a' and self.in_link: 
            print('{} ({})'.format(''.join(self.chunks), self.url)) 
            self.in_link = False 

text = urlopen('http://python.org/jobs').read().decode() 
parser = Scraper() 
parser.feed(text) 
parser.close()

# 输出类似
Python Developer (/jobs/7209/)
Python developer (/jobs/7208/)
🛠 Experienced Data Engineer (/jobs/7200/)
🤖 Experienced Machine Learning Engineer (/jobs/7199/)
Lead / Senior Python Software Engineer (/jobs/7198/)
Senior Back-End Developer (/jobs/7197/)
IT Specialist (Data Science) (/jobs/7196/)
Software Engineer (Mid/Senior) (/jobs/7194/)
Python Back-end Developer - Planning and Tasking Team (/jobs/7193/)
Senior Python Developer (/jobs/7187/)
Health / Intermountain Healthcare (/jobs/7185/)
Senior Python Developer (/jobs/7184/)
Senior Python Developer (/jobs/7181/)
Principal Python Engineer (/jobs/7180/)
Senior Backend Software Engineer (f/m/d) (/jobs/7177/)
Senior Software Engineer @ Omnipresent (/jobs/7175/)
Principal Software Engineer @ Omnipresent (/jobs/7174/)
Senior Software Engineer - Python (/jobs/7173/)
Sr. Python Developer (/jobs/7171/)
Senior Python Software Engineer (/jobs/7170/)p
Python Engineer @ Aeguana (/jobs/7165/)
Algorithms Engineer (/jobs/7164/)
Experienced Django Developer (Python) (/jobs/7163/)
Python Software Engineer (/jobs/7162/)
Senior Python/Django Engineer (/jobs/7129/)

Beautiful Soup

Beautiful Soup是一个小巧而出色的模块,用于解析你在Web上可能遇到的不严谨且格式糟糕
的HTML。

# 使用BeautifulSoup的屏幕抓取程序
from urllib.request import urlopen 
from bs4 import BeautifulSoup 

text = urlopen('http://python.org/jobs').read() 
soup = BeautifulSoup(text, 'html.parser') 

jobs = set() 
for job in soup.body.section('h2'): 
    jobs.add('{} ({})'.format(job.a.string, job.a['href'])) 

print('\n'.join(sorted(jobs, key=str.lower)))

# 输出类似
Algorithms Engineer (/jobs/7164/)
Experienced Django Developer (Python) (/jobs/7163/)
Health / Intermountain Healthcare (/jobs/7185/)
IT Specialist (Data Science) (/jobs/7196/)
Lead / Senior Python Software Engineer (/jobs/7198/)
Principal Python Engineer (/jobs/7180/)
Principal Software Engineer @ Omnipresent (/jobs/7174/)
Python Back-end Developer - Planning and Tasking Team (/jobs/7193/)
Python developer (/jobs/7208/)
Python Developer (/jobs/7209/)
Python Engineer @ Aeguana (/jobs/7165/)
Python Software Engineer (/jobs/7162/)
Senior Back-End Developer (/jobs/7197/)
Senior Backend Software Engineer (f/m/d) (/jobs/7177/)
Senior Python Developer (/jobs/7181/)
Senior Python Developer (/jobs/7184/)
Senior Python Developer (/jobs/7187/)
Senior Python Software Engineer (/jobs/7170/)
Senior Python/Django Engineer (/jobs/7129/)
Senior Software Engineer - Python (/jobs/7173/)
Senior Software Engineer @ Omnipresent (/jobs/7175/)
Software Engineer (Mid/Senior) (/jobs/7194/)
Sr. Python Developer (/jobs/7171/)
🛠 Experienced Data Engineer (/jobs/7200/)
🤖 Experienced Machine Learning Engineer (/jobs/7199/)

使用CGI创建动态网页

通用网关接口(CGI)。CGI是一种标准机制,Web服务器可通过它将(通常是通过Web表
达提供的)查询交给专用程序(如你编写的Python程序),并以网页的方式显示查询结果。

第一步:准备 Web 服务器

$ python3 -m http.server --cgi
Serving HTTP on 0.0.0.0 port 8000 (http://0.0.0.0:8000/) ...

如果现在将浏览器指向http://127.0.0.1:8000或http://localhost:8000,将看到运行这个服务器所
在目录的内容。另外,你还将看到服务器提供的有关连接的信息。

CGI程序也必须放在可通过Web访问的目录中。另外,必须将其标识为CGI脚本,以免Web
服务器以网页的方式提供其源代码。为此,有两种常见的方式:

  • 将脚本放在子目录cgi-bin中;
  • 将脚本文件的扩展名指定为.cgi。

PS 开放指定端口

# 查看防火墙状态 
systemctl status firewalld

# 查看端口是否已开 
firewall-cmd --query-port=8000/tcp
# 添加指定需要开放的端口
firewall-cmd --permanent --add-port=8000/tcp   
# 移除指定的端口
firewall-cmd --permanent --remove-port=8000/tcp
# 添加端口后,重启防火墙
systemctl restart firewalld

第二步:添加!#行

脚本开头添加如下(之前没有空行)

#!/usr/bin/python3

第三步:设置文件权限

chmod 755 hello.cgi

简单的cgi脚本

#!/usr/bin/python3

print('Content-type: text/plain') 
print()# 打印一个空行,以结束首部

print('Hello, world!')

PS;需要放到cgi-bin目录下

小结

屏幕抓取:指的是自动下载网页并从中提取信息。程序Tidy及其库版本是很有用的工具,可用来修复格式糟糕的HTML,然后使用HTTML解析器进行解析。另一种抓取方式是使用Beautiful Soup,即便面对混乱的输入,它也可以处理。

CGI:通用网关接口是一种创建动态网页的方式,这是通过让Web服务器运行、与客户端程序通信并显示结果而实现的。模块cgi和cgitb可用于编写CGI脚本。CGI脚本通常是在HTML表单中调用的。

Flask:一个简单的Web框架,让你能够将代码作为Web应用发布,同时不用过多操心Web部分。

Web应用框架:要使用Python开发复杂的大型Web应用,Web应用框架必不可少。对简单的项目来说,Flask是不错的选择;但对于较大的项目,你可能应考虑使用Django或TurboGears。

Web服务:Web服务之于程序犹如网页之于用户。你可以认为,Web服务让你能够以更抽象的方式进行网络编程。常用的Web服务标准包括RSS(以及与之类似的RDF和Atom)、XML-RPC和SOAP。

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值