pythonrestapicctv_Python Web服务(15) 持续更新-CSDN博客

本文链接：https://blog.csdn.net/weixin_30605469/article/details/112838830

博客围绕Python展开，介绍网页信息抓取，指出正则表达式抓取的缺点及Tidy、Beautiful Soup库等解决方案；还阐述使用CGI创建动态网页，包括准备服务器、加Pound Bang行、设文件权限等步骤，提及mod_python扩展及网络程序框架、Web服务等内容。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

网页信息抓取

from urllib importurlopenimportre

p= re.compile('

(.*?)')

text= urlopen('http://python.org/community/jobs').read()for url, name inp.findall(text):print '%s (%s)' % (name, url)

这段程序有以下缺点

正则表达式读起来并不是那容易理解

程序对于CDATA部分和字符实体(比如&)

正则表达式被HTML源代码约束

对于这种有两个解决方案

程序调用Tidy(Python库)，进行XHTML解析

使用Beautiful Soup库，它是专门为网页信息抓取的

还有其他的如 scrape.py

Tidy和XHTML解析

Tidy是用来修复不规范且有些随意的HTML文档的工具。当然Tidy不能修复HTML文件的所有问题，但是它会确保文件的格式正确的(就是所有元素正确的嵌套)。

Tidy相关内容

tidy 原版c语言写的tidy

utidy python包装的库，比较老了

mxtidy python包装的库，比较老了，只支持到python2.5

jtidy 用java写的tidy

tidy-html5 c语言写的支持html5的tidy

npp-tidy2 notepad编辑器的tidy插件

Window 安装 http://binaries.html-tidy.org/ 下载压缩包文件，然后解压到程序目录下。把 tidy.exe 移动到和程序同一级目录下。

然后Python用subprocess模块中的popen函数运行tidy程序

messy.html

Pet Shop

Complaints

There is no way at allwe can accetp returned parrots.

Dead pets

Our pets may tend to rest at times. but rarely die within the warranty period.

News

We have just received a really nice parrot.

It's really nice

The Norwegian Blue

Plumage andpining behavior

More information

Features:

Beautiful plumage

tidy_test.py

from subprocess import Popen, PIPE

text = open('messy.html').read()

tidy = Popen('tidy', stdin=PIPE, stdout=PIPE, stderr=PIPE)

tidy.stdin.write(text)

tidy.stdin.close()

print tidy.stdout.read()

输出

"HTML Tidy for HTML5 for Windows version 5.2.0">

Pet Shop

Complaints

There is no way at allwe can accetp

returned parrots.

Dead pets

Our pets may tend to rest at times. but rarely die within the

warranty period.

News

We have just received a really nice parrot.

It's really nice

The Norwegian Blue

Plumage and

pining behavior

More information

Features:

Beautiful plumage

实际上就是在Shell里面执行语句

D:\python_basic_course\course_15>tidy messy.html

line 1 column 1 - Warning: missingdeclaration

line 1 column 1 - Warning: inserting implicit

line 1 column 1 - Warning: missing before

line 2 column 1 - Warning: missing

before line 2 column 15 - Warning: discarding unexpectedline 4 column 19 - Warning: replacing unexpected b withline 4 column 29 - Warning: inserting implicit line 6 column 5 - Warning: missing before line 8 column 4 - Warning: inserting implicit line 10 column 1 - Warning: missing before

line 8 column 4 - Warning: missing before

line 10 column 8 - Warning: inserting implicitline 10 column 17 - Warning: discarding unexpectedline 12 column 26 - Warning: missing before

line 14 column 4 - Warning: inserting implicitline 14 column 30 - Warning: isn't allowed in

elements

line 14 column 26 - Info:

previously mentioned

line 15 column 9 - Warning:

isn't allowed in

elements

line 14 column 61 - Info:

previously mentioned

line 16 column 5 - Warning: is probably intended as line 17 column 22 - Warning: discarding unexpected

line 17 column 29 - Warning: isn't allowed in elements

line 1 column 1 - Info:

previously mentioned

line 17 column 29 - Warning: inserting implicit

line 17 column 29 - Warning: missing

line 1 column 1 - Warning: inserting missing 'title' element

line 10 column 1 - Warning: trimming emptyInfo: Document content looks like HTML5

Tidy found 24 warnings and 0 errors!

"HTML Tidy for HTML5 for Windows version 5.2.0">

Pet Shop

Complaints

There is no way at allwe can accetp

returned parrots.

Dead pets

Our pets may tend to rest at times. but rarely die within the

warranty period.

News

We have just received a really nice parrot.

It's really nice

The Norwegian Blue

Plumage and

pining behavior

More information

Features:

Beautiful plumage

About HTML Tidy: https://github.com/htacg/tidy-html5

Bug reports and comments: https://github.com/htacg/tidy-html5/issues

Official mailing list: https://lists.w3.org/Archives/Public/public-htacg/

Latest HTML specification: http://dev.w3.org/html5/spec-author-view/

Validate your HTML documents: http://validator.w3.org/nu/

Lobby your company to join the W3C: http://www.w3.org/Consortium

Do you speak a language other than English, or a different variant of

English? Consider helping us to localize HTML Tidy. For details please see

https://github.com/htacg/tidy-html5/blob/master/README/LOCALIZE.md

D:\python_basic_course\course_15>

XHTML

xhtml和旧版的html之间最主要的区别是XHTML对显示关闭所有元素要求更加严格。

如 html中可能只用一个开始标签

标签结束一段然后开始下一段。而在XHTML中必须显式的关闭当前段落。

解析从Tidy中获得的XHtml然后用HTMLParser解析。

HTMLParser

使用HTMLParser意味着要生成它的一个子类，并且对handle_starttage或handle_data之类的事件处理方法进行覆盖。

HTMLParser 回调方法

handle_starttag(tag, attrs) 找到开始标签时，调用。attrs是(名称，值)对的序列

handle_startendtag(tag, attrs) 使用空标签时调用。默认分开处理和结束标签

handle_endtag(tag) 找到结束标签时调用

handle_data(data) 使用文本数据时调用

handle_charref(ref) 当使用ref;形式的实体引用时调用

handle_entityref(name) 当使用&name;形式的实体引用时调用

handle_comment(data) 注释时调用。只对注释内容调用

handle_decl(decl) 声明形式时调用

handle_pi(data) 处理指令时调用

htmlparser_test.py

#coding: utf-8

from urllib importurlopenfrom HTMLParser importHTMLParserclassScraper(HTMLParser):

in_h3=False

in_link=Falsedefhandle_starttag(self, tag, attrs):

attrs=dict(attrs)if tag == 'h3':

self.in_h3=Trueif tag == 'a' and 'href' inattrs:

self.in_link=True

self.chunks=[]

self.url= attrs['href']defhandle_data(self, data):ifself.in_link:

self.chunks.append(data)defhandle_endtag(self, tag):if tag == 'h3':

self.in_h3=Falseif tag == 'a':if self.in_h3 andself.in_link:

string= ''.join(self.chunks)ifisinstance(string, unicode):

string= string.encode("utf8")else:

string= unicode(string, "gb2312")

string= string.encode("utf8")print '%s (%s)' %(string, self.url)

self.in_link=False

response=Nonetry:

response= urlopen("http://www.qq.com/")exceptException as e:print "错误：下载网页时遇到问题：" +str(e)if response.code != 200:print "错误：访问后，返回的状态代码(Code)并不是预期值【200】，而是【" + str(response.code) + "】"text=response.read()

parser=Scraper()

parser.feed(text)

parser.close()

输出

台风“妮妲”登陆广东大树被连根拔起 (http://news.qq.com/a/20160802/003497.htm)

航拍张家界玻璃栈道绝壁凌空令人头晕目眩 (http://news.qq.com/a/20160802/006341.htm#p=1)

疯狂敛财10亿的“心灵培训班”，别无人监管 (http://view.news.qq.com/original/intouchtoday/n3605.html)

中国高铁盈利地图：东部赚翻中西部巨亏 (http://finance.qq.com/a/20160802/005982.htm)

20万多买奥迪Q3 讴歌CDX竞品SUV最高降6万 (http://auto.qq.com/a/20160802/004725.htm)

美国男篮44分狂胜尼日利亚奥运热身5战净胜215分 (http://sports.qq.com/nba/)

央视：房价未来怎么走？看完这个就明白了 (http://news.house.qq.com/)

韦德遭热火怠慢詹皇抱不平称韦德是热火科比 (http://sports.qq.com/a/20160802/005100.htm)

霍建华林心如返台准备归宁宴走商务通道避媒体 (http://ent.qq.com/a/20160802/006231.htm#p=1)

企鹅智酷：魏则西事件后，网民如何看网上就医？ (http://tech.qq.com/a/20160503/006393.htm#p=1)

iPhone指纹扫描弱爆了，LG把指纹做到了屏幕里 (http://tech.qq.com/a/20160503/002791.htm)

这才是中国的奢侈品，惊艳上千年！ (http://cul.qq.com/a/20160802/004627.htm#p=1)

全都输范冰冰？但比腿我站张馨予 (http://fashion.qq.com/visual/photo.shtml)

张檬又变脸了，这次真的认不出！ (http://fashion.qq.com/a/20160802/008502.htm#p=1)

遭遇恐怖袭击怎么办？一个听天由命者的视角 (http://dajia.qq.com/)

星运365 8月2日12星座运势哪个星座运势最差 (http://astro.fashion.qq.com)

星座控：从南北交点探寻你的前世今生(上) (http://astro.fashion.qq.com/original/constellationControl/NBJD.html)

点赞！河南双腿瘫痪高考生被武大录取 (http://edu.qq.com/photo/)

暑期充电助你变身学霸 (http://edu.qq.com/class/onecourse/shujiaxuexi.htm)

两只考拉树上打架一只被打哭哈哈哈！ (http://v.qq.com/cover/a/aekiwhvmdhhwa23/k00206jkl9t.html)

中国海军和平方舟医院船凯旋而归 (http://news.qq.com/a/20160127/011493.htm#p=1)

CCTV称直播行业烧钱曝小智1.2亿被挖！ (http://games.qq.com/a/20160802/000757.htm)

Sky李晓峰晒魔兽选手聚会网友看哭：都是青春 (http://games.qq.com/a/20160802/001398.htm)

压力山大的现代人你可以试着用佛法减压 (http://foxue.qq.com/)

净慧长老：《心经》里的一个“心”字奥义无穷 (http://rufodao.qq.com/a/20160801/023724.htm)

存在：84岁老爹和他13岁的娃 (http://gongyi.qq.com/original/exist/oldfatherinfamily.html)

Beautiful Soup

beautifulsoup_test.py

from urllib importurlopenfrom BeautifulSoup importBeautifulSoup

text= urlopen("http://www.qq.com/").read()

soup=BeautifulSoup(text)

jobs=set()for header in soup('h3'):

links= header('a', 'reference')if notlinks:continuelink=links[0]

jobs.add('%s (%s)' % (link.string, link['href']))print '\n'.join(sorted(jobs, key=lambda s: s.lower()))

使用CGI创建动态网页

CGI 通用网关接口(Common Gateway Interface).

第一步：准备网络服务器

CGI程序必须放在通过网络可以访问的目录中。并且须将它们标识为CGI脚本，这样网络服务器就不会将普通源代码作为网页处理。

将脚本放在叫做cgi-bin的子目录中

把脚本文件扩展名改为.cgi

如果用的Apache，需要目录的ExecCGI选项。

第二步：加入Pound Bang行

当脚本放在正确位置后，需要在脚本的开始处增加 pound bang 行。没有这样的话，网络服务器不知道如何执行脚本。

(脚本可以用其他的语言来写，比如Perl或者Ruby) 只要在脚本开始处添加

#!/usr/bin/env python

注意，它一定要是第一行(之前没有空行)。如果不能正常工作，需要查看Python可执行文件的确切位置。

#!/usr/bin/python

如果还是不行确保这行是以\r\n而不是\n结尾，且文件为UNIX风格的文本文件。

在Window系统中

#!C:\Python22\python.exe

第三步：设置文件权限

确保每个人都可以读取和执行脚本文件，还要确保只有你可以写入文件。

有的时候，在Window编辑脚本文件，而它存储在UNIX磁盘服务手上(通过Samba或FTP访问文件)，文件权限有可能在对文件进行更改后搞乱了。所以脚本无法执行时，请确保文件权限仍是正确的。

修改文件权限(或者文件模式)的UNIX命令是chmod。只要运行下面命令即可

chmod 755 somescript.cgi

用法：chmod XXX filename

×××(所有者\组用户\其他用户)

×=4 读的权限

×=2 写的权限

×=1 执行的权限

常用修改权限的命令：

sudo chmod 600 ××× (只有所有者有读和写的权限)

sudo chmod 644 ××× (所有者有读和写的权限，组用户只有读的权限)

sudo chmod 700 ××× (只有所有者有读和写以及执行的权限)

sudo chmod 666 ××× (每个人都有读和写的权限)

sudo chmod 777 ××× (每个人都有读和写以及执行的权限)

关于Linux知识可以看

如果还是不清楚怎么搭建可以看下下面文章

http://koda.iteye.com/blog/556393

http://8796902.blog.51cto.com/8786902/1560549

http://www.111cn.net/sys/Windows/63254.htm

搭建完成后，我们就可以直接访问测试了

http://localhost/cgi-bin/test.py

test.py

#!D:\Python27\python.exe

print 'Content-type: text/html'

print #Prints an empty line, to end the headers

print 'Hello, world2222!'

注意 print 'Content-type: text/html' 后面必须有两个空行，后面才是主程序。所以上面示例的代码后面的空print 是必须的。否则报错。

http://soige.blog.51cto.com/512568/325409

使用CGITB调试

#!D:\Python27\python.exe

importcgitb

cgitb.enable()print 'COntent-type: text/html'

print 1/0print 'Hello,Python'

注意，开发完成后需要关掉 cgitb 功能，因为回溯也不是为程序的一般用户准备的。

使用cgi模块

html表单提供给CGI脚本的键-值对，或称为字段，使用FieldStorage类从CGI脚本中获取这些字段。当创建FieldStorage实例时(应该只创建一个)，它会从请求中获取输入变量，然后通过类字典接口将它们提供给程序。

如果真的请求中包括名为name的值不应该这样做

form =cgi.FieldStorage()

name= form['name']

应该这样

form =cgi.FieldStorage()

name= form['name'].value

还可以这样

form.getvalue('name', 'Unknown')

示例

#!D:\Python27\python.exe

importcgi, cgitb

cgitb.enable()

form=cgi.FieldStorage()

name= form.getvalue('name', 'Python')print 'Content-type: text/html'

print 'Hello,%s!' % name

调用示例 http://localhost/cgi-bin/test.py?name=Java&age=12

>>> importurllib>>> urllib.urlencode({'name':'c++','age':'23'}

... )'age=23&name=c%2B%2B'

带有问候的HTML表单脚本

#!D:\Python27\python.exe

importcgi, cgitb

cgitb.enable()

form=cgi.FieldStorage()

name= form.getvalue('name', 'Python')print 'Content-type: text/html'

print '''

Greeting Page

Hello,%s!

Change name

''' % name

这里的test.py 也可以是 test.cgi

mod_python

mod_python是Apache网络服务器的扩展，可以让Python解释器直接成为Apache的一部分。在Python中编写Apache处理程序的功能，和使用C语言不通，它是标准的。使用mod_python处理程序框架可以访问丰富的API，深入Apache内核。、

CGI处理程序，允许使用mod_python解释器运行CGI脚本，执行速度会有相当大的提高

PSP处理程序，运行使用HTML以及Python代码混合编程创建可创建可执行网页，或者Python服务器页面

发布处理程序，允许使用url调用python函数。