实现python的web编程(一):屏幕抓取

最新推荐文章于 2023-06-14 05:37:54 发布

Siri_only

最新推荐文章于 2023-06-14 05:37:54 发布

阅读量385

点赞数

文章标签： python 正则表达式 web html html5

本文链接：https://blog.csdn.net/qq_45416295/article/details/105662656

版权

python3 专栏收录该内容

1 篇文章 0 订阅

订阅专栏

1.Tidy解析

Tidy是用于对格式不正确且不严谨的HTML进行修复的工具
在Windows系统上目前不支持Tidy,但可以在其他操作系统实现对HTML文本的修补
有了二进制版本后，就可使用模块subprocess（或其他包含popen函数的模块）来运行Tidy程序
如下图,所以,我试着在Debian系的操作系统play一下:

终端命令行输入:

sudo apt install tidy

执行结果OK,如下:
在这里插入图片描述
然后在vscode输入我们的程序,用于对一个缺项HTML文本进行还原,事先在应用商店下载python开发的插件,这个就不说了,很简单,新建文件,输入以下代码:

#!/usr/bin/env python 
# -*- coding:utf-8 -*-

from subprocess import Popen,PIPE

text=open(r"/lastore/test.txt").read()
tidy=Popen('tidy',stdin=PIPE,stdout=PIPE,stderr=PIPE)

tidy.stdin.write(text.encode())
tidy.stdin.close()

print(tidy.stdout.read().decode())

保存文件"test.py",对这段代码的理解,可以参照Linux的输入输出流控制的知识,
当然需要在指定路径创建HTML文本,本来是不完整的用于测试:

<h1>Pet Shop
<h2>Complaints</h3>
<p>There is <b>no <i>way</b> at all</i> we can accept returned
parrots.
<h1><i>Dead Pets</h1>
<p>Our pets may tend to rest at times, but rarely die within the
warranty period.
<i><h2>News</h2></i>
<p>We have just received <b>a really nice parrot.
<p>It's really nice.</b>
<h3><hr>The Norwegian Blue</h3>
<h4>Plumage and <hr>pining behavior</h4>
<a href="#norwegian-blue">More information<a>
<p>Features:
<body>
<li>Beautiful plumage

执行这段程序,调试控制台就可以还原这段HTML文本,是不是很神奇
还原代码如下:

<!DOCTYPE html>
<html>
<head>
<meta name="generator" content=
"HTML Tidy for HTML5 for Linux version 5.2.0">
<title></title>
</head>
<body>
<h1>Pet Shop</h1>
<h2>Complaints</h2>
<p>There is <b>no <i>way</i></b> <i>at all</i> we can accept
returned parrots.</p>
<h1><i>Dead Pets</i></h1>
<p><i>Our pets may tend to rest at times, but rarely die within the
warranty period. </i></p>
<h2><i>News</i></h2>
<p>We have just received <b>a really nice parrot.</b></p>
<p><b>It's really nice.</b></p>
<hr>
<h3>The Norwegian Blue</h3>
<h4>Plumage and</h4>
<hr>
<h4>pining behavior</h4>
<a href="#norwegian-blue">More information</a>
<p>Features:</p>
<ul>
<li>Beautiful plumage</li>
</ul>
</body>
</html>

截图为证:
在这里插入图片描述

2.XHTML解析

XHTML非常严格，要求显式地结束所有的元素（至少就我们当前的目标而言如此）。
因此，在HTML中，可通过（使用标签<p>）开始另一个段落来结束当前段落，但在XHTML中，必须先（使用标签</p>）显式地结束当前段落
对Tidy生成的格式良好的XHTML进行解析
使用html.parser中的HTMLParser类。
使用HTMLParser意味着继承它，并重写各种事件处理方法，如handle_starttag和handle_data。
函数使用方法

使用模块HTMLParser的屏幕抓取程序

#!/usr/bin/env python 
# -*- coding:utf-8 -*-

from urllib.request import urlopen
from html.parser import HTMLParser


def isjob(url):
    try:
        a, b, c, d = url.split('/')
    except ValueError:
        return False
    return a == d == '' and b == 'jobs' and c.isdigit()


class Scraper(HTMLParser):
    # 布尔变量,跟踪自己是否处于相关链接
    in_link = False

    def handle_starttag(self, tag, attrs):
        # 形如(key,value)的元组,转换为字典
        attrs = dict(attrs)

        url = attrs.get('href', '')
        if tag == 'a' and isjob(url):
            self.url = url
            self.in_link = True
            self.chunks = []

    def handle_data(self, data):
        """
        不是通过调用handle_data一次就能获得所需的所有文本
        假定这些文本分成多个块，需要多次调用handle_data才能获得。
        原因:缓冲、字符实体、忽略的标记等，因此需要确保获取所有的文本。
        为了（在方法handle_endtag中）输出结果，我将所有的文本块合并在一起。
        为运行这个解析器，调用其方法feed将并text作为参数，然后调用其方法close。
        """
        if self.in_link:
            self.chunks.append(data)

    def handle_endtag(self, tag):
        if tag == 'a' and self.in_link:
            print('{} ({})'.format(''.join(self.chunks), self.url))
            self.in_link = False


text = urlopen('http://python.org/jobs').read().decode()
parser = Scraper()
parser.feed(text)
parser.close()

笔者在Linux虚拟机上运行这段代码,运行效率感觉还是比Windows快一点,结果如下:
在这里插入图片描述
链接:
1.HTML Tidy中文手册,超有料—>
2.Tidy官网---->http://www.html-tidy.org/

各位读者如果觉得笔者写的可以的话,请记得可以点个赞哦.如有错误之处,不吝赐教.

Siri_only

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
打赏
0
评论
实现python的web编程(一):屏幕抓取

目录1.Tidy解析2.XHTML解析1.Tidy解析Tidy是用于对格式不正确且不严谨的HTML进行修复的工具在Windows系统上目前不支持Tidy,但可以在其他操作系统实现对HTML文本的修补有了二进制版本后，就可使用模块subprocess（或其他包含popen函数的模块）来运行Tidy程序如下图,所以,我试着在Debian系的操作系统play一下:终端命令行输入:su...
复制链接

扫一扫