Python3 HTML 解析器

最新推荐文章于 2023-03-28 10:14:45 发布

金金2019

最新推荐文章于 2023-03-28 10:14:45 发布

阅读量6k

点赞数

文章标签： python html html解析器 encoding import c#

本文链接：https://blog.csdn.net/dove1980/article/details/7000232

版权

作为爬虫比较重要的一部分HTML解析器：

Python3 自己有一个，使用了一下，还不是很好用

第三方的有：

目前还不支持Python3 , Python3 的资料还不是很多

BeautifulSoup -- 支持版本2.x 很不错，我不想退版本，以后他有了我再用

Chilkat Spider component -==-- 专门支持爬虫的库, 看了一下还行，

使用库，失去的这个项目原有的初衷，所以放弃把，

剩下的就是慢慢的分析包结构和逻辑了

急需努力把

本周末发布c# 第一个版本给予2的升级

Python 2版本，打算重写，不用过去的升级

希望这轮的升级可以对Python有一个新的认识

###解析超文本这个思路过于难，以后在考虑

###目前没有好用的库
from html.parser import HTMLParser

class MyHTMLParser(HTMLParser):
def handle_starttag(self, tag, attrs):
#print("Encountered a {} 测试".format(tag))
#print(format(attrs))

#当前资源田的级别和位置连接
# if tag=='area':
# print("Encountered a {} start tag".format(tag))
# #print(format(attrs))

# if tag == 'res':
# print(format(attrs))

# if tag == 'tbody':
# print(format(tag))
# print(format(attrs))

if tag == 'l1':
print(format(tag))
print(format(attrs))

#每小时产量
if tag == 'li':
print(format(tag))
print(format(attrs))

def handle_endtag(self, tag):
if tag=='area':
print("Encountered a {} end tag".format(tag))
if tag == 'res':
print(format(attrs))

#1 page = """<html><h1>Title</h1><p>I'm a paragraph!</p></html>"""

#2-==============test code start======================--
file=open('temp.html',encoding='utf-8')
p=file.read()
#2 test code end-=====================================--

myparser = MyHTMLParser()
myparser.feed(p)

金金2019

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Python3 HTML 解析器

作为爬虫比较重要的一部分HTML解析器：Python3 自己有一个，使用了一下，还不是很好用第三方的有：目前还不支持Python3 , Python3 的资料还不是很多 BeautifulSoup -- 支持版本2.x 很不错，我不想退版本，以后他有了我再用Chilkat Spider component -==-- 专门支持爬虫的库, 看了一下还行，使用库，失去
复制链接

扫一扫