python提取html文件中的内容_python提取html当中的信息

最新推荐文章于 2022-06-07 15:11:02 发布

weixin_39793553

最新推荐文章于 2022-06-07 15:11:02 发布

阅读量1.1k

点赞数

文章标签： python提取html文件中的内容

import

urllib2

from

sgmllib

import

SGMLParser

class

ListName(SGMLParser):

def

__init__

(

self

SGMLParser.

__init__

(

self

)

self

.is_h4

self

.name

[]

def

start_h4(

self

attrs):

self

.is_h4

def

end_h4(

self

.is_h4

def

handle_data(

self

text):

self

.is_h4

self

.name.append(text)

content

urllib2

.urlopen(

'http://list.taobao.com/browse/cat-0.htm'

).read()

listname

ListName()

listname.feed(content)

for

item

listname.name:

item.decode(

'gbk'

).encode(

'utf8'

)

很简单，这里定义了一个叫做

ListName

的类，继承

SGMLParser

里面的方法。使用一个

变量

is_h4

做标记判定

html

文件中的

标签，

如果遇到

标签，

则将标签内的内容加

入到

List

变量

name

中。解释一下

start_h4()

和

end_h4()

函数，他们原型是

SGMLParser

中的

start_

tagname

(self, attrs)

end_

tagname

(self)

tagname

就是标签名称，比如当遇到

，就会调用

start_pre

，遇到

，就会

调用

end_pre

。

attrs

为标签的参数，以

[(attribute, value), (attribute,

value), ...]

的形式传回。

输出：

虚拟票务

数码市场

家电市场

女装市场

男装市场

童装童鞋

确定要放弃本次机会？

福利倒计时

: :

立减 ¥

普通VIP年卡可用

立即使用

weixin_39793553

关注关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
python提取html文件中的内容_python提取html当中的信息

importurllib2fromsgmllibimportSGMLParserclassListName(SGMLParser):def__init__(self):SGMLParser.__init__(self)self.is_h4=""self.name=[]defstart_h4(self,attrs):self.is_h4=1defend_h4(self):self.is_h4=""d...
复制链接

扫一扫