python爬虫网页解析器怎么写_Python写爬虫——抓取网页并解析HTML(修订篇）

最新推荐文章于 2020-12-22 23:07:55 发布

weixin_39953629

最新推荐文章于 2020-12-22 23:07:55 发布

阅读量100

点赞数

文章标签： python爬虫网页解析器怎么写

1. 获取html页面

其实，最基本的抓站，两句话就可以了import urllib2

content = urllib2.urlopen('http://XXXX').read()

2. 解析html

SGMLParser

Python默认自带HTMLParser以及SGMLParser等等解析器，前者实在是太难用了，我就用SGMLParser写了一个示例程序：import urllib2

from sgmllib import SGMLParser

class ListName(SGMLParser):

def __init__(self):

SGMLParser.__init__(self)

self.is_h4 = ""

self.name = []

def start_h4(self, attrs):

self.is_h4 = 1

def end_h4(self):

self.is_h4 = ""

def handle_data(self, text):

if self.is_h4 == 1:

self.name.append(text)

content = urllib2.urlopen('http://list.taobao.com/browse/cat-0.htm').read()

listname = ListName()

listname.feed(content)

for item in listname.name:

print item.decode('gbk').encode('utf8')

很简单，这里定义了一个叫做ListName的类，继承SGMLParser里面的方法。使用一个变量is_h4做标记判定html文件中的h4标签，如果遇到h4标签，则将标签内的内容加入到List变量name中。解释一下start_h4()和end_h4()函数，他们原型是SGMLParser中的

start_tagname(self, attrs)

end_tagname(self)

tagname就是标签名称，比如当遇到

，就会调用start_pre，遇到

，就会调用 end_pre。attrs为标签的参数，以[(attribute, value), (attribute, value), ...]的形式传回。

输出：

女装男装

鞋类箱包

内衣配饰

运动户外

珠宝手表

数码

家电办公

护肤彩妆

母婴用品

家居建材

美食特产

日用百货

汽车摩托

文化玩乐

本地生活

虚拟

pyQuery

pyQuery是jQuery在python中的实现，能够以jQuery的语法来操作解析HTML文档，十分方便。使用前需要安装，easy_install pyquery即可

以下例子：#coding=gbk

from pyquery import PyQuery as pyq

doc=pyq(url=r'http://list.taobao.com/browse/cat-0.htm')

cts=doc('.market-cat')

for i in cts:

text=pyq(i).find('h4').text()

text=text.encode('raw_unicode_escape').decode('GB18030', 'ignore')

print '====',text ,'===='

print '\n'

BeautifulSoup

有个头痛的问题是，大部分的网页都没有完全遵照标准来写，各种莫名其妙的错误令人想要找出那个写网页的人痛打一顿。为了解决这个问题，我们可以选择著名的BeautifulSoup来解析html文档，它具有很好的容错能力。

#coding=gbk

import urllib

import BeautifulSoup

url='http://list.taobao.com/browse/cat-0.htm'

webfile = urllib.urlopen(url)

webcontext = webfile.read()

soup = BeautifulSoup.BeautifulStoneSoup(webcontext)

htmldata = soup.findAll("h4")

for h4content in htmldata:

h4content=h4content.string

print h4content.encode('raw_unicode_escape').decode('GB18030', 'ignore')

weixin_39953629

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
python爬虫网页解析器怎么写_Python写爬虫——抓取网页并解析HTML(修订篇）

1. 获取html页面其实，最基本的抓站，两句话就可以了import urllib2content = urllib2.urlopen('http://XXXX').read()2. 解析htmlSGMLParserPython默认自带HTMLParser以及SGMLParser等等解析器，前者实在是太难用了，我就用SGMLParser写了一个示例程序：import urllib2from sgm...
复制链接

扫一扫

评论

被折叠的条评论为什么被折叠?

到【灌水乐园】发言

查看更多评论

添加红包

成就一亿技术人!

hope_wisdom

发出的红包

实付元

使用余额支付

点击重新获取

扫码支付

钱包余额 0

抵扣说明：

1.余额是钱包充值的虚拟货币，按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载，可以购买VIP、付费专栏及课程。