Python写爬虫——抓取网页并解析HTML(修订篇）

最新推荐文章于 2024-04-12 13:00:00 发布

weixin_33737774

最新推荐文章于 2024-04-12 13:00:00 发布

阅读量159

点赞数

文章标签： python 爬虫 javascript ViewUI

原文链接：http://blog.51cto.com/jaysonzhang/1214705

版权

转载：http://www.lovelucy.info/python-crawl-pages.html

1. 获取html页面

其实，最基本的抓站，两句话就可以了

import urllib2
content = urllib2.urlopen('http://XXXX').read()

2. 解析html

SGMLParser

Python默认自带HTMLParser以及SGMLParser等等解析器，前者实在是太难用了，我就用SGMLParser写了一个示例程序：

import urllib2
from sgmllib import SGMLParser
                                                           
class ListName(SGMLParser):
    def __init__(self):
        SGMLParser.__init__(self)
        self.is_h4 = ""
        self.name = []
    def start_h4(self, attrs):
        self.is_h4 = 1
    def end_h4(self):
        self.is_h4 = ""
    def handle_data(self, text):
        if self.is_h4 == 1:
            self.name.append(text)
                                                           
content = urllib2.urlopen('http://list.taobao.com/browse/cat-0.htm').read()
listname = ListName()
listname.feed(content)
for item in listname.name:
    print item.decode('gbk').encode('utf8')

很简单，这里定义了一个叫做ListName的类，继承SGMLParser里面的方法。使用一个变量is_h4做标记判定html文件中的h4标签，如果遇到h4标签，则将标签内的内容加入到List变量name中。解释一下start_h4()和end_h4()函数，他们原型是SGMLParser中的

start_tagname(self, attrs)
end_tagname(self)

tagname就是标签名称，比如当遇到<pre>，就会调用start_pre，遇到</pre>，就会调用 end_pre。attrs为标签的参数，以[(attribute, value), (attribute, value), ...]的形式传回。

输出：

女装男装
鞋类箱包
内衣配饰
运动户外
珠宝手表
数码
家电办公
护肤彩妆
母婴用品
家居建材
美食特产
日用百货
汽车摩托
文化玩乐
本地生活
虚拟

pyQuery

pyQuery是jQuery在python中的实现，能够以jQuery的语法来操作解析HTML文档，十分方便。使用前需要安装，easy_install pyquery即可

以下例子：

#coding=gbk
from pyquery import PyQuery as pyq
doc=pyq(url=r'http://list.taobao.com/browse/cat-0.htm')
cts=doc('.market-cat')
for i in cts:
    text=pyq(i).find('h4').text()
    text=text.encode('raw_unicode_escape').decode('GB18030', 'ignore')
    print '====',text ,'===='
    print '\n'

BeautifulSoup

有个头痛的问题是，大部分的网页都没有完全遵照标准来写，各种莫名其妙的错误令人想要找出那个写网页的人痛打一顿。为了解决这个问题，我们可以选择著名的BeautifulSoup来解析html文档，它具有很好的容错能力。

#coding=gbk
import urllib
import BeautifulSoup
url='http://list.taobao.com/browse/cat-0.htm'
webfile = urllib.urlopen(url)
webcontext = webfile.read()
soup = BeautifulSoup.BeautifulStoneSoup(webcontext)
htmldata = soup.findAll("h4")
for h4content in htmldata:
    h4content=h4content.string
    print h4content.encode('raw_unicode_escape').decode('GB18030', 'ignore')

转载于:https://blog.51cto.com/jaysonzhang/1214705