python + lxml 抓取网页，不需用正则，用xpath

最新推荐文章于 2024-07-17 21:04:45 发布

xscool

最新推荐文章于 2024-07-17 21:04:45 发布

阅读量308

点赞数

文章标签： python lxml xpath 网页抓取

我的第一个python入门程序：
python + lxml 抓取网页，不需用正则，用xpath


# -*- coding:gb2312 -*-
import urllib
import hashlib
import os
class Spider:
    '''crawler html'''
    def get_html(self,url):
        sock = urllib.urlopen(url)
        htmlSource = sock.read()
        sock.close()
        return htmlSource
    def cache_html(self,filename,htmlSource):
        f = open(filename,'w')
        f.write(htmlSource)
        f.close
    def analysis_html(self,htmlSource):
        #from lxml import etree
        import lxml.html.soupparser as soupparser
        dom = soupparser.fromstring(htmlSource)
        #doc = dom.parse(dom)
        r = dom.xpath(".//*[@id='lh']/a[2]")
        print len(r)
        print r[0].tag
        '''
        这里直接输出中文print r[0].text 会报错，所以用了encode('gb2312')
并且在文件头部声明了文件编码类型
参考：http://blogold.chinaunix.net/u2/60332/showart_2109290.html
        '''
        print r[0].text.encode('gb2312')
        print 'done'
    def get_cache_html(self,filename):
        if not os.path.isfile(filename):
            return ''
        f = open(filename,'r')
        content = f.read()
        f.close()
        return content
if __name__ == '__main__':
    spider = Spider()
    url = 'http://www.baidu.com'
    md5_str = hashlib.md5(url).hexdigest()
    filename = "html-"+md5_str+".html"
    htmlSource = spider.get_cache_html(filename);
    if not htmlSource:
        htmlSource = spider.get_html(url)
        spider.cache_html(filename,htmlSource)
    spider.analysis_html(htmlSource)

程序流程：
抓取页面：get_html
保存页面：cache_html
分析页面：analysis_html

辅助方法：get_cache_html，如果已经抓取过的页面，保存为本地文件，下一次直接从本地文件取html内容，不用再次通过网络抓取

xpath分析工具：firefox插件，firepath

[img]http://dl.iteye.com/upload/attachment/553501/98d34f25-76fa-3319-a1c9-15e1a4e341cd.jpg[/img]

lxml 学习参考：http://lxml.de/index.html

xscool

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
python + lxml 抓取网页，不需用正则，用xpath

我的第一个python入门程序：python + lxml 抓取网页，不需用正则，用xpath[code="java"]# -*- coding:gb2312 -*-import urllibimport hashlibimport osclass Spider: '''crawler html''' def get_html(self,url...
复制链接

扫一扫