基本方案,采用 lxml + beautifulsoup 进行html解析和url 提取
参考 Python HTML 解析器性能评测 lxml 解析速度快,beautifulsoup 的容错性更好一些.
下了一个 lxml-2.3-py2.7-win32.egg 安装需要先安装一个 setuptools 然后执行 setuptools.exe xxx.egg 安装了xml
lxml封装了beautifulsoup , 但需要自己安装这个东东.下完解开,自带setup . 执行 python.exe setup.py install
下面就可以开始解析文件了
#-*- coding: utf-8 -*-
import lxml.html.soupparser as soupparser
import lxml.etree as etree
print "hello html parser !"
html = r'i:\temp\test.html'
dom = soupparser.parse(html)
#dom = soupparser.fromstring(html)
count = 0
for ele in dom.iter():
if(ele.tag == 'a'):
count += 1
print ele.attrib.get('href')
print "parse finished ! find url = ", count