Python libraries for web scraping

Python libraries for web scraping

philips
2011.10.07 8:21
  • Python( http://www.python.org/) is a very simple, powerful programming language. FMiner( http://www.fminer.com/) is developed by python, and it use PySide( http://www.pyside.org/) doing the core scraping features. In addition to PySide, python has many libraries for web scraping(screen scraping), this article will list those common python libraries for web extraction.



    Web scraping framework

    Scrapy: http://scrapy.org/

    Scrapy is a fast high-level web crawling and web scraping(screen scraping) framework, used to crawl websites, parse and extract structured data from their pages. It can be used for a wide range of purposes, such as data mining, automated testing and sites monitoring.



    Page downloading libraries

    urllib: http://docs.python.org/library/urllib.html

    urllib2: http://docs.python.org/library/urllib2.html

    They are standard libraries in python, can do the general jobs for downloading web pages.



    PycURL: http://pycurl.sourceforge.net/

    PycURL is a Python interface to libcurl, and it can be used to fetch objects identified by a URL from a Python program, similar to the urllib Python module. PycURL's core is libcurl and made by C language, so it's fast, very fast, and supports a lot of features.



    mechanize: http://wwwsearch.sourceforge.net/mechanize/

    Stateful programmatic web browsing in Python, it can simulate web browser, but it does not use a real browser core, and can not handle javascript code.



    twill: http://twill.idyll.org/

    Twill is a simple language that allows users to browse through the web from a command-line interface. With twill, you can navigate through Web sites that use forms, cookies, and most standard web features. Twill supports automated web testing and has a simple Python interface.



    Page parser

    BeautifulSoup: http://www.crummy.com/software/BeautifulSoup/

    Beautiful Soup is a Python HTML/XML parser designed for quick turnaround projects like screen scraping. It's very easy using for some small python web scraping projects. Its selection work likes Query.



    lxml: http://lxml.de/

    The lxml XML toolkit is a Pythonic binding on the C libraries libxml2 and libxslt. lxml.html can parse a html page to a dom tree, select the dom using XPath. Early versions of FMiner use it as a core module, but in order to deal with the page that contains javascript code, it was replaced with PySide.



    re: http://docs.python.org/library/re.html

    Regular Expression, it is a standard library in python. You can use regular expression to extract the page contents, but the writing a regular expression is very complex.



    Browser core

    PyQt: http://www.riverbankcomputing.co.uk/software/pyqt/intro

    PyQt is a set of Python bindings for Nokia's Qt application framework, and it developed a long time, very mature. It contain the Webkit package which can browse through web pages and do web extraction. It has the GNU GPL (v2 and v3) and a commercial license.



    PySide: http://www.pyside.org/

    The PySide project provides LGPL-licensed Python bindings for the Qt cross-platform application and UI framework. It also contains webkit package and support LGPL. That's why FMiner chooses it.



    Pamie: http://pamie.sourceforge.net/

    stands for Python Automated Module For I.E.

    Pamie's main use is for testing web sites by which you automate the Internet Explorer client using the Pamie scripting language. It uses IE com as the core, and main for testing web, to make screen scraping, you should do some more work to extract the page's content, and some javascript code is needed.

posted on 2012-03-11 17:39  lexus 阅读( ...) 评论( ...) 编辑 收藏

转载于:https://www.cnblogs.com/lexus/archive/2012/03/11/2390374.html

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值