PyQuery Tutorial: Basic HTML Parsing with PyQuery

As Python is my programming language of choice when it comes to getting things done quickly, I need a dead simple XML parser that gets me the data I want and gets the hell out of the way.

Enter PyQuery

PyQuery, as you may have guessed, is a Python port of the extremely popular jQueryJavaScript library. Anyone even remotely experienced with jQuery know how easy it is to select any element you wish from the DOM. Once you move away from JavaScript, many XML parsers become extremely verbose. PyQuery helps us keeps things simple and extract the data we want without wasting any time.

Using PyQuery for Basic Parsing

PyQuery includes many of the jQuery DOM manipulation methods. For this tutorial, we'll just deal with retrieving data from HTML. Once you can read the HTML to a string via PyQuery, you can instantly apply your knowledge of jQuery and append(), remove(), or whatever you need.

The Setup

This guide uses Python 2.6. If you don't have virtualenv, grab it now. We'll use it in a minute to install the PyQuery package. Now create a project directory for this tutorial. I'll call it pyquery_tutorial
$ mkdir pyquery_tutorial
$ cd pyquery_tutorial
Now create the virtual environment with your Python executable version of choice (I have only tested this for 2.6 and 2.7)
$ virtualenv env --python=python2.6
Running virtualenv with interpreter /usr/bin/python2.6
New python executable in env/bin/python2.6
Also creating executable in env/bin/python
Installing distribute.................................................................................................................................................................................done.
Now activate the virtualenv. (You should see (env) beside your prompt if done correctly)
$ . env/bin/activate
Now we install the PyQuery package.
(env) $ pip install pyquery
...
Successfully installed lxml pyquery
Cleaning up...
Woohoo, PyQuery is now ready for use!

Using PyQuery

Using PyQuery for parsing will feel extremely similar to using jQuery. One of the only differences is initializing the jQuery object. First, create this html file called "index.html" in the project directory. index.html
<!DOCTYPE html>
<html>
  <head>
    <title>PyQuery Test!</title>
  </head>

<body>
  <h1>PyQuery is AWESOME!</h1>
  <p><a href="http://pypi.python.org/pypi/pyquery">PyQuery</a> is a Python port of the famous <a href="http://jquery.com">jQuery</a> JavaScript library.
  <h2>What is it Good For?</h2>
  <ul id="pitch">
    <li>It makes parsing files a <strong>SNAP</strong>!</li>
    <li>DOM Manipulation is EASY!</li>
    <li>You <em>never</em> have to worry about confusing syntax</li>
  </ul>
</body>
</html>
Now fire up Python. (Make sure your virtualenv is still activated!)
$ python
Python 2.6.6 (r266:84292, Mar 25 2011, 19:36:32) 
[GCC 4.5.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> 
First we import PyQuery from the pyquery package.
>>> from pyquery import PyQuery
Now let's read in our index.html file and store it to a string.
>>> html = open("index.html", 'r').read()
Now we instantiate a PyQuery object, passing in our html string. To keep things looking familiar, let the instantiating object be named jQuery!
>>> jQuery = PyQuery(html)
Now we can traverse this document using the selectors we've grown to love through CSS and jQuery. It might look strange that we're assigning jQuery to something, but at this point, we use this jQuery variable JUST like we use $ in our JavaScript. For example, let's get the title tag.
>>> jQuery("title").text()
'PyQuery Test!'
jQuery developers that have created their own plugin may already be comfortable using jQuery in place of $ in their JS. Let's mess around with PyQuery some more.
>>> jQuery("li").eq(1).text()
'DOM Manipulation is EASY!'
>>> jQuery("a") # The 'jQuery Object' we're used to is now a list
[<a>, <a>]
>>> for x in jQuery("a"): # We can do for-loops as normal in Python
... print jQuery(x).text()
...
PyQuery
jQuery
Get the HTML of the first li element.
>>> jQuery("ul").children().eq(0).html()
u'It makes parsing files a <strong>SNAP</strong>!'

Remote Files

Wanna parse a remote file? No problem!
>>> jQuery = PyQuery(url="http://www.vertstudios.com/")
>>> jQuery("title").text()
"Web Design that Doesn't Suck | Vert Studios | Tyler, Texas"

Conclusion

Now that we've given you a nice kickstart of PyQuery, your knowledge of jQuery coupled with the PyQuery API provides sufficient power to parse XML/HTML documents. September 20, 2011

Ref: http://vertstudios.com/blog/pyquery-tutorial-basic-html-parsing-pyquery/

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 1
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值