7 Python Libraries you should know about

转载备用,谢谢博主分享:http://doda.co/7-python-libraries-you-should-know-about

In my years of programming in Python and roaming around GitHub's Explore section, I've come across a few libraries that stood out to me as being particularly enjoyable to use. This blog post is an effort to further spread that knowledge.

I specifically excluded awesome libs like requestsSQLAlchemyFlaskfabric etc. because I think they're already pretty "main-stream". If you know what you're trying to do, it's almost guaranteed that you'll stumble over the aforementioned. This is a list of libraries that in my opinion should be better known, but aren't.

1. pyquery (with lxml)

pip install pyquery

For parsing HTML in Python, Beautiful Soup is oft recommended and it does a great job. It sports a good pythonic API and it's easy to find introductory guides on the web. All is good in parsing-land .. until you want to parse more than a dozen documents at a time and immediately run head-first into performance problems. It's - simply put - very, very slow.

Just how slow? Check out this chart from the excellent Python HTML Parser comparison Ian Bicking compiled in 2008:

http://blog.ianbicking.org/wp-content/uploads/images/parsing-results.png

What immediately stands out is how fast lxml is. Compared to Beautiful Soup, the lxml docs are pretty sparse and that's what originally kept me from adopting this mustang of a parsing library. lxml is pretty clunky to use. Yeah you can learn and use Xpath or cssselect to select specific elements out of the tree and it becomes kind of tolerable. But once you've selected the elements that you actually want to get, you have to navigate the labyrinth of attributes lxml exposes, some containing the bits you want to get at, but the vast majority just returning None. This becomes easier after a couple dozen uses but it remains unintuitive.

So either slow and easy to use or fast and hard to use, right?

Wrong!

Enter PyQuery

Oh PyQuery you beautiful seductress:

from pyquery import PyQuery
page = PyQuery(some_html)

last_red_anchor = page('#container > a.red:last')

Easy as pie. It's ever-beloved jQuery but in Python!

There are some gotchas, like for example that PyQuery, like jQuery, exposes its internals upon iteration, forcing you to re-wrap:

for paragraph in page('#container > p'):
    paragraph = PyQuery(paragraph)
    text = paragraph.text()

That's a wart the PyQuery creators ported over from jQuery (where they'd fix it if it didn't break compatability). Understandable but still unfortunate for such a great library.

2. dateutil

pip install dateutil

Handling dates is a pain. Thank god dateutil exists. I won't even go near parsing dates without trying dateutil.parser first:

from dateutil.parser import parse

>>> parse('Mon, 11 Jul 2011 10:01:56 +0200 (CEST)')
datetime.datetime(2011, 7, 11, 10, 1, 56, tzinfo=tzlocal())

# fuzzy ignores unknown tokens

>>> s = """Today is 25 of September of 2003, exactly
...        at 10:49:41 with timezone -03:00."""
>>> parse(s, fuzzy=True)
datetime.datetime(2003, 9, 25, 10, 49, 41,
                  tzinfo=tzoffset(None, -10800))

Another thing that dateutil does for you, that would be a total pain to do manually, is recurrence:

>>> list(rrule(DAILY, count=3, byweekday=(TU,TH),
...            dtstart=datetime(2007,1,1)))
[datetime.datetime(2007, 1, 2, 0, 0),
 datetime.datetime(2007, 1, 4, 0, 0),
 datetime.datetime(2007, 1, 9, 0, 0)]

3. fuzzywuzzy

pip install fuzzywuzzy

fuzzywuzzy allows you to do fuzzy comparison on wuzzes strings. This has a whole host of use cases and is especially nice when you have to deal with human-generated data.

Consider the following code that uses the Levenshtein distance comparing some user inputto an array of possible choices.

from Levenshtein import distance

countries = ['Canada', 'Antarctica', 'Togo', ...]

def choose_least_distant(element, choices):
    'Return the one element of choices that is most similar to element'
    return min(choices, key=lambda s: distance(element, s))

user_input = 'canaderp'
choose_least_distant(user_input, countries)
>>> 'Canada'

This is all nice and dandy but we can do better. The ocean of 3rd party libs in Python is so vast, that in most cases we can just import something and be on our way:

from fuzzywuzzy import process

process.extractOne("canaderp", countries)
>>> ("Canada", 97)

More has been written about fuzzywuzzy here.

4. watchdog

pip install watchdog

watchdog is a Python API and shell utilities to monitor file system events. This means you can watch some directory and define a "push-based" system. Watchdog supports all kinds of problems. A solid piece of engineering that does it much better than the 5 or so libraries I tried before finding out about it.

5. sh

pip install sh

sh allows you to call any program as if it were a function:

from sh import git, ls, wc

# checkout master branch
git(checkout="master")

# print(the contents of this directory
print(ls("-l"))

# get the longest line of this file
longest_line = wc(__file__, "-L")

6. pattern

pip install pattern

This behemoth of a library advertises itself quite modestly:

Pattern is a web mining module for the Python programming language.

... that does Data MiningNatural Language ProcessingMachine Learning and Network Analysis all in one. I myself yet have to play with it but a friend's verdict was very positive.

7. path.py

pip install path.py

When I first learned Python os.path was my least favorite part of the stdlib.

Even something as simple as creating a list of files in a directory turned out to be grating:

import os

some_dir = '/some_dir'
files = []

for f in os.listdir(some_dir):
    files.append(os.path.joinpath(some_dir, f))

That listdir is in os and not os.path is unfortunate and unexpected and one would really hope for more from such a prominent module. And then all this manual fiddling for what really should be as simple as possible.

But with the power of path, handling file paths becomes fun again:

from path import path

some_dir = path('/some_dir')

files = some_dir.files()

Done!

Other goodies include:

>>> path('/').owner
'root'

>>> path('a/b/c').splitall()
[path(''), 'a', 'b', 'c']

# overriding __div__
>>> path('a') / 'b' / 'c'
path('a/b/c')

>>> path('ab/c').relpathto('ab/d/f')
path('../d/f')

Best part of it all? path subclasses Python's str so you can use it completely guilt-free without constantly being forced to cast it to str and worrying about libraries that checkisinstance(s, basestring) (or even worse isinstance(s, str)).


That's it! I hope I was able to introduce you to some libraries you didn't know before.


评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值