python常用工具包及模块使用（持续更新）

最新推荐文章于 2024-08-21 08:46:42 发布

洛洛宝

最新推荐文章于 2024-08-21 08:46:42 发布

阅读量281

点赞数

分类专栏： Python 文章标签： python

本文链接：https://blog.csdn.net/u010868523/article/details/105423067

版权

Python 专栏收录该内容

6 篇文章 0 订阅

订阅专栏

WHOIS协议封装库：查询域名的注册者是谁

安装：pip install python-whois

import whois
print(whois.whois("域名"))

Requests库：比urllib库更加强大的http协议库

安装：pip install requests

Beautiful Soup库：解析网页，并提供定位内容的便捷接口

安装：pip install beautifulsoup4

#解析格式错误的html
from bs4 import BeautifulSoup
from ppprint import pprint
broken_html = '<ul class=test><li>Name<li>Area</ul>'
#html.parser 是一种解析器
soup = BeautifulSoup(broken_html,'html.parser')
fixed_html = soup.prettify()
pprint(fixed_html)

html.parser解析器不准确的时候，可以使用html5lib

安装：pip install html5lib

#解析格式错误的html
from bs4 import BeautifulSoup
from ppprint import pprint
broken_html = '<ul class=test><li>Name<li>Area</ul>'
#改用html5lib
soup = BeautifulSoup(broken_html,'html5lib')
fixed_html = soup.prettify()
pprint(fixed_html)

html解析正确后，可使用find()和findall()方法定位我们所需的元素

ul = soup.find('ul',attrs={'class':'test'})
ul.find('li')  #返回第一个匹配元素
ul.findall('li')  #返回所有匹配元素

Lxml库：基于libxml2解析库构建，使用C语言编写，解析速度比Beautiful Soup更快

windows安装说明链接

Tips：一般情况用 pip install lxml 安装，如果报错，请不要犹豫点上面的链接

安装cssselect库，使用CSS选择器

安装：pip install cssselect

tree = fromstring(html)
td = tree.cssselect('tr#id > td.classname')[0]
area = td.text_content()
print(area)

CSS选择器无法正常工作的时候（HTML非常不完整或存在格式不当的元素时），可使用XPath选择器。

tree = fromstring(html)
area = tree.xpath('//tr[@id="id"]/td[@class="classname"]/text()')[0]
print(area)

Pillow库，图像处理

安装：pip install Pillow

OCR，光学字符识别

安装：pip install pytesseract

#常规应用方法只能识别典型的文本
import pytesseract
img = get_captcha_img(html.content)
pytesseract.image_to_string(img)
#一般都需要对图片进行处理，如转换为灰度图像等

学习书籍：

1、《用Python写网络爬虫（第二版）》中国工信出版集团人民邮电出版社

2、《Python编程无师自通》中国工信出版集团人民邮电出版社

洛洛宝

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

专栏目录