python垃圾分类源代码_Python文本处理教程（1）

最新推荐文章于 2024-05-18 06:42:57 发布

weixin_39662594

最新推荐文章于 2024-05-18 06:42:57 发布

阅读量476

点赞数

文章标签： python垃圾分类源代码 python正则表达式处理文本内容

文本处理简介

文本处理直接应用于自然语言处理，也称为NLP。 NLP旨在处理人类在彼此交流时所说或写的语言。这不同于计算机和人之间的通信，其中通信是由人写的计算机程序或人的某些姿势，例如在某个位置点击鼠标。 NLP试图理解人类所说的自然语言并对其进行分类，并在必要时对其进行分析。 Python拥有丰富的库，可满足NLP的需求。自然语言工具包(NLTK)是一套这样的库，它提供了NLP所需的功能。

下面是一些使用NLP和python间接使用NLTK的应用程序。

概要

很多时候，我们需要获得新闻文章，电影情节或重大故事的摘要。它们都是用人类语言编写的，而不使用NLP，我们需要依赖另一个人对总结和解释。但是在NLP的帮助下，我们可以编写程序来使用NLTK，并用各种参数汇总长文本，比如在最终输出中想要的文本百分比，选择正面和负面的词汇进行汇总等。在线新闻提要依赖在这种摘要技术上提出新闻见解。

基于语音的工具

像苹果Siri或亚马逊Alexa这样的基于语音的工具依靠NLP来理解与人类交互非常成功。他们有大量的单词，句子和语法训练数据集来解释来自人类的问题或命令并对其进行处理。虽然它是关于语音的，但间接地翻译成文本，并且由语音产生的文本通过NLP系统来产生结果。

信息提取

Web抓取是使用python代码从网页中提取数据的常见示例。这里它可能不是严格基于NLP，但它确实涉及文本处理。例如，如果只需要提取html页面中存在的标题，那么在页面结构中查找h1标记，并找到一种方法来仅在这些标记之间提取文本。这需要来自python的文本处理程序。

垃圾邮件过滤

通过分析主题行中的文本以及消息的内容，可以识别和消除电子邮件中的垃圾邮件。由于垃圾邮件通常是批量发送给许多收件人，即使他们的主题和内容变化很小，也可以进行匹配和标记以将其标记为垃圾邮件。它也需要使用NLTK库。

语言翻译

计算机化的语言翻译在很大程度上依赖于NLP。随着在线平台中使用越来越多的语言，将语言从一种语言自动转换为另一种语言变得必不可少。这将涉及编程以处理翻译中涉及的语言的词汇，语法和上下文标记。同样，也可以使用NLTK处理这些要求。

情绪分析

要找出对电影表现的整体反应，我们可能需要阅读来自观众的数千条反馈帖子。但也可以通过词语和句子分析使用积极的负反馈分类自动化。然后测量正面和负面评论的频率，以找出观众的整体情绪。这显然需要分析观众所写的人类语言，NLTK也可以用于处理这样的文本。

Python文本处理开发环境

要在本教程中成功创建和运行示例代码，我们需要一个环境Python开发环境配置，它既包含通用python，也包含数据科学所需的特殊包。我们首先看一下安装python 2或python 3的通用python。但本教程更多地使用python 2，主要是因为python 2的成熟度和对外部包的更广泛的支持。

获取Python

最新的源代码，二进制文件，文档，新闻等，可在Python官方网站 - https://www.python.org/ 上找到。

也可以从 https://www.python.org/doc/ 下载Python文档。该文档以HTML，PDF和PostScript格式提供。

安装Python

Python发行版适用于各种平台。只需下载适用于您的平台的二进制代码并安装Python。

如果您的平台的二进制代码不可用，则需要C编译器手动编译源代码。编译源代码在选择安装所需的功能方面提供了更大的灵活性。

有关Python开发环境的安装和配置，请参考:

https://m.yiibai.com/python/python_environment.html

安装NLTK包

NLTK是很容易融入python环境的。使用以下命令将NLTK添加到Python环境中。

sudo pip install -U nltk

# Windows 系统使用以下命令
pip install -U nltk

当想要将其他库在python程序中使用时，也可通过类似的方法添加，在后续文章中用到时再做详细讲解。

字符串的不变性

在python中，字符串数据类型是不可变的。这意味着无法更新字符串值。我们可以通过尝试更新字符串的一部分来验证这一点，这将会产生错误。

# Can not reassign 
t= "Yiibai"
print type(t)
t[0] = "M"

当我们运行上面的程序时，将会得到以下错误提示 -

t[0] = "M"
TypeError: 'str' object does not support item assignment

我们可以通过检查字符串字母位置的内存位置地址来进一步验证这一点。

x = 'banana'

for idx in range (0,5):
    print x[idx], "=", id(x[idx])

当运行上面的程序时，将会得到以下输出。正如可以看到上面的a和指向同一内存位置。N和N也指向相同的位置。

b = 91909376
a = 91836864
n = 91259888
a = 91836864
n = 91259888

排序行

很多时候，我们需要对文件的内容进行排序以进行分析。例如，我们希望得到不同学生写的句子，按名称的字母顺序排列。这将涉及排序不仅仅是行的第一个字符，而是从左边开始的所有字符。在下面的程序中，首先从文件中读取行，然后使用sort函数打印它们，sort函数是标准python库的一部分。

打印文件

FileName = ("D:/path/poem.txt")
data=file(FileName).readlines()
for i in range(len(data)):
    print data[i]

当我们运行上面的程序时，得到以下输出 -

Summer is here.

Sky is bright.

Birds are gone.

Nests are empty.

Where is Rain?

对文件中的行进行排序

现在在打印文件内容之前应用sort函数。这些行根据左边的第一个字母排序。

FileName = ("D:pathtopoem.txt")
data=file(FileName).readlines()
data.sort()
for i in range(len(data)):
    print data[i]

当我们运行上面的程序时，得到以下输出 -

Birds are gone.

Nests are empty.

Sky is bright.

Summer is here.

Where is Rain?

重新格式化段落

当我们处理大量文本并将其呈现为可呈现的格式时，需要格式化段落。可能只想打印具有特定宽度的每一行，或者在打印诗词时增加每一行的缩进。在本章中，将使用textwrap3模块根据需要格式化段落。

首先，需要安装所需的包，如下所示 -

pip install textwrap3

环绕固定宽度

在此示例中，为段落的每一行指定了30个字符的宽度。通过为width参数指定值来使用wrap函数。

from textwrap3 import wrap

text = 'In late summer 1945, guests are gathered for the wedding reception of Don Vito Corleones daughter Connie (Talia Shire) and Carlo Rizzi (Gianni Russo). Vito (Marlon Brando), the head of the Corleone Mafia family, is known to friends and associates as Godfather. He and Tom Hagen (Robert Duvall), the Corleone family lawyer, are hearing requests for favors because, according to Italian tradition, no Sicilian can refuse a request on his daughters wedding day.'

x = wrap(text, 30)
for i in range(len(x)):
    print(x[i])

当运行上面的程序时，我们得到以下输出 -

In late summer 1945, guests
are gathered for the wedding
reception of Don Vito
Corleones daughter Connie
(Talia Shire) and Carlo Rizzi
(Gianni Russo). Vito (Marlon
Brando), the head of the
Corleone Mafia family, is
known to friends and
associates as Godfather. He
and Tom Hagen (Robert Duvall),
the Corleone family lawyer,
are hearing requests for
favors because, according to
Italian tradition, no Sicilian
can refuse a request on his
daughters wedding day.

变量缩进

在这个例子中，增加了要打印诗语的每一行的缩进。

import textwrap3

FileName = ("pathpoem.txt")

print("**Before Formatting**")
print(" ")

data=file(FileName).readlines()
for i in range(len(data)):
   print data[i]

print(" ")
print("**After Formatting**")
print(" ")
data=file(FileName).readlines()
for i in range(len(data)):
    dedented_text = textwrap3.dedent(data[i]).strip()
    print dedented_text

当运行上面的程序时，得到以下输出 -

**Before Formatting**

 Summer is here.
  Sky is bright.
    Birds are gone.
     Nests are empty.
      Where is Rain?

**After Formatting**

Summer is here.
Sky is bright.
Birds are gone.
Nests are empty.
Where is Rain?

段落计数令牌

令牌有时也叫作标志，在从源读取文本时，有时我们需要找出有关所用单词的一些统计信息。这使得有必要计算单词的数量以及计算给定文本中具有特定类型单词的行数。在下面的示例中，我们展示了使用两种不同方法计算段落中单词的程序。假设这个示例文本中包含好莱坞电影的摘要。

读取文件

FileName = ("PathGodFather.txt")

with open(FileName, 'r') as file:
    lines_in_file = file.read()
    print lines_in_file

当运行上面的程序时，得到以下输出 -

Vito Corleone is the aging don (head) of the Corleone Mafia Family. His youngest son Michael has returned from WWII just in time to see the wedding of Connie Corleone (Michael's sister) to Carlo Rizzi. All of Michael's family is involved with the Mafia, but Michael just wants to live a normal life. Drug dealer Virgil Sollozzo is looking for Mafia families to offer him protection in exchange for a profit of the drug money. He approaches Don Corleone about it, but, much against the advice of the Don's lawyer Tom Hagen, the Don is morally against the use of drugs, and turns down the offer. This does not please Sollozzo, who has the Don shot down by some of his hit men. The Don barely survives, which leads his son Michael to begin a violent mob war against Sollozzo and tears the Corleone family apart.

使用nltk计算单词

接下来，使用nltk模块来计算文本中的单词。请注意，(head)这个词被算作3个单词而不是1个单词。

参考以下代码 -

import nltk

FileName = ("PathGodFather.txt")

with open(FileName, 'r') as file:
    lines_in_file = file.read()

    nltk_tokens = nltk.word_tokenize(lines_in_file)
    print nltk_tokens
    print "n"
    print "Number of Words: " , len(nltk_tokens)

当运行上面的程序时，得到以下输出 -

['Vito', 'Corleone', 'is', 'the', 'aging', 'don', '(', 'head', ')', 'of', 'the', 'Corleone', 'Mafia', 'Family', '.', 'His', 'youngest', 'son', 'Michael', 'has', 'returned', 'from', 'WWII', 'just', 'in', 'time', 'to', 'see', 'the', 'wedding', 'of', 'Connie', 'Corleone', '(', 'Michael', "'s", 'sister', ')', 'to', 'Carlo', 'Rizzi', '.', 'All', 'of', 'Michael', "'s", 'family', 'is', 'involved', 'with', 'the', 'Mafia', ',', 'but', 'Michael', 'just', 'wants', 'to', 'live', 'a', 'normal', 'life', '.', 'Drug', 'dealer', 'Virgil', 'Sollozzo', 'is', 'looking', 'for', 'Mafia', 'families', 'to', 'offer', 'him', 'protection', 'in', 'exchange', 'for', 'a', 'profit', 'of', 'the', 'drug', 'money', '.', 'He', 'approaches', 'Don', 'Corleone', 'about', 'it', ',', 'but', ',', 'much', 'against', 'the', 'advice', 'of', 'the', 'Don', "'s", 'lawyer', 'Tom', 'Hagen', ',', 'the', 'Don', 'is', 'morally', 'against', 'the', 'use', 'of', 'drugs', ',', 'and', 'turns', 'down', 'the', 'offer', '.', 'This', 'does', 'not', 'please', 'Sollozzo', ',', 'who', 'has', 'the', 'Don', 'shot', 'down', 'by', 'some', 'of', 'his', 'hit', 'men', '.', 'The', 'Don', 'barely', 'survives', ',', 'which', 'leads', 'his', 'son', 'Michael', 'to', 'begin', 'a', 'violent', 'mob', 'war', 'against', 'Sollozzo', 'and', 'tears', 'the', 'Corleone', 'family', 'apart', '.']

Number of Words:  167

使用Split函数计数单词

接下来使用Split函数计算单词，这里单词(head)被计为单个单词而不是3个单词，就像使用nltk一样。

FileName = ("PathGodFather.txt")

with open(FileName, 'r') as file:
    lines_in_file = file.read()

    print lines_in_file.split()
    print "n"
    print  "Number of Words: ", len(lines_in_file.split())

当运行上面的程序时，得到以下输出 -

['Vito', 'Corleone', 'is', 'the', 'aging', 'don', '(head)', 'of', 'the', 'Corleone', 'Mafia', 'Family.', 'His', 'youngest', 'son', 'Michael', 'has', 'returned', 'from', 'WWII', 'just', 'in', 'time', 'to', 'see', 'the', 'wedding', 'of', 'Connie', 'Corleone', "(Michael's", 'sister)', 'to', 'Carlo', 'Rizzi.', 'All', 'of', "Michael's", 'family', 'is', 'involved', 'with', 'the', 'Mafia,', 'but', 'Michael', 'just', 'wants', 'to', 'live', 'a', 'normal', 'life.', 'Drug', 'dealer', 'Virgil', 'Sollozzo', 'is', 'looking', 'for', 'Mafia', 'families', 'to', 'offer', 'him', 'protection', 'in', 'exchange', 'for', 'a', 'profit', 'of', 'the', 'drug', 'money.', 'He', 'approaches', 'Don', 'Corleone', 'about', 'it,', 'but,', 'much', 'against', 'the', 'advice', 'of', 'the', "Don's", 'lawyer', 'Tom', 'Hagen,', 'the', 'Don', 'is', 'morally', 'against', 'the', 'use', 'of', 'drugs,', 'and', 'turns', 'down', 'the', 'offer.', 'This', 'does', 'not', 'please', 'Sollozzo,', 'who', 'has', 'the', 'Don', 'shot', 'down', 'by', 'some', 'of', 'his', 'hit', 'men.', 'The', 'Don', 'barely', 'survives,', 'which', 'leads', 'his', 'son', 'Michael', 'to', 'begin', 'a', 'violent', 'mob', 'war', 'against', 'Sollozzo', 'and', 'tears', 'the', 'Corleone', 'family', 'apart.']

Number of Words:  146

将二进制转换为ASCII

ASCII转为二进制和二进制转到ascii的转换由内置的binascii模块执行。它具有非常直接的用法，可以获取输入数据并进行转换。下面的程序显示了binascii模块及其功能名为b2a_uu和a2b_uu的用法。 uu代表“UNIX到UNIX编码”，它负责根据程序的要求从字符串到二进制和ascii值的数据转换。

参考以下代码 -

import binascii

text = "Simply Easy Learning"

# Converting binary to ascii
data_b2a = binascii.b2a_uu(text)
print "**Binary to Ascii** n"
print data_b2a

# Converting back from ascii to binary 
data_a2b = binascii.a2b_uu(data_b2a)
print "**Ascii to Binary** n"
print data_a2b

当运行上面的程序时，得到类似下面的输出 -

**Binary to Ascii** 

44VEM<&QY($5A

字符串作为文件

在读取文件时，它被读作具有多个元素的字典。因此，我们可以使用元素的索引访问文件的每一行。在下面的示例中，有一个包含多行的文件，这些行成为文件的各个元素。

with open ("PathGodFather.txt", "r") as BigFile:
    data=BigFile.readlines()

# Print each line
    for i in range(len(data)):
    print "Line No- ",i 
    print data[i]

当执行上面示例代码后，得到类似以下的结果 -

Line No-  0
Vito Corleone is the aging don (head) of the Corleone Mafia Family. 

Line No-  1
His youngest son Michael has returned from WWII just in time to see the wedding of Connie Corleone (Michael's sister) to Carlo Rizzi. 

Line No-  2
All of Michael's family is involved with the Mafia, but Michael just wants to live a normal life. Drug dealer Virgil Sollozzo is looking for Mafia families to offer him protection in exchange for a profit of the drug money. 

Line No-  3
He approaches Don Corleone about it, but, much against the advice of the Don's lawyer Tom Hagen, the Don is morally against the use of drugs, and turns down the offer.

Line No-  4
This does not please Sollozzo, who has the Don shot down by some of his hit men. 

Line No-  5
The Don barely survives, which leads his son Michael to begin a violent mob war against Sollozzo and tears the Corleone family apart.

文件作为字符串

但是，通过删除新行字符并使用read函数，可以将整个文件内容读取为单个字符串，如下所示。结果中没有分行。

当执行上面示例代码后，得到类似以下的结果 -

Vito Corleone is the aging don (head) of the Corleone Mafia Family. His youngest son Michael has returned from WWII just in time to see the wedding of Connie Corleone (Michael's sister) to Carlo Rizzi. All of Michael's family is involved with the Mafia, but Michael just wants to live a normal life. Drug dealer Virgil Sollozzo is looking for Mafia families to offer him protection in exchange for a profit of the drug money. He approaches Don Corleone about it, but, much against the advice of the Don's lawyer Tom Hagen, the Don is morally against the use of drugs, and turns down the offer.This does not please Sollozzo, who has the Don shot down by some of his hit men. The Don barely survives, which leads his son Michael to begin a violent mob war against Sollozzo and tears the Corleone family apart.

向后读取文件

正常读取文件时，从文件开头逐行读取内容。但是有些情况下我们想先读取最后一行。例如，文件中的数据底部有最新记录，需要先读取最新记录。为了达到此要求，可使用以下命令安装所需的包以执行此操作。

pip install file-read-backwards

但是在向后读取文件之前，我们先逐行读取文件的内容，以便可以在向后读取后可以作比较。

with open ("PathGodFather.txt", "r") as BigFile:
    data=BigFile.readlines()

# Print each line
    for i in range(len(data)):
    print "Line No- ",i 
    print data[i]

当我们运行上面的程序时，得到以下输出 -

Line No-  0
Vito Corleone is the aging don (head) of the Corleone Mafia Family. 

Line No-  1
His youngest son Michael has returned from WWII just in time to see the wedding of Connie Corleone (Michael's sister) to Carlo Rizzi. 

Line No-  2
All of Michael's family is involved with the Mafia, but Michael just wants to live a normal life. Drug dealer Virgil Sollozzo is looking for Mafia families to offer him protection in exchange for a profit of the drug money. 

Line No-  3
He approaches Don Corleone about it, but, much against the advice of the Don's lawyer Tom Hagen, the Don is morally against the use of drugs, and turns down the offer.

Line No-  4
This does not please Sollozzo, who has the Don shot down by some of his hit men. 

Line No-  5
The Don barely survives, which leads his son Michael to begin a violent mob war against Sollozzo and tears the Corleone family apart.

向后读取行

现在要向后读取文件，这里使用上面已安装的模块 - file-read-backwards。

from file_read_backwards import FileReadBackwards

with FileReadBackwards("PathGodFather.txt", encoding="utf-8") as BigFile:

# getting lines by lines starting from the last line up
    for line in BigFile:
        print line

当运行上面的程序时，得到以下输出 -

The Don barely survives, which leads his son Michael to begin a violent mob war against Sollozzo and tears the Corleone family apart.
This does not please Sollozzo, who has the Don shot down by some of his hit men. 
He approaches Don Corleone about it, but, much against the advice of the Don's lawyer Tom Hagen, the Don is morally against the use of drugs, and turns down the offer.
All of Michael's family is involved with the Mafia, but Michael just wants to live a normal life. Drug dealer Virgil Sollozzo is looking for Mafia families to offer him protection in exchange for a profit of the drug money. 
His youngest son Michael has returned from WWII just in time to see the wedding of Connie Corleone (Michael's sister) to Carlo Rizzi. 
Vito Corleone is the aging don (head) of the Corleone Mafia Family.

可以按相反顺序验证已读取的行。

向后读取单词

我们也可以向后阅读文件中的单词。首先向后读取行，然后使用reverse()函数对其中的单词进行标记。在下面的示例中，使用nltk模块将向后打印的单词标记形成为同一文件。

import nltk
from file_read_backwards import FileReadBackwards

with FileReadBackwards("PathGodFather.txt", encoding="utf-8") as BigFile:

# getting lines by lines starting from the last line up
# And tokenizing with applying reverse()
    for line in BigFile:
        word_data= line
        nltk_tokens = nltk.word_tokenize(word_data)
        nltk_tokens.reverse()
        print (nltk_tokens)

执行上面示例代码，得到以下结果 -

['.', 'apart', 'family', 'Corleone', 'the', 'tears', 'and', 'Sollozzo', 'against', 'war', 'mob', 'violent', 'a', 'begin', 'to', 'Michael', 'son', 'his', 'leads', 'which', ',', 'srvives', 'barely', 'Don', 'The']
['.', 'men', 'hit', 'his', 'of', 'some', 'by', 'down', 'shot', 'Don', 'the', 'has', 'who', ',', 'Sollozzo', 'please', 'not', 'does', 'This']
['.', 'offer', 'the', 'down', 'trns', 'and', ',', 'drgs', 'of', 'se', 'the', 'against', 'morally', 'is', 'Don', 'the', ',', 'Hagen', 'Tom', 'lawyer', "'s", 'Don', 'the', 'of', 'advice', 'the', 'against', 'mch', ',', 'bt', ',', 'it', 'abot', 'Corleone', 'Don', 'approaches', 'He']
['.', 'money', 'drg', 'the', 'of', 'profit', 'a', 'for', 'exchange', 'in', 'protection', 'him', 'offer', 'to', 'families', 'Mafia', 'for', 'looking', 'is', 'Sollozzo', 'Virgil', 'dealer', 'Drg', '.', 'life', 'normal', 'a', 'live', 'to', 'wants', 'jst', 'Michael', 'bt', ',', 'Mafia', 'the', 'with', 'involved', 'is', 'family', "'s", 'Michael', 'of', 'All']
['.', 'Rizzi', 'Carlo', 'to', ')', 'sister', "'s", 'Michael', '(', 'Corleone', 'Connie', 'of', 'wedding', 'the', 'see', 'to', 'time', 'in', 'jst', 'WWII', 'from', 'retrned', 'has', 'Michael', 'son', 'yongest', 'His']
['.', 'Family', 'Mafia', 'Corleone', 'the', 'of', ')', 'head', '(', 'don', 'aging', 'the', 'is', 'Corleone', 'Vito']

过滤重复的字词

很多时候，需要仅针对文件中存在的唯一单词分析文本。因此，我们需要从文本中删除重复的单词这是通过使用nltk中可用的单词标记化和集合功能来实现的。

不保留顺序

在下面的例子中，我们首先将句子标记为单词。然后应用set()函数创建一个无序的唯一元素集合。结果一个不排序的唯一单词。

import nltk
word_data = "The Sky is blue also the ocean is blue also Rainbow has a blue colour." 

# First Word tokenization
nltk_tokens = nltk.word_tokenize(word_data)

# Applying Set
no_order = list(set(nltk_tokens))

print no_order

当执行上面代码，得到以下结果 -

['blue', 'Rainbow', 'is', 'Sky', 'colour', 'ocean', 'also', 'a', '.', 'The', 'has', 'the']

保留顺序

要在删除重复项之后获取单词但仍然保留句子中单词的顺序，我们将读取单词并通过附加单词将其添加到列表中。

import nltk
word_data = "The Sky is blue also the ocean is blue also Rainbow has a blue colour." 
# First Word tokenization
nltk_tokens = nltk.word_tokenize(word_data)

ordered_tokens = set()
result = []
for word in nltk_tokens:
    if word not in ordered_tokens:
        ordered_tokens.add(word)
        result.append(word)

print result

当执行上面代码，得到以下结果 -

['The', 'Sky', 'is', 'blue', 'also', 'the', 'ocean', 'Rainbow', 'has', 'a', 'colour', '.']

提取电子邮件地址

要从文本中提取电子邮件，我们可以使用正则表达式。在下面的示例中，借助正则表达式包来定义电子邮件ID的模式，然后使用findall()函数来检索与此模式匹配的文本。

import re
text = "Please contact us at contact@qq.com for further information."+
        " You can also give feedbacl at feedback@yiibai.com"


emails = re.findall(r"[a-z0-9.-+_]+@[a-z0-9.-+_]+.[a-z]+", text)
print emails

执行上面示例代码，得到以下结果 -

['contact@qq.com', 'feedback@tp.com']

提取URL地址

通过使用正则表达式从文本文件实现URL提取。表达式在文本与模式匹配的任何位置获取文本。只有re模块用于此目的。

我们可以将输入文件包含一些URL并通过以下程序处理它以提取URL。 findall()函数用于查找与正则表达式匹配的所有实例。

输入的文本文件

显示的是下面的输入文件。其中包含几个URL。

Now a days you can learn almost anything by just visiting http://www.google.com. But if you are completely new to computers or internet then first you need to leanr those fundamentals. Next
you can visit a good e-learning site like - https://m.yiibai.com to learn further on a variety of subjects.

现在，当获取上述输入文件并通过以下程序处理它时，我们得到所需的输出，也就是从文件中提取出来URL地址。

import re

with open("pathurl_example.txt") as file:
        for line in file:
            urls = re.findall('https?://(?:[-w.]|(?:%[da-fA-F]{2}))+', line)
            print(urls)

执行上面示例代码，得到以下结果 -

['http://www.google.com.']
['https://m.yiibai.com']

美化打印数字

python模块pprint用于为python中的各种数据对象提供正确的打印格式。这些数据对象可以表示字典数据类型，甚至可以表示包含JSON数据的数据对象。在下面的示例中，我们将看到在应用pprint模块之前和应用它之后数据的输出格式。

import pprint

student_dict = {'Name': 'Tusar', 'Class': 'XII', 
     'Address': {'FLAT ':1308, 'BLOCK ':'A', 'LANE ':2, 'CITY ': 'HYD'}}

print student_dict
print "n"
print "***With Pretty Print***"
print "-----------------------"
pprint.pprint(student_dict,width=-1)

当运行上面的程序时，得到以下输出 -

{'Address': {'FLAT ': 1308, 'LANE ': 2, 'CITY ': 'HYD', 'BLOCK ': 'A'}, 'Name': 'Tusar', 'Class': 'XII'}


***With Pretty Print***
-----------------------
{'Address': {'BLOCK ': 'A',
             'CITY ': 'HYD',
             'FLAT ': 1308,
             'LANE ': 2},
 'Class': 'XII',
 'Name': 'Tusar'}

处理JSON数据

Pprint还可以通过将JSON数据格式化为更易读的格式来处理它们。

import pprint

emp = {"Name":["Rick","Dan","Michelle","Ryan","Gary","Nina","Simon","Guru" ],
   "Salary":["623.3","515.2","611","729","843.25","578","632.8","722.5" ],   
   "StartDate":[ "1/1/2012","9/23/2013","11/15/2014","5/11/2014","3/27/2015","5/21/2013",
      "7/30/2013","6/17/2014"],
   "Dept":[ "IT","Operations","IT","HR","Finance","IT","Operations","Finance"] }

x= pprint.pformat(emp, indent=2)
print x

当运行上面的程序时，得到以下输出 -

{ 'Dept': [ 'IT',
            'Operations',
            'IT',
            'HR',
            'Finance',
            'IT',
            'Operations',
            'Finance'],
  'Name': ['Rick', 'Dan', 'Michelle', 'Ryan', 'Gary', 'Nina', 'Simon', 'Guru'],
  'Salary': [ '623.3',
              '515.2',
              '611',
              '729',
              '843.25',
              '578',
              '632.8',
              '722.5'],
  'StartDate': [ '1/1/2012',
                 '9/23/2013',
                 '11/15/2014',
                 '5/11/2014',
                 '3/27/2015',
                 '5/21/2013',
                 '7/30/2013',
                 '6/17/2014']}

weixin_39662594

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
python垃圾分类源代码_Python文本处理教程（1）

文本处理简介文本处理直接应用于自然语言处理，也称为NLP。 NLP旨在处理人类在彼此交流时所说或写的语言。这不同于计算机和人之间的通信，其中通信是由人写的计算机程序或人的某些姿势，例如在某个位置点击鼠标。 NLP试图理解人类所说的自然语言并对其进行分类，并在必要时对其进行分析。 Python拥有丰富的库，可满足NLP的需求。自然语言工具包(NLTK)是一套这样的库，它提供了NLP所需的功能。下面...
复制链接

扫一扫