如何有效地从字符串中删除标点符号-CSDN博客

Recently I found myself spending many hours trying to make sense of messy text data, and decided to review some of the preprocessing involved. There are many different ways to achieve a simple cleaning step. Today, I will review a couple of different methods to remove punctuations from a string and compare their performances.

最近，我发现自己花了很多时间试图理解凌乱的文本数据，并决定回顾一些涉及的预处理。有多种方法可以实现简单的清洁步骤。今天，我将回顾几种不同的方法来删除字符串中的标点符号并比较它们的性能。

使用翻译 (Using Translate)

The string translate method is a convenient way to change multiple characters to different values at once. Translate requires a table that will work as a dictionary to map the strings. The maketrans does that job for you.

字符串转换方法是一种将多个字符立即更改为不同值的便捷方法。翻译需要一个表，该表将用作字典来映射字符串。 maketrans为您完成这项工作。

The maketrans syntax works like str.maketrans('abcd', '0123', 'xyz'). It will create a table that tells translate to change all a with 0, b with 1, c with 2, etc., and remove x, y, and z.

maketrans语法的工作方式类似于str.maketrans('abcd', '0123', 'xyz') 。它将创建一个表，指示translate将所有a更改为 0，将b更改为1，将c更改为2，以此类推，然后删除x，y和z。

Full syntax to remove punctuations and digits using translate is as below.

完整的语法可使用翻译删除标点符号和数字，如下所示。

# importing a string of punctuation and digits to removeimport string
exclist = string.punctuation + string.digits# remove punctuations and digits from oldtext
table_ = str.maketrans('', '', exclist)
newtext = oldtext.translate(table_)

This approach will entirely remove any character that is in string.punctuation and string.digits. That includes !”#$%&\’()*+,-./:;<=>?@[\\]^_`{|}~’ and all numbers.

这种方法将完全删除在字符串.punctuation任何字符和字符串.digits。 其中包括！”＃$％＆\'()* +，-。/ :; <=>？@ [\\] ^ _`{|}〜'和所有数字。

使用翻译+加入 (Using Translate + Join)

But sometimes, we might want to add a space in place of these special characters instead of getting rid of them entirely. We can do so by telling a table to change special characters to space instead of excluding them.

但是有时候，我们可能想添加一个空格来代替这些特殊字符，而不是完全摆脱它们。我们可以通过告诉表将特殊字符更改为空格而不是排除它们来实现。

table_ = str.maketrans(exclist, ' '*len(exclist))

Additionally, we can simply split and join to make sure this operation does not result in multiple spaces between words.

此外，我们可以简单地拆分和合并以确保此操作不会在单词之间导致多个空格。

newtext = ' '.join(oldtext.translate(table_).split())

使用Join + String (Using Join + String)

We can also just use join instead of translate, taking the same exclusion list from the string package we made above.

我们也可以只使用join而不是translation，从上面制作的字符串包中获取相同的排除列表。

# using exclist from above
newtext = ''.join(x for x in oldtext if x not in exclist)

使用Join + isalpha (Using Join + isalpha)

We can forego the exclusion list and just use the string method to call only the alphabets.

我们可以放弃排除列表，而仅使用字符串方法仅调用字母。

newtext = ''.join(x for x in oldtext if x.isalpha())

This approach will only keep the alphabet. As a result, it will also eliminate space between words.

这种方法只会保留字母。结果，它也将消除单词之间的间隔。

使用加入+筛选 (Using Join + Filter)

Instead of the list comprehension, we can do the same thing using the filter. This is slightly more efficient than using a list comprehension but outputs a new text in the same manner.

除了使用列表理解之外，我们还可以使用filter来做同样的事情。这比使用列表理解要有效得多，但是以相同的方式输出新文本。

newtext = ''.join(filter(str.isalpha, oldtext))

使用替换 (Using Replace)

Another way to remove punctuations (or any select characters) is to iterate through each special character and remove them one at a time. We can do this by using the replace method.

删除标点符号(或任何选择的字符)的另一种方法是遍历每个特殊字符并一次将其删除。我们可以通过使用replace方法来做到这一点。

# using exclist from abovefor s in exclist:
     text = text.replace(s, '')

使用正则表达式 (Using Regex)

There are many ways to accomplish a similar thing using regex depending on the exact goal. One way to do it is to replace characters that are not alphabets with space.

有多种方法可以使用regex来完成类似的任务，具体取决于确切的目标。一种方法是用空格替换非字母的字符。

import re
newtext = re.sub(r'[^A-Za-z]+', ' ', oldtext)

[^A-Za-z]+ selects any character that matches the rule inside of the square bracket ([]), that does not (^) have at least one (+) letter in upper case alphabets (A-Z) or lower case alphabets (a-z). Then the regex sub replaces these characters in the old text with space.

[^ A-Za-z] +选择与方括号( [] )内的规则匹配的任何字符，不 ( ^ )的大写字母( AZ )或小写字母( az )中至少有一个( + )字母。然后，regex 子项将旧文本中的这些字符替换为空格。

Another method is to select all the non-words by using a metacharacter \W. This metacharacter does not include underscore (-) and numbers.

另一种方法是通过使用元字符\ W选择所有非单词。此元字符不包含下划线(-)和数字。

newtext = re.sub(r'\W+', ' ', oldtext)

Image for post — Photo by Caleb Jones on Unsplash

性能 (Performance)

We reviewed a handful of methods, but which one is the best? I used the timeit module to measure how long each of the methods takes to process approximately 1kb string data 10000 times.

我们回顾了几种方法，但是哪种方法最好？我使用timeit模块来测量每种方法处理大约1kb字符串数据10000次所需的时间。

The test shows that using translate takes much less time compared to other methods! On the other hand, using the join with the list comprehension seems to be the least efficient way to clean select characters. Translate is the most versatile and fast option out of all reviewed today.

测试表明，与其他方法相比，使用翻译所花费的时间要少得多！另一方面，将join与列表理解一起使用似乎是清除所选字符的最不有效的方法。在当今的所有评论中，翻译是功能最全，最快速的选项。

If you have any other method, please leave a comment and I will add the test result to the post!

如果您还有其他方法，请发表评论，我会将测试结果添加到帖子中！