Python自然语言处理入门

A significant portion of the data that is generated today is unstructured. Unstructured data includes social media comments, browsing history and customer feedback. Have you found yourself in a situation with a bunch of textual data to analyse, and no idea how to proceed? Natural language processing in Python can help.

今天生成的数据的很大一部分都是非结构化的。 非结构化数据包括社交媒体评论,浏览历史记录和客户反馈。 您是否发现自己处于要分析大量文本数据的情况,却不知道如何进行? Python中的自然语言处理可以提供帮助。

The objective of this tutorial is to enable you to analyze textual data in Python through the concepts of Natural Language Processing (NLP). You will first learn how to tokenize your text into smaller chunks, normalize words to their root forms, and then, remove any noise in your documents to prepare them for further analysis.

本教程的目的是使您能够通过自然语言处理(NLP)的概念来分析Python中的文本数据。 您将首先学习如何将文本标记成较小的块,将单词归一化为它们的根形式,然后消除文档中的所有杂音以准备进行进一步分析。

Let’s get started!

让我们开始吧!

先决条件 (Prerequisites)

In this tutorial, we will use Python’s nltk library to perform all NLP operations on the text. At the time of writing this tutorial, we used version 3.4 of nltk. To install the library, you can use the pip command on the terminal:

在本教程中,我们将使用Python的nltk库对文本执行所有NLP操作。 在编写本教程时,我们使用了nltk 3.4版。 要安装该库 ,可以在终端上使用pip命令:

pip install nltk==3.4

To check which version of nltk you have in the system, you can import the library into the Python interpreter and check the version:

要检查系统中的nltk版本,可以将库导入Python解释器并检查版本:

import nltk
print(nltk.__version__)

To perform certain actions within nltk in this tutorial, you may have to download specific resources. We will describe each resource as and when required.

要在本教程中的nltk中执行某些操作,您可能必须下载特定的资源。 我们将在需要时描述每种资源。

However, if you would like to avoid downloading individual resources later in the tutorial and grab them now in one go, run the following command:

但是,如果您希望避免在本教程的稍后部分中下载单个资源并立即进行下载,请运行以下命令:

python -m nltk.downloader all

步骤1:转换成代币 (Step 1: Convert into Tokens)

A computer system can not find meaning in natural language by itself. The first step in processing natural language is to convert the original text into tokens. A token is a combination of continuous characters, with some meaning. It is up to you to decide how to break a sentence into tokens. For instance, an easy method is to split a sentence by whitespace to break it into individual words.

计算机系统本身无法找到自然语言的含义。 处理自然语言的第一步是将原始文本转换为标记。 令牌是具有某些含义的连续字符的组合。 由您决定如何将句子分解为标记。 例如,一种简单的方法是通过空格将句子拆分为单个单词。

In the NLTK library, you can use the word_tokenize() function to convert a string to tokens. However, you will first need to download the punkt resource. Run the following command in the terminal:

在NLTK库中,可以使用word_tokenize()函数将字符串转换为令牌。 但是,您首先需要下载punkt资源。 在终端中运行以下命令:

nltk.download('punkt')

Next, you need to import word_tokenize from nltk.tokenize to use it.

接下来,你需要进口word_tokenizenltk.tokenize使用它。

from nltk.tokenize import word_tokenize
print(word_tokenize("Hi, this is a nice hotel."))

The output of the code is as follows:

代码的输出如下:

['Hi', ',', 'this', 'is', 'a', 'nice', 'hotel', '.']
  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值