机器学习示例_使用机器学习方法的简单文本分析示例-CSDN博客

本文提供了一个简单的机器学习文本分析示例，展示了如何利用机器学习方法进行文本理解。

摘要由CSDN通过智能技术生成

机器学习示例

信息提取与信息检索(Information Extraction vs Information Retrieval)

To begin with lets get some terminologies under our belt right. Information Retrieval (IR) is the process of finding a document within a repository. It has to do with Document search, Google and other search engines perform IR of websites from the Internet. Information Extraction is the process of extracting relevant information from within documents. IR follows a separate set of algorithms and design patterns that include search engine, optimization etc. Here we will concentrate on IE.

首先，让我们掌握一些术语。 信息检索(IR)是在存储库中查找文档的过程。它与文档搜索有关，Google和其他搜索引擎执行来自Internet的网站的IR。 信息提取是从文档内部提取相关信息的过程。 IR遵循一套单独的算法和设计模式，包括搜索引擎，优化等。在这里，我们将重点介绍IE。

文字阶层 (Text Hierarchy)

The basic unit of work in Text Analytics is a Token this translates to a word in general but can include in such cases as Is’nt to Is, [‘] and nt. But for all practical purpose it is a word. These come together to form Sentences and this in turn combine to form a Paragraph that come together to form a Document. In general Documents belonging to a particular domain or source are stored in a Document Store, multiple document stores come together to form a Repository.

文本分析的基本工作单位是令牌，通常将其翻译为单词，但在某些情况下可以包括Is'nt to Is，[']和nt。但这实际上是一个词。这些一起构成句子，然后依次组合形成一个段落，一起形成一个文档。通常，属于特定域或源的文档存储在文档存储中，多个文档存储一起形成一个存储库。

Note: In NLP document does not mean a file that has multiple paragraphs with Table of contents etc. A document is a a single complete unit of information e.g. a feedback or review constitutes a document, also a novel constitutes a document

注意：在NLP中，文档并不表示文件具有多个带有目录等内容的段落。文档是一个完整的信息单元，例如，反馈或审阅构成文档，新颖的文档也构成文档

文件分类(Document Classification)

This is one of the most common task that we perform in NLP and in general is the first task that is done in IE. There are many methods for doing document classification.

这是我们在NLP中执行的最常见的任务之一，通常是在IE中执行的第一个任务。有许多方法可以进行文档分类。

文件分类和技术的类型 (Types of Document Classification and Techniques)

Supervised Document Classification
监督文件分类
Unsupervised Document Classification
无监督文件分类

Supervised Document Classification In supervised classification we use a human annotated/classified documents as input. This is used mainly in tasks where we already have a well defined business process, for example in ITOps ticket classification if we already have a series of labelled tickets (buckets) we would use supervised classification.

监督文件分类在监督分类中，我们使用人工注释/分类文件作为输入。这主要用于我们已经有明确定义的业务流程的任务中，例如在ITOps票证分类中，如果我们已经有一系列标记的票证(存储桶)，我们将使用监督分类。

Unsupervised Document Classification This is also known as clustering and is done with no specified targeted labels. This is done when we want to group together similar documents to study the clustering of the documents.

无监督文档分类这也称为聚类，无需指定目标标签即可完成。当我们希望将相似的文档组合在一起以研究文档的聚类时，便可以完成此操作。

So let us begin with a simple technique for document classification.

因此，让我们从一种简单的文档分类技术开始。

import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score, StratifiedKFold
from sklearn.feature_extraction.text import TfidfVectorizer
from scipy.sparse import hstack
fro