the bag of words representation —— Python Data Science CookBook

最新推荐文章于 2023-04-04 21:16:00 发布

SnailDove

最新推荐文章于 2023-04-04 21:16:00 发布

阅读量1.6k

点赞数

分类专栏： NLTK python 文章标签： NLTK python

本文链接：https://blog.csdn.net/you1314520me/article/details/54974270

版权

In order to do machine learning on text, we will need to convert the text to numerical feature vectors.The bag of words representation : the text is converted to numerical vectors and the column names are the underlying words and values can be either of thw following points:

Binary, which indicates whether the word is present/absent in the given document
Frequency, which indicates the count of the word in the given document
TFIDF, which is a score that we will cover subsequently

Bag of words is the most frequent way of representing the text. As the name suggests, the order of words is ignored and only the presence/absence of words are key to this
representation. It is a two-step process, as follows:
1. For every word in the document that is present in the training set, we will assign an integer and store this as a dictionary.

2. For every document, we will create a vector. The columns of the vectors are the actual words itself. They form the features. The values of the cell are binary, frequency, or TFIDF.

Tip

Depending on your application, the notion of a document can change. In this case, our sentence is considered as a document. In some cases, we can also treat a paragraph as a document. In web page mining, a single web page can be treated as a document or parts of the web page separated by the <p> tags can also be treated as a document. In our case, we have 5 sentences, that's documents.

Example

In step 3 of source code , we will import CountVectorizer from the scikitlearn.feature_extraction text package. It converts a collection of documents—in this case, a list of sentences—to a matrix, where the rows are sentences and the columns are the words in these sentences.The count of these words are inserted in the value of these cells. Count_v is a CountVectorizer object. We had mentioned in the introduction that we need to build a dictionary of all the words in the given text. The vocabulary_attribute of CountVectorizer object provides us with the list of words and their associated IDs or feature indices.

#!/usr/bin/env python2
# -*- coding: utf-8 -*-
"""
@author: snaildove
"""
# Load Libraries
from nltk.tokenize import sent_tokenize
from sklearn.feature_extraction.text import CountVectorizer
from nltk.corpus import stopwords
# 1. Our input text, we use the same input which we had used in stop word removal recipe.
text = "Text mining, also referred to as text data mining, roughly equivalent to text analytics, refers to the process of deriving high-quality information from text. Highquality information is typically derived through the devising of patterns and trends through means such as statistical pattern learning. Text mining usually involves the process of structuring the input text (usually parsing, along with the addition of some derived linguistic features and the removal of others, and subsequent insertion into a database), deriving patterns within the structured data, and finally evaluation and interpretation of the output.  'High quality' in text mining usually refers to some combination of relevance, novelty, and interestingness. Typical text mining tasks include text categorization, text clustering, concept/entity extraction, production of granular taxonomies, sentiment analysis, document summarization, and entity relation modeling (i.e., learning relations between named entities).Text analysis involves information retriev

最低0.47元/天解锁文章

SnailDove

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
the bag of words representation —— Python Data Science CookBook

In order to do machine learning on text, we will need to convert the text to numerical feature vectors.The bag of words representation : the text is converted to numerical vectors and the column nam
复制链接

扫一扫

专栏目录