大数据文本分析方法_比较文本数据以立即增强分析影响的三种方法

最新推荐文章于 2023-12-29 01:21:53 发布

weixin_26752765

最新推荐文章于 2023-12-29 01:21:53 发布

阅读量1.9k

点赞数

文章标签： java python 大数据 linux 机器学习

原文链接：https://towardsdatascience.com/three-methods-for-comparing-text-data-to-instantly-boost-the-impact-of-your-analysis-99efb99c6f40

版权

本文探讨了从大数据文本中获取洞察力的三种有效方法，通过比较文本数据来提升分析效果。这些方法有助于深入理解大规模文本数据集，适用于机器学习和数据科学项目。

摘要由CSDN通过智能技术生成

大数据文本分析方法

(The data and full Jupyter notebook walk-through can be found here.)

(数据和完整的Jupyter笔记本演练可以在 此处找到。)

If you’re looking for a job as a data analyst or scientist, as well as trying to learn NLP — get ready to knock out two birds with one stone! For this article, I’ll be using 1000 job postings for “Data Analyst” on Indeed.com as my set of documents (corpora). I chose data analyst since there will be a larger variety of postings and requirements compared to data scientist. The columns are pretty self-explanatory:

如果您正在寻找数据分析师或科学家的工作，并且想学习NLP，请准备好用一只石头击倒两只鸟！对于本文，我将在Indeed.com上使用1000个职位发布作为“数据分析师”作为我的文档集(语料库)。我之所以选择数据分析师，是因为与数据科学家相比，职位和要求的种类更多。这些列很容易解释：

Before we dive into code, let’s stop and think about why we’re analyzing text data, to begin with. For me, it’s usually for one of two reasons:

在深入研究代码之前，让我们停下来思考一下为什么我们要分析文本数据 。对我来说，通常是由于以下两个原因之一：

I want to find the area of greatest overlap in my corpora, i.e what are the relationships and variations between documents
我想找到语料库中最大的重叠区域，即文档之间的关系和变异是什么
I’m looking for a set of keywords or values in a subset of my corpora, i.e. what is the main subject or points of discussion across documents
我在我的语料库的子集中寻找一组关键字或值，即跨文档的主要主题或讨论点是什么

In both cases, the point is to help me narrow my scope of research and save me the time of reading through everything myself. Of course, these methods also help as a precursor to building a machine learning pipeline, as the analysis may lead you to errors or biases in your text data before you start training your model.

在这两种情况下，目的都是为了帮助我缩小研究范围，并节省我自己阅读所有内容的时间 。当然，这些方法也有助于建立机器学习管道，因为在开始训练模型之前，分析可能会导致文本数据出现错误或偏差。

我将审查的三种方法是： (The three methods I’ll be reviewing are:)

Euclidean Distance of Tokenized Vectors with PCA and t-SNE
带PCA和t-SNE的标记化向量的欧式距离
Normal and Soft Cosine Similarities of a Set of Corpora
语料库集的正余弦和软余弦相似度
Extracting Structured Relationships from Unstructured Text Data
从非结构化文本数据中提取结构化关系

欧氏距离 (Euclidean Distance)

Let’s start with the basics, tokenizing the data using nltk methods:

让我们从基础开始，使用nltk方法标记数据：

And then apply CountVectorizer (think of this as one-hot encoding) as well as Tfidf transformer (think of this as weighting the encoding, so more frequent words get lower weight).

然后应用CountVectorizer(将其视为一站式编码)以及Tfidf转换器(将其视为对编码进行加权，以便更频繁的单词获得较低的权重)。

Ultimately, we can fit it into a PCA model and plot the first two components using Seaborn’s KDE plot:

最终，我们可以将其拟合到PCA模型中，并使用Seaborn的KDE图来绘制前两个组件：

This is great for getting an initial idea of lay of the land, based on usage of words alone. However, since we aren’t layering any categories such as industry or salary on top of this, we won’t get too much more insight from this step.

仅凭单词的使用，这对于获得关于土地布局的初步想法非常有用。但是，由于我们没有在此基础上对行业或薪水等任何类别进行分层，因此从此步骤中我们将不会获得太多的洞察力。

正常和软余弦相似度 (Normal and Soft Cosine Similarities)

For PCA based on euclidean distance covering large corpora, you usually will not end up finding strong relationships or clusters. This is because semantic meaning of words has not yet been considered, on top of the fact some posting are two lines long and others are multiple paragraphs. We can solve both of these issues with Cosine Similarity, or the measurement of angles between word vectors.

对于基于覆盖大型语料库的欧几里得距离的PCA，通常不会最终找到牢固的关系或聚类。 这是因为尚未考虑单词的语义含义，而且有些帖子的长度为两行，而另一些则为多个段落。 我们可以使用余弦相似度或字向量之间的角度测量来解决这两个问题。

There are two main options for cosine similarity. The first choice is to take the cosine similarity on the word vectors after applying all the same tokenization steps as we did before and create a matrix. Simple enough:

余弦相似度有两个主要选择。首选是在应用与之前相同的所有标记化步骤之后，在单词向量上采用余弦相似度，并创建一个矩阵。很简单：

The second or going a step further and taking an inner dot product with L2 normalization between a previous word embedding model similarity matrix and the normal similarity matrix we just calculated to get a soft cosine matrix (essentially trying to capture semantic meaning of words in the descriptions better). We can do this easily using gensim word2vec or FastText pre-trained word embedding models. If you are unfamiliar with embedding models, think of it as words that given a similarity score based on how similar the context of use is (i.e. the two words directly before and after it).

第二步或更进一步，在我们先前计算的先前单词嵌入模型相似性矩阵和正常相似性矩阵之间采用L2归一化的内点积，以得到软余弦矩阵 (本质上是试图捕获描述中单词的语义)更好)。我们可以使用gensim word2vec或FastText预训练的单词嵌入模型轻松完成此操作。如果您不熟悉嵌入模型，则可以将其视为根据使用上下文的相似程度(即，紧接其前后的两个词)给出相似性评分的词。

We get an average similarity is 0.80 if we take cosine.mean().mean() , which makes sense given these are all data analyst postings. Taking the same PCA and density plot gives us:

如果我们采用cosine.mean().mean() ，则平均相似度为0.80，因为这些都是数据分析人员发布的数据，所以这很有意义。采用相同的PCA和密度图可以得出：

With the added standardization and semantic layering, we can now dig into certain most dense areas to see what posts are clustering there.

通过添加标准化和语义分层，我们现在可以深入研究某些最密集的区域，以查看哪些帖子聚集在该区域。

Using the above search function on an area of the plot x(-1,1) and y(-0.5,-0.1), then applying gensim summarize gives us:

在图x(-1,1)和y(-0.5，-0.1)的区域上使用上述搜索功能，然后应用gensim摘要可得出：

Manage and consolidate all Operational Spreadsheets.Work with management to prioritize business and information needs.Locate and define new process improvement opportunities.Collaborating with the executive team to understand the analytical needs of our multichannel operations and developing data-driven insights that are both strategic and operational.Formulating and championing insights on specific business tactics such as inventory forecasting and gaps analysis to drive those insights into action.What You Need to Succeed:-  to  years of experience working in Inventory Analyst role or Warehouse Master Data Management.- Bachelor's degree or higher education in Computer Science, Information Technology, Statistics, Business Administration or a closely related field.- Sound knowledge of inventory control practices & supply chain- Advanced Excel spreadsheet knowledge: Pivot Table, VLOOKUP, Various Formula- Familiarity with SAP and Warehouse Management System is a plus.Other Skills:-Exceptional written and oral communication skills.- Eye for detail and have a strong analytical skills.- Ability to priorities work effectively.- Able to balance/challenge conflicting objectives.- Able to see the bigger picture beyond the scope of their role.- Able to challenge the current status quo and being open minded towards new and different ways of doing things.- Tackles new problems in a timely manner. Pharmapacks, is a leading e-commerce company with a proprietary tech platform that empowers Brands with a complete and cost-effective logistics, fulfillment, marketing and sales solution.

While a summary of 81 postings may not read entirely coherently, you can see that this area is focused on management of data and excel spreadsheets in a logistics/inventory environment. Since I already stored the URL’s, I could go on and apply to just these 81 postings if I was interested!

虽然对81条过帐的摘要可能无法完全连贯地阅读，但是您可以看到，该区域专注于物流/库存环境中的数据管理和excel电子表格。 因为我已经存储了URL，所以如果我有兴趣，我可以继续并仅对这81条帖子进行申请！

文本的结构化关系和数据 (Structured Relationships and Data from Text)

A lot of useful data is hidden inside reports and articles, instead of tables. For example, job postings often list requirements and salaries on top of a brief description of the role.

许多有用的数据隐藏在报表和文章中，而不是表中。例如，职位发布通常在角色的简短描述之上列出要求和薪水。

I’ll be focusing this analysis on the skills and tasks a job posting is asking for. We can do this by deconstructing sentences into parts of speech, then labeling the main subject and object as well as the relationship between them.

我将把分析重点放在职位发布所要求的技能和任务上。我们可以通过将句子分解为词性，然后标记主要的主语和宾语以及它们之间的关系来做到这一点。

For example,

例如，

“This position will be the first of its kind at (Company Name) and as such, the person in this role must be able and excited to take full responsibility for building out and maintaining internal tracking and reporting procedures”

“此职位将是(公司名称)的首个此类职位，因此，担任此职位的人必须有能力并且很兴奋地承担起建立和维护内部跟踪和报告程序的全部责任”

can be broken down into parts and dependencies using spaCy’s english model:

可以使用spaCy的英语模型分为多个部分和相关性：

By leveraging functions that create entity pairs as well as relationships (based on this helpful article) we can get a better picture of what types of tasks are being asked of data analysts. Taking value_counts() of the relationships found gives us:

通过利用创建实体对和关系的功能(基于这篇很有帮助的文章 )，我们可以更好地了解数据分析员要求哪种类型的任务。采用找到的关系的value_counts()给我们：

Filtering for “Provide” and creating a network graph gives us:

过滤“提供”并创建网络图可以使我们：

The knowledge graph works well for “Provide” as there are only 100 mentions, but can vary for larger counts of relationships. We can see that analysts are expected to provide everything from potential business risks to action plans and ad hoc reports.

知识图对于“提供”非常有效，因为只有100个提及，但随着关系的增加而变化。 我们可以看到，期望分析师提供一切，从潜在的业务风险到行动计划和临时报告。

结论 (Conclusion)

I hope this has helped you understand how to better approach text data. With these three tools, we’ve quickly identified niche types of roles within “data analyst”, as well as some of the abilities and tasks you should prepare for them. If you want to see deeper analysis using these techniques, check out this article!

我希望这可以帮助您了解如何更好地处理文本数据。借助这三个工具，我们Swift确定了“数据分析师”中角色的细分类型，以及您应为其准备的一些功能和任务。如果您想使用这些技术进行更深入的分析，请查看本文！

As mentioned earlier, the data and full Jupyter notebook walk-through can be found here.

如前所述，可以在此处找到数据和完整的Jupyter笔记本演练。