pd种知道每个数据的类型_每个数据科学家都应该知道的5个专业项目

pd种知道每个数据的类型

重点 (Top highlight)

目录 (Table of Contents)

  1. Introduction

    介绍
  2. Customer Segmentation

    客户细分
  3. Text Classification

    文字分类
  4. Sentiment Analysis

    情绪分析
  5. Time Series Forecasting

    时间序列预测
  6. Recommendation Systems

    推荐系统
  7. Summary

    摘要

介绍 (Introduction)

The goal of this article is to outline projects that a professional Data Scientist will eventually perform or should perform. I have taken a lot of bootcamps and educational courses in Data Science. While they have all been useful in some way, I find that some forget to highlight real-world applications of Data Science. It is beneficial to know what to expect as you transition from educational to professional Data Scientist. Customer segmentation, text classification, sentiment analysis, time series forecasting, and recommender systems can all help your company that you are employed at tremendously. I will perform a deep dive and explain why these specific five projects come to mind, and we will hopefully motivate you to employ these where you work.

本文的目的是概述专业数据科学家最终将要执行或应该执行的项目。 我参加了很多有关数据科学的训练营和教育课程。 尽管它们都以某种方式发挥了作用,但我发现有些人忘记强调数据科学的实际应用。 从教育专家转变为专业数据科学家的过程中,知道会有什么期望是有益的。 客户细分,文本分类,情感分析,时间序列预测和推荐系统都可以为您所在的公司提供极大的帮助。 我将进行深入探讨,并解释为什么想到这五个具体项目,并且我们希望能够激发您在工作中使用这些项目。

客户细分 (Customer Segmentation)

Customer segmentation is a form of Data Science where an unsupervised and clustering modeling technique is employed to develop groups or segments of a human population or observations in data. The goal is to create groups that are separate, but the groups themselves have closely related features. The technical term for this separation and togetherness is called:

客户细分是数据科学的一种形式,在这种形式中,采用了无监督的聚类建模技术来开发人口的群体或细分或数据中的观察值。 目标是创建独立的组,但组本身具有密切相关的功能。 这种分离和统一的技术术语称为:

Between-groups sum of squares (BGSS)

组间平方和( BGSS )

  • how different the unique groups are from one another

    独特群体之间的差异如何

Within-group sum of squares (WGSS)

组内平方和( WGSS )

  • how closely related the unique group features are

    独特的群体特征之间的紧密联系
Image for post
K-means clustering. Image by Author [2].
K-均值聚类。 图片作者[2]。

As you can see in the image above, these groups are well separated — BGSS and are closely centered — WGSS. This example is ideal. Think of each of the clusters as those groups that you will target with a specific marketing advertisement: ‘we want to appeal to recent college graduates by marketing our company product as young-professional centered’. Some useful clustering algorithms are:

如您在上图中所看到的,这些组被很好地分开了-BGSS和紧密居中的WGSS。 这个例子是理想的。 可以将每个类群视为您将针对特定目标市场营销广告的人群:“ 我们希望通过以年轻专业人员为中心对公司产品进行营销来吸引应届大学毕业生 。” 一些有用的聚类算法是:

DBSCANK-meansAgglomerative Hierarchical Clustering

What happens with customer segmentation results?

客户细分结果会怎样?

— finding insights about specific groups

-寻找有关特定群体的见解

— marketing towards specific groups

-针对特定人群的营销

— defining groups in the first place

-首先定义组

— tracking metrics about certain groups

—跟踪有关某些组的指标

This type of Data Science project is broadly used, but most useful in the marketing industry.

这种类型的数据科学项目得到了广泛的使用,但在营销行业中最有用。

文字分类 (Text Classification)

Image for post
Feature and target of text classification example. Code by Author [3].
文本分类示例的特征和目标。 作者的代码[3]。

Text classification is under the umbrella of Natural Language Processing (NLP), which utilizes techniques to ingest text data. You can think of this algorithm or project as a way to categorize text labels by using text features (along with numeric features as well).

文本分类在自然语言处理(NLP)的保护下,该语言利用各种技术来提取文本数据。 您可以将这种算法或项目视为通过使用文本特征( 以及数字特征)对文本标签进行分类的一种方式。

Here [4] is a simple example of utilizing both text and numeric features for text classification. Instead of having one word for your text feature, you could, perhaps, have hundreds and will need to perform NLP techniques, like Part-of-Speech tagging, stop word removal, tf-idf, count vectorizing, etc. A common library Data Scientists use in Python is nltk. The goal of these techniques is to clean your text data, and create the best representation of itself, so as to eliminate noise.

这里 [4]是利用文本和数字功能进行文本分类的简单示例。 可能没有数百个单词,而您的文字功能可能只有一个单词,而是需要执行NLP技术,例如词性标记停用单词删除tf-idf计数向量化等。常见的库数据科学家在Python中使用的是nltk 。 这些技术的目的是清除您的文本数据,并创建其自身的最佳表示形式,以消除噪音。

What happens with text classification results?

文本分类结果会怎样?

— automatic categorization of observations

—观测值的自动分类

— scores associated with each category suggested

-与建议的每个类别相关的分数

You can also categorize text documents that would otherwise take hours upon hours to manually read.

您还可以对文本文档进行分类,否则将需要花费数小时才能手动阅读。

This type of project is useful in the finance or historian/librarian industry.

这种类型的项目在金融或历史学家/图书馆员行业中很有用。

情绪分析 (Sentiment Analysis)

Sentiment analysis is also under the umbrella of NLP. It is a way to assign sentiment scores from the text, or more specifically, polarity and subjectivity. It is beneficial to use sentiment analysis when you have plenty of text data and want to digest it to create levels of good or bad sentiment. If you have a rating system already in place at your company, it may seem redundant, but oftentimes people can leave reviews with text that do not match their numerical score. Another benefit of sentiment analysis is that you can flag certain keywords or phrases that you would want to highlight in order to make your product better. Aligning keywords with key sentiment can be used to aggregate metrics that you can visualize what your product is lacking and where possible improvements could be made.

情绪分析也属于NLP的范畴。 这是一种从文本中分配情感分数的方法,或更具体地说,可以是极性和主观性。 当您有大量的文本数据并希望对其进行摘要以创建好坏程度的情感时,使用情感分析会很有用。 如果您的公司已经建立了一个评分系统,这似乎是多余的,但是通常人们会留下与数字分数不匹配的文字来发表评论。 情绪分析的另一个好处是,您可以标记要突出显示的某些关键字或短语,以使您的产品更好。 使关键字与关键情绪保持一致可以用于汇总指标,您可以直观地看到产品缺少的内容以及可以在哪里进行改进。

What happens with sentiment analysis results?

情绪分析结果会如何?

— product improvements

—产品改进

— sentiment flagging to for customer service

—情绪低落以提供客户服务

This type of project is useful in plenty of industries, especially e-commerce, entertainment, or anywhere that includes text reviews.

这种类型的项目可用于许多行业,尤其​​是电子商务,娱乐或任何包含文本评论的地方。

时间序列预测 (Time Series Forecasting)

Image for post
Sonja Langford on Sonja Langford在[ Unsplash []. Unsplash ]上的照片。

Time series can be applied to several parts of various industries sectors. Most times, time series forecasting can be used ultimately to allocate funds or resources for the future. If you have a sales team, they would benefit from your forecast, as well as investors, as they see where your company is going (hopefully increasing in sales). More directly, if you have certain employees assigned with the forecasted target for that day, you can allocate employees in general, and to certain places. A popular example would be Amazon or any similar company where consumers have frequent behaviors and need an allocation of factories, drivers, and different locations that will merge together.

时间序列可以应用于各个行业的多个部分。 在大多数情况下,时间序列预测可以最终用于为未来分配资金或资源。 如果您有销售团队,他们会从您的预测以及投资者那里受益,因为他们可以看到您公司的发展方向( 希望销售额增加 )。 更直接地,如果您为当日分配了某些员工并指定了预测目标,则可以将其分配给一般员工,并分配给某些地方。 一个流行的示例是亚马逊或任何其他类似的公司,其中消费者的行为频繁且需要分配工厂,驱动程序以及将合并在一起的不同位置。

What happens with time series forecasting results?

时间序列预测结果会怎样?

— allocation of resources

-资源分配

— awareness of future sales

-对未来销售的认识

Some popular algorithms that utilize time series are ARIMA and LSTM.

利用时间序列的一些流行算法是ARIMA和LSTM。

This type of project is useful in plenty of industries as well, but usually in sales or supply management.

这种类型的项目也可用于许多行业,但通常用于销售或供应管理。

推荐系统 (Recommender Systems)

Image for post
Simon Bak on Simon BakUnsplash [6]. Unsplash [6]上的照片。

While you may or may not be designing Netflix’s next recommendation system algorithm, you may find yourself applying similar techniques to several parts of your business. Think of using this type of project to ultimately achieve the sales of more products from users. As a consumer, if you are buying certain products or groceries, but you see some recommended ones at the end of your cart checkout, you may be inclined to quickly buy one of those recommendations. Expand this result to every user and you can make your companies millions.

尽管您可能会或可能不会设计Netflix的下一个推荐系统算法,但您可能会发现自己将类似的技术应用于业务的多个部分。 考虑使用这种类型的项目最终实现用户销售更多产品。 作为消费者,如果您要购买某些产品或杂货,但是在购物车结帐时看到一些推荐的产品或杂货,您可能会倾向于快速购买其中的一种建议。 将此结果扩展到每个用户,您就可以使公司成百万。

Here are some common ways to approach recommendation systems in Data Science.

以下是在数据科学中采用推荐系统的一些常用方法。

Collaborative-filtering — alternating least square (matrix factorization)

协同过滤-交替最小二乘(矩阵分解)

  • how similar other people are to you and recommends what they like to you

    其他人与您有多相似,并推荐他们喜欢您的人

Content-based filtering — cosine similarity

基于内容的过滤-余弦相似度

  • how attributes or features about the product you already bought can recommend a similar product in the future

    您已经购买的产品的属性或功能将来如何推荐类似的产品

This type of project is useful in plenty of industries as well, but usually in e-commerce and entertainment.

这种类型的项目也可用于许多行业,但通常用于电子商务和娱乐。

摘要 (Summary)

I hope I gave you some inspiration from highlighting these key projects that you may often use already, or will use as a professional Data Scientist. The focus on Machine Learning in education is to focus on obtaining the best accuracy sometimes, but the focus of Data Science in the professional sense is to help your company to improve its product, help people, and save or make more money.

我希望我从突出您可能经常使用或将用作专业数据科学家的这些关键项目中给您一些启发。 在教育中,机器学习的重点是有时专注于获得最佳准确性,但是从专业意义上讲,数据科学的重点是帮助您的公司改善产品,帮助人们以及节省或赚更多的钱。

To summarize, here are five popular professional projects to practice:

总结一下,这里有五个受欢迎的专业项目需要实践:

customer segmentationtext classificationsentiment analysis time series forecastingrecommender systems

I hope you enjoyed my article. Thank you for reading! Please feel free to comment down below and suggest other professional Data Science projects you have encountered so that we can all improve our professional Data Science portfolios.

希望您喜欢我的文章。 感谢您的阅读! 请在下面随意评论,并建议您遇到的其他专业数据科学项目,以便我们所有人都能改善我们的专业数据科学产品组合。

翻译自: https://towardsdatascience.com/5-professional-projects-every-data-scientist-should-know-e89bf4e7e8e1

pd种知道每个数据的类型

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值