java数据科学指南_100天学习数据科学的完整指南

java数据科学指南

Are you interested in learning Data Science but not sure where to start? If Yes, then you have reached the right place.

您是否对学习数据科学感兴趣,但不确定从哪里开始? 如果是,那么您来对地方了。

I have come across many people who were very passionate about learning Data Science but just in a few weeks, they quit learning. I was wondering why could someone so much passionate about a field but not pursuing it? By talking to some of them I understood that the main reasons for people to drop out from learning were,

我遇到了很多对学习数据科学充满热情的人,但是短短几周后,他们退出了学习。 我想知道为什么有人会对一个领域如此热情却不追求它? 通过与其中一些人交谈,我了解到人们辍学的主要原因是,

  • They were overwhelmed by the number of topics to learn to become a Data Scientist

    他们不知所措要学习成为数据科学家的话题数量
  • They come across gatekeepers who say that to become a Data Scientist one needs to be a talented programmer, an expert in mathematics, a master in applied statistics, and very skillful in using pandas, NumPy, and other python libraries.

    他们遇到了看门人,他们说要成为一名数据科学家,需要成为一个有才华的程序员,一个数学专家,一个应用统计大师,并且非常熟练地使用熊猫,NumPy和其他python库。

These are enough to scare an experienced Data Scientist, no wonder they made people attempting to learn Data Science quit. Each of the above topics is like an ocean and when someone tries to master them quickly they get frustrated and quit the learning journey. The actual truth one needs to know just enough of the above topics to become a successful Data Scientist or to get hired as a Data Scientist.

这些足以吓an一位经验丰富的数据科学家,难怪他们让尝试学习数据科学的人们放弃了。 上面的每个主题都像海洋,当有人试图快速掌握它们时,他们会感到沮丧并退出学习之旅。 实际的事实是,您仅需要了解上述主题中的足够知识就可以成为一名成功的数据科学家或被聘为数据科学家。

告诉我学习数据科学的途径 (Show me the Path to Learn Data Science)

Image for post
Joshua Earle on 约书亚·厄尔( Unsplash Unslash)摄

To become a Data Scientist one needs to learn just enough from the below topics

要成为数据科学家,您需要从以下主题中学到足够的知识

  • Basics of Python or R programming

    Python或R编程基础
  • If you choose Python then libraries like Pandas and Numpy

    如果您选择Python,那么Pandas和Numpy之类的库
  • Visualization libraries like ggplot, Seaborn, and Plotly.

    可视化库,例如ggplot,Seaborn和Plotly。
  • Statistics

    统计
  • SQL Programming

    SQL编程
  • Mathematics especially Linear Algebra and Calculus

    数学,尤其是线性代数和微积分

In this below video I have mentioned the step-by-step guide to learn Data Science. I have explained the depth of knowledge required to reach different levels of expertise in Data Science.

在下面的视频中,我提到了学习数据科学的分步指南。 我已经解释了达到数据科学不同专业水平所需的知识深度。

如何计划学习? 首先应涵盖哪些主题? (How to plan the learning? Which topics should be covered first?)

Let me clearly explain the plan to learn Data Science in 100 Days. Below is a Day-by-Day plan to learn Data Science using Python, this plan spans 100 days and it is required to spend at least an hour each day

让我清楚地解释一下在100天内学习数据科学的计划。 以下是使用Python学习数据科学的每日计划,该计划跨越100天,每天至少要花费一个小时

Day 1: Installation of Tool

第一天:工具安装

Just ensure that the required tools are installed and you become comfortable with the tool you are going to use for the next few weeks/months. If you choose Python then install Anaconda, which would also install the IDEs Jupyter Notebook and Spyder. In case you choose ‘R’ then install RStudio. Try to just play around the IDE and become comfortable in using it. Like, try to understand about the installation of packages/libraries, executing a portion of the code, clearing the memory, and so on.

只需确保已安装必需的工具,然后您就可以适应接下来几周/几个月要使用的工具。 如果选择Python,则安装Anaconda,这也将安装IDE Jupyter Notebook和Spyder。 如果选择“ R”,则安装RStudio。 尝试仅在IDE上玩耍,并逐渐习惯使用它。 像这样,尝试了解软件包/库的安装,执行部分代码,清除内存等。

Day 2 to Day 7: Basic Programming for Data Science

第2天到第7天:数据科学基础编程

The next step is to learn basic programming, below are some of the topics that should be learned

下一步是学习基础编程,以下是一些应学习的主题

  • Creation of Variable

    创造变量
  • String Data Type and operation commonly performed on a String Data type

    字符串数据类型和通常对字符串数据类型执行的操作
  • Numeric Data type, Boolean and Operators

    数值数据类型,布尔值和运算符
  • Collection data types List, Tuple, Sets, and Dictionary — It is important to understand the uniqueness and difference between each one of them’

    集合数据类型列表,元组,集合和字典-重要的是要了解它们之间的唯一性和区别。
  • If-Then-Else Conditions, For Loop and While Loop Implementations

    If-Then-Else条件,用于循环和While循环实现
  • Functions and Lambda Function — Benefits of each of them and their differences

    函数和Lambda函数-它们各自的优点及其区别

Day 8 to Day 17: Pandas Library

第8天到第17天:熊猫图书馆

Learn about the Pandas library, some of the topics that one needs to learn in pandas are,

了解熊猫图书馆,需要在熊猫中学习的一些主题是,

  • Creating a data frame, reading data from a file, and writing a data frame to a file

    创建数据框,从文件读取数据以及将数据框写入文件
  • Indexing and Selection of data from a data frame

    从数据帧中索引和选择数据
  • Iteration and Sorting

    迭代和排序
  • Aggregation and Group By

    汇总和分组依据
  • Missing Values and handling of missing values

    缺失值和缺失值的处理
  • Renaming and Replacing in Pandas

    重命名和替换熊猫
  • Concatenating, Merging, and Joining in a data frame

    串联,合并和联接数据框
  • Summary Analysis, Cross Tabulation, and Pivot

    摘要分析,交叉制表和数据透视
  • Date, Categorical and Sparse Data

    日期,分类和稀疏数据

Spend a good 10 days in thoroughly learning the above topics as these topics will be very useful when you perform exploratory data analysis. While covering these topics try to go into the granular details like understanding the differences between merge and join, Crosstab and Pivot by that you not only learn about each one of them but also know when and where to use them.

花10天的时间来彻底学习上述主题,因为当您进行探索性数据分析时,这些主题将非常有用。 在涵盖这些主题的同时,尝试深入了解诸如合并和联接,Crosstab和Pivot之间的差异之类的详细信息,您不仅可以了解它们中的每一个,还可以知道何时何地使用它们。

Why should I learn Pandas? If you take any Data Science project they always begin with exploratory data analysis to understand the data better and these topics you have covered in pandas would come in handy. Also, because Pandas would help in reading data from different sources and formats, they are fast and efficient and also provides easy functionalities to perform various operation on the dataset.

我为什么要学习熊猫? 如果您参加任何数据科学项目,则它们总是从探索性数据分析开始,以更好地理解数据,而您在熊猫中涵盖的这些主题将派上用场。 此外,由于熊猫将帮助您从不同的来源和格式读取数据,因此它们既快速又高效,并且还提供了简便的功能来对数据集执行各种操作。

Day 18 to Day 22: Numpy Library

第18天到第22天:Numpy图书馆

Having learned Pandas the next important library to be learned is Numpy. The reason to learn Numpy is that they are very fast as compared to List. The topics to cover in Numpy would include

学习了熊猫之后,下一个要学习的重要图书馆就是Numpy。 学习Numpy的原因是,与List相比,它们非常快。 Numpy中要涵盖的主题将包括

  • Creation of an Array

    创建数组
  • Indexing and Slicing

    索引和切片
  • Data Types

    资料类型
  • Joining and splitting

    连接和拆分
  • Searching and Sorting

    搜索和排序
  • Filtering required data elements

    过滤所需的数据元素

Why is it important to learn Numpy? Numpy enables the performance of scientific operations on the data in a fast and efficient way. It supports efficient matrix operations which are commonly used in machine learning algorithms and also pandas library extensively uses Numpy

为什么学习Numpy很重要? Numpy可以快速有效地对数据执行科学运算。 它支持机器学习算法中常用的高效矩阵运算,并且熊猫库广泛使用了Numpy

Day 23 to Day 25: Visualizations

第23天到第25天:可视化

Now its time to spend some quality time understanding and using some of the key visualization libraries like ggplot, Plotly, and Seaborn. Use a sample dataset and try different visualization like Bar Chart, Line/Trend Chart, Box Plot, Scatter Plots, Heatmaps, Pie Chart, Histogram, Bubble Charts, and other interesting or interactive visualizations

现在该花一些时间来理解和使用一些关键的可视化库,例如ggplot,Plotly和Seaborn。 使用样本数据集并尝试不同的可视化效果,例如条形图,折线/趋势图,箱形图,散点图,热图,饼图,直方图,气泡图,以及其他有趣或交互式的可视化效果

Image for post
Photo by Luke Chesser on Unsplash
Luke ChesserUnsplash拍摄的照片

The key in a Data Science project is the communication of insights to the stakeholders and visualizations are a great tool for that purpose.

数据科学项目的关键是将见解传达给利益相关者,而可视化是实现此目的的绝佳工具。

Day 26 to Day 35: Statistics, Implementation, and Use-cases

第26天到第35天:统计信息,实施和用例

The next important topic to be covered is Statistics, explore the descriptive statistics techniques that are commonly used such as Mean, Median, Mode, Range Analysis, Standard Deviations, and Variances.

下一个要讨论的重要主题是统计,探索常用的描述性统计技术,例如均值,中位数,众数,范围分析,标准差和方差。

Then cover slightly deeper techniques such as identifying the Outliers in the Dataset and measuring the Margin of Error.

然后介绍一些更深层次的技术,例如识别数据集中的离群值和测量误差范围。

As a final step start exploring the various statistics test such as below, understand the application of these statistical tests in real-life

作为最后一步,开始探索各种统计测试,例如以下内容,了解这些统计测试在现实生活中的应用

  • F-test

    F检验
  • ANOVA

    方差分析
  • Chi-Squared Test

    卡方检验
  • T-Test

    T检验
  • Z-Test

    Z检验

Day 36 to Day 40: SQL for Data Analysis

第36天到40天:用于数据分析SQL

Now time to learn some SQL, this is important because in most corporate use-cases the data will be stored in a Database, and knowing SQL will greatly help in querying the required data from the system for the analysis.

现在该学习一些SQL了,这很重要,因为在大多数公司用例中,数据将存储在数据库中,而了解SQL将极大地帮助您从系统中查询所需的数据以进行分析。

You can start by installing an opensource Database such as MySQL, it would come with some default databases just play around with the data and learn SQL. It will be good if you could focus on learning the below

您可以从安装开源数据库(例如MySQL)开始,它将附带一些默认数据库,它们只是在处理数据并学习SQL。 如果您可以集中精力学习以下内容,那将是很好的

  • Selecting data from a table

    从表中选择数据
  • Joining data from different tables based on a key

    基于键联接来自不同表的数据
  • Performing Group by and Aggregation functions on data

    对数据执行分组依据和聚合功能
  • Use of case statements and filter conditions

    使用案例陈述和过滤条件

Day 41 to Day 50: Exploratory Data Analysis (EDA)

第41天到第50天:探索性数据分析(EDA)

In any Data Science project, about 80% of the time is spent in this activity so it is best to spend time learning this topic thoroughly. In order to learn the Exploratory Data Analysis, there isn’t a specific set of functionalities or topics to be covered but the dataset and the use-case would drive the analysis. Hence it would be preferred to use some sample dataset from competitions hosted in kaggle and learn to perform exploratory analysis.

在任何数据科学项目中,大约80%的时间都花在这项活动上,因此最好花时间彻底学习该主题。 为了学习探索性数据分析,没有要涵盖的一组特定功能或主题,但是数据集和用例将推动分析。 因此,最好使用kaggle举办的比赛中的一些样本数据集,并学习进行探索性分析。

Another method to learn exploratory data analysis is to write your questions about the dataset and try to find answers for them from the dataset. Like if I consider the most popular Titanic Dataset then try to find answers for questions like people of which gender/age/deck had a higher probability of dying and so on. Your ability to perform a thorough analysis would improve with time so be patient and learn slowly and confidently.

学习探索性数据分析的另一种方法是编写有关数据集的问题,并尝试从数据集中找到答案。 就像我认为最受欢迎的《泰坦尼克号数据集》一样,请尝试寻找问题的答案,例如性别/年龄/甲板死亡可能性较高的人等等。 您执行彻底分析的能力会随着时间的推移而提高,因此要耐心并缓慢而自信地学习。

By now you have learned all the core skills required for a Data Scientist, now you are ready to learn Algorithms.

到目前为止,您已经学习了数据科学家所需的所有核心技能,现在可以学习算法了。

What happened to Mathematics?

数学发生了什么?

Yes it is important to know about Linear Algebra and Calculus but I would prefer not to spend time learning mathematics concepts but as and when they are required you can refer and brush up your skills, High-School Level of mathematics would be sufficient. For example, let’s say you are learning about Gradient Descent then while learning the algorithm you can spend time learning about the mathematics behind it. Because if you start learning the important concepts in mathematics then it could be very time consuming and moreover by learning as and when required you would learn just enough required for the time but instead if you start learning all concepts in mathematics then you would be spending way more time and would be learning way more than what is required.

是的,了解线性代数和微积分很重要,但是我宁愿不要花时间学习数学概念,但是当需要它们时,您可以参考并提高您的技能,高中学历就足够了。 例如,假设您正在学习梯度下降,那么在学习算法的同时,您可以花时间学习其背后的数学。 因为如果您开始学习数学中的重要概念,那么这可能会非常耗时,而且通过在需要时进行学习,您将学到足够的时间,但是如果您开始学习数学中的所有概念,那么您将在浪费时间更多的时间,并且将比需要的东西更多地学习。

Day 51 to Day 70: Supervised Learning and Project Implementation

第51天到第70天:有监督的学习和项目实施

Spend the first 10 days in knowing some of the key algorithms in Supervised Learning, understand the math behind them, and in the next 10 days focus on learning by developing a project. Some of the Algorithms that should be covered in this period are,

在头10天里花时间了解监督学习中的一些关键算法,了解它们背​​后的数学原理,并在接下来的10天里通过开发项目来专注于学习。 在此期间应涵盖的一些算法包括:

  • Linear Regression and Logistic Regression

    线性回归和逻辑回归
  • Decision Tree / Random Forest

    决策树/随机森林
  • Support Vector Machine (SVM)

    支持向量机(SVM)

In the first 10 days, the focus should be on understanding the theory behind the algorithms you have chosen. Then spend some time understanding the scenarios where each of the algorithms would be more suitable as compared with others like Decision Trees are best when there are a lot of categorical attributes in the dataset.

在最初的10天里,重点应该放在了解所选算法背后的理论上。 然后花一些时间来了解场景,在该场景中,当数据集中有很多分类属性时,与其他算法(例如决策树)相比,每种算法都更合适。

Then pick a solved example in Kaggle, you will be able to find ample solved examples try to re-execute them but carefully understand each and every line in the code and understand the reason for them. By now you have got good theoretical knowledge as well as working knowledge from the solved examples.

然后在Kaggle中选择一个已解决的示例,您将能够找到足够的已解决示例,尝试重新执行它们,但请仔细了解代码中的每一行并了解其原因。 到目前为止,您已经从解决的示例中获得了很好的理论知识和工作知识。

As a final step, pick a project, and implement a supervised learning algorithm, start with data collection, exploratory analysis, feature engineering, model building, and model validation. There will definitely be a lot of questions and issues but when you complete the project you would have got a very good knowledge about the algorithm and the methodologies

最后,选择一个项目,并实施监督学习算法,从数据收集,探索性分析,特征工程,模型构建和模型验证开始。 肯定会有很多问题,但是当您完成项目时,您将对算法和方法学有很好的了解。

Day 71 to Day 90: Unsupervised Learning and Project Implementation

第71天到第90天:无监督学习和项目实施

Now its time to focus on unsupervised learning, similar to the method used in supervised learning spend the initial days in understanding the concepts behind the algorithms you have chosen in unsupervised learning, and then learn by implementing a project.

现在是时候专注于无监督学习了,类似于无监督学习中使用的方法,最初的时间是花在了解您在无监督学习中选择的算法背后的概念之后,再通过实施一个项目进行学习。

The algorithm that should be covered here are,

这里应该涵盖的算法是

  • Clustering Algorithm — Used to identify Clusters in the dataset

    聚类算法-用于识别数据集中的聚类
  • Association Analysis — Used to identify patterns in the data

    关联分析-用于识别数据中的模式
  • Principal Components Analysis — Used to reduce the number of attributes

    主成分分析-用于减少属性数量
  • Recommendation System — Used to identify similar users/products and to make recommendations

    推荐系统-用于识别相似的用户/产品并提出建议

In the initial days, the focus should be on understanding each of the above algorithms and techniques also to understand the purpose of each of them and the scenarios where they can be used like principal components analysis generally used for dimensionality reduction when the dataset you are working is having a very large number of columns and you would want to reduce it but still retain the information from them and recommendation systems are popular in e-commerce where based on the purchase patterns of a customer other items that they would likely be interested in could be recommended to increase sales.

在最初的日子里,重点应该放在理解上述每种算法和技术上,也要理解它们的目的以及可以使用它们的场景,例如在工作数据集时通常用于降维的主成分分析。拥有大量的列,您可能希望减少列数,但仍保留其中的信息,而推荐系统在电子商务中非常流行,在这种情况下,基于客户的购买模式,他们可能会对其他项目感兴趣建议增加销量。

When you are comfortable with the theory and the scenarios where they can be used then it is time to pick a solved example and learn by reverse engineering them that is understanding each and every line of code and re-executing them.

如果您对可以使用它们的理论和方案感到满意,那么该是时候选择一个已解决的示例并通过反向工程来学习它们了,它们理解了每一行代码并重新执行它们。

As a final step now its time to pick a use-case and implement based on your learnings so far. On completing the project/use-case you would have learned a lot and you would have gained a much better understanding of these algorithms and that would remain with you forever.

作为最后一步,现在是时候根据您的经验选择用例并实施了。 在完成项目/用例时,您将学到很多东西,并且会对这些算法有更好的了解,而这将永远存在。

Day 91 to Day 100: Natural Language Processing Basics

第91天到第100天:自然语言处理基础

Make use of this time to focus on analysis and use-cases for unstructured / text data. Few things worth spending time here would include

利用这段时间专注于非结构化/文本数据的分析和用例。 值得在这里花时间的事情很少包括

  • Learn to use API to fetch data from the public sources

    学习使用API​​从公共资源中获取数据
  • Perform a few basic sentiments analysis — Data from twitter API can be used to extract tweets of a particular hashtag and then the sentiment and the emotions behind those tweets can be computed

    执行一些基本的情绪分析-来自Twitter API的数据可用于提取特定主题标签的推文,然后可以计算这些推文的情绪和情感
  • Topic Modelling — This is useful when there are a large number of document and you want to group them into different categories that this method would come handy

    主题建模-当有大量文档并且您希望将它们分为不同类别时非常有用,此方法很方便

That’s it! You have now covered all the important concepts and ready to apply for any Data Science Jobs. I have started this journey of learning Data Science in 100 Days on my YouTube Channel, if you are interested please join me and start your journey to learn Data Science here

而已! 现在,您已经涵盖了所有重要概念,并准备申请任何Data Science Jobs。 我已经在YouTube频道上开始了为期100天的数据科学学习之旅,如果您有兴趣,请加入我的行列,开始在这里学习数据科学的旅程

Start your journey here
从这里开始您的旅程

常见问题(FAQs)

谁能在100天之内成为数据科学家?(Can anyone become a Data Scientist in 100 Days?)

Yes, Just like anyone can learn swimming in just a few days, anyone can learn Data Science in 100 days or even less. But just like in swimming one can become an elite swimmer or Olympic swimmer only through hard work and continuous practice, the same goes for Data Science as well, with practice and hard work you can become an expert.

是的,就像任何人都可以在几天内学习游泳一样,任何人都可以在100天或更短时间内学习数据科学。 但是就像在游泳中只有努力工作和不断练习才能成为精英游泳运动员或奥林匹克游泳运动员一样,数据科学也是如此,通过实践和努力,您可以成为专家。

如果我遵循这个旅程,我将学到多少? (If I follow this Journey, How much would I have learned?)

By end of this journey, you will have enough knowledge to work on a typical Data Science project. And you would have broken the learning barrier and hence with minimal effort and support, you would be able to continue learning advanced topics in Data Science.

在本旅程结束时,您将具有足够的知识来从事典型的Data Science项目。 而且您将突破学习障碍,因此,只需付出最小的努力和支持,您就可以继续学习数据科学中的高级主题。

签核前的最后讯息 (Final Message before Sign-Off)

At first, things might look too complicated don’t get overwhelmed just take one step at a time and continue your learning journey it might take some time but you will definitely reach your destination.

起初,事情看起来太复杂了,不要一次只迈出一步,继续学习,这可能会花费一些时间,但是您一定会到达目的地。

翻译自: https://medium.com/swlh/a-complete-guide-to-learn-data-science-in-100-days-8c6557154102

java数据科学指南

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值