# 数据分析师成长和进阶免费教程

• 这是非常技术流的教程，涉及大数据处理，编程和统计。不是Excel sheet，PowerPoint或者商业咨询市场分析类型，如果你是目的是做普通的Business Analyst 或者 BI 咨询，你不需要这个教程。
• 针对大数据（1 TB+ ）的处理和分析（如果你的数据只是几个Excel sheet，请略过）
• 所有教程内容都是英文，你可能需要翻墙（后果自负）。

• 全部免费哦！
• 帮助完全没有概念的菜鸟快速入门（教授基础的统计学和编程知识, 无需基础但要有常识）
• 从数据采集，分析，到最终可视化展示，教授大数据分析全过程的重要理念，方法和工具。
• 所需时间：310+ 小时。
• 菜鸟：要那么长时间？太慢了？
• 回答：什么？啥基础都没有，想要多快？你学了9年英语还要3个月新东方考GRE呢。
• 菜鸟：我有些学过了
• 回答：你不会跳过啊，菜鸟。

• exploratory and predictive statistics （统计学：检测数据和预测分析）
• basic Python （Python编程基础）
• advanced computer program design （电脑程序设计原理，进阶）
• an introduction to algorithms (算法基础）
• R for statistical analysis （使用 R 做统计分析）
• practical machine learning techniques （机器学习 基本技法）
• Unix
• data visualization best practices （数据视觉化展示 技巧）

-------------------------------割割哥-------------------------------------------

Exploratory and Predictive Statistics - 初级统计学

1. Statistics - Udemy ( 12 小时 ）

Optional 完整基础入门课程 （Strongly recommend if you have the time）
2.1 Introduction to Statistics Descriptive Statistics （50 小时）
The focus of Stat2.1x is on descriptive statistics. The goal of descriptive statistics is to summarize and present numerical information in a manner that is illuminating and useful. The course will cover graphical as well as numerical summaries of data, starting with a single variable and progressing to the relation between two variables. Methods will be illustrated with data from a variety of areas in the sciences and humanities.

2.2 Introduction to Statistics: Probability （50 小时）
The focus of Stat2.2x is on probability theory: exactly what is a random sample, and how does randomness work? If you buy 10 lottery tickets instead of 1, does your chance of winning go up by a factor of 10? What is the law of averages? How can polls make accurate predictions based on data from small fractions of the population? What should you expect to happen "just by chance"? These are some of the questions we will address in the course.

2.3 Introduction to Statistics: Inference （50 小时）
The focus of Stat2.3x is on statistical inference: how to make valid conclusions based on data from random samples. At the heart of the main problem addressed by the course will be a population (which you can imagine for now as a set of people) connected with which there is a numerical quantity of interest (which you can imagine for now as the average number of MOOCs the people have taken).
we will discuss good ways to select the subset (yes, at random); how to estimate the numerical quantity of interest, based on what you see in your sample; and ways to test hypotheses about numerical or probabilistic aspects of the problem

Basic Python

1. Intro to Python (3 - 5 小时） 扫盲

This is a great place to start if you have no programming background at all or want to brush up. If you have programming experience but have never seen Python, you may still want to skim through these lessons. You’ll learn basic programming techniques, such as loops, lists and dictionaries, functions, classes, and file input/ output.

1.1 彩蛋 Complete the Python Statistics Problem Set ( 0.5 小时 )

2. Videos and Problem Sets of Design of Computer Programs (20 - 30 小时）
This class will teach you to write elegant and efficient code. This will be essential in order to manipulate data effectively and write code that is reusable and easy for others to understand. You will also learn about some of the more sophisticated Python techniques, such as generator functions and list comprehensions.

Optional: Computer programming and Python 完整基础入门课程

2. Introduction to Computer Science and Programming Using Python (135 小时）
This course focuses on breadth rather than depth. The goal is to provide students with a brief introduction to many topics so they will have an idea of what is possible when they need to think about how to use computation to accomplish some goal later in their career.
• A Notion of computation
• The Python programming language
• Some simple algorithms
• Testing and debugging
• An informal introduction to algorithmic complexity
• Data structures

SQL and JSON

1. Introduction to Database ( 10 小时 - 只需要看前面的基础部分）
Watch the videos on Relational Databases, JSON Data, Relational Algebra, and SQL, and complete the exercises for those sections.

Algorithm 入门

1. Introduction to Algorithms (SMA 5503) （15小时 - 只需要看前面的基础部分）
This course teaches techniques for the design and analysis of efficient algorithms, emphasizing methods useful in practice. Topics covered include: sorting; search trees, heaps, and hashing; divide-and-conquer; dynamic programming; amortized analysis; graph algorithms; shortest paths; network flow; computational geometry; number-theoretic algorithms; polynomial and matrix calculations; caching; and parallel computing.

1. Unix Basics [4:20] （ 1 小时 ）

Watch
• [Lecture 3: Linux and Server-Side Javascript]
• [Lecture 4a: The Linux Command Line ]

2. Try Git （1小时）
Git is a version control system. It enables programmers to work together on large projects without overwriting each other’s work. Furthermore, it saves old versions of code in case you make a mistake and need to revert back. It can also be a useful portfolio of your programming and analysis projects to show potential employers.

Data Visualization Best Practices （数据视觉化展示 技巧）

1. Introduction to Infographics and Data Visualization ( 5 小时）
These videos are enjoyable and they make a nice break from the more technically challenging courses in this path. However, while the material in the course may be easy to understand, data visualization is a deeper topic than it seems. These examples should help illuminate what makes a good visualization and give ideas for some more creative ways to display information. You will also learn general principles of graphic design and visual perception.

Optional: Information Dashboard Design: The Effective Visual Communication of Data by Stephen Few - Dashboard 设计的经典书籍

Python 数据分析

Python 有很多针对统计和数据分析的library，常用的有：Pandas, Scipy, Numpy, and Scikit
1. Introduction to Pandas （ 1 小时）
2. explore SciPy and Numpy libraries （5 小时）

Machine learning aims to extract knowledge from data, relying on fundamental concepts in computer science, statistics, probability and optimization. Learning algorithms enable a wide range of applications, from everyday tasks such as product recommendations and spam filtering to bleeding edge applications like self-driving cars and personalized medicine. In the age of ‘Big Data,’ with datasets rapidly growing in size and complexity and cloud computing becoming more pervasive, machine learning techniques are fast becoming a core component of large-scale data processing pipelines.

1. Introduction to Big Data with Apache Spark (30 小时 with Python)
teach students how to use PySpark (part of Apache Spark) to deliver against their data for decision support and to build data-intensive products and services, such as recommendation, prediction, and diagnostic systems.
• Learn how to use Apache Spark to perform data analysis
• How to use parallel programming to explore data sets
• Apply Log Mining, Textual Entity Recognition and Collaborative Filtering to real world data questions

2. Scalable Machine Learning (35 小时 - With Python and Spark )
This course introduces the underlying statistical and algorithmic principles required to develop scalable real-world machine learning pipelines. We present an integrated view of data processing by highlighting the various components of these pipelines, including exploratory data analysis, feature extraction, supervised learning, and model evaluation. You will gain hands-on experience applying these principles using Apache Spark, a cluster computing system well-suited for large-scale machine learning tasks. You will implement scalable algorithms for fundamental statistical models (linear regression, logistic regression, matrix factorization, principal component analysis) while tackling key problems from various domains: online advertising, personalized recommendation, and cognitive neuroscience.
• The underlying statistical and algorithmic principles required to develop scalable real-world machine learning pipelines
• Exploratory data analysis, feature extraction, supervised learning, and model evaluation
• Application of these principles using Apache Spark
• How to implement scalable algorithms for fundamental statistical models

Optional: Statistical Learning ( 30 小时 - with R )
This is an introductory-level course in supervised learning, with a focus on regression and classification methods. The syllabus includes: linear and polynomial regression, logistic regression and linear discriminant analysis; cross-validation and the bootstrap, model selection and regularization methods (ridge and lasso); nonlinear models, splines and generalized additive models; tree-based methods, random forests and boosting; support-vector machines. Some unsupervised learning methods are discussed: principal components and clustering (k-means and hierarchical).

1. Try R ( 5 小时）
R is a tool for statistics and data modeling. The R programming language is elegant, versatile, and has a highly expressive syntax designed around working with data. R is more than that, though — it also includes extremely powerful graphics capabilities. If you want to easily manipulate your data and present it in compelling ways, R is the tool for you.
This course will teach you the basics of R: data types, summary statistics, functions, and control structures.

2. The Analytics Edge （100 小时）
• An applied understanding of many different analytics methods, including linear regression, logistic regression, CART, clustering, and data visualization
• How to implement all of these methods in R
• An applied understanding of mathematical optimization and how to solve optimization models in spreadsheet software                                                                  09-03
01-11 4597
06-22 2047
02-27 506
07-19 161
03-13 650
10-17 106
04-27 51
04-20 1275
04-26 69
01-11 18
01-23 779
05-23 3395
02-20 538
09-20 478
02-25 619
06-28 193