关闭

数据挖掘知识框架

标签: 数据数据挖掘数据分析
1476人阅读 评论(0) 收藏 举报
分类:

从图开始

先放一张从moozhi上看到的图片。图片上详细的介绍了数据挖掘这里领域所需要的知识组成,分成了多个模块,每个模块下辖几个小节。
笔者看到的文章原始链接是 在MOOC上的数据科学家养成计划 路线图 Roadmap。这篇文章介绍了如何学习mooc上的视频学习数据挖掘这个领域。

知识架构图

本文与上面引述的文章不同,本文将从文献书籍的角度去讲述这些知识。

Fundamentals 基础

1. Matrics & Linear Algebra Fundamentals 矩阵和线性代数基础
2. Hash Functions, Binary Tree, O(n) 哈希函数,二叉树,时间复杂度计算
3. Relational Algebra, DB Basics 关系代数,数据库基础
4. Inner, Outer, Cross, Theta Join 数据库中的内、外、交叉、西塔连接
5. CAP theorem CAP理论
6. Tabular Data 表格数据
7. Entropy 熵理论
8. Data Frames & Series 数据框和数据系列理论
9. Sharding  分片理论
10. OLAP
11. Multidimensional Data Model 多维数据模型
12. ETL
13. Reporting Vs BI Vs Analytics 
14. JSON & XML
15. NoSQL
16. Regex 正则表达式
17. Vendor Landscape
18. Env Setup

Statistics 统计

1. Pick a Dataset(UCI Repo)
2. Descriptive Statistics(mean, median, range, SD, Var) 统计描述(均值、中间值、层次、方差、标准差)
3. Exploratory Data Analysis
4. Histograms
5. Percentiles & Outliers
6. Probability Theory 概率论
7. Bayes Theorem 贝叶斯理论
8. Random Variables 随机变量
9. Cumul Dist Fn(CDF)
10. Continuos Distributions(Normal, Poisson, Gaussian)
11. Skewness
12. ANOVA
13. Prob Den Fn(PDF)
14. Central Limit Theorem
15. Monte Carlo Method
16. Hypothesis Testing
17. p-Value
18. Chiz Test
19. Estimation
20. Confid Int(CI)
21. MLE
22. Kernel Density Estimate
23. Regression
24. Convariance
25. Correlation
26. Pearson Coeff
27. Causation
28. Least2 fit 最小二乘算法
29. Eculidean Distance 欧几里得距离

Programming 编程

1. Python Basics python语言基础
2. Working in Excel excel操作基础
3. R Setup, R studio 安装R语言和R studio
4. R Basics R语言基础
5. IBM SPSS
6. Rapid Miner
7. Varibles 变量
8. Vectors 向量
9. Matrices 矩阵
10. Arrays 数组
11. Factors 特征
12. Lists 列表
13. Data Frames 数据框
14. Reading CSV Data 从csv文件中读取数据
15. Reading Raw Data 读取行数据
16. Subsetting Data
17. Manipulate Data Frames
18. Functions 函数
19. Factor Analysis 特征分析
20. Install Pkgs 安装pkgs

Machine Learning 机器学习

1. What is ML?
2. Numerical Var
3. Categorical Var
4. supervised Learning
5. Unsupervied Learning
6. Concepts, Inputs & Attributes
7. Traning & Test Data
8. Classifier
9. Prediction
10. Lift
11. Overfitting
12. Bias & Variance
13. Trees & Classification
14. Classification Rate
15. Decision Tress
16. Boosting
17. Naive Bayes Classifiers
18. K-Nearest Neighbour
19. Logistic Regression
20. Ranking
21. Linear Regression
22. Perceptron
23. Hierarchical Clustering
24. K-means Clusterning
25. Neural Networks
26. Sentimeter Analysis
27. Collaborative Fitering
28. Tagging

Text Mining / NLP 文本挖掘,自然语言处理

1. Corpus
2. Named Entity Recognition
3. Text Analysis
4. UIMA
5. Term Document Matrix
6. Tern Document Matrix
7. Term Frequency & Weight
8. Support Vector Machines
9. Association Rules
10. Market Based Analysis
11. Feature Extraction
12. Using Mahout
13. Using Weka
14. Using NLTK
15. Classify Text
16. Vocabulary Mapping

Visualization 可视化

1. Data Exploration in R(Hist, Boxplot etc)
2. Uni, Bi & Multivariate Viz
3. ggplot2
4. Histogram & Pie(Uni)
5. Tree & Tree Map
6. Scatter Plot (Bi)
7. Line Charts (Bi)
8. Spatial Charts
9. Survey Plot
10. Timeline
11. Decision Tree
12. D3.js
13. infoVis
14. IBM ManyEyes
15. Tableau

Big Data 大数据

1. Map Reduce Fundamentals
2. Hadoop Components
3. HDFS
4. Data Replication Principles
5. Setup Hadoop (IBM/Cloudera/HortonWorks)
6. Name & Data Nodes
7. Job & Task Tracker
8. MIR Programming
9. Sqoop: Loading Data in HDFS
10. Flue, Scribe: For Unstruct Data
11. SQL with Pig
12. DWH with Hive
13. Scribe, Chunkwa For Weblog
14. Using Mahout
15. Zookeeper Avro
16. Storm: Hadoop Realtime
17. Rhadoop, Phipe
18. rmr
19. Cassandra
20. MongoDB, Neo4j

Data Ingestion 数据获取

1. Summary of Data Formats
2. Data Discovery
3. Data Sources & Acquisition
4. Data Integration
5. Data Fusion
6. Transformation & Enrichament
7. Data Survey
8. Google OpenRefine
9. How much Data
10. Using ETL

Data Munging 数据清理/数据转换

1. Dimensionality & Numerosity Reduction
2. Normalization
3. Data Scrubbing
4. Handling Missing Values
5. Unbiased Estimators
6. Binning Sparse Values
7. Feature Extraction
8. Denoising
9. Sampling
10. Stratified Sampling
11. Principal Component Analysis

Toolbox 工具箱

1. MS Excel w/ Analysis Toolpak
2. Java, Python
3. R, Rstudio, Rattle
4. Weka, Knime, RapidMiner
5. Hadoop Dist of Choice
6. Spark, Storm
7. Flume, Scibe, Chukwa
8. Nutch, Talend, Scraperwiki
9. Webscraper, Flume, Sqoop
10. tm, RWeka, NLTK
11. PHIPE
12. D3.js, ggplot2, Shiny
13. IBM Languageware
14. Cassandra, MongoDB
0
0

查看评论
* 以上用户言论只代表其个人观点,不代表CSDN网站的观点或立场
    个人资料
    • 访问:168524次
    • 积分:2867
    • 等级:
    • 排名:第12824名
    • 原创:124篇
    • 转载:4篇
    • 译文:0篇
    • 评论:0条
    文章分类