Topic | Content | Key points | Reference | |
DB/OLTP & DW/OLAP | Database/OLTP basic | The relational model, SQL, index/secondary index, inner join/left join/right join/full join, transaction/ACID | Ramakrishnan, Raghu, and Johannes Gehrke. Database Management Systems. | |
Database internal & implementation | Architecture, memory management, storage/B+ tree, query parse /optimization/execution, hash join/sort-merge join | |||
Distributed and parallel database | Sharding, database proxy | |||
Data warehouse/OLAP | Materialized views, ETL, column-oriented storage, reporting, BI tools | |||
Basic programming | Programming language | Java, Python (Pandas/NumPy/SciPy/scikit-learn), SQL, Functional programming, R/SAS/SPSS | Wes McKinney. Python for Data Analysis: Agile Tools for Real World Data. | |
OS | Linux | |||
DB & DW system | MySQL/ Hive/Impala | |||
Text format and process | JSON/XML, regex | |||
Tool | Git/SVN, Maven | |||
Distributed system & Hadoop ecosystem & NoSQL | Distributed system principal theory | CAP theorem, RPC (Protocol Buffer/Thrift/Avro), Zookeeper, Metadata management (HCatalog) | ||
Distributed storage & computing framework & resource management | Hadoop/HDFS/MapReduce/YARN | Tom White. Hadoop : The Definitive Guide.
Donald Miner, Adam Shook. MapReduce Design Patterns : Building Effective Algorithm and Analytics for Hadoop and Other Systems. | ||
SQL on Hadoop | Data (log) acquisition/integration/fusion, normalization, feature extraction | Sqoop, Flume/Scribe/Chukwa,SerDe | Edward Capriolo, Dean Wampler, Jason Rutherglen. Programming Hive. | |
Query & In-database analytics | Hive, Impala, UDF/UDAF | |||
Large scale data mining & machine learning framework | Spark/MLbase, MR/Mahout | |||
Streaming process | Storm | |||
NoSQL | HBase/Cassandra (column oriented database) | Lars George. HBase: The Definitive Guide. | ||
Mongodb (Document database) | ||||
Neo4j (graph database) | ||||
Redis (cache) | ||||
Data mining & Machine learning | DM & ML basic | Numerical/Categorical variable, training/test data, over fitting, bias/variance, precision/recall, tagging | ||
Statistic | Data exploration (mean, median/range/standard deviation/variance/histogram), Continues distributions (Normal/ Poisson/Gaussian), covariance, correlation coefficient, distance and similarity computing, Bayes theorem, Monte Carlo Method, Hypothesis testing | |||
Supervised learning | Classifier, boosting, prediction, regression analysis | Han, Jiawei,Micheline Kamber, and Jian Pei. Data mining: concepts and techniques.
| ||
Unsupervised learning | Cluster, deep learning | |||
Collaborative filtering | Item based CF, user based CF
| |||
Algorithm | Classifier | Decision trees, KNN (K-Nearest neighbor), SVM (support vector machines), SVD (Singular Value Decomposition), naïve Bayes classifiers, neural networks, | ||
Regression | Linear regression, logistic regression, ranking, perception | |||
Cluster | Hierarchical cluster, K-means cluster, Spectral Cluster | |||
Dimensionality reduction | PCA (Principal Component Analysis), LDA (Linear discriminant Analysis), MDS (Multidimensional scaling) | |||
Text mining & Information retrieval | Corpus, term document matrix, term frequency & weight, association rules, market based analysis, vocabulary mapping, sentiment analysis, tagging, PageRank, VSM (Vector Space Model), inverted index | Jimmy Lin and Chris Dyer. Data-Intensive Text Processing with MapReduce. |
大数据工程人员知识图谱
最新推荐文章于 2023-11-24 16:02:42 发布