大数据工程人员知识图谱

在企业里面从事大数据相关的工作到底需要掌握哪些知识呢?

我认为需要从两个角度来看:一个是技术;一个是业务。技术上主要涉及到概率和数理统计,计算机系统、算法和编程等;而业务的角度呢则是因公司业务的不同而异。对于从事大数据的工程人员来说,需要学会使用数据挖掘方法在计算机系统和编程工具的帮助下解决实际的问题,这样才能够在海量数据中挖掘出业务增长的助推剂,才能在激烈的市场竞争中为企业创造更多的价值。

因为业务会因公司的不同而不同,但是技术点是想通的。我在这里简单总结了一下大数据相关工程人员需要掌握的技术相关知识点。主要涉及到数据库、数据仓库、编程、分布式系统、Hadoop生态系统相关、数据挖掘和机器学习相关的基础知识点。当然我这里列出来的应该是一个team的人员汇集在一起所具备的,每个人会因在团队中的角色不同而有所侧重。在此剖砖引玉,欢迎大家发表意见。

Topic

Content

Key points

Reference

DB/OLTP & DW/OLAP

Database/OLTP basic

The relational model, SQL, index/secondary index, inner join/left join/right join/full join, transaction/ACID

Ramakrishnan, Raghu, and Johannes Gehrke. Database Management Systems.

Database internal & implementation

Architecture, memory management, storage/B+ tree, query parse /optimization/execution, hash join/sort-merge join

Distributed and parallel database

Sharding, database proxy

Data warehouse/OLAP

Materialized views, ETL, column-oriented storage, reporting, BI tools

Basic programming

Programming language

Java, Python (NumPy/scikit-learn), SQL

 

OS

Linux

DB & DW system

MySQL/ Hive/Impala

Text format and process

JSON/XML, regex

Tool

Git/SVN, Maven

Distributed system & Hadoop ecosystem & NoSQL

Distributed system principal theory

CAP theorem, RPC (Protocol Buffer/Thrift/Avro), Zookeeper, Metadata management (HCatalog)

 

Distributed storage & computing framework & resource management

Hadoop/HDFS/MapReduce/YARN

Tom White. Hadoop : The Definitive Guide.

Donald Miner, Adam Shook. MapReduce Design Patterns : Building Effective Algorithm and Analytics for Hadoop and Other Systems.

SQL on Hadoop

Data (log) acquisition/integration/fusion, normalization, feature extraction

Sqoop, Flume/Scribe/Chukwa,

SerDe

Edward Capriolo, Dean Wampler, Jason Rutherglen. Programming Hive.

Query & In-database analytics

Hive, Impala, UDF/UDAF

Large scale data mining & machine learning framework

Spark/MLbase, Mahout

 

Streaming process

Storm

 

NoSQL

HBase/Cassandra (column oriented database)

Lars George. HBase: The Definitive Guide.

Mongodb (Document database)

Neo4j (graph database)

Redis (cache)

Data mining & Machine learning

DM & ML basic

Numerical/Categorical variable, training/test data, over fitting, bias/variance, precision/recall, tagging

 

Statistic

Data exploration (mean, median/range/standard deviation/variance/histogram), Continues distributions (Normal/ Poisson/Gaussian), covariance, correlation coefficient, distance and similarity computing, Bayes theorem, Monte Carlo Method, Hypothesis testing

 

Supervised learning

Classifier, boosting, prediction, regression analysis

Han, Jiawei,Micheline Kamber, and Jian Pei. Data mining: concepts and techniques.

 

Unsupervised learning

Cluster

Collaborative filtering

Item based CF, user based CF

 

Algorithm

Classifier

Decision trees, KNN (K-Nearest neighbor), SVM (support vector machines), SVD (Singular Value Decomposition), naïve Bayes classifiers, neural networks,

Regression

Linear regression, logistic regression, ranking, perception

Cluster

Hierarchical cluster, K-means cluster, Spectral Cluster

Dimensionality reduction

PCA (Principal Component Analysis), LDA (Linear discriminant Analysis), MDS (Multidimensional scaling)

Text mining

Corpus, term document matrix, term frequency & weight, association rules, market based analysis, vocabulary mapping, sentiment analysis, tagging

Jimmy Lin and Chris Dyer. Data-Intensive Text Processing with MapReduce.

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值