大数据工程人员知识图谱

34 篇文章 1 订阅
15 篇文章 0 订阅

http://yanbohappy.sinaapp.com/?cat=32


大数据工程人员知识图谱

在企业里面从事大数据相关的工作到底需要掌握哪些知识呢?我认为需要从两个角度来看:一个是技术;一个是业务。技术上主要涉及到概率和数理统计,计算机系统、算法和编程等;而业务的角度呢则是因公司业务的不同而异。对于从事大数据的工程人员来说,需要学会使用数据挖掘方法在计算机系统和编程工具的帮助下解决实际的问题,这样才能够在海量数据中挖掘出业务增长的助推剂,才能在激烈的市场竞争中为企业创造更多的价值。

因为业务会因公司的不同而不同,但是技术点是想通的。我在这里简单总结了一下大数据相关工程人员需要掌握的技术相关知识点。主要涉及到数据库、数据仓库、编程、分布式系统、Hadoop生态系统相关、数据挖掘和机器学习相关的基础知识点。当然我这里列出来的应该是一个team的人员汇集在一起所具备的,每个人会因在团队中的角色不同而有所侧重。在此剖砖引玉,欢迎大家发表意见。

TopicContentKey pointsReference
DB/OLTP & DW/OLAPDatabase/OLTP basicThe relational model, SQL, index/secondary index, inner join/left join/right join/full join, transaction/ACIDRamakrishnan, Raghu, and Johannes Gehrke. Database Management Systems.
Database internal & implementationArchitecture, memory management, storage/B+ tree, query parse /optimization/execution, hash join/sort-merge join
Distributed and parallel databaseSharding, database proxy
Data warehouse/OLAPMaterialized views, ETL, column-oriented storage, reporting, BI tools
Basic programmingProgramming languageJava, Python (Pandas/NumPy/SciPy/scikit-learn), SQL, Functional programming, R/SAS/SPSSWes McKinney. Python for Data Analysis: Agile Tools for Real World Data. 
OSLinux
DB & DW systemMySQL/ Hive/Impala
Text format and processJSON/XML, regex
ToolGit/SVN, Maven
Distributed system & Hadoop ecosystem & NoSQLDistributed system principal theoryCAP theorem, RPC (Protocol Buffer/Thrift/Avro), Zookeeper, Metadata management (HCatalog) 
Distributed storage & computing framework & resource managementHadoop/HDFS/MapReduce/YARNTom White. Hadoop : The Definitive Guide.

Donald Miner, Adam Shook. MapReduce Design Patterns : Building Effective Algorithm and Analytics for Hadoop and Other Systems.

SQL on HadoopData (log) acquisition/integration/fusion, normalization, feature extractionSqoop, Flume/Scribe/Chukwa,SerDeEdward Capriolo, Dean Wampler, Jason Rutherglen. Programming Hive.
Query & In-database analyticsHive, Impala, UDF/UDAF
Large scale data mining & machine learning frameworkSpark/MLbase, MR/Mahout 
Streaming processStorm 
NoSQLHBase/Cassandra (column oriented database)Lars George. HBase: The Definitive Guide.
Mongodb (Document database)
Neo4j (graph database)
Redis (cache)
Data mining & Machine learningDM & ML basicNumerical/Categorical variable, training/test data, over fitting, bias/variance, precision/recall, tagging 
StatisticData exploration (mean, median/range/standard deviation/variance/histogram), Continues distributions (Normal/ Poisson/Gaussian), covariance, correlation coefficient, distance and similarity computing, Bayes theorem, Monte Carlo Method, Hypothesis testing 
Supervised learningClassifier, boosting, prediction, regression analysis

Han, Jiawei,Micheline Kamber, and Jian Pei. Data mining: concepts and techniques.

 

Unsupervised learningCluster, deep learning
Collaborative filtering

Item based CF, user based CF

 

AlgorithmClassifierDecision trees, KNN (K-Nearest neighbor), SVM (support vector machines), SVD (Singular Value Decomposition), naïve Bayes classifiers, neural networks,
RegressionLinear regression, logistic regression, ranking, perception
ClusterHierarchical cluster, K-means cluster, Spectral Cluster
Dimensionality reductionPCA (Principal Component Analysis), LDA (Linear discriminant Analysis), MDS (Multidimensional scaling)
Text mining & Information retrievalCorpus, term document matrix, term frequency & weight, association rules, market based analysis, vocabulary mapping, sentiment analysis, tagging, PageRank, VSM (Vector Space Model), inverted indexJimmy Lin and Chris Dyer. Data-Intensive Text Processing with MapReduce.
This entry was posted in Data Mining, Data Warehouse, Database, Hadoop, HBase, Hive, Impala, Machine Learning, NewSQL, NoSQL, PostgreSQL and tagged BigData, Data Mining, Data Warehouse, Database, Hadoop, HBase, Machine Learning on 2013 年 11 月 5 日 by ybliang8@gmail.com.

  • 0
    点赞
  • 3
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值