以下是学习数据分析过程中用到的文档:
一、 Hadoop相关:
0. hadoop十年解读与发展预测: http://www.infoq.com/cn/articles/hadoop-ten-years-interpretation-and-development-forecast
1. Hadoop集群搭建: http://blog.csdn.net/weixuehao/article/details/15813681
2. 分布式文件系统HDFS的架构和设计: https://hadoop.apache.org/docs/r1.0.4/cn/hdfs_design.html
3. Hadoop fs shell命令:https://hadoop.apache.org/docs/r1.0.4/cn/hdfs_shell.html
4. mapreduce计算框架原理:
5. mapreduce的核心shuffle and sort: http://langyu.iteye.com/blog/992916
6. hadoop streaming: https://hadoop.apache.org/docs/r1.2.1/streaming.html
7. hadoop Sqoop: https://sqoop.apache.org/docs/1.4.6/SqoopUserGuide.html
8. hadoop streaming multiple output: http://stackoverflow.com/questions/18541503/multiple-output-files-for-hadoop-streaming-with-python-mapper
二、 Hive相关:
1. Hive编译成mapreduce: http://tech.meituan.com/hive-sql-to-mapreduce.html
2. Hive数据存储模式: http://www.iteblog.com/archives/866
3. Hive内部表和外部表: http://www.aboutyun.com/thread-7458-1-1.html
4. Hive的left join、left outer join和left semi join三者的区别: http://www.crazyant.net/1470.html
5. Hive regex_extract: http://blog.csdn.net/lxpbs8851/article/details/39202735
6. get_json_object, lateral view等函数用法: http://my.oschina.net/leejun2005/blog/120463
三、瓦利哥的专栏:
http://zhuanlan.zhihu.com/sangwf (讲了百度大数据从0到1的架构演变)
四、Spark:
1. Spark的核心--RDD: http://www.infoq.com/cn/articles/spark-core-rdd