下面一些链接是关于大数据应用的各方面,有点乱,但都比较有用,会不时更新:
1、AMPLab发布了其利用workload测试Hive/Impala/Tez/Shark/Redshift等SQL查询在Scan/Aggregation/Join/External script等场景的结果,进行了性能对比。
https://amplab.cs.berkeley.edu/benchmark/
2、由上面的blog找出一个intel hadoop test bechmark tools: This benchmark suite contains 9 typical Hadoop workloads (including micro benchmarks, HDFS benchmarks, web search benchmarks, machine learning benchmarks, and data analytics benchmarks).
https://github.com/intel-hadoop/HiBench
3、(来自hashjoin的微博)Tresata今天发布针对金融和保险行业的实时大数据挖掘解决方案。这家由Rackspace创始人投资的公司基于Spark开发了关系挖掘以及风险分析的应用,是华尔街的新宠。Hadoop为业界带来了廉价的大数据存储,下一代的大数据公司则应该围绕着如何从这些储存起来的数据中挖去价值:
4、facebook针对hbase的使用对hdfs做了一些性能上的改进,似乎是增加了一个flash cache,需要细看一下:
http://research.cs.wisc.edu/wind/Publications/fbmessages-fast14.pdf
5、apache hadoop 2.3.0 released:
With this release, there are two significant enhancements to HDFS:
• Support for Heterogeneous Storage Hierarchy in HDFS (HDFS-2832)
• In-memory Cache for data resident in HDFS via Datanodes (HDFS-4949)
In YARN, we are very excited to see that ResourceManager Automatic Failover(YARN-149) is nearly complete; even it isn’t ready for primetime yet. We expect it to land by the next release i.e. hadoop-2.4. Furthermore, a number of key operational enhancements have been driven into YARN such as better logging, error-handling, diagnostics etc.
On the MapReduce side of the house, a key enhancement is MAPREDUCE-4421; with this we now no longer need to install MapReduce binaries on every machine and can just use a MapReduce tarball via the YARN DistributedCache by copying it into HDFS.
http://hortonworks.com/blog/apache-hadoop-2-3-0-released/
6 apache hadoop 2.4.0 released
Hadoop 2.4.0 continues that momentum, with additional enhancements to both HDFS & YARN:
- Support for Access Control Lists in HDFS (HDFS-4685)
- Native support for Rolling Upgrades in HDFS (HDFS-5535)
- Smooth operational upgrades with protocol buffers for HDFS FSImage (HDFS-5698)
- Full HTTPS support for HDFS (HDFS-5305)
- Support for Automatic Failover of the YARN ResourceManager (YARN-149) (a.k.a Phase 1 of YARN ResourceManager High Availability)
- Enhanced support for new applications on YARN with Application History Server (YARN-321) and Application Timeline Server (YARN-1530)
- Support for strong SLAs in YARN CapacityScheduler via Preemption (YARN-185)