每周大数据论文(一)Data-intensive applications, challenges, techniques and technologies: A survey on Big Data

自以为读国外文献,总结记录新的思路,算法以及处理方法,总结文献内容,可以开拓视野,找出问题和创新点。记录下来也顺便告诉自己這篇论文讲的是什么,以备之后需要相关内容能快速翻阅到。也能和大家分享有关文献,需要此文献的留言邮箱,之后发给你们。


文章来源:Information Sciences
作者:C.L. Philip Chen , Chun-Yang Zhang

“This paper is aimed to demonstrate a close-up view about Big Data, including Big Data applications,Big Data opportunities and challenges, as well as the state-of-the-art techniques and technologies we currently adopt to deal with the Big Data problems. We also discuss several
underlying methodologies to handle the data deluge, for example, granular computing,
cloud computing, bio-inspired computing, and quantum computing.”——–ABSTRACT
这篇综述类的文献一共33叶,算是篇幅比较长了。文章分为7部分。
分别是:
1. Introduction
2. Big Data problems
3. Big Data opportunities and challenges
4. Big Data tools: techniques and technologies
5. Principles for designing Big Data systems
6. Underlying technologies and future researches
7. Conclusion


第一部分
文章先引用了国际社会对大数据的共识,给出了大数据的定义以及大数据的几个显著特点,其他就是泛泛而谈。大数据的定义:‘‘Big Data are high-volume, high-velocity, and/or high-variety information assets 。More generally, a data set can be called Big Data if it is formidable to perform capture, curation, analysis andvisualization on it at the current technologies.大数据的特点4V:volume, velocity variety virtual.
第二部分
这部分没有什么实质内容,就是从个领域包括金融,商业,政府等方面来说大数据问题成为迫切需要解决的问题
第三部分
“Challenges in Big Data analysis include data inconsistence and incompleteness, scalability, timeliness and data security ”数据不一致,不完整,数据的时效,扩展,数据安全是大数据面临的挑战。接下里作者从数据挖掘,数据存储,数据处理等几个角度简单谈了谈目前的技术,以及仍需要解决的问题。在数据综合处理的一小节中作者提到了目前比较主流的数据库“NoSQL database , also called ‘‘Not Only SQL’’, is a current approach for large and distributed data management and
database design. Its name easily leads to misunderstanding that NoSQL means ‘‘not SQL’’. On the contrary, NoSQL doesnot avoid SQL”并简单的进行了介绍,用handoop的HBase进行了进一步介绍。总体上说的还是一个概念,并没有什么典型的例证。之后也简单的提到了数据分析的概念,数据可视化的概念。
第四部分
这一部分我觉得我收获最多的一部分,也是我精读的一部分。
“We need tools (platforms) to make sense of Big Data. Current tools concentrate on three classes, namely, batch processing tools, stream processing tools, and interactive analysis tools. Most batch processing tools are based on the Apache Hadoop infrastructure, such as Mahout and Dryad. The latter is more like necessary for real-time analytic for stream data applications.Storm and S4 are good examples for large scale streaming data analytic platforms. The interactive analysis processes the data in an interactive environment, allowing users to undertake their own analysis of information. The user is directly connected to the computer and hence can interact with it in real time. The data can be reviewed, compared and analyzed in tabular or graphic format or both at the same time. Google’s Dremel and Apache Drill are Big Data platforms based on interactive analysis. In the following sub-sections, we’ll discuss several tools for each class. More information about Big Data tools can be found in Appendix C.”
目前处理大数据的工具一般分为三类:一种是批处理工具,一种是流式处理工具,最后一种是交互分析工具。批式处理工具国际上比较有知名度和市场的是基于Apache的Handoop框架,比如Mahout(直译为象兵),Dryad(直译为森林女神)。流式处理的工具有Storm,S4等。交互式分析工具有谷歌的Dremel(直译为马达),和Apache的Drill(直译为钻头)。文章最后会展开每一个工具进行一些认知性解释,以及对比,这部分对刚入门的同学了解大数据处理的基本框架十分有用。
“Apache Hadoop and map/reduce
Apache Hadoop is one of the most well-established software platforms that support data-intensive distributed applications.It implements the computational paradigm named Map/Reduce. Apache Hadoop (see Fig. 6) platform consists of the Hadoop kernel, Map/Reduce and Hadoop distributed file system (HDFS), as well as a number of related projects, including Apache Hive, Apache HBase, and so on.
Map/Reduce [43], which is a programming model and an execution for processing and generating large volume of data sets,was pioneered by Google, and developed by Yahoo! and other web companies. Map/Reduce is based on the divide and conquer method, and works by recursively breaking down a complex problem into many sub-problems, until these sub-problems is scalable for solving directly. After that, these sub-problems are assigned to a cluster of working notes, and solved in separate and parallel ways. Finally, the solutions to the sub-problems are then combined to give a solution to the original problem. The divide and conquer method is implemented by two steps: Map step and Reduce step. In terms of Hadoop cluster, there are two kinds of nodes in Hadoop infrastructure. They are master nodes and worker nodes. The master node takes the input, divides it into smaller sub-problems, and distributes them to worker nodes in Map step. Afterwards, the master node collects the answers to all the sub-problems and combines them in some way to form the output in Reduce step.”
图片摘自文章内

Dryad
“Dryad [75] is another popular programming models for implementing parallel and distributed programs that can scale up capability of processing from a very small cluster to a large cluster. It bases on dataflow graph processing [101]. The infrastructure for running Dryad consists of a cluster of computing nodes, and a programmer use the resources of a computer cluster to running their programs in a distributed way.”
“Dryad provides a large number of functionality, including generating the job graph, scheduling the processes on the available machines, handling transient failures in the cluster, collecting performance metrics, visualizing the job, invoking userdefined policies and dynamically updating the job graph in response to these policy decisions, without awareness of the semantics of the vertices [101]. Fig. 9 schematically shows the implementation schema of Dryad. There is a centralized job manager to supervise every Dryad job. It uses a small set of cluster services to control the execution of the vertices on the cluster.”
图片来自于原论文
“ Apache mahout
The Apache Mahout [74] aims to provide scalable and commercial machine learning techniques for large-scale and intelligent data analysis applications. Many renowned big companies, such as Google, Amazon, Yahoo!, IBM, Twitter and Facebook,have implemented scalable machine learning algorithms in their projects. Many of their projects involve with Big Data problems and Apache Mahout provides a tool to alleviate the big challenges.Mahout’s core algorithms, including clustering, classification, pattern mining, regression, dimension reduction, evolutionary algorithms and batch based collaborative filtering, run on top of Hadoop platform via the Map/reduce framework [46,47].These algorithms in the libraries have been well-designed and optimized to have good performance and capabilities. A number of non-distributed algorithms are also contained. The goal of Mahout is to build a vibrant, responsive, diverse community to facilitate discussions not only on the project itself but also on potential use cases. The business users need to purchase Apache software license for Mahout. More detailed content can be found on the web site: http://mahout.apache.org/.”
“Storm
A Storm cluster consists of two kinds of working nodes. As illustrated in Fig. 11, they are only one master node and several worker nodes. The master node and worker nodes implement two kinds of daemons: Nimbus and Supervisor respectively.The two daemons have similar functions with according JobTracker and TaskTracker in Map/Reduce framework. Nimbus is in charge of distributing code across the Storm cluster, scheduling works assigning tasks to worker nodes, monitoring the whole system. If there is a failure in the cluster, the Nimbus will detect it and re-execute the corresponding task. The supervisor complies with tasks assigned by Nimbus, and starts or stops worker processes as necessary based on the instructions of Nimbus. The whole computational topology is partitioned and distributed to a number of worker processes, each worker
process implements a part of the topology. How can Nimbus and the Supervisors work swimmingly and complete the job fast? Another kind of daemon called Zookeeper play an important role to coordinate the system. It records all states of the Nimbus and Supervisors on local disk.”
Storm(图片来自于原论文)
论文之后的部分我粗略的看了看,设计大数据系统的遵循的原则以及新技术的简单罗列介绍,展望等。就不贴了。其实英文了一些专业名次还是需要注意的,比如分治法 the divide and conquer method等。
总之这篇综述是入门级的,对新入门同学有一定的帮助,希望大家觉得有用。

  • 5
    点赞
  • 26
    收藏
    觉得还不错? 一键收藏
  • 12
    评论
### 回答1: 《Designing Data-Intensive Applications》这本书是一本关于设计和构建大型数据应用程序的指南。它重点关注数据存储、处理和传输方面的现代技术。这本书由三部分组成:第一部分围绕着数据存储系统展开,包括传统的关系型数据库和现代的NoSQL数据库等;第二部分则聚焦于数据处理,涵盖流处理、批处理和交互式查询等领域;第三部分则探讨了如何构建可靠、可扩展且具有良好性能的分布式系统。无论是新手还是老手,都可以从本书中获得一些有价值的见解。这本书概述了不同技术在数据处理方面的优缺点,并介绍了一些构建大型、高效数据应用程序的最佳实践。另外,由于该书旨在让读者获得对现代架构和技术的深刻理解,因此它也介绍了一些分布式系统的核心理论,例如CAP定理和BASE原则等。这本书的读者包括数据科学家、软件工程师、数据工程师和系统管理员。总之,如果你正在构建一个数据密集型应用程序,想要了解最新的技术和最佳实践,那么《Designing Data-Intensive Applications》绝对值得一读。 ### 回答2: 《设计数据密集型应用程序》(Designing Data-Intensive Applications)是一本由 Martin Kleppmann 所著的计算机科学类图书,这本书讨论了当今主要的数据系统和应用程序所面临的问题和挑战,并提供了一些解决方案。这本书内容涵盖了分布式系统、数据存储和查询、数据一致性与容错性、流处理、数据流水线、实时分析等领域,不仅提供了丰富的理论知识,还有很多实践案例和相关技术细节。 对于软件工程师、架构师、数据工程师等人来说,这本书是一本非常重要的访问。如果你正在开发一个数据密集型的应用或系统,这本书提供了很多有用的指导和建议。例如,这本书会告诉你如何选择和使用不同的数据存储技术,如何设计高效的数据处理流水线,如何保证系统的可扩展性和容错性等等。 《设计数据密集型应用程序》还使用了很多实用的案例和场景,比如 Twitter 的分布式消息队列 Kafka,Google 的基于 Paxos 算法的分布式一致性协议 Chubby,Facebook 的实时数据处理系统 Apache Samza 等等。通过这些案例,读者可以更好地了解如何应用书中的理论知识到实际工作中。 总之,《设计数据密集型应用程序》是一本值得阅读的计算机科学类图书,无论你是软件工程师、数据工程师、系统架构师等,都有望从中获得很多启发。 ### 回答3: Designing Data-Intensive Applications是一本讲述大数据应用程序设计的书籍。这本书主要涵盖了数据密集型应用程序设计的方方面面,包括数据存储和查询、数据处理和流处理、分布式系统和高可用性、性能和可扩展性等。它通过介绍各种不同的数据管理工具和技术,帮助读者了解如何在大数据领域中设计出高效和可靠的应用程序。 在设计数据密集型应用程序时,需要考虑很多因素。从数据存储和查询的角度来看,我们需要考虑使用哪种数据库或数据存储方案,并且了解其适用的场景和可扩展性。同时还需要思考如何使用数据查询工具来优化查询性能。 在数据处理和流处理方面,我们需要选择正确的数据处理框架和工具,以处理大数据集合。我们也需要了解如何设计分布式系统,并且使其具有高可用性和容错性。性能和可扩展性也是设计数据密集型应用程序时的至关重要因素,因此我们需要考虑如何优化系统吞吐量和处理能力,并在需要时进行水平扩展。 总之,Designing Data-Intensive Applications是一本非常有价值的书籍,可以帮助读者了解如何在大数据领域中进行应用程序设计。它提供了丰富的知识和有实际应用的案例,让读者在实践中掌握数据密集型应用程序设计的关键技能。

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 12
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值