大数据基准总览(论文笔记)

论文:Benchmarking big data systems

part1:note

part2:change

 

part 1:

1. Three important aspects of benchmarking: 

(1)workload generation techniques

workload implementation techniques 

run-time execution modes

 

(2)workload input data generation techniques,

ready-made datasets, 

data generators based on synthetic distributions, 基于人造分布的数据生成器

data generators based on real-world data, 

hybrid data generators.

 

(3) metrics used to assess systems

performance

price

energy consumption metrics

 

2. Big data systems:

(1) Hadoop and its related system: become the solution for a majority of big data applications; 

(2) database management systems and NoSQL data stores: are widely used in online transactional and analytical applications;

(3) specialized systems in the big data domain: created by the specific processing requirements of connected graphs, continuous streams, and complex scientific data

 

3. Features of big data: high volume, velocity, and variety (3V)

 

4. Benchmarks : 

micro benchmarks(微基准测试集): evaluate either individual system components or specific system behaviors(or functions of codes)

 

End-to-End benchmarks(端到端基准测试集): evaluate the entire systems using typical application scenarios, each scenario corresponds to a collection of related workloads.

Benchmark suites(基准测试集套件): are combinations of different micro and /or end-to-end benchmarks

 

5.high-level languages on Hadoop

The first category of systems is developed with two objectives. 

The first objective is to simplify the development of Hadoop mapreduce jobs and to automatically optimize their execution, thus allowing developers to cocus on the programming logic.( pig )

第一个目标是简化Hadoop MapReduce作业的开发并自动优化它们的执行,从而使开发人员能够专注于编程逻辑。

The second objective is to meet the increasing demand of extracting values from high-velocity data in real time.

第二个目标是为了满足不断增长的实时数据提取的需求

Eg:  Traditionally, the batch processing offered by Hadoop MapReduce usually takes minutes or hours to complete. Hence the purpose of adding high-level languages on Hadoop is to enable the fast data processing.

 

The second category of systems adds SQL interfaces to Hadoop.

 

6.data store

NewSQL数据库(HStore)是另一种关系型DBMS,它被设计用于高吞吐量的联机OLTP,同时仍然保持ACID属性。

许多大型数据应用程序可能不需要严格的ACID遵从性,它们比一致性和可靠性更喜欢诸如低延迟和高吞吐量之类的性能。因此,各种遵循基本可用、软状态、最终一致性(BASE)属性的NoSQL数据存储被开发为SQL数据存储的替代品。为了处理大数据,常用的三类NoSQL数据存储是key_2_value存储(例如,Amazon Dynamo、Cassandra和Linkedin Voldemort)、面向列的数据库(BigTable)和面向文档存储(couchDB,MongoDB)。

 

7.special systems

Graph data:In a connected world, graphs naturally model complicated data structures such as social networks, protein interaction networks, and nature networks. Graph data is widely used in many  application domains such as online retails, social applications, and bioinformatics. To address the challenges brought by the increasing size and complexity of graph data, two types of graph systems are developed:(图数据日益增长的规模和复杂性所带来的挑战)

(1) graph databases such as Neo4j 

(2)distributed graph processing systems. 

When evaluating big data applications, the large data volume and the high diversity of graph computations give rise to new challenges that require new benchmarks.

Stream data:In these systems, data arrives continuously and it is processed within short time windows. Traditional stream data is mostly numerical data coming from sensor networks and financial transactions. Today, a majority of stream data is text and media data generated by Web 2.0 applications and Internet of Things (IoT) devices. Hence more powerful stream data processing is needed to provide high availability and fast recovery when handling high data volume and velocity. 

Developing stream benchmarks requires addressing several challenges, including generating data with semantic validity; producing correct answers for continuous query results; and designing a querying language standard.

流数据是一组顺序、大量、快速、连续到达的数据序列,一般情况下,数据流可被视为一个随时间延续而无限增长的动态数据集合。(by百度)

 

Scientific data:Modern research instruments can generate massive amount of scientific data, in which both data volume and velocity grow exponentially. The increasing demand for efficiently capturing, storing, analyzing, and aggregating these vast datasets in current and future scientific data infrastructure stimulates the development of many scientific big data systems. 

 

Although there are diverse types of scientific big data systems, few benchmarks are available at present. 

8.基准测试集准则:(大部分不能满足)

(1)相关性:负载应能体现被测系统的典型行为;

(2)可移植性:负载应能在不同的软件系统和架构上执行;

(3)可伸缩性:负载数量应能增加或删除以来测试不同规模的系统。

 

part 2

change总结:

在大数据时代,不断增长的数据量、快速处理数据(数据速度)的需求以及数据类型、结构和来源的多样性给我们带来了全新的挑战,传统的数据管理系统已经捉襟见肘。

一、数据类型

图数据日益增长的规模和复杂性所带来的挑战。在评估大数据应用时,数据量大和图计算的多样性带来了新的挑战——我们需要新的基准测试集。

传统的流数据主要来自传感器网络和金融交易。今天,大多数流数据是由Web 2.0应用程序和物联网(IoT)设备生成的文本和媒体数据。在处理大量流数据时,需要更强大的流数据处理来提供高可用性和快速恢复能力。为了应对这一挑战,来自开源社区的流分析系统(如Apache Storm [15]、Spark Streaming [14] 和Samza [11])和来自工业界的流分析系统(如IBM的InfoSphere Streams[32]和TIBCO[54])已经被提出。

当前和未来的科学数据基础设施对高效捕获、存储、分析和整合这些海量数据集的需求不断增长,这将刺激许多科学大数据系统的发展。尽管有各种各样的科学大数据系统,但目前还没有可用的基准测试集。

二、基准的挑战

相关性:

大数据系统的多样性和快速发展使得开发具有代表性的工作负载并使之覆盖不同的应用场景变得非常具有挑战性。

->解决此问题的一个可行方案是从各种工作负载独立于系统的行为中抽象出一般方法,从而将整个工作负载分解为许多EO(基本操作)及其组合模式。充分的抽象不仅允许使用现有的EO开发新的工作负载而很少或不改变EO本身,也为不同的系统实现留出空间。

->在数据应用中,应用抽象方法来深入了解EO和它们的组合模式需要解决三个挑战性问题。

(1) 在大多数大数据应用中(例如因特网服务和多媒体),工作负载是在半结构化和非结构化数据上执行的,它们的操作和模式是复杂多样的。

(2)大数据领域(例如机器学习、数据库、自然语言处理和计算机视觉)多意味着研究它们所有的应用抽象是相当耗时的。

(3)不同的大数据软件栈(例如Hadoop,Spark,Flink,Kudu)和库(例如Mahout,MLlib,AstroML)中的算法种类加大了抽象的难度。

伸缩性:

现有基准测试集要么支持使用参数调整负载规模的工作负载而不考虑其真实性,要么执行和提交真实工作负载但不能动态地调整规模。

三、数据集的挑战

第一个问题是现有的基准测试集可以构建模型来提取某些数据类型(如表格,文本和图数据)的真实数据集的特征[172],但是,他们很少关注其他数据类型,如流、图、视频和科学数据。此外,某些数据类型具有各种数据源,因此需要区分建模技术。

第二个同时也是更具挑战性的问题是如何评估产生的合成数据的真实性水平。恰当的的评估方法不仅允许基准测试集用户评估其测试结果的可靠性水平,而且还能管理数据生成器中已建立模型的有效性。

 

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值