The Top 10 Challenges in Extreme-Scale Visual Analytics

最新推荐文章于 2020-09-28 14:17:50 发布

Evan_Gu

最新推荐文章于 2020-09-28 14:17:50 发布

阅读量1.4k

点赞数 1

分类专栏：综合大数据可视化文章标签：可视分析 Top 10

综合同时被 2 个专栏收录

114 篇文章 2 订阅

订阅专栏

大数据可视化

32 篇文章 12 订阅

订阅专栏

The Top 10 Challenges in Extreme-Scale Visual Analytics

Pak Chung Wong,

Pacific Northwest National Laboratory

Han-Wei Shen,

Ohio State University

Christopher R. Johnson,

University of Utah

Chaomei Chen, and

Drexel University

Robert B. Ross

Argonne National Laboratory

摘要

在这期CG&A中，研究者们分享他们在应用可视分析（visual analytics (VA)）超大规模数据的R&D发现和结果。通过调查领域中文献和其它R&D，我们确定了我们所认为大规模VA最主要的挑战。为了满足杂志不同的阅读者，我们论述评估在该领域所有范围的挑战，包括算法、硬件、软件、工程和社会问题。

背景

在九月/十月 2004期 CG&A介绍了计算机科学中的可视分析概念，在2005年，一个国际多学科陪审团一致同意并集体地定义了这个新建立的领域“通过交互的可视界面促进科学的分析推理”。VA的含义和目标从此演化并延伸，覆盖不同科学和非科学的数据类型，形状、大小、领域、以及应用。大规模数据集开始彻底变革我们每天的工作生活，研究者们注意到VA作为大数据问题的解决方案。

今天的大规模VA应用通常结合高性能计算机来进行计算，高性能数据库应用或者云服务数据存储与管理，以及台式机的人机交互。大规模数据资源通常来自模型或者观察结果，产生不同的科学工程，社会、以及网络应用。虽然很多PT或者TB数据分析仍然没有解决，但是科学家们开始分析EB级数据。

Top 10 大挑战

解决这十大挑战意义重大，蕴含长远，不仅仅完成重大的科学和技术需求，而且促进解决方案的转移到一个广阔的群体。因此，我们从技术和社会两个角度评估了这些问题。下述的挑战排列顺序没有影响他们之间的重要性，但是内容上有相互关系在各个挑战中。

1.原位分析 (又称现场分析（In Situ Analysis）)

传统事后磁盘存储数据的方法，然后分析数据较晚也许不能超过未来千亿规模的要求。取而代之，原位可视分析VA试图实现尽可能多分析同时数据仍然在原内存中。这个方法能极大的减少I/O花费，并最大化数据磁盘访问的数据使用率。然而，这个引入了一个队列的设计和实现挑战，包括交互式分析、算法、内存、I/O、工作流，以及线程。

即使现在有些技术挑战理论上可以被解决。仍而，这个潜在的解决方案可能需要彻底改变高性能计算中群体操作，管理，以及政策，并且在商业硬件卖家系统和工程支持中也可能彻底改变。之后我们将再次遇到这个问题。

2. 交互与用户界面（Interaction and User Interfaces）

在大规模数据前沿竞争中VA的交互和UI界面越来越显著。鉴于数据大小继续快速的增长，人类感知能力仍然不变。我们完成一个深度研究，集中在大规模数据VA中的交互和UI挑战。

Top Interaction and UI Challenges in Extreme-Scale Visual Analytics

Human-computer interaction has long been a substantial barrier in many computer

science development areas. Visual analytics (VA) is no exception. Significantly

increasing data size in traditional VA tasks inevitably compounds the existing problems.

We identified 10 major challenges regarding interaction and user interfaces in extremescale

VA. The following is a brief revisit of our study of the topic.1

1. 原位交互分析（In Situ Interactive Analysis）

在原位VA试图完成尽可能多的分析同时这数据仍然在内存中，主要挑战是有效共享硬件执行单元中的内核，并减轻整个工作流中因人机交互产生的分裂。

2. 用户驱动数据约简（User-Driven Data Reduction）

当数据变得巨大规模时，通过压缩的传统数据约简方法会变得无效。一个挑战是如何发展一个弹性的机制，使得用户能容易的控制他们收集数据的粒度和分析需求。

3. 可伸缩与多等级层次Scalability and Multilevel Hierarchy

解决多VA可伸缩问题的流行方法是多等级层次。但是当数据大小增长时，分层也会增加且复杂。如何导航一个非常深多等级层次和搜索最优分辨率是可伸缩分析的挑战。

4. 表现证据和不确定（ Representing Evidence and Uncertainty）

在VA环境中，证据综合与不确定定量通常由可视化统一，由人类理解开始。如何通过可视化清楚的表现证据与超大规模数据的不确定，而不引起显著偏见。

5. 多源数据融合（ Heterogeneous-Data Fusion）

许多超大规模数据问题是高度混杂的。我们必须花费适当的注意在分析这些各种各样数据对象或实体中的内在关系。如何从超大规模数据中提取适量的语义，并将数据交互融合来进行VA。

6.交互查询的数据综述与分诊（Data Summarization and Triage for Interactive Query）

分析整个数据集也许不能分离或者即使需要如果数据大小超过PB级。数据综述与分诊使用户需要特别属性的数据。如何使基础I/O组件与数据综述与分诊结果一起工作良好，这使得交互查询极大规模数据成为可能。

7. 时间演化特征分析（Analytics of Temporally Evolved Features）

一个超大规模随时间变化的数据集通常具有时间长、光谱窄（或者空间上，依赖数据类型）。如何有效的VA技术来计算特定时间流，并利用人类感知能力来动态跟踪数据是关键问题。

8. 人类瓶颈（The Human Bottleneck）

专家预测所有主要高性能计算（HPC）组件——能量、内存、存储、带宽、并发等等，将提升四分之三。人类感知能力将仍然不变。找到可选择的方式来弥补人类感知的不足是一个挑战。

9. 设计与工程开发（Design and Engineering Development）

在一个HPC平台上社区广泛的API和框架支持仍然不是系统开发者的选择。HPC社区需建立在HPC系统上，交互和UI开发的设计标准和工程资源。

10. 传统智慧文艺复兴（The Renaissance of Conventional Wisdom）

也许最重大的挑战是在应用与超大规模数据的VA中，发出传统智慧的文艺复兴星火。（例如，Ben Shneiderman的信息可视化原则---overview，zoom与filter，details on demand。当应用到前述的挑战中变得不足）成功返回传统智慧发现的可靠原则，并发现如何应用它们到大规模数据将最可能促进更多我们所描述问题的解决。

References

1. Wong, PC.; Shen, H-W.; Chen, C. Top Ten Interaction Challenges in Extreme-Scale Visual

Analytics. In: Dill, J., et al., editors. Expanding the Frontiers of Visual Analytics and

Visualization. Springer; 2012. p. 197-207.

2. Ashby, S., et al. The Opportunities and Challenges of Exascale Computing—Summary Report of

the Advanced Scientific Computing Advisory Committee (ASCAC) Subcommittee. US Dept. of

Energy Office of Science; 2010. http://science.energy.gov/~/media/ascr/ascac/pdf/reports/

Exascale_subcommittee_report.pdf

以人为中心的交互与UI挑战是深远的、多方面的、并且一些方面是重叠的。基于机器的自动系统也许不能解决一些涉及到自然人在分析过程偏见的挑战。其他挑战是根植在人类感知中，并限制人的性能也许不能被完全的解决。

3. 大规模数据可视化（Large-Data Visualization）

在VA中，这个挑战集中在基础的数据表达上，包括可视化技术，以及信息的可视显示。最近在抽象可视化的R&D，高扩展数据投影，多维约简，高分辨率显示，以及电力墙显示帮助克服了这些方面的挑战。

然而，更多数据投影和多维约简在可视化也是一个意味着更抽象的表现。这些表现需要附件领悟与说明，为这些执行可视分析推理与信息搜寻。此外，虽然我们能建立甚至更大，具有更大分辨率的可视显示，但是人的视觉明锐度阻碍了在VA中实施有效的大屏幕方法。在人和机器的限制下，如何可视化大规模的数据，在可预见的未来仍然是有重大意义的。

4. 数据库与存储（Databases and Storage）

云服务与应用的出现已经深深的影响大规模数据库与存储领域。这个盛行的Apache Hadoop框架支持应用工作在EB级别的数据存储在公共云中。许多在线商家，包括Facebook, Google ,eBay 以及Yahoo，已经开发了基于Hadoop的大规模数据应用。

在EB数据挑战是现实和需求关注的。一个云基础的解决方案也许不能满足数据分析挑战的需要，例如这些集合由美国能源部（DOE）科学发现界提供的数据集。每PG的云存储的花费仍然是高显著的，相比较私人群体的硬件驱动存储。另外，云数据库的延迟和吞吐仍然是受网络带宽的限制。

最后，不是所有的云服务系统支持分布式存储中ACID(atomicity, consistency,isolation, and durability)所需，关于Hadoop，在软件应用层必须解决这些需求。超大规模数据库是硬件和软件问题。

5. 算法（Algorithms）

传统的VA算法通常没有考虑可伸缩性的设计。所以，很多算法要么计算花费或者不能产生足够人们容易消化的清楚信息。此外，大多数算法假设一个后加工的模型，其中所有的数据已经在内存中有效，或者在一个锁定磁盘上。我们必须开发算法，来解决数据大小和可视化效率的问题。我们需要引入新的可视表现与用户交互。此外，用户喜好必须集成到自动学习，以至于可视输出是高度适应的。

当可视化算法有一个极大的搜索空间来控制参数，自动算法能组织，限定搜索空间将决定最小化数据分析与探索的付出。

6.数据管理，数据传输，网络基础设施（Data Movement, Data Transport, and Network Infrastructure）

计算能力花费继续降低，数据运动将快速变成VA管道中最昂贵的组件。而数据源在地里上的分散，以及数据变得极其大量，应用移动数据需要将增加，这使得这个问题更具挑战。

计算科学模拟已经领先使用HPC系统来处理大规模问题。一个挑战在HPC系统，例如平行计算是网络系统高效使用。

计算科学领域已经献身与这些挑战这，信息传递接口标准，以及高质量实现这些标准，形成大规模模拟代码基础。类似，VA计算执行在一个更大的系统，我们必须开发算法和软件能有效使用网络资源，并提供方便的抽象，使得VA专家高产的产生探索他们数据。

7.不确定量化（ Uncertainty Quantification）

不确定量化在很多科学和工程学科中很重要，追溯到实验测量产生的大部分数据。理解数据中不确定源，在决策和风险分析中很重要。数据继续增加，我们处理整个数据集的能力将受限制。许多分析任务将依赖数据二次抽样来克服实时约束，包括甚至更多不确定。

不确定量化与可视化在未来数据分析工具中将十分重要。我们必须开发分析技术，能够包括不完整的数据。

很多算法必须重新设计，考虑到数据分布性。

一种新的可视化技术将提供一个直观的视图，来帮助用户理解风险，因此选择合适参数来最小化产生误导性结果的可能性。不确定量化和可视化将可能变成几乎每个VA任务的核心。

8. 并行（Parallelism）

To cope with the sheer size of data, parallel processing can effectively reduce the turnaround

time for visual computing and hence enable interactive data analytics. Future computer

architectures will likely have significantly more cores per processor. In the meantime, the

amount of memory per core will shrink, and the cost of moving data within the system will

increase. Large-scale parallelism will likely be available even on desktop and laptop

computers.

To fully exploit the upcoming pervasive parallelism, many VA algorithms must be

completely redesigned. Not only is a larger degree of parallelism likely, but also new data

models will be needed, given the per-core memory constraint. The distinction between task

and data parallelism will be blurred; a hybrid model will likely prevail. Finally, many

parallel algorithms might need to perform out-of-core if their data footprint overwhelms the

total memory available to all computing cores and they can’t be divided into a series of

smaller computations.

为了对付数据下绝对大小，平行处理能有效减少时间转向可视化计算，因此，交互数据分析。为了计算框架将可能显著每个多核处理器。同时，每个和的存储量将收缩，并且数据移动花费将增加。大量并行将可能有效甚至在台式机和笔记本计算机。

要充分利用即将到来的普遍的并行性，许多的VA算法必须是完全重新设计。不仅是一个更大程度的平行度，但也将需要新的数据模型，给定的每核心内存约束。任务之间和数据的并行性的区别将被模糊，一个混合模式可能会占上风。最后，许多并行算法可能需要执行的核心，如果他们的数据足迹淹没所有的计算核心可用内存，它们不能被划分成一系列的较小的计算。

9. （Domain and Development Libraries, Frameworks, and Tools）

The lack of affordable resource libraries, frameworks, and tools hinders the rapid R&D of

HPC-based VA applications. These problems are common in many application areas,

including UIs, databases, and visualization, which all are critical to VA system

development. Even software development basics such as post-C languages or debugging

tools are lacking on most, if not all, HPC platforms. Unsurprisingly, many HPC developers

are still using printf() as a debugging tool.

The lack of resources is especially frustrating for scientific-domain users. Many popular

visualization and analytics software tools for desktop computers are too costly or

unavailable on HPC platforms. Developing customized software is always an option but

remains costly and time-consuming. This is a community-wide challenge that can be

addressed only by the community itself, which brings us to the final challenge.

10. Social, Community, and Government Engagements

Two major communities in the civilian world are investing in R&D for extreme-scale VA.

The first is government, which supports the solution of scientific-discovery problems

through HPC. The second is online-commerce vendors, who are trying to use HPC to tackle

their increasingly difficult online data management problems.

The final challenge is for these two communities to jointly provide leadership to disseminate

their extreme-scale-data technologies to society at large. For example, they could influence

hardware vendors to develop system designs that meet the community’s technical needs—an

issue we mentioned earlier. They must engage the academic community to foster future

development talent in parallel computation, visualization, and analytics technologies. They

must not just develop problem-solving technologies but also provide opportunities for

society to access these technologies through, for example, the Web. These goals require the

active engagement of the technical community, society, and the government.

Discussion

The previous top-10 list echoes challenges previously described in a 2007 DOE workshop

report.6 Compared to that report, our list concentrates particularly on human cognition and

user interaction issues raised by the VA community. It also pays increased attention to

database issues found in both public clouds and private clusters. In addition, it focuses on

both scientific and nonscientific applications with data sizes reaching exabytes and beyond.

Although we face daunting challenges, we’re not without hope. The arrival of multithreaded

desktop computing is imminent. Significantly more powerful HPC systems that

require less cooling and consume less electricity are on the horizon. Although we accept the

reality that there is no Moore’s law for human cognitive abilities, opportunities exist for

significant progress and improvement in areas such as visualization, algorithms, and

databases. Both industry and government have recognized the urgency and invested

significant R&D in various extreme-scale-data areas. Universities are expanding their

teaching curricula in parallel computation to educate a new generation of college graduates

to embrace the world of exabyte data and beyond.

The technology world is evolving, and new challenges are emerging even as we write. But

we proceed with confidence that many of these challenges can be either solved or alleviated

significantly in the near future.

Acknowledgments

The article benefited from a discussion with Pat Hanrahan. We thank John Feo, Theresa-Marie Rhyne, and the

anonymous reviewers for their comments. This research has been supported partly by the US Department of Energy

(DOE) Office of Science Advanced Scientific Computing Research under award 59172, program manager Lucy

Nowell; DOE award DOE-SC0005036, Battelle Contract 137365; DOE SciDAC grant DE-FC02-06ER25770; the

DOE SciDAC Visualization and Analytics Center for Enabling Technologies; DOE SciDAC grant DEAC02-

06CH11357; US National Science Foundation grant IIS-1017635; and the Pfizer Corporation. Battelle

Memorial Institute manages the Pacific Northwest National Laboratory for the DOE under contract DEAC06-

76R1-1830.

References

1. Wong PC, Thomas J. Visual Analytics. IEEE Computer Graphics and Applications. 2004; 24(5):20–

21. [PubMed: 15628096]

2. Thomas, JJ.; Cook, KA., editors. Illuminating the Path—the Research and Development Agenda for

Visual Analytics. IEEE CS; 2005.

3. Swanson, B. The Coming Exaflood. The Wall Street J. Jan 20. 2007www.discovery.org/a/3869

4. Wong, PC.; Shen, H-W.; Chen, C. Top Ten Interaction Challenges in Extreme-Scale Visual

Analytics. In: Dill, J., et al., editors. Expanding the Frontiers of Visual Analytics and Visualization.

Springer; 2012. p. 197-207.

5. ASCR Research: Scientific Discovery through Advanced Computing (SciDAC). US Dept. of

Energy; Feb 15. 2012 http://science.energy.gov/ascr/research/scidac

6. Johnson, C.; Ross, R. Visualization and Knowledge Discovery: Report from the DOE/ASCR

Workshop on Visual Analysis and Data Exploration at Extreme Scale. US Dept. of Energy; Oct.

2007 http://science.energy.gov/~/media/ascr/pdf/program-documents/docs/

Doe_visualization_report_2007.pdf

Biographies

Pak Chung Wong is a project manager and chief scientist in the Pacific Northwest National

Laboratory’s Computational and Statistical Analytics Division. Contact him at

pak.wong@pnnl.gov.

Han-Wei Shen is an associate professor in Ohio State University’s Computer Science and

Engineering Department. Contact him at hwshen@cse.ohio-state.edu.

Christopher R. Johnson is a Distinguished Professor of Computer Science and the director

of the Scientific Computing and Imaging Institute at the University of Utah. Contact him at

crj@sci.utah.edu.

Chaomei Chen is an associate professor in Drexel University’s College of Information

Science and Technology. Contact him at chaomei.chen@drexel.edu.

Robert B. Ross is a senior fellow of the Computation Institute at the University of Chicago

and Argonne National Laboratory. Contact him at rross@cs.anl.gov.

Evan_Gu

关注

1
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
The Top 10 Challenges in Extreme-Scale Visual Analytics

The Top 10 Challenges in Extreme-Scale Visual AnalyticsPak Chung Wong,Pacific Northwest National LaboratoryHan-Wei Shen,Ohio State UniversityChristopher R. Johnson,University of UtahChao
复制链接

扫一扫

专栏目录