信息可视化information visualization

HUT_Tyne265

已于 2024-01-14 02:13:08 修改

阅读量114

点赞数

文章标签：信息可视化人工智能机器学习

于 2023-05-05 20:19:36 首次发布

本文链接：https://blog.csdn.net/weixin_52297290/article/details/130496955

版权

Glasgow-software development 专栏收录该内容

5 篇文章 0 订阅

订阅专栏

在这里插入图片描述获取数据的能力–能够理解数据，处理它，从中提取价值，可视化它，沟通它-这将是一个非常重要的
技能在未来几十年,…因为现在我们确实拥有了基本上免费和无处不在ubiquitous的数据。所以稀缺scarce因素是理解能力该数据并从中提取值。
在这里插入图片描述 geometric 几何的
blueprint 蓝图 seismographs 地震仪

overview

The nature of data（数据的本质）: data types, tasks and basic depictions
Design spaces
Bertin and the Semiology（符号学） of Graphics
Perception, design and evaluation
The visualization pipeline
Designs for graphs, trees, and multidimensional data
Historical examples of good and bad design

在这里插入图片描述

Five different data types

• item: an object
• link: relationship between items
• attribute: property of an item
• position: a location in 2D or 3D space
• grid: regular sampling of continuous data

“Running” example

Hill running in Scotland
Runners take part in races
Races are held annually
• item: a runner
• link: two runners train together (“run-buddies”)
• attribute: a runner belongs to a club
• position: the start point of a race
• grid: a runner’s heartbeat sampled every 30s

Four different data set types

• A data set type is a method for collecting data together
– table: rows and columns (2D or multidimensional多维)
– networks and trees: relationships between items
– fields: continuous data (conceptually there are an
infinite number of measurements you could take,
so sampling and exptrapolation is necessary)
– geometry: spatial data

Summary

• Data types: nature of the data (5)
– items, attributes, links, positions, grids（网格）
• Data set types: how the data is arranged (4)
– tables, networks, fields, geometry（几何学）
• When the data is available (2)
– static, dynamic
• Attributes: properties of the data (2)
– categorical, ordered (ordinal（序数）, quantitative（定量）)
• Direction: ways of ordering (3)
– sequential（顺序的）, diverging（发散的）, cyclic（循环的）

Three different actions

• Given a visualisation of a data set, a user can:
– Analyse:
• consume or produce
– Search:
• location/target is known/unknown?
– Query:
• find specific information

The Analyse action

• Consuming: user simply accesses the data using the visualisation
• to discover information not known before
• to present information to another person
• enjoy and have fun
• Producing: user actively creates something
• annotations（注释） of the data or the visualisation
• a persistant record of a visualisation (or aspects thereof)
• derive new data based on existing data

Running example: Analyse

• Discover: did anyone win both the TBHR and the Highland Fling in 2010?
• Present: here are the first Swandling club finishers for the TBHR in 2012
• Enjoy: gosh - I had no idea that so many people liked running up and down hills!
• Annotate: Mary Smith is the same person as Mary Bernados
• Record: this chart on my wall shows how much faster I have become in the Ben Lomond race over the past ten years
• Derive（衍生，推导）: calculate the percentage of active women in each club in each year

The Search action

Locating targest of interest in the visualisation
• Lookup: target known & location known (where and what)
• Browse: target unknown & location known (where)
• Locate: target known & location unknown (what)
• Explore: target unknown & location unknown

Running example: Search

• Lookup: what position did John Thomas (JT) come in? (4)
• Browse: who won the race? (SB)
• Locate: did CG run this year? (no)
• Explore: is there any noticable pattern? (no)

The Query action

Once you have found the data you are interested in, what will you do with it?
– Identify: get all the information about it
– Compare: differences between more than one data item
– Summarise: produce an overview of more than one data item

Running example: Query

• Identify: What club was the TBHR 2015 winner from?
• Compare: Was ND faster than DF?
• Summarise: Of the first ten finishers, three were women

Targets

• Targets are the things of interest in a visualisation
• Targets are not necessarily just the individual data points (although this is common)
– for all data: trends, outliers（异常值）, features
– for attributes: distributions, dependencies, correlations, similarities
– for network data: topology（拓扑学）, paths
– for spatial data: shape

Depicting Quantitative Data

Dimensionality:data about running clubs

• Univariate（单变量）: only one variable describes the data
– number of members in each club
• Bivariate（双变量）: two variables describe the data
– number of male and female members in each club
• Tri-variate（三变量）: three variables describe the data
– number of men, women, average race finishing
position for the club
• Multivariate（多变量）: more than three variables
– number of men, women, membership fees, colour,founding year,average race finishing position

The data

Club name: categorical
although note that an alphabetic ordering may be imposed,making the data ordered ordinal
Number of members: ordered quantitative
Number of women: ordered quantitative
Number of men: ordered quantitative
Membership fees: ordered quantitative
Colour: categorical
Founding year: ordered quantitative
Average race finishing position: ordered quantitative

Tri-variate: Heat maps

• Typically two (independent) categorical variables,and a quantitative variable
• The categories are on the two axes
• The quantitative value is represented by change in colour value
– typically: ‘darker’ = ‘more’
• The order of the categories on each axis can be changed (and may be important for identification of patterns)
• Each cell has only one value

Multivariate: Parallel（平行） coordinates（座标）

• Each vertical axis is a dimension, with its values equally spaced along it
• The dimensions are arranged, equally spaced,horizontally
• A single data point is a line that joins its values on each dimension
在这里插入图片描述

Design space

esign is all about making decisions
“What is common among design spaces is that they make design decisions explicit, summarize what is possible, and what is under-explored.”

Design space: definition

• Each decision is a dimension
• Each dimension has a range of values
• Each design is a point in n-dimensional space
• Dimensions may interact with each other
• Constraints may indicate that some of the space is not available
• Some areas of the space might be preferable to to others
Design justification explains why one particular point has been chosen instead of another

Design is all about choices

Choices
– which data to present
– which visualisation method to use
– what order to present the data categories in the visualisation
– what colours, what fonts, size…

multiple dimensions and rationale（理由）

• In any design there will be a very large number of decisions to make
• Each decision represents a dimension in multi-dimensional space
• We can’t draw more than two dimensions!
• Parellel co-ordinates is a common visualisation method for
high-dimensional data
• We can use it to visualise our design space…

Parallel coordinates

• Used for visualising multidimensional data
• Each dimension (decision) is represented as a vertical axis, with its values equally spaced along it
• The dimensions are arranged horizonatally, equally spaced
• A single data point is a line that joins its values on each dimension

Questions Options Criteria（QOC notation）

A more formal way of representing the Design Choice and Design Rationale
– Questions: the key issues/choices of the design
– Options: possible answers to the questions
– Criteria: reasons for arguing for or against the options

Design process

• What are the design decisions?
• Which combinations are
– possible
– impossible
– relevant
– preferable
– under-explored (gap-detection)
• Which options best satisfy our criteria?

Jacques Bertin:The Semiology of Graphics

在这里插入图片描述

Semiotics (in brief)

• Visualisation facilitates communication between people
• Visualisation therefore is a visual language
• Like all languages, it has tokens (words, signs) and rules describing how the tokens can legitimately（合法地） be combined (syntax)
Semiotics is the study of signs and how they convey meaning

The nature of signs

Signs can be:
– symbols: there is no perceptual relationship beween the object and what it is meant to represent (arbitrary（任意地）)
– icons: there is a clear perceptual（知觉的） relationshop between the object and what it is meant to represent (non-arbitrary)
“An absolute boundary between symbols and icons is illusory（虚幻的） because as soon as a symbol’s meaning has been learned it will become a meaningful image”

• Bertin defined a set of “visual variables”
• The various ways a visual object can be displayed (and therefore perceived)
• Independent of each other
• Reducing the map/visualisation into its constituent（成分） graphical symbols, for critical analysis

Bertin’s Visual Variables

• Location variables (position, relative to a coordinate frame)
– e.g. horizonal and vertical axes on a scatterplot（散点图）; longitude and latitude on a map
– (so fundamental to presenting map information that these variables are often ignored in cartography)
• Retinal（视网膜） variables (perceptual properties)
– ways of representing differences between objects
– size, shape, colour (hue), colour (value), texture, orientation

This separation makes clear the difference between the spatial relationships between symbols and the perceptual properties of the symbols themselves
• Location variables
– fix a ‘graphic mark’ (symbol, visual object) on to a position on the plane
• Retinal variables
– ‘elevate’ that mark with a different ‘pattern of light’

The Six Retinal Variables

• Shape: (e.g. square, circle, star)
• Size: (e.g. measured in mm or pixels)
• Orientation: angle of most prominent axis in the symbol to the coodinate axes (e.g. 36o,218o)
• Texture: spacing between repeated elements of a symbol (e.g. fine, coarse)
• Hue（色调）: colour, as associated with wavelength (e.g. blue,green, turquoise)
• Value: depth of colour, as associated ink density and represented by greyscale (e.g. red ink with low value will be perceived as pink)

Using the variables

nordered (colour hue, orientation, shape, texture)
for nominal information: apples, oranges, pears
Ordered, non-quantitative (colour value)
for ordinal information: rainfall map
Ordered, quantitative (location, size)
for numerical information: electricity usage
(also good for non-quantitative and nominal information given their visual dominance)

Extensions to Bertin’s Visual Variables

• Morrison (1974)
– colour saturation（饱和）, arrangement
– particularly for cartographic（制图的） purposes
• MacEachren (1995)
– crispness, resolution, transparency
– variations enabled by digital manipulation(see Roth for details)

Perceptual model

Three levels to preceiving a scene:

Level 1: processing low-level properties (parallel)
Level 2: pattern recognition (sequential)
Level 3: target-oriented search (sequential)
在这里插入图片描述
• Level 1:
– rapid, parallel extraction of features
– e.g. edges, orientation, colour, texture, movement
– bottom-up, data-driven
– pre-attentive, held very briefly
• Level 2:
– slow, serial detection of patterns
– e.g. contours, regions
– combination of bottom-up and top-down
– needs attention, uses memory (working and long-term)
• Level 3:
– slow, serial identification of objects
– e.g. a handle to turn, a data point to focus on
– related to action, purpose, concentration
– uses memory

Topics in Visual Perception

• Level 1 (bottom-up)
– pre-attention
– colour
• Level 2 (bottom-up & top-down)
– pattern identification
– Gestalt laws
• Level 3 (top-down)
– object identification
• Interference between levels

Pre-attention experiments

• Stimulis:
– one unique target amongst several identical distractors
– the target represents a feature (or features) that is absent in the distractors
• Task: identify the target
• Data collected: response time
If the response time does not depend on the number of distractors, the feature is pre-attentive

Level 1: Colour

• Objective measures
– Hue
• the colour itself
– Saturation（饱和度）
• intensity of the colour
• intense vs dull
– Lightness/Value
• light vs dark
• varying amounts of black or white in the colour
• Subjective assessment
– Brightness (Luminence?)
在这里插入图片描述

Observations

• Only eight colours, plus white, consistently named
– green, yellow, orange, red, aqua（水色）, pink, purple,blue, white
• The pure monitor ‘red’ was named orange most of the time
• Data obtained with a black background;
different results expected with white background
在这里插入图片描述

Some rules on colour

• Less is more!
• Don’t use blue for thin lines, rather use it for large areas
• Use red and green in the center of the field of view
• Use black, white, yellow in the periphery（周边）
• For large regions, don’t use highly saturated colours
• Don’t use adjacent colours that vary in the amount of blue
• Use colour for grouping and to asssist search
• Use a neutral（中性的） tone to encode the number 0
• Positive and negative numbers should be encoded with the saturation of
contrary colours (e.g. red/green; purple/yellow; blue/orange)
• Errors in contrast can be avoided by drawing boundaries around selected areas
• Also: see ColorBrewer.org

Level 2: Pattern identification

Level 2: ‘interim’（临时的） level using bottom-up and top-down processing
– bottom-up: uses the actual features that are physically perceived
– top-down: uses other contextual information – e.g. from the
environment, from memory

Level 2: Gestalt laws

Rules describing how we see patterns in a visual display
In particular, how we see how visual objects form groups
– proximity（接近）
Elements that are physically close together are perceptually grouped together
– similarity
Similar elements tend be to be grouped together
So: use different colours to encode rows/columns in a grid data set
– connectedness
Elements connected by lines form groups
So: use lines to show relationships between objects
– continuity
We perceive elements as smooth and continuous(rather than with abrupt change in direction)
Consider continuity when showing overlapping objects
– symmetry
Symmetric elements tend to be grouped together
So: use symmetry to make pattern comparisons easier
– closure（闭合）
Contours with gaps tend to be perceptually ‘closed’
So: put related informaion in a closed contour（等高线） – defined by line, colour or texture
– figure and ground
Small areas tend to be seen as ‘figure’
Context may affect figure/ground interpretation
So: Use closure, symmetry, layout etc. to ensure objects will be perceived as figures, not ground.
– common fate
Things that move together are grouped together
• Ware (2021) provides example design principles for each

Level 3: Object identification

Top-down identification of objects
Often led by a query, task or intention
Supported by memory and context
Interaction between bottom-up and top-down processing

Summary

Types of interaction

• Filtering: （dynamic queries）only show me the data I am interested in [F,Yi,Sh,K]
• Selecting:（highlighting items） mark or track items I am interested in [F,Yi]
• Abstract & Elaborate（详细说明）:（zoom） show me more or less detail [F,Yi,K]
“Filter by navigation” results in loss or gain of information
• Overview & Explore（distortion失真）/Focus & Context（exposing details暴露细节）:overview first, zoom and filter, details on demand [F,Sh,K]
• Connect/Relate:（multiple views多个视图）/（linking
and brushing） show me how this data is related [F,Yi,Sh,K]
• Reconfigure: （data choice/dimension order）show me a different arrangement of the data [F,Yi,K]
• Encode: show me a different representation of the data [F,Yi]
• Switch between views of the same data
– e.g. scatterplot to clustered bar chart
• Change visual variables
– e.g colour, shape, line width
• Extraction of features: allow me to extract data that interests me [F,Sh]
• History: allow me to retrace the steps I take [F,Sh]
• Participation/Collaboration: allow me to contribute to the data [F]
• Gamification: show me the data in a more playful way [F]

习题

A programmer records their visualisation program’s performance with a number of random samples froman input data set, noting (for each test) the sample size, the metric used, the run time, and a uniqueidentifier for the text, e.g. (10000, Manhattan, 23.3, 1) for the first test, How best to describe the recordeddata?
程序员用输入数据集中的一些随机样本记录可视化程序的性能，记录（每次测试）样本大小、使用的度量、运行时间和文本的唯一标识符，例如第一次测试的（10000，Manhattan，23.3，1），如何最好地描述记录的数据？
A：The data could be treated as a table, with each row being a tuple of three items: two sequential quantitative items and one categorical item.
B:The data could be treated as a table, with each row being a tuple of four items: three sequential quantitative items, and one categorical item.
C:The data could be treated as a table, with each row being a tuple of four items: two sequential quantitative items, and two categorical items.
解析：Your answer is correct.答案：C
note the the example identifier (1) may look like a guantitative（定量的） value, but it could be anything later on, e.g1a, 1b. lt is best treated as categorical. This also makes it a bad idea to use the identifier as a row index, asin the choice that mentions each row having three items.
The correct answer is: The data could be treated as a table, with each row being a tuple of four items: twosequential quantitative items and two categorical items.
你的答案是正确的。
请注意，示例标识符（1）可能看起来像一个冠名值，但它可能是以后的任何值，例如g1a、1b。它最好被视为绝对的。这也使得使用标识符作为行索引成为一个坏主意，因为选择时会提到每行有三个项目。
正确的答案是：数据可以被视为一个表，每一行都是由四个项目组成的元组：两个顺序的定量项目和两个分类项目。

A programmer runs ten tests of thelr visualisation system, recording run time and data set size for eachtest (along with an identifier for each test). lf they were to make a chart of the run time for each test, tolook for patterns and outliers, which of the following would be best?
程序员对较低的可视化系统进行十次测试，记录每次测试的运行时间和数据集大小（以及每次测试的标识符）。如果他们要为每个测试制作一张运行时间图表，寻找模式和异常值，下面哪一个是最好的？
A:A line chart connecting 10 dots, with the dots spaced evenly along the x-axis, and run time shown as the y coordinate of each dot
B：A bar chart with identifiers spaced evenly along the x-axis, and run time as the height of the bar o b.above each identifier
C：A pie chart, with ten segments, one segment per test. Each segment’s area matches theproportion of the total of all ten run times.
解析：Your answer is correct.答案：B
Since identifier is not an ordered attribute (like data set size or run time), it does not make sense to use a line chart to show trend.
Area of a pie chart segment is not a reliable way to show magnitudes. (Angle is better, but not great.
The correct answer is:
A bar chart with identifiers spaced evenly along the x-axis, and run time as the height of the bar above each identifier
你的答案是正确的
由于标识符不是有序属性（如数据集大小或运行时），因此使用折线图显示趋势。
饼图分段的面积不是显示幅度的可靠方法。（角度更好，但不太好。）
正确答案是：B
标识符沿x轴均匀分布的条形图，运行时间为每个标识符上方条形的高度

A programmer has collected data on run time and data set size from 100 tests of their program, They nowplan to make a scatterplot with data set size as the x-axis, and run time as the y-axis. Which of the followingbest describe the design choice for the mark on the scatterplot, made for each test?
一位程序员已经从他们程序的100次测试中收集了关于运行时间和数据集大小的数据。他们现在计划制作一个以数据集大小为x轴、以运行时间为y轴的散点图。以下哪项测试描述了散点图上标记的设计选择，用于每次测试？
A：Each mark would be a small dot, and the identifier would be encoded as the colour hue of that dot
B;Each mark would be a small dot, and the identifier would be encoded as the colour value of that dot
C:It is not a good idea to encode the identifier for each test into visual variable for each mark
解析：C
Your answer is correct.If there had been a small number of tests, then colour hue might be a good choice of variable to use, butthere are too many tests.
Colour value is not a good choice, even for fairly small numbers, as identifier is a nominal value (and valueimplies an order that does not exist).
The correct answer is:lt is not a good idea to encode the identifier for each test into visual variable for each mark
你的答案是正确的。如果有少量的测试，那么色调可能是一个很好的变量选择，但测试太多了。
颜色值不是一个好的选择，即使对于相当小的数字也是如此，因为标识符是一个标称值（值意味着不存在的顺序）。
正确的答案是：将每个测试的标识符编码为每个标记的视觉变量不是一个好主意

A programmer runs 50 pairs of tests of their dimensional reduction program, comparing two algorithms (Aand B) for each of 50 different data set sizes. This size (N) is increased in regular steps, e.g. 100, 200, 300,400, etc. For each test, they record the run time and the error. Which of the following would be the best toshow the results in a scatterplot of run time (x-axis) against error (y-axis)?
程序员对他们的降维程序进行50对测试，对50个不同数据集大小中的每一个进行两种算法（A和B）的比较。这个大小（N）以规则的步骤增加，例如100、200、300，400等。对于每个测试，他们记录运行时间和错误。以下哪项最能显示运行时间（x轴）与误差（y轴）的散点图结果？
A:Each pair of tests is connected with fine lines. All data points from A have the same colour,which is clearly different to the colour used for the data points from B.
B：Each pair of tests is aligned vertically, All data points from A have the same colour.,which is clearly different to the colour used for the data points from B.
C：Tests from A have a different hue to those from B. For each set of 50 data points from one program, colour saturation rises in 50 steps as N increases.
解析：A
Your answer is correct.
Using fine connecting lines is not perfect, but will generally show pairs of tests for the same value of Nclearly, as connectedness is a very powerful perceptual cue. lt would be easy to add line highlighting onmouseover etc.
Using 50 levels of saturation will show rough trends with rising N but, with so many levels of saturation, itwill also mean that it is hard to see which tests came from the same N. The programmer would have to huntaround to find matching tests, or to create some other additional tool to filter them.Aligning pairs vertically means that the correct value for run time is not being shown, unless a pair of pointshas exactly the same run time (which is unlikelv).
你的答案是正确的
使用精细的连接线并不完美，但通常会对相同的Nclearly值进行成对的测试，因为连通性是一个非常强大的感知线索。在鼠标悬停等位置添加线条高亮显示会很容易。
使用50个饱和水平将显示出随着N的上升而大致的趋势，但由于饱和水平如此之多，这也意味着很难看出哪些测试来自同一个N。程序员必须寻找匹配的测试，或者创建一些其他额外的工具来过滤它们。垂直对齐对意味着没有显示正确的运行时间值，除非一对点的运行时间完全相同（这是不可能的）。