The nature of data(数据的本质): data types, tasks and basic depictions
Design spaces
Bertin and the Semiology(符号学) of Graphics
Perception, design and evaluation
The visualization pipeline
Designs for graphs, trees, and multidimensional data
Historical examples of good and bad design


Five different data types

• item: an object
• link: relationship between items
• attribute: property of an item
• position: a location in 2D or 3D space
• grid: regular sampling of continuous data

“Running” example

Hill running in Scotland
Runners take part in races
Races are held annually
• item: a runner
• link: two runners train together (“run-buddies”)
• attribute: a runner belongs to a club
• position: the start point of a race
• grid: a runner’s heartbeat sampled every 30s

Four different data set types

• A data set type is a method for collecting data together
– table: rows and columns (2D or multidimensional多维)
– networks and trees: relationships between items
– fields: continuous data (conceptually there are an
infinite number of measurements you could take,
so sampling and exptrapolation is necessary)
– geometry: spatial data


• Data types: nature of the data (5)
– items, attributes, links, positions, grids(网格)
• Data set types: how the data is arranged (4)
– tables, networks, fields, geometry(几何学)
• When the data is available (2)
– static, dynamic
• Attributes: properties of the data (2)
– categorical, ordered (ordinal(序数), quantitative(定量))
• Direction: ways of ordering (3)
– sequential(顺序的), diverging(发散的), cyclic(循环的)

Three different actions

• Given a visualisation of a data set, a user can:
– Analyse:
• consume or produce
– Search:
• location/target is known/unknown?
– Query:
• find specific information

The Analyse action

• Consuming: user simply accesses the data using the visualisation
• to discover information not known before
• to present information to another person
• enjoy and have fun
• Producing: user actively creates something
• annotations(注释) of the data or the visualisation
• a persistant record of a visualisation (or aspects thereof)
• derive new data based on existing data

Running example: Analyse

• Discover: did anyone win both the TBHR and the Highland Fling in 2010?
• Present: here are the first Swandling club finishers for the TBHR in 2012
• Enjoy: gosh - I had no idea that so many people liked running up and down hills!
• Annotate: Mary Smith is the same person as Mary Bernados
• Record: this chart on my wall shows how much faster I have become in the Ben Lomond race over the past ten years
• Derive(衍生,推导): calculate the percentage of active women in each club in each year

The Search action

Locating targest of interest in the visualisation
• Lookup: target known & location known (where and what)
• Browse: target unknown & location known (where)
• Locate: target known & location unknown (what)
• Explore: target unknown & location unknown

Running example: Search

• Lookup: what position did John Thomas (JT) come in? (4)
• Browse: who won the race? (SB)
• Locate: did CG run this year? (no)
• Explore: is there any noticable pattern? (no)

The Query action

Once you have found the data you are interested in, what will you do with it?
– Identify: get all the information about it
– Compare: differences between more than one data item
– Summarise: produce an overview of more than one data item

Running example: Query

• Identify: What club was the TBHR 2015 winner from?
• Compare: Was ND faster than DF?
• Summarise: Of the first ten finishers, three were women


• Targets are the things of interest in a visualisation
• Targets are not necessarily just the individual data points (although this is common)
– for all data: trends, outliers(异常值), features
– for attributes: distributions, dependencies, correlations, similarities
– for network data: topology(拓扑学), paths
– for spatial data: shape

Depicting Quantitative Data

Dimensionality:data about running clubs

• Univariate(单变量): only one variable describes the data
– number of members in each club
• Bivariate(双变量): two variables describe the data
– number of male and female members in each club
• Tri-variate(三变量): three variables describe the data
– number of men, women, average race finishing
position for the club
• Multivariate(多变量): more than three variables
– number of men, women, membership fees, colour,founding year,average race finishing position

The data

Club name: categorical
although note that an alphabetic ordering may be imposed,making the data ordered ordinal
Number of members: ordered quantitative
Number of women: ordered quantitative
Number of men: ordered quantitative
Membership fees: ordered quantitative
Colour: categorical
Founding year: ordered quantitative
Average race finishing position: ordered quantitative

Tri-variate: Heat maps

• Typically two (independent) categorical variables,and a quantitative variable
• The categories are on the two axes
• The quantitative value is represented by change in colour value
– typically: ‘darker’ = ‘more’
• The order of the categories on each axis can be changed (and may be important for identification of patterns)
• Each cell has only one value

Multivariate: Parallel(平行) coordinates(座标)

• Each vertical axis is a dimension, with its values equally spaced along it
• The dimensions are arranged, equally spaced,horizontally
• A single data point is a line that joins its values on each dimension

Design space

esign is all about making decisions
“What is common among design spaces is that they make design decisions explicit, summarize what is possible, and what is under-explored.”

Design space: definition

• Each decision is a dimension
• Each dimension has a range of values
• Each design is a point in n-dimensional space
• Dimensions may interact with each other
• Constraints may indicate that some of the space is not available
• Some areas of the space might be preferable to to others
Design justification explains why one particular point has been chosen instead of another

Design is all about choices

– which data to present
– which visualisation method to use
– what order to present the data categories in the visualisation
– what colours, what fonts, size…

multiple dimensions and rationale(理由)

• In any design there will be a very large number of decisions to make
• Each decision represents a dimension in multi-dimensional space
• We can’t draw more than two dimensions!
• Parellel co-ordinates is a common visualisation method for
high-dimensional data
• We can use it to visualise our design space…

Parallel coordinates

• Used for visualising multidimensional data
• Each dimension (decision) is represented as a vertical axis, with its values equally spaced along it
• The dimensions are arranged horizonatally, equally spaced
• A single data point is a line that joins its values on each dimension

Questions Options Criteria(QOC notation)

A more formal way of representing the Design Choice and Design Rationale
– Questions: the key issues/choices of the design
– Options: possible answers to the questions
– Criteria: reasons for arguing for or against the options

Design process

• What are the design decisions?
• Which combinations are
– possible
– impossible
– relevant
– preferable
– under-explored (gap-detection)
• Which options best satisfy our criteria?

Jacques Bertin:The Semiology of Graphics


Semiotics (in brief)

• Visualisation facilitates communication between people
• Visualisation therefore is a visual language
• Like all languages, it has tokens (words, signs) and rules describing how the tokens can legitimately(合法地) be combined (syntax)
Semiotics is the study of signs and how they convey meaning

The nature of signs

Signs can be:
– symbols: there is no perceptual relationship beween the object and what it is meant to represent (arbitrary(任意地))
– icons: there is a clear perceptual(知觉的) relationshop between the object and what it is meant to represent (non-arbitrary)
“An absolute boundary between symbols and icons is illusory(虚幻的) because as soon as a symbol’s meaning has been learned it will become a meaningful image”

• Bertin defined a set of “visual variables”
• The various ways a visual object can be displayed (and therefore perceived)
• Independent of each other
• Reducing the map/visualisation into its constituent(成分) graphical symbols, for critical analysis

Bertin’s Visual Variables

Location variables (position, relative to a coordinate frame)
– e.g. horizonal and vertical axes on a scatterplot(散点图); longitude and latitude on a map
– (so fundamental to presenting map information that these variables are often ignored in cartography)
Retinal(视网膜) variables (perceptual properties)
– ways of representing differences between objects
– size, shape, colour (hue), colour (value), texture, orientation

This separation makes clear the difference between the spatial relationships between symbols and the perceptual properties of the symbols themselves
• Location variables
– fix a ‘graphic mark’ (symbol, visual object) on to a position on the plane
• Retinal variables
– ‘elevate’ that mark with a different ‘pattern of light’

The Six Retinal Variables

• Shape: (e.g. square, circle, star)
• Size: (e.g. measured in mm or pixels)
• Orientation: angle of most prominent axis in the symbol to the coodinate axes (e.g. 36o,218o)
• Texture: spacing between repeated elements of a symbol (e.g. fine, coarse)
• Hue(色调): colour, as associated with wavelength (e.g. blue,green, turquoise)
• Value: depth of colour, as associated ink density and represented by greyscale (e.g. red ink with low value will be perceived as pink)

Using the variables

nordered (colour hue, orientation, shape, texture)
for nominal information: apples, oranges, pears
Ordered, non-quantitative (colour value)
for ordinal information: rainfall map
Ordered, quantitative (location, size)
for numerical information: electricity usage
(also good for non-quantitative and nominal information given their visual dominance)

Extensions to Bertin’s Visual Variables

• Morrison (1974)
– colour saturation(饱和), arrangement
– particularly for cartographic(制图的) purposes
• MacEachren (1995)
– crispness, resolution, transparency
– variations enabled by digital manipulation(see Roth for details)

Perceptual model

Three levels to preceiving a scene:

Level 1: processing low-level properties (parallel)
Level 2: pattern recognition (sequential)
Level 3: target-oriented search (sequential)
• Level 1:
– rapid, parallel extraction of features
– e.g. edges, orientation, colour, texture, movement
– bottom-up, data-driven
– pre-attentive, held very briefly
• Level 2:
– slow, serial detection of patterns
– e.g. contours, regions
– combination of bottom-up and top-down
– needs attention, uses memory (working and long-term)
• Level 3:
– slow, serial identification of objects
– e.g. a handle to turn, a data point to focus on
– related to action, purpose, concentration
– uses memory

Topics in Visual Perception

• Level 1 (bottom-up)
– pre-attention
– colour
• Level 2 (bottom-up & top-down)
– pattern identification
– Gestalt laws
• Level 3 (top-down)
– object identification
• Interference between levels

Pre-attention experiments

• Stimulis:
– one unique target amongst several identical distractors
– the target represents a feature (or features) that is absent in the distractors
• Task: identify the target
• Data collected: response time
If the response time does not depend on the number of distractors, the feature is pre-attentive

Level 1: Colour

• Objective measures
– Hue
• the colour itself
– Saturation(饱和度)
• intensity of the colour
• intense vs dull
– Lightness/Value
• light vs dark
• varying amounts of black or white in the colour
• Subjective assessment
– Brightness (Luminence?)


• Only eight colours, plus white, consistently named
– green, yellow, orange, red, aqua(水色), pink, purple,blue, white
• The pure monitor ‘red’ was named orange most of the time
• Data obtained with a black background;
different results expected with white background

Some rules on colour

• Less is more!
• Don’t use blue for thin lines, rather use it for large areas
• Use red and green in the center of the field of view
• Use black, white, yellow in the periphery(周边)
• For large regions, don’t use highly saturated colours
• Don’t use adjacent colours that vary in the amount of blue
• Use colour for grouping and to asssist search
• Use a neutral(中性的) tone to encode the number 0
• Positive and negative numbers should be encoded with the saturation of
contrary colours (e.g. red/green; purple/yellow; blue/orange)
• Errors in contrast can be avoided by drawing boundaries around selected areas
• Also: see

Level 2: Pattern identification

Level 2: ‘interim’(临时的) level using bottom-up and top-down processing
– bottom-up: uses the actual features that are physically perceived
– top-down: uses other contextual information – e.g. from the
environment, from memory

Level 2: Gestalt laws

Rules describing how we see patterns in a visual display
In particular, how we see how visual objects form groups
– proximity(接近)
Elements that are physically close together are perceptually grouped together
– similarity
Similar elements tend be to be grouped together
So: use different colours to encode rows/columns in a grid data set
– connectedness
Elements connected by lines form groups
So: use lines to show relationships between objects
– continuity
We perceive elements as smooth and continuous(rather than with abrupt change in direction)
Consider continuity when showing overlapping objects
– symmetry
Symmetric elements tend to be grouped together
So: use symmetry to make pattern comparisons easier
– closure(闭合)
Contours with gaps tend to be perceptually ‘closed’
So: put related informaion in a closed contour(等高线) – defined by line, colour or texture
– figure and ground
Small areas tend to be seen as ‘figure’
Context may affect figure/ground interpretation
So: Use closure, symmetry, layout etc. to ensure objects will be perceived as figures, not ground.
– common fate
Things that move together are grouped together
• Ware (2021) provides example design principles for each

Level 3: Object identification

Top-down identification of objects
Often led by a query, task or intention
Supported by memory and context
Interaction between bottom-up and top-down processing


• Level 1 (bottom-up)
– pre-attention
– colour
• Level 2 (bottom-up & top-down)
– pattern identification
– Gestalt laws
• Level 3 (top-down)
– object identification
• Interference between levels

Types of interaction

• Filtering: (dynamic queries)only show me the data I am interested in [F,Yi,Sh,K]
• Selecting:(highlighting items) mark or track items I am interested in [F,Yi]
• Abstract & Elaborate(详细说明):(zoom) show me more or less detail [F,Yi,K]
“Filter by navigation” results in loss or gain of information
• Overview & Explore(distortion失真)/Focus & Context(exposing details暴露细节):overview first, zoom and filter, details on demand [F,Sh,K]
• Connect/Relate:(multiple views多个视图)/(linking
and brushing) show me how this data is related [F,Yi,Sh,K]
• Reconfigure: (data choice/dimension order)show me a different arrangement of the data [F,Yi,K]
• Encode: show me a different representation of the data [F,Yi]
• Switch between views of the same data
– e.g. scatterplot to clustered bar chart
• Change visual variables
– e.g colour, shape, line width
• Extraction of features: allow me to extract data that interests me [F,Sh]
• History: allow me to retrace the steps I take [F,Sh]
• Participation/Collaboration: allow me to contribute to the data [F]
• Gamification: show me the data in a more playful way [F]


A programmer records their visualisation program’s performance with a number of random samples froman input data set, noting (for each test) the sample size, the metric used, the run time, and a uniqueidentifier for the text, e.g. (10000, Manhattan, 23.3, 1) for the first test, How best to describe the recordeddata?
A:The data could be treated as a table, with each row being a tuple of three items: two sequential quantitative items and one categorical item.
B:The data could be treated as a table, with each row being a tuple of four items: three sequential quantitative items, and one categorical item.
C:The data could be treated as a table, with each row being a tuple of four items: two sequential quantitative items, and two categorical items.
解析:Your answer is correct.答案:C
note the the example identifier (1) may look like a guantitative(定量的) value, but it could be anything later on, e.g1a, 1b. lt is best treated as categorical. This also makes it a bad idea to use the identifier as a row index, asin the choice that mentions each row having three items.
The correct answer is: The data could be treated as a table, with each row being a tuple of four items: twosequential quantitative items and two categorical items.

A programmer runs ten tests of thelr visualisation system, recording run time and data set size for eachtest (along with an identifier for each test). lf they were to make a chart of the run time for each test, tolook for patterns and outliers, which of the following would be best?
A:A line chart connecting 10 dots, with the dots spaced evenly along the x-axis, and run time shown as the y coordinate of each dot
B:A bar chart with identifiers spaced evenly along the x-axis, and run time as the height of the bar o b.above each identifier
C:A pie chart, with ten segments, one segment per test. Each segment’s area matches theproportion of the total of all ten run times.
解析:Your answer is correct.答案:B
Since identifier is not an ordered attribute (like data set size or run time), it does not make sense to use a line chart to show trend.
Area of a pie chart segment is not a reliable way to show magnitudes. (Angle is better, but not great.
The correct answer is:
A bar chart with identifiers spaced evenly along the x-axis, and run time as the height of the bar above each identifier

A programmer has collected data on run time and data set size from 100 tests of their program, They nowplan to make a scatterplot with data set size as the x-axis, and run time as the y-axis. Which of the followingbest describe the design choice for the mark on the scatterplot, made for each test?
A:Each mark would be a small dot, and the identifier would be encoded as the colour hue of that dot
B;Each mark would be a small dot, and the identifier would be encoded as the colour value of that dot
C:It is not a good idea to encode the identifier for each test into visual variable for each mark
Your answer is correct.If there had been a small number of tests, then colour hue might be a good choice of variable to use, butthere are too many tests.
Colour value is not a good choice, even for fairly small numbers, as identifier is a nominal value (and valueimplies an order that does not exist).
The correct answer is:lt is not a good idea to encode the identifier for each test into visual variable for each mark

A programmer runs 50 pairs of tests of their dimensional reduction program, comparing two algorithms (Aand B) for each of 50 different data set sizes. This size (N) is increased in regular steps, e.g. 100, 200, 300,400, etc. For each test, they record the run time and the error. Which of the following would be the best toshow the results in a scatterplot of run time (x-axis) against error (y-axis)?
A:Each pair of tests is connected with fine lines. All data points from A have the same colour,which is clearly different to the colour used for the data points from B.
B:Each pair of tests is aligned vertically, All data points from A have the same colour.,which is clearly different to the colour used for the data points from B.
C:Tests from A have a different hue to those from B. For each set of 50 data points from one program, colour saturation rises in 50 steps as N increases.
Your answer is correct.
Using fine connecting lines is not perfect, but will generally show pairs of tests for the same value of Nclearly, as connectedness is a very powerful perceptual cue. lt would be easy to add line highlighting onmouseover etc.
Using 50 levels of saturation will show rough trends with rising N but, with so many levels of saturation, itwill also mean that it is hard to see which tests came from the same N. The programmer would have to huntaround to find matching tests, or to create some other additional tool to filter them.Aligning pairs vertically means that the correct value for run time is not being shown, unless a pair of pointshas exactly the same run time (which is unlikelv).





