The knowledge for this article comes from my college course: Data Visualization. In the article, I will share my course notes and the techniques and implementation methods of data visualization projects. In terms of technology, I mainly use R language and conduct analysis based on csv files. I will use R language libraries such as ggplot2.

The current article is still in the update stage.

Data Abstraction & Task Abstraction

  • “What”: What data are we visualizing? This concerns data structure and attribute types.
  • “Why”: Why are we visualizing it? This concerns the cognitive/analytical task we want viewers to complete.

Data abstraction (“What”)

  • Data is not only tabular; it can also be network data and spatial data.
  • Tabular data: each row is an item; each column is an attribute (dimension).
  • Network data: nodes and links (for example, social networks); trees are a special case of networks.
  • Spatial data: includes spatial locations and geometry (for example, geographic boundaries), common in geovis and scientific vis.
  • Attribute types affect both analysis and visual encoding:
  • Categorical: no inherent order (for example, city names).
  • Ordinal: ordered but not arithmetic (for example, S/M/L sizes).
  • Quantitative: numeric values where arithmetic operations are meaningful.
  • Special structures: cyclic/periodic, diverging (centered distribution), sequential (one-direction order).
  • Different attribute types require different analysis and chart choices. For example, means apply to quantitative data, while rank ordering applies to ordinal data.

Task abstraction (“Why”)

  • The goal of visualization is to support cognition and analysis, not just to produce a visually pleasing chart.
  • Task intent should be explicit. Typical tasks include:
  • Compare
  • Find extrema
  • Characterize distributions
  • Identify outliers
  • Explore correlations
  • Discover trends
  • Derive values (for example, trade balance from imports/exports)
  • A practical convention is verb + target (for example, “compare trends”, “identify outliers”).
  • Task descriptions can be low-level and domain-agnostic, or high-level and domain-specific.

Important note on “derive” tasks

  • If a value can be derived (for example, balance), the visualization should compute and show it directly instead of forcing viewers to do mental arithmetic.

Design takeaway

  • Before designing a chart, first define what data/attributes you have, then define why (the analysis goal). Both determine the right visual design.
  • Good visualization minimizes cognitive load and helps users complete target tasks directly.

Marks & Channels

Basic concepts

  • Marks are geometric primitives: point (0D), line (1D), and area (2D). 3D volumetric marks are less common.
  • Channels are visual properties used to encode data attributes: position, color, size (length/area/volume), shape, orientation, and so on.

Roles

  • Marks typically correspond to data items (for example, one dot or one bar per item).
  • Channels map data attributes to visual form and carry the encoded meaning.

Channel categories and data matching

  • Identity channels express category membership (“what it is”), suitable for categorical data (for example, hue, shape).
  • Magnitude channels express amount (“how much”), suitable for ordinal and quantitative data (for example, position, length, luminance, area).

Two principles of visual encoding

  • Expressiveness principle: encode all and only the intended data semantics.
  • Effectiveness principle: map important attributes to more perceptually accurate channels.

Typical effectiveness ranking (high to low)

  • Position
  • Length
  • Angle / slope
  • Area
  • Luminance / saturation
  • Hue
  • Shape / texture
  • Lower-ranked channels are harder to compare precisely.

Perception notes

  • Human perception of length is close to linear; perception of area and brightness is more nonlinear.
  • These rankings are supported by perception studies (for example, Cleveland and McGill).

Design guidance

  • Use top-ranked channels for critical attributes whenever possible.
  • Avoid requiring precise comparisons on weak channels.
  • Example:
  • If charting product profit over time, map profit to position, time to another position axis, and product category to color.

Discriminability

  • Discriminability is how many distinct values users can reliably distinguish in one channel.
  • Every channel has limits (for example, too many colors become hard to differentiate).
  • Exceeding limits causes visual clutter and identification errors.

What affects discriminability?

  • Physical limits of the channel itself
  • Data cardinality (number of categories/levels)
  • Spatial arrangement (ordered vs. scattered)
  • Mark size
  • Context effects (nearby similar colors/sizes reduce distinguishability)

How to improve discriminability

  • Use alternative channels (for example, spatial grouping instead of only color).
  • Merge or cluster categories when appropriate.
  • Filter to keep only relevant categories.
  • Use faceting (small multiples).

Visual Separability

  • Refers to whether the audience can focus on the attributes of a certain channel when multiple channels are combined.
  • Separable channels (such as position + color): can be paid attention to separately without interfering with each other.
  • Integral channels (such as a combination of height and width): easy to perceive as a whole, difficult to focus individually.

Separability design recommendations

  • If viewers need to analyze attributes independently, choose channel combinations with high separability.
  • If viewers should perceive attributes jointly as one whole pattern, choose integral channels.

Summary points

  • Marks are geometric elements and Channels are visual encoding attributes.
  • Visual coding should match appropriate channels according to data attribute types, and allocate channels with high effectiveness according to importance.
  • Pay attention to the resolvability and separation of channels to avoid information congestion and confusion.

#Lesson 6

Story & Narrative

  • Story: refers to all events in the narrative, i.e. facts and figures. It’s the “What”—the raw data points, people, places, and actions.
  • Narrative: refers to the way “telling” these events. It is the “How” - how the author organizes, orders, paces, and shapes these events so that they are clear and engaging for the audience.
  • Traditional narrative guides (e.g. literature, film) are insufficient for data- and visualization-based narratives. Therefore, this chapter proposes a set of “Narrative Design Patterns” to help creators design data-driven narratives.

Narrative Design Patterns

  • A narrative pattern is a low-level narrative device that serves a specific intent. A pattern can be used individually or in combination with others to give form to a story.
  • i.e.: an underlying narrative device that serves a specific purpose
  • They are not concrete implementations or code. They are more abstract concepts, decoupled from the final medium (e.g. web, video, print).

Five categories of 18 modes (Figure 5.2)

1. Argumentation

  • Intent: Persuading and convincing.
  • example:
  • Compare: Presents two or more data sets or visualizations side by side.
  • Concretize: Use concrete visual objects (such as the little man icon in ISOTYPE) to represent abstract data points or aggregate statistics.
  • Repetition: Represent a phenomenon multiple times to enhance the narrative rhythm or emphasize changes and differences.

2. Process (Flow)

  • Intent: Structuring the sequencing, rhythm, and pace of information and arguments.
  • example:
  • Gradual Reveal: Gradually reveal data elements, eventually leading to the full picture and final argument.
  • Speed-up/Slow-down (speed up/slow down): Change the narrative rhythm. Slowing down can give the audience time to think; speeding up can produce an overwhelming effect of information impact.

3. Framing

  • Intent: Determine how facts and events are perceived and understood; playing with expectations (how facts are perceived and understood; playing with expectations).
  • example:
  • Familiarization: Create a familiar entry point that the audience can identify with (for example, the OECD’s Regional Well-Being website lets users look at their region first).
  • Make-a-guess: Invite the audience to guess data or results (e.g., “You Draw it Yourself” from The New York Times).
  • Defamiliarization: Presenting familiar things in unexpected ways (for example, turning a map upside down).
  • Silent Data: Deliberately hiding data, like a rest in music, allowing the audience to infer missing information.

4. Empathy and Emotion

  • Intent: Enhance understanding, reposition the reader’s perspective, or inspire action (enhance understanding, provoke a call-for-action).
  • example:
  • Breaking-the-fourth-wall: The narrator talks directly to the audience, breaking the “wall” between the stage and the audience.
  • Humans-behind-the-dots (people behind the data points): Display the personal stories (names, photos, experiences) represented by the data points through “detail-on-demand” (such as clicks).

5. Engagement

  • Intent: Make the audience feel like they are part of the story, or being in control (feeling of being part of the story, or being in control).
  • example:
  • Rhetorical Question: Ask the audience directly (such as “What if I told you…”) to stimulate thinking.
  • Call-to-action: A clear call to action at the end of the narrative.
  • Exploration: Typically at the end of a linear narrative, allowing the user to freely interact and explore the data.

Key discussion points

  • How ​​to use it: These patterns can be used Generative (to help conceive and create new stories) and Analytical (to help deconstruct and critique existing stories).
  • Three concepts of time: The author distinguishes three types of time:
  1. Authoring-time (creation time): The time when stories and narratives are created.
  2. Presentation-time: The time when the story is presented or consumed by the audience.
  3. Data-time:** The time represented by the data itself.
  • Audience: The choice of mode is highly dependent on the target audience (background knowledge, culture, willingness to interact, etc.).

L6: Exploration vs. Explanation


FeaturesExplorationExplanation
CenteredData centered (Data centered)Human centered (Human centered)
PhilosophyThe more the better (More is more)Less is more (Less is more)
AudienceExpertsNon-experts
OutputsInsightsMessages
EnvironmentLab SettingIn-the-wild
StyleLengthy, FuzzyTo-the-point, Precise

  1. “Data-driven” Story vs. Narrative This section reiterates and expands the definitions we got from the paper:
  • Data-driven Story (Story): Facts, insights, information. It also includes the underlying processes, such as data transformation, selection, and aggregation.
  • Data-driven narrative (Narrative):
  • Provide context (e.g. people, importance, problem).
  • Explain the visualization itself.
  • Conversation with the audience.
  • Convey the “core message” (Take home message).

Classic story structure (Story Structure)

The courseware proposes a classic three-stage drama structure, as well as the goals of each stage:

  • Beginning: Introduction, background, questions.
  • Audience reaction: Curiosity.
  • Middle (Middle): Events, facts, relationships, surprises, discoveries, insights.
  • Audience response: Understanding .
  • End: Conclusion, solution, core information.
  • Audience Response: Action (e.g. Call-to-action).

Courseware Part 2 (L6-storytelling-Part2.pdf)

This section delves into the specific structures, patterns, and genres that enable narrative.

  1. Narrative Structures Spectrum

The courseware (citing research by Segel & Heer) proposes a spectrum from “author-driven” to “reader-driven”:

  • Author-driven:
  • Heavy messaging.
  • No interactivity.
  • Linear.
  • Example: Traditional report or data video.
  • Reader-driven:
  • No messaging.
  • Free interactivity.
  • Example: Exploratory visualization tools.

⠀2. Three hybrid narrative structures (Hybrid Structures) Most data stories fall somewhere between these two poles, and the courseware highlights three hybrid structures:

  1. Martini-glass Structure: Guide first (author-driven), then explore (reader-driven).
  2. Interactive Slideshow: Have an overall narrative structure (author-driven) but allow for local exploration at each step (reader-driven).
  • Example: The “Climate Change Calculator” shown in the courseware.
  1. Drill-down Story: It is largely driven by readers. Readers can choose the parts they are interested in and explore them in depth.

Narrative Patterns

Quotes the first PDF we looked at (the paper by Bach et al.). It sees “patterns” as tools for achieving narrative intent.

*Examples given in the courseware include:

  • Contrast
  • Concretize, especially using ISOTYPE
  • Scales, as a concrete way to use references and analogies to help understand complex measurements (such as using the volume of a $100 bill to represent a trillion dollars).
  • Repetition
  • Juxtaposition
  • Incorporating the audience

Seven narrative genres (Storytelling Genres)

Finally, the courseware (citing again Segel & Heer) introduces seven common data storytelling “genres” or formats:

  1. Magazine Style
  2. Annotated Chart
  3. Partitioned Poster, also known as Infographic
  4. Flow Chart
  5. Comic Strip
  6. Slide Show
  7. Movie/Film/Video/Animation The courseware also highlights Data Videos (such as “Inequality in America”), Life Presentations (such as Hans Rosling), and the emerging Data Comics as important genres.

Summarize

These three documents (1 paper + 2 courseware) together provide us with a complete framework for data storytelling in STA313.

  1. Why:** Exploration vs. Explanation (courseware 1)
  2. What: ** Story (data) vs. Narrative (narrative) (courseware 1 & paper)
  3. How:
  • High-level structure: Author-driven vs. Reader-driven and three hybrid modes
  • Underlying tools (Pattern): 18 patterns, divided into 5 groups (argument, process, framework, empathy, participation)
  • Final form (Genre): Seven genres (magazines, videos, comics, etc.)

#Lesson7

Design for Information by Isabel Meirelles provides an in-depth exploration of the design principles and methods of thematic maps.

1. Core definition: Thematic Maps

  • Definition: Thematic map is a cartography that presents attribute data (quantitative and qualitative) on a base map. *Purpose: Its main purpose is not geographical navigation, but to display a specific “topic”, such as social, political, economic or cultural phenomena, to reveal the patterns and frequencies of these phenomena in geographical space.

2. Brief History

  • Thematic maps date from the second half of the 17th century
  • The first thematic map: ** An isoline map** drawn by the Englishman Edmond Halley in 1701 showing changes in the magnetic field.
  • The first modern statistical map:** Choropleth map** drawn by Frenchman Charles Dupin in 1826, showing educational attainment in France
  • The “Golden Age”: The mid-19th century (mid-1800s) was the “golden age” of innovation in graphical methods, driven largely by governments recognizing the importance of numerical information in planning population welfare

3. Map Design: 3 Basic Areas

Making a data map involves three basic decisions:

  • projection
  • scale
  • symbolization

A. Projection

  • Definition: A mathematical transformation that converts the 3D surface of the Earth into a 2D plane
  • Core problem: All projections produce distortions in angle, area, shape, distance, or direction
  • Key trade-off: A projection cannot be both a “conformal” (preserving angle /shape) and an"equivalent" (preserving area) at the same time.
  • Example: ** The Mercator projection (Mercator) is conformal and suitable for navigation, but it seriously exaggerates the area of ​​high latitudes (for example, Alaska looks about the same size as Brazil, but in fact the area of ​​Brazil is 5 times that of Alaska) and is therefore very unsuitable for comparing land areas.
  • Principle: On maps where area densities need to be compared (such as point distribution maps), it is crucial to use equal area projection.

B. Scale

  • Definition: ** The degree of map reduction, i.e. the ratio of distances on the map to corresponding distances on Earth
  • Conceptual distinction:
  • Large scale (e.g. 1:10,000): Shows more detail in a small area (such as a city street map).
  • Small scale** (e.g. 1:100,000,000): Shows less detail of large areas (e.g. world map)
  • Principle: ** The smaller the scale, the less physical space available for visual marks and detail. The details of the base map should match the scale.

C. Symbolization

Definition: Also known as visual encoding, it is the process of matching the phenomenon to be visualized (data set) to the most appropriate visual representation (graphical elements and visual attributes) [cite: 371, 372].

4. Data Considerations for Charting

  • **Data type: **
  • Nominal:** categorical data, used to distinguish (such as political parties) Ordinal:* Allows sorting, but no exact magnitude (such as small, medium, large)
  • Quantitative: A measurable value (such as population)
  • **Data distribution: **
  • Discrete:** Consists of a single item (such as a city on a map)
  • Continuous:** Data exists continuously in space (such as temperature)im
  • Data Model:** Data can be conceptualized as a spectrum from abrupt to smooth, and from discrete to continuous. This classification helps in choosing the right map type.

5. Visual Variables

  • The system in this chapter is based on the theories of Jacques Bertin.
  • This theory associates basic graphic elements (points, lines, surfaces) with “visual variables” to convey data.
  • Key variables and their applicable data types
  • Location (Location / X, Y): Applicable to all data types
  • Size / Value (i.e. the depth of the color): Applicable to quantitative and ordinal data
  • Color Hue / Shape: Applies to Nominal (Category) data

6. Six main graphics methods (Graphical Methods)

This chapter highlights six major thematic mapping methods

1. Dot Distribution Maps

Use the point element to reveal the spatial distribution of phenomena Type: “One to One” (one point = one event, e.g. Dr. John Snow’s Cholera Map or “one-to-many” (one point = one aggregate value, e.g. 1 point represents 10,000 people

  • Advantages and Disadvantages: Very good at showing relative density and clustering but not good at showing absolute quantities
  • Rules: Must use equal area projection

2. Graduated Symbol Maps

  • Method: Use the visual variable size to represent the magnitude proportionally

Key Features: The size of a symbol is proportional to the data value and independent of the size of the geographical area in which it is located** This avoids possible misleading of large areas in hierarchical charts (see below) *Historical first: Charles Minard used a grading pie chart in 1858 to represent the meat supply to Paris

  • Common mistakes: ** Cannot scale symbols by “radius” or “diameter”, must be scaled by “area (area)

3. Choropleth Maps

  • One of the most popular technologies
  • It uses area notation (usually an administrative unit such as a state or county) to display aggregated data
  • Core Principle (1): Normalized data must be used (e.g. densities, ratios, averages) and never raw absolute data (e.g. total population, total income) must be used
  • Core Principle (2): Ordered visual variables such as color value (light to dark) or saturation should be used instead of unordered color hues (e.g. rainbow colors)
  • Color Schemes (Cynthia Brewer Theory):
  • Sequential: suitable for data from low to high (such as population density)
  • Diverging: Emphasizing two extremes and a meaningful midpoint (such as the degree of electoral bias toward Democrats or Republicans)
  • Qualitative: Applicable to nominal type/Category data

4. Isometric and Isopleth Maps

Method: Represent a continuous 3D surface through contours Isometric: Values ​​can be referenced to specific “points” (e.g. temperature, altitude) Isopleth: Values ​​are derived values ​​calculated from an “area” and cannot be referenced to a single point (e.g. population density)

5. Flow and Network Maps

Method: Describe linear phenomena, usually involving movement and connections (starting and ending points) between points

Key coding: Line width is used to express the quantity, Color hue (Color hue) is used to express the category Representative figure: Charles Minard was a pioneer in this field. He drew streamline diagrams for grain transportation and cotton imports.

6. Area and Distance Cartograms

  • Method: Deliberately distort the shape of a geographic area in order to encode another variable (such as population) into spatial area
  • type:
  • Contiguous: Maintain topology (i.e. adjacent areas are still adjacent), such as the “Pulse of the Nation” map
  • Noncontiguous: Replaces the original shape with a non-overlapping shape (such as a circle), such as the Dorling cartogram, or the New York Times’ Olympic medal map

Chapter 7: “Spatial Structures: Maps” The core points of the courseware.

This courseware builds on the Design for Information chapter we discussed earlier and provides a more structured framework (What, Why, How) and an in-depth exploration of key challenges (especially map projections and hierarchical charts).

1. Core framework: What, Why, How

The courseware first proposes a framework for thinking about spatial visualization

  • What (data type): What kind of data are you plotting?
  • Locations: 0-dimensional point data
  • Trajectories: 1-dimensional line data
  • Areas: 2-dimensional area data
  • Why (Task): What do you want your audience to understand from the data?
  • Location Data Tasks: View distribution, density, value, distance, or temporality
  • Trajectory Data Tasks: Find common paths, view values, length, directionality or timing
  • Regional Data Task: Compare and find the maximum /Minimum values, identifying geographic trends and outliers

2. Key Challenge (1): Map Projections

This is the most fundamental problem in map visualization.

  • Core problem: Flattening a 3D globe onto a 2D map will inevitably produce distortion.

  • Basic trade-off: Any projection can only retain 1-2 properties (such as shape, area, angle, distance), but not all properties

  • Common projections and their trade-offs:

  • Mercator projection: Preserves shape. This is the source of its biggest problem: it severely distorts the area, making high latitude areas (such as Greenland) appear much larger than equatorial areas (such as Africa), when in fact Africa is several times the size of Greenland.

  • Hobo-Dyer projection: Reserved area. This makes it suitable for comparing densities, but distorts shapes.

  • Ginzburgh IV / Goode Homolosine projection: Compromise scheme, trying to find a balance between area and shape distortion

  • Effect on trajectory: On a 2D map (such as Mercator projection), the shortest path (great circle route) between two points on the earth will be displayed as a curve

3. Key Challenge (2): Choropleth Maps

The courseware specifically states that this is the “most common…and most harmful” map type.

  • Core Flaw: Hierarchical charts are colored using the area of ​​geographic regions, but this creates a severe perceptual bias.
  • They overemphasize areas of large geographic area (these areas have a population of /Data density is usually low).
  • They Hide areas of small geographic area (these areas have a population of /Data density is usually very high).
  • Example of Canada map in the courseware: The northern part of Canada is sparsely populated, but it occupies a dominant visual position on the hierarchical statistical map, while the densely populated southern cities (such as Toronto, Vancouver) are almost invisible

4. Alternatives to Choropleth

In order to solve the above problems, the courseware proposes several alternative methods:

  • Dot Maps: Use dots to represent density, which is more intuitive
  • “Bar” Maps: Place a bar (or level symbol) on each area
  • Cartograms: Deliberately distort geographic areas so that their area is proportional to some data variable (such as population or parliamentary seats)
  • Tile Grid Maps / Equal Area Glyphs: This is one of the most effective alternatives. It represents each geographic unit (such as a province or state) with an equal-sized shape (such as a square or hexagon), thereby eliminating geographic area bias
  • Glyphs (Symbol Chart): Place a small chart (such as pie chart, box plot, smiley face (Chernoff Faces)) on each geographical unit (or tile) to display multi-dimensional data

5. Other map types

  • Point Data:
  • Heatmaps / Isopleth Maps: Displays the smoothed density of point data
  • Elevation Maps: Use 3D height (instead of color) to represent density values. The advantage is that it can display huge numerical differences, but the disadvantage is that there may be occlusion.
  • Binning: Aggregates points into discrete grids (such as hexagons) and colors the grid, an alternative to heat maps
  • Trajectories (trajectories / streamlines):
  • Encoding method: The trajectory (line) can be encoded by Thickness for quantity, Texture or Hue for encoding category, and Time steps for encoding speed
  • Geo-Temporal Data:
  • Small Multiples: The most common method, which is to display a series of maps side by side, each map representing a point in time (for example, displayed by year)
  • Space-Time Cube: A 3D method where the (X, Y) axis is geographic space and the (Z) axis is time
  • Glyph Maps: Place a small time series chart (such as a line or area chart) over each geographic area

Chapter 8 - Multidimensional Data Visualization

This lecture systematically introduces the visualization technology for different dimensional data (from low dimension to high dimension), focusing on the design principles, applicable scenarios of commonly used charts, and statistical traps that need to be wary of in data analysis.

Low dimensional data

(Low-Dimensional Data, < 3 dimensions) This part mainly focuses on a single variable or the relationship between two variables.

###Univariate Data

  • Common chart types:
  • Barcode Plot and Data Plot.
  • Histogram and Density Plot.
  • Box Plot and Violin Plot.
  • Core warning: Do not dumb down your data
  • Reject “Dynamite Plots”: This type of bar chart with error bars is strongly recommended to be avoided because it masks the true distribution of the data.
  • Anscombe’s Quartet (Anscombe’s Quartet): This is a classic statistics case, showing that four sets of data sets with exactly the same statistical properties (such as mean, variance, correlation coefficient) may have completely different graphical distributions. This emphasizes the need to visualize the data in its entirety, not just statistical summaries.
  • Datasaurus Dozen: Even if the boxplots look the same, the shape of the distribution of the original data (perhaps even a dinosaur) can be completely different, so it is necessary to visualize the distribution of the original data.

Bivariate Data

A. Quantitative x Quantitative

  • Scatterplot: The most basic and powerful tool for showing the relationship between two variables.
  • Focus to observe: Clusters, Trends, Outliers and Correlation.
  • Scagnostics: Based on Leland Wilkinson’s “Grammar of Graphics” theory, a series of indicators used to describe the characteristics of scatter plots to help identify data patterns: *Includes: Outlying, Skewed, Clumpy, Sparse, Striated, Convex, Skinny, etc.
  • Simpson’s Paradox:
  • Trends that appear in grouped data (such as negative correlation) may disappear or even be reversed (such as positive correlation) when the data are combined.
  • Enlightenment: When analyzing data, try to split the groups and don’t be misled by the overall trend. Correlation does not equal causation.
  • Mekko Chart (mosaic chart variant):
  • Suitable for showing situations where two variables are multiplied to get a third variable (for example: GDP per capita $\times$ Population = Total GDP). The area of ​​the rectangle represents the result, and the width and height represent the two factors.

B. Quantitative x Ordered/Category (Quantitative x Ordered/Categorical)

  • Heatmaps: Suitable for displaying matrix data (such as month x year), expressing the value through color depth.
  • Swarm Plots / Beeplots: Applicable to one categorical variable + one quantitative variable. Better than a simple bar chart, it can show the distribution density of data points under each category and avoid overlapping points.
  • Bertin Matrices:
  • Mainly used for categorical data. Reveal patterns between categorical data (e.g. clustering of social attributes across countries) by rearranging the rows and columns of the matrix.
  • Design space for bar chart:
  • Select row or column layout based on data type.
  • Variations include: Mirror Bar, Small Multiples, Bullet Chart/Benchmark Bar, etc.

High dimensional data

As dimensions increase, presenting data on a two-dimensional screen becomes more challenging.

3D Plots

  • You can use 3D Scatterplot or 3D Bar Chart.
  • Disadvantages: Although intuitive, there are problems with Occlusion and Depth perception, making it difficult to accurately read data, and usually the results are not good.

Scatterplot Matrix / SPLOM

  • Principle: Pair all variables in pairs to form a matrix.
  • Usage:
  • Diagonal: Usually shows the univariate distribution of the variable itself (histogram or density plot), since plotting it by itself makes no sense ($Y=X$).
  • Off-diagonal: Displays a scatter plot between two variables.
  • Advantages: Scalable, provides an overview, is easy to interpret, and can see the pairwise relationship between all variables.
  • Disadvantages: As the dimensions increase, the chart becomes very small and difficult to read.
  • Interactive technology: Scatterdice is a technology that switches between different dimensions through interactive navigation, which solves the problem of too large matrices when there are too many dimensions.

Parallel Coordinates Plot (PCP)

  • Principle: Each dimension is a vertical axis, and each data point is represented by a polyline passing through all axes.
  • **Pattern recognition (how to interpret): **
  • Parallel Lines: Indicates Positive Correlation.
  • **Cross line (X-shaped): ** Indicates negative correlation (Negative Correlation).
  • Convergence/Bundle line: represents clusters or groups (Clusters/Groups).
  • Common Traps (Caveats/Pitfalls): 1 Axes order: Only adjacent axes can be visually compared for correlation. If two related variables are far apart, you won’t see a relationship. The order of the axes has a huge impact on the readability of the chart. 2 Axis scales: The units and ranges of different dimensions may be completely different (such as horsepower vs weight), which is prone to misunderstanding. 3 Truncated axes: The axes may not start from 0, causing small differences to be visually magnified and easily misleading.

Glyphs

  • Principle: Encode a single data point into a small graph, with different visual variables of the graph (such as shape, size, color) representing different dimensions.
  • Typical case:
  • Star Glyphs: The axes radiating out from the center represent different variables.
  • Chernoff Faces: Use facial features (such as mouth curvature, eye size) to represent data dimensions.
  • Flower Glyphs: For example, the OECD Better Life Index uses the length and width of petals to represent different happiness indicators of a country.
  • Dear Data Project: (Giorgia Lupi) Hand-drawn creative data glyphs that demonstrate extremely rich ways of encoding data.
  • Applicable scenarios: Suitable for viewing the overall characteristics of individual elements to facilitate comparison of specific objects (such as comparing the comprehensive indicators of two cities).
  • Disadvantages: It is difficult to see global correlations between variables.

Dimensionality Reduction

  • The courseware mentions that dimensionality reduction is a method of reducing the dimensionality of data through mathematical transformation, but this course does not explain the specific algorithm details in depth, focusing on the processing at the visualization level.

Summary: What should you pay attention to?

  • Beware of Statistical Pitfalls: When analyzing data, always remember Anscombe’s Quartet (don’t just look at the mean) and Simpson’s Paradox (watch for reversing trends in subgroups).
  • Select charts based on dimensions:
  • 2D: Scatterplots are the way to go.
  • Multidimensional Overview: With Scatter Plot Matrix (SPLOM).
  • Multidimensional correlation analysis: Use parallel coordinate plot (PCP), but be careful to adjust the axis order.
  • Individual comparison: Use Glyphs.
  • Avoid 3D: Try not to use 3D histograms or 3D scatter plots on a 2D screen unless it is just to show general patterns rather than precise analysis.
  • Visual Coding Principle: Although you can use color, shape, and size to increase dimensions, do not overlay it, otherwise it will make the chart difficult to read (Clutter).

Chapter 9- Hierarchical and Relational Structure

Relational data: Network/Graphs& Networks

This part deals with the relationship between nodes (Nodes) and connections (Edges/Links). Depending on the choice of visualization coding, there are two main categories: node-link graphs using Connection Channels, and adjacency representations using Matrix Views.

1. Basic concepts and layout

  • Node-Link Diagram:
  • Definition: The most common network diagram uses point marks to represent nodes and line marks to represent connections.
  • Advantages: Ideal for understanding the topology of a network (Topology tasks), such as path tracing, finding the shortest path, or finding adjacent nodes.
  • Distance concept: Distance is usually measured as a discrete quantity called “Hops” rather than as a continuous plane distance.
  • Force-Directed Layout:
  • Principle: Simulate physical gravity and repulsion. Nodes repel each other (like magnets), while wires act like springs and pull connected nodes closer together.
  • Features: The algorithm usually starts from a random position and iteratively optimizes. This makes it nondeterministic, that is, the results of each run may be different, making it difficult to utilize spatial memory.
  • Function: It can reveal clusters and outliers, and the algorithm is relatively easy to implement and understand.
  • Limitations: It is easy to fall into local optimality. And for graphs with more than a few hundred nodes, the layout will quickly become a messy “Hairball”.

2. Core Challenge: Dense Networks

When there are too many connections in a network, a “hairball” can form, making it impossible to read. Generally speaking, force-directed layout will fail when the number of connections exceeds approximately 4 times the number of nodes. The instructor highlighted several techniques for resolving crowding:

  • Multilevel Force-Directed Placement (sfdp) —— Supplementary Technology
  • Principle: Construct a derived cluster hierarchy (Cluster hierarchy), first lay out a simplified coarse-grained network, and then gradually refine it.
  • Function: Improves the speed and quality of processing large-scale networks. It can not only avoid local optimality, but also display the clustering structure of thousands of node networks to a certain extent.
  • Motif Simplification:
  • Principle: Identify recurring substructures in the network (such as sectors, bidirectional connections, fully connected cliques) and replace them with simple glyphs.
  • Function: Reduce visual complexity and directly display structural features.
  • Adjacency Matrices - Key Content
  • Principle: Convert the network into a derived table (Derived table). The row is the starting point, the column is the ending point, and the color or fill of the cell indicates a connection.
  • advantage:
  • Completely eliminate line intersections (Occlusion): Solve the occlusion problem of node link diagram.
  • High scalability (Scalability): Even very dense networks can be displayed, with a single view supporting 1000 nodes and 1 million edges.
  • Predictability and Stability: The screen space required is predictable, and adding new nodes does not drastically change the overall layout.
  • Node lookup: Finding a specific node (by label) in an ordered list is much faster than looking in a scattered force-directed graph.
  • Node degree estimation: You can quickly estimate the degree of a node by counting the number of filled cells in a row or column.
  • shortcoming:
  • Difficulty in path tracing: It is difficult to perform topology tasks (Topology tasks), such as tracing multi-hop paths (A passes to B, B passes to C).
  • Unfamiliar: Users often need training to interpret matrix views.
  • Pattern Recognition: Specific visual patterns can be discovered by reordering rows and columns.
  • Cliques (fully connected subgraphs): Appear as filled squares on the diagonal.
  • Clusters: Regions that appear to be highly interconnected but clustered in a matrix.
  • Comparison conclusion: Node-Link is used for sparse networks, and Matrix is ​​used for dense networks.
  • Edge Bundling:
  • Principle: Just like tying up wires, bundle the wires with the same direction together.
  • Advantages: Greatly reduces visual clutter and can clearly display macro structures (such as overall flow direction).
  • Disadvantages: Introduces ambiguity. Once a wire is in the “bundle” you don’t know which one it came out of, making it unsuitable for tracing individual connections.

3. Special types of networks

  • Multivariate Networks: Pivot Graphs can be used when nodes have attributes (such as gender, department). It aggregates nodes with the same attributes and only displays the connections between groups, which is very suitable for macro analysis.
  • Geographic Networks:
  • Pain Point: Strictly arranging nodes according to map location usually results in severe line occlusion.
  • Strategy: Abstract geographic information. Don’t stick to the location of the map. You can distort the map (Cartograms), turn the map into a Chord Diagram, or abstract it into squares to make room for connections.

Part 2: Hierarchical data (trees)

This part deals with structures that only have parent-child relationships and no loops. Visual coding is mainly divided into two channels: connection and containment.

1. Explicit Techniques/Connection

  • Features: Use lines (connection markers) to explicitly connect parent and child nodes.
  • Typical chart:
  • Traditional tree diagram (Node-Link Tree): Commonly used vertical or horizontal layout, using spatial position to show depth.
  • Radial Node-Link: The root is at the center, the depth is encoded as the distance from the center, and the connections are usually curves.
  • Phylogenetic tree (Phylogram/Dendrogram): Commonly used in biology, the length of the branches usually represents similarity (the shorter, the more similar).
  • Disadvantages: Low space utilization. As the number of nodes increases, leaf nodes (the lowest nodes) are usually drawn very small.

2. Implicit Visualization (Implicit Techniques / Containment & Position)

  • Features: Instead of lines, use Containment (nesting), overlap or relative position to express the parent-child relationship.
  • Typical chart:
  • Icicle Plots: Like hanging icicles, the vertical position and size are used to show depth, and the horizontal position is to show brotherhood.
  • Rectangular Treemaps (Treemaps) —— Key Content
  • Principle: Use nested rectangles. The parent rectangle contains the child rectangle, and the sum of the areas of all child nodes is equal to the parent node.
  • Advantages: 100% space utilization. Very suitable for displaying the attribute values of leaf nodes (such as file size, GDP), especially for finding extreme values /Outliers.
  • Disadvantages: Not suitable for displaying topology or hierarchical depth.
  • Layout Algorithm: The lecture compared Slice-and-Dice (easy to produce slender bars, difficult to compare) and Squarified (squared, easier to compare sizes).
  • Voronoi Treemaps: Replace rectangles with more natural cell shapes, which have good visual effects but complex algorithms.

3. Polar Layouts

  • Problem Solution: The traditional tree diagram has smaller space as you go down. The polar coordinate layout (circle) has a longer perimeter toward the outside, so it can accommodate more leaf nodes.
  • Typical Chart: Sunburst Chart. Essentially, roll the icicle image into a circle, or use a radial layout filling method.
  • Compound Networks:
  • GrouseFlocks: A hybrid view that combines connections (showing the network structure) and inclusions (showing the clustering hierarchy).

Comparison Table

Task / AttributeBest Idiom (According to Transcript)Why? / Constraint
Correlation (2 vars)ScatterplotMost accurate (position channel).
Correlation (Many vars)Scatterplot Matrix (SPLOM)Shows all pairwise.
Correlation (Specific Pair)Parallel Coordinates (PCP)ONLY IF axes are adjacent.
Distribution (1 var)Histogram / Density PlotAVOID Dynamite Plots.
Sparse NetworkNode-Link DiagramIntuitive for path following.
Dense NetworkAdjacency MatrixNo occlusion; shows cliques.
Path TracingNode-Link DiagramHard in Matrices.
Tree Leaves (Size)TreemapSpace-filling.
Tree Structure (Deep)Sunburst (Polar)More space on periphery.

Summary: What do you need to pay attention to?

  • Select charts based on data density:
  • The network is sparse? Use Node-Link (suitable for topology tasks).
  • Is the network as dense as a hair ball (number of connections > 4 times the number of nodes)? Use Adjacency Matrix (good for overviews and searches).
  • 2 Select charts based on mission objectives:
  • Depends on the path (how to get from A to E)? Use Node-Link.
  • Want to look at clustering (who is with whom) or estimate node degree? Use Matrix.
  • Do you want to look at the macro flow? Use Edge Bundling.
  • 3 Tree map choice:
  • If your tree is very deep and the leaf nodes have not only levels but also numerical sizes (such as hard disk file occupancy), Treemap is the best choice.
  • If there are too many leaf nodes and the outer layer is too crowded, consider using Sunburst.
  • 4 Resource recommendation: treevis.net is a library that specializes in collecting tree data visualization, which is very useful when you want to find inspiration.