Justin Matejka, George Fitzmaurice (2017)
Same Stats, Different Graphs: Generating Datasets with Varied Appearance and Identical Statistics through Simulated Annealing...make both calculations and graphs. Both sorts of output should be studied; each will contribute to understanding. F. J. Anscombe, 1973
(and echoed in nearly all talks about data visualization...)
It can be difficult to demonstrate the importance of data visualization. Some people are of the impression that charts are simply "pretty pictures", while all of the important information can be divined through statistical analysis. An effective (and often used) tool used to demonstrate that visualizing your data is in fact important is Anscome's Quartet. Developed by F.J. Anscombe in 1973, Anscombe's Quartet is a set of four datasets, where each produces the same summary statistics (mean, standard deviation, and correlation), which could lead one to believe the datasets are quite similar. However, after visualizing (plotting) the data, it becomes clear that the datasets are markedly different. The effectiveness of Anscombe's Quartet is not due to simply having four different datasets which generate the same statistical properties, it is that four clearly different and visually distinct datasets are producing the same statistical properties. In contrast the "Unstructured Quartet" on the right in Figure 1 also shares the same statistical properties as Anscombe's Quartet, however without any obvious underlying structure to the individual datasets, this quartet is not nearly as effective at demonstrating the importance of visualizing your data.
While very popular and effective for illustrating the importance of visualizing your data, they have been around for nearly 45 years, and it is not known how Anscombe came up with his datasets. So, we developed a technique to create these types of datasets – those which are identical over a range of statistical properties, yet produce dissimilar graphics.
Recently, Alberto Cairo created the Datasaurus dataset which urges people to "never trust summary statistics alone; always visualize your data", since, while the data exhibits normal seeming statistics, plotting the data reveals a picture of a dinosaur. Inspired by Anscombe's Quartet and the Datasaurus, we present, The Datasaurus Dozen (download .csv):
These 13 datasets (the Datasaurus, plus 12 others) each have the same summary statistics (x/y mean, x/y standard deviation, and Pearson's correlation) to two decimal places, while being drastically different in appearance. This work describes the technique we developed to create this dataset, and others like it.
The key insight behind our approach is that while it is relatively difficult to generate a dataset from scratch with particular statistical properties, it is relatively easy to take an existing dataset, modify it slightly, and maintain those statistical properties. We do this by choosing a point at random, moving it a little bit, then checking that the statistical properties of the set haven't strayed outside of the acceptable bounds (in this particular case, we are ensuring that the means, standard deviations, and correlations remain the same to two decimal places.)
Repeating this subtle "perturbation" process enough times, results in a completely different dataset. However, as mentioned above, in order for these datasets to be effective tools for underscoring the importance of visualizing your data, they need to be visually distinct and clearly different. We accomplish this by biasing the random point movements towards a particular shape. In the animation below, we show the process of 200,000 iterations of perturbations towards a 'circle' shape:
To move the points towards a particular shape, we perform an additional check at each random perturbation. Besides checking that the statistical properties are still valid, we also check to see if the point has moved closer to the target shape. If both of those conditions are met, we "accept" the new position, and move to the next iteration. To mitigate the possibility of getting stuck in a locally-optimal solution, where other, more globally-optimal solutions closer to the target shape are possible, we use a simulated annealing technique which begins by accepting some solutions where the point moves away from the target in the early iterations, and reduces the frequency of such acceptances over time.
To generate the Datasaurus Dozen, we created 12 shapes to direct the dots towards. Each of the resulting plots has the same summary statistics as the original Datasaurus, and in fact, all of the intermediate frames do as well. The process of converting the Datasaurus into each of these shapes can be seen below. Of course, the technique is not limited to these shapes, any collection of line segments could be used as a target.
Iterating through the datasets sequentially, we can see how the data points morph from one shape to another, all the while maintaining the same summary statistical values to two decimal places throughout the entire process.
Besides the Datasaurus Dozen, we have created several other example datasets using our technique. They are explained in more detail in the paper, and can be downloaded for your own visualizations.
One interesting property of our technique is that it can work for visualizations other than 2D scatter plots, and statistical properties besides the standard summary statistics. In the example below each of the datasets start out as a normal distribution of points. The boxplot shown at the bottom is a standard "Tukey Boxplot" which shows the 1st quartile, median, and 3rd quartile values on the "box", and the "whiskers" showing the location of the furthest data points within 1.5 interquartile ranges from the 1st and 3rd quartiles. Boxplots are commonly used to show the distribution of a dataset, and are better than simply showing the mean or median value. However, here we can see as the distribution of points changes, the box-plot remains the same.
Another way to look at these 1D distributions is to consider a dataset with seven categories (Figure 8, below). The data in each category is shifting overtime, as can clearly be seen in the "Raw" data view, yet the boxplots remain static. Violin plots are a good method for presenting the distribution of a dataset with more detail than is available with a traditional boxplot. This is not to say that using a boxplot is never appropriate, but if you are going to use a boxplot, it is important to make sure the underlying data is distrubuted in a way that important information is not hidden.
The datasets presented on this page (and in the paper) are available for download. The Python source code is available for download here. I have tried to remove as many "extra" things from the code as possilbe, to make it more readable, but is still a bit rough... I would like to put it up on GitHub soon, and if there is enough interest, turn it into a real library. If you are a researcher and would like to see all of the original code (even though it might not run properly, and is qutie "researchy"), please contact me.
Amazingly, and without any work on my part, these datasets have been turned into an R package (GitHub, cran package). This effort has been led by Stephanie Locke and Lucy McGowan. Thanks!
A special thanks to Alberto Cairo for creating the Datasaurus. When I asked if he had saved the datapoints from his original tweet, he hadn't, but he very graciously (and quickly!) created a new (and even better) dinosaur drawing (using the fantasitc DrawMyData tool). Also thanks to Fraser Anderson for the idea to start with an existing dataset which has the desired statistical properties, rather than try to create one from scratch.
The following are links to some of the attention this article has received. If you have found additional coverage, I'd love to hear about it.
New #chi2017 paper is up. Don't trust statistics alone, visualize your data! https://t.co/amnbAYvsq1 pic.twitter.com/1s6vkge6dl
— Justin Matejka (@JustinMatejka) May 1, 2017
Be wary of boxplots! They might be obscuring important information.https://t.co/amnbAYvsq1 pic.twitter.com/7YxslPGp1n
— Justin Matejka (@JustinMatejka) August 9, 2017
A great demonstration of why we need to plot the data and never trust statistics tables! https://t.co/JyUb57v0or pic.twitter.com/hsivGZdpZ1
— Taha Yasseri (@TahaYasseri) May 1, 2017
All of these datasets have the same mean, standard deviation, and Pearson's correlation. #dataviz #DataSciencehttps://t.co/0sh6n8CsIA pic.twitter.com/NJKgg1K4NP
— Randy Olson (@randal_olson) May 2, 2017
Same means, standard deviations and correlation!https://t.co/6Ya5hpNquv pic.twitter.com/2dZkQE95ns by @JustinMatejka & Fitzmaurice
— Dina D. Pomeranz (@dinapomeranz) May 2, 2017
For more information, please see the research paper or watch the longer video embedded below.
For questions, please contact Justin Matejka through email Justin.Matejka@Autodesk.com or on Twitter @JustinMatejka.