The thirteen data sets in the Datasaurus Dozen, visualized and summarized
The thirteen data sets were labeled as the following:
away
bullseye
circle
dino
dots
h_lines
high_lines
slant_down
slant_up
star
v_line
wide_lines
x_shape
Similar to the Anscombe's quartet, the Datasaurus dozen was designed to further illustrate the importance of looking at a set of data graphically before starting to analyze according to a particular type of relationship, and the inadequacy of basic statistic properties for describing realistic data sets.[2][3][4][5][1][6]
Creation
The dinosaur data set created by Alberto Cairo that inspired the creation of the Datasaurus Dozen
The first data set, in the shape of a Tyrannosaurus, that inspired the rest of the "datasaurus" data set was constructed in 2016 by Alberto Cairo.[7][8] It was proposed by Maarten Lambrechts that this data set also be called "Anscombosaurus".[7]
This data set was then accompanied by twelve other data sets that were created by Justin Matejka and George Fitzmaurice at Autodesk. Unlike the Anscombe's quartet, where it is not known how the data set was generated,[9] the authors used simulated annealing to make these data sets. They made small, random, and biased changes to each point towards the desired shape. Each shape took 200,000 iterations of perturbations to complete.[1]
current_ds ← initial_ds
for x iterations, do:
test_ds ← perturb(current_ds, temp)
if similar_enough(test_ds, initial_ds):
current_ds ← test_ds
function perturb(ds, temp):
loop:
test ← move_random_points(ds)
if fit(test) > fit(ds) or temp > random():
return test
where
initial_ds is the seed data set
current_ds is the latest version of the data set
fit() is a function used to check whether moving the points gets closer to the desired shape
temp is the temperature of the simulated annealing algorithm
similar_enough() is a function that checks whether the statistics for the two given data sets are similar enough
move_random_points() is a function that randomly moves data points