Recently, I’ve been introduced to the AAIA’s ‘ACES’ datasets. These datasets aim to describe the make, model, and configuration of pretty much every vehicle in recent history.
These datasets are a terrifying case study in how not to design datasets.
A Comedy of Errors
There are so many design mistakes in the three datasets that the AAIA provides – Qdb, PCdb, and VCdb – that I hardly know where to begin. Let’s start with the biggest / most obvious problems and proceed from there.
Four Kinds of
NULL, and Magic Values
It’s long been recognized that there can be different logical 'kinds’ of
NULL. Codd himself acknowledged “A-mark” (absence of information), and
“I-mark” (inapplicability of information). This is a commonplace and
potentially important distinction: I don’t have a value for a field because
it’s missing, or I don’t have it because it doesn’t make sense in this
An easy example of this distinction would be a patient chart with a
num_pregnancies column. If the
gender column has a value of
then it would be nonsense to have a non-
NULL value in
female then it could be
NULL because the patient hasn’t
filled that information out yet.
There are many many arguments about
NULL, and whether or not it should even
exist as a concept in relational databases at all:
- Three-valued logic is unintuitive and irritatingly complex.
NULLcolumns add needless complexity to SQL, and can be handled more gracefully through schema design.
- “A-mark” violates the closed-world principle, and
NULLin general violates other principles behind the relational model.
There have been some discussions of adding support for multiple – or even
NULL domains to SQL but ultimately all have fallen by the wayside
for a variety of reasons.
Keep all of that in mind as you consider the situation with ACES:
The AAIA has an interesting need to differentiate between several different kind of “A-mark”. When a piece of information is missing, they need to know what state they are in with respect to finding it: Has research begun, or not? If a piece of information is inapplicable, they need to know if it has been confirmed to be inapplicable, or is just plain not there yet.
This is all important and useful information but their choice of representation is just beyond bizarre and harkens back to the COBOL era.
They start simply enough: A table to represent the kinds of
NULL. Ok, easy
enough. But… the unique identifier for this table is a
one of several strings:
On an aside, this is one of only two places where they don’t use
FOREIGN KEY constraints – and while there are practical considerations that
come into play about when and whether those are useful, a lack of consistency
is the worst of all worlds.
But stop for a minute, and ask yourself: “How would I represent these kinds of
NULL values, if I were creating a schema?”
Unless you hate yourself, you probably thought “I’d add a
missing_<x>_reason_id column for each
NULL-able column, to refer to that
If you were a bit paranoid – a tactic I’ve often found prudent in these sorts
of situations – you may also have thought to yourself: “And maybe, through
CONSTRAINT or application logic, I will enforce that only one of each pair
missing_<x>_reason_id columns could be
NULL at a time.”
That is not what the AAIA did.
Instead, any column for which an absence must have this state tracked, is
coerced to be a
CHAR(3) column, with magic values to represent the key back
reasons table, and it is left to the developer to test values against
the set of keys in that table to determine if data is, or is not, actually
WHERE <x> IS NULL for you!
Watch Your Parsing, Bub.
The AAIA provides each dataset in a varity of formats, including SQL-dumps (for Access, SQL Server, and MySQL), and as a set of pipe-delimited-value files.
In the pipe-delimited-value files,
VARCHAR-typed columns have space-padded
values. Always. This is not true of the SQL dumps.
One line in each of nine files has a
\n terminator, whereas every other
Also, they provide no guidance or mechanism for clarification on the difference
between an empty string and a
NULL value in the places where that is
potentially ambiguous. (Yet another argument against
NULLs in general!)
Oh, wait, did you think that just because they were using magic values to
represent their four-kinds-of-
NULL that everything was
NOT NULL? Sadly,
not so much.
Keep an eye on what your loader/parser is doing if you use the pipe-delimited-value files!
Towards, Uh, 297th Normal Form?!
There’s a time to normalize, and a time not to. While many real-world datasets sit in-between normal forms, by having some things more properly normalized than others – often as a result of organic growth and a need for optimization – you’d think a dataset designed specifically for publication and consumption by others would have a fair bit of care put into its design.
You’d be wrong.
BaseVehicle table (they usually use some variation of
CamelCase in the
SQL dumps) has a
YearID column. This refers to the
Year table has precisely one column:
… Except When We Don’t Feel Like It!
The table which I load as
VehicleConfig in their
SQL dumps) has some 10 one-to-one associated tables, all directly
corresponding to columns representing a many-to-one relationship from
vehicle_configuration to some table representing a particular possible
configuration of a particular subsystem, such as body style, bed, brakes, etc.
The 10 or so tables are… interesting to say the least. They represent information about the data source that was used to conclude that a particular vehicle is associated with a particular subsystem configuration.
Do they include a
vehicle_configuration_id? No, of course not! That would
be silly! Just because records in these tables correspond
vehicle_configuration, doesn’t mean they should actually refer
to them, does it?? Well, apparently not – according to the AAIA.
Instead, they include a tuple of (
columns appear in
More interestingly, there is no instance I can find in which a
vehicle_configurations record exists, but a record in one of these
peripheral tables does not. What does occur though is “stranded” data –
records in the leaf tables that do not correspond to records in the
That may be very useful if you’re the AAIA and are in the perpetual process of tracking down information about sources – but as a consumer of the data, it just makes consuming it more annoying.
And then there’s the
Source column that occurs in each of these, and a few
other, tables: It’s a
CHAR(10) everywhere. Not a
source_id INT NOT NULL
FOREIGN KEY reference to a
Sources table – no no no! That would be
No, this is one place they have no problem simply duplicating values all over the place.
Year needs to be normalized out, despite the complete lack of any benefit,
Source? Not so much. Oh yes: I’ve found less than 20 distinct values
(And some of them appear to be errors. E.G.
OE, when the most common values
… and this is yet another place where they don’t bother to make a variable
length string into a
VARCHAR column. (There are lots of places where they
do – just not these ones.)
What’s In a Name?
Of a total of 88 tables:
- Number with singular names (excluding mapping tables –
- Number with plural names (also excluding mapping tables): 11
- Number with the dataset in their name: 2
- Number with spaces in their name: 1
- Number with names that collide across datasets: 5 tables, 2 distinct names
Additionally, many tables have columns that include the table name, or a
singular version of it, or an alternative name for the same thing (the
PartTerminology<Foo> columns, and
Parts is synonymous with
PartTerminology throughout the schema) – but of course, not all tables do.
In fact, sometimes you’ll see a mix of conventions within one table.
Hell, they can’t even be consistent about the casing in their camel-casing.
Subcategories has columns
Semantics? What Are Semantics?
In one table, for one column, they choose to export this in the MySQL dumps as
TIMESTAMP column. Despite clearly not wanting the still-default
auto-populate-with-current-time semantics, they don’t add a
DEFAULT 0 to the
Units? Figure it Out Yourself!
In many cases they have quantities representing physical measurements. They
tend to provide these in both US and metric versions. As
of course. And the values are not provided in enough precision to produce one
from the other without having a deviation, even when perfect precision is
possible without using some sort of
Rational type to represent things – but
that can be forgiven as mathematical accuracy may be less relevant than
reproducing what appears on a manufacturer’s datasheet.
That said, “metric” is not a unit. Is
BedLengthMetric in millimeters?
Centimeters? Meters? Is it an integer, or a decimal value (remember, the
columns are all
CHAR(<n>))? The only way to know is to look at the values –
in each column pair – and judge based on reality, or check the math versus the
US version and see if you need a multiplier of 2.54, or 25.4 to get close.
Oh, did I forget to mention that the units are inconsistent between the US and metric versions of column pairs in various tables? Sorry…
But my personal favorite has to be engine volume. They have three columns to
represent engine volume. And sometimes they just… don’t populate
Nevermind that the table has all the information needed to do the math…
Of course that’s not the only case where one value or another might be missing, but it’s by far the most commonly occurring.
The ACES datasets look like a COBOL programmer got really drunk, read a book on relational database normalization, had some sort of seizure, and puked all over the database.