Recently, I’ve been introduced to the AAIA’s ‘ACES’ datasets. These datasets aim to describe the make, model, and configuration of pretty much every vehicle in recent history.
These datasets are a terrifying case study in how not to design datasets.
A Comedy of Errors
There are so many design mistakes in the three datasets that the AAIA provides – Qdb, PCdb, and VCdb – that I hardly know where to begin. Let’s start with the biggest / most obvious problems and proceed from there.
Four Kinds of NULL
, and Magic Values
It’s long been recognized that there can be different logical ‘kinds’ of
NULL
. Codd himself acknowledged “A-mark” (absence of information), and
“I-mark” (inapplicability of information). This is a commonplace and
potentially important distinction: I don’t have a value for a field because
it’s missing, or I don’t have it because it doesn’t make sense in this
context.
An easy example of this distinction would be a patient chart with a
num_pregnancies
column. If the gender
column has a value of male
,
then it would be nonsense to have a non-NULL
value in num_pregnancies
, but
if gender
is female
then it could be NULL
because the patient hasn’t
filled that information out yet.
There are many many arguments about NULL
, and whether or not it should even
exist as a concept in relational databases at all:
- Three-valued logic is unintuitive and irritatingly complex.
NULL
columns add needless complexity to SQL, and can be handled more gracefully through schema design.- “A-mark” violates the closed-world principle, and
NULL
in general violates other principles behind the relational model.
There have been some discussions of adding support for multiple – or even
arbitrary NULL
domains to SQL but ultimately all have fallen by the wayside
for a variety of reasons.
Keep all of that in mind as you consider the situation with ACES:
The AAIA has an interesting need to differentiate between several different kind of “A-mark”. When a piece of information is missing, they need to know what state they are in with respect to finding it: Has research begun, or not? If a piece of information is inapplicable, they need to know if it has been confirmed to be inapplicable, or is just plain not there yet.
This is all important and useful information but their choice of representation is just beyond bizarre and harkens back to the COBOL era.
They start simply enough: A table to represent the kinds of NULL
. Ok, easy
enough. But… the unique identifier for this table is a CHAR(3)
containing
one of several strings: -
, N/A
, U/K
, or U/R
. WAT.
On an aside, this is one of only two places where they don’t use
FOREIGN KEY
constraints – and while there are practical considerations that
come into play about when and whether those are useful, a lack of consistency
is the worst of all worlds.
But stop for a minute, and ask yourself: “How would I represent these kinds of
NULL
values, if I were creating a schema?”
Unless you hate yourself, you probably thought “I’d add a
missing_<x>_reason_id
column for each NULL
-able column, to refer to that
reasons
table.”
If you were a bit paranoid – a tactic I’ve often found prudent in these sorts
of situations – you may also have thought to yourself: “And maybe, through
a CONSTRAINT
or application logic, I will enforce that only one of each pair
of <x>
and missing_<x>_reason_id
columns could be NULL
at a time.”
That is not what the AAIA did.
Instead, any column for which an absence must have this state tracked, is
coerced to be a CHAR(3)
column, with magic values to represent the key back
to the reasons
table, and it is left to the developer to test values against
the set of keys in that table to determine if data is, or is not, actually
missing. No WHERE <x> IS NULL
for you!
Watch Your Parsing, Bub.
The AAIA provides each dataset in a varity of formats, including SQL-dumps (for Access, SQL Server, and MySQL), and as a set of pipe-delimited-value files.
In the pipe-delimited-value files, VARCHAR
-typed columns have space-padded
values. Always. This is not true of the SQL dumps.
One line in each of nine files has a \n
terminator, whereas every other
record has \r\n
.
Also, they provide no guidance or mechanism for clarification on the difference
between an empty string and a NULL
value in the places where that is
potentially ambiguous. (Yet another argument against NULL
s in general!)
Oh, wait, did you think that just because they were using magic values to
represent their four-kinds-of-NULL
that everything was NOT NULL
? Sadly,
not so much.
Keep an eye on what your loader/parser is doing if you use the pipe-delimited-value files!
Towards, Uh, 297th Normal Form?!
There’s a time to normalize, and a time not to. While many real-world datasets sit in-between normal forms, by having some things more properly normalized than others – often as a result of organic growth and a need for optimization – you’d think a dataset designed specifically for publication and consumption by others would have a fair bit of care put into its design.
You’d be wrong.
The BaseVehicle
table (they usually use some variation of CamelCase
in the
SQL dumps) has a YearID
column. This refers to the Year
table.
The Year
table has precisely one column: YearID
.
WAT.
… Except When We Don’t Feel Like It!
The table which I load as vehicle_configurations
(VehicleConfig
in their
SQL dumps) has some 10 one-to-one associated tables, all directly
corresponding to columns representing a many-to-one relationship from
vehicle_configuration
to some table representing a particular possible
configuration of a particular subsystem, such as body style, bed, brakes, etc.
The 10 or so tables are… interesting to say the least. They represent information about the data source that was used to conclude that a particular vehicle is associated with a particular subsystem configuration.
Do they include a vehicle_configuration_id
? No, of course not! That would
be silly! Just because records in these tables correspond directly
to
records in vehicle_configuration
, doesn’t mean they should actually refer
to them, does it?? Well, apparently not – according to the AAIA.
Instead, they include a tuple of (vehicle_id
, and <subsystem>_id
). Both
columns appear in vehicle_configurations
.
More interestingly, there is no instance I can find in which a
vehicle_configurations
record exists, but a record in one of these
peripheral tables does not. What does occur though is “stranded” data –
records in the leaf tables that do not correspond to records in the
vehicle_configurations
table.
That may be very useful if you’re the AAIA and are in the perpetual process of tracking down information about sources – but as a consumer of the data, it just makes consuming it more annoying.
And then there’s the Source
column that occurs in each of these, and a few
other, tables: It’s a CHAR(10)
everywhere. Not a source_id INT NOT NULL
with a FOREIGN KEY
reference to a Sources
table – no no no! That would be
too easy!
No, this is one place they have no problem simply duplicating values all over the place.
Year
needs to be normalized out, despite the complete lack of any benefit,
but Source
? Not so much. Oh yes: I’ve found less than 20 distinct values
for Source
.
(And some of them appear to be errors. E.G. OE
, when the most common values
are OEM
and OEM+
…)
… and this is yet another place where they don’t bother to make a variable
length string into a VARCHAR
column. (There are lots of places where they
do – just not these ones.)
What’s In a Name?
Of a total of 88 tables:
- Number with singular names (excluding mapping tables –
<Foo>To<Bar>
): 66 - Number with plural names (also excluding mapping tables): 11
- Number with the dataset in their name: 2
- Number with spaces in their name: 1
- Number with names that collide across datasets: 5 tables, 2 distinct names
Additionally, many tables have columns that include the table name, or a
singular version of it, or an alternative name for the same thing (the Parts
table, has PartTerminology<Foo>
columns, and Parts
is synonymous with
PartTerminology
throughout the schema) – but of course, not all tables do.
In fact, sometimes you’ll see a mix of conventions within one table.
Hell, they can’t even be consistent about the casing in their camel-casing.
The table Subcategories
has columns SubCategoryID
, and SubCategoryName
.
Semantics? What Are Semantics?
In one table, for one column, they choose to export this in the MySQL dumps as
a TIMESTAMP
column. Despite clearly not wanting the still-default
auto-populate-with-current-time semantics, they don’t add a DEFAULT 0
to the
definition.
Units? Figure it Out Yourself!
In many cases they have quantities representing physical measurements. They
tend to provide these in both US and metric versions. As CHAR(<n>)
columns,
of course. And the values are not provided in enough precision to produce one
from the other without having a deviation, even when perfect precision is
possible without using some sort of Rational
type to represent things – but
that can be forgiven as mathematical accuracy may be less relevant than
reproducing what appears on a manufacturer’s datasheet.
That said, “metric” is not a unit. Is BedLengthMetric
in millimeters?
Centimeters? Meters? Is it an integer, or a decimal value (remember, the
columns are all CHAR(<n>)
)? The only way to know is to look at the values –
in each column pair – and judge based on reality, or check the math versus the
US version and see if you need a multiplier of 2.54, or 25.4 to get close.
Oh, did I forget to mention that the units are inconsistent between the US and metric versions of column pairs in various tables? Sorry…
But my personal favorite has to be engine volume. They have three columns to
represent engine volume. And sometimes they just… don’t populate cid
.
Nevermind that the table has all the information needed to do the math…
Of course that’s not the only case where one value or another might be missing, but it’s by far the most commonly occurring.
tl;dr
The ACES datasets look like a COBOL programmer got really drunk, read a book on relational database normalization, had some sort of seizure, and puked all over the database.