The AAIA ACES Datasets

Recently, I’ve been introduced to the AAIA’s ‘ACES’ datasets. These datasets aim to describe the make, model, and configuration of pretty much every vehicle in recent history.

These datasets are a terrifying case study in how not to design datasets.

A Comedy of Errors

There are so many design mistakes in the three datasets that the AAIA provides – Qdb, PCdb, and VCdb – that I hardly know where to begin. Let’s start with the biggest / most obvious problems and proceed from there.

Four Kinds of NULL, and Magic Values

It’s long been recognized that there can be different logical ‘kinds’ of NULL. Codd himself acknowledged “A-mark” (absence of information), and “I-mark” (inapplicability of information). This is a commonplace and potentially important distinction: I don’t have a value for a field because it’s missing, or I don’t have it because it doesn’t make sense in this context.

An easy example of this distinction would be a patient chart with a num_pregnancies column. If the gender column has a value of male, then it would be nonsense to have a non-NULL value in num_pregnancies, but if gender is female then it could be NULL because the patient hasn’t filled that information out yet.

There are many many arguments about NULL, and whether or not it should even exist as a concept in relational databases at all:

  • Three-valued logic is unintuitive and irritatingly complex.
  • NULL columns add needless complexity to SQL, and can be handled more gracefully through schema design.
  • “A-mark” violates the closed-world principle, and NULL in general violates other principles behind the relational model.

There have been some discussions of adding support for multiple – or even arbitrary NULL domains to SQL but ultimately all have fallen by the wayside for a variety of reasons.

Keep all of that in mind as you consider the situation with ACES:

The AAIA has an interesting need to differentiate between several different kind of “A-mark”. When a piece of information is missing, they need to know what state they are in with respect to finding it: Has research begun, or not? If a piece of information is inapplicable, they need to know if it has been confirmed to be inapplicable, or is just plain not there yet.

This is all important and useful information but their choice of representation is just beyond bizarre and harkens back to the COBOL era.

They start simply enough: A table to represent the kinds of NULL. Ok, easy enough. But… the unique identifier for this table is a CHAR(3) containing one of several strings: -, N/A, U/K, or U/R. WAT.

On an aside, this is one of only two places where they don’t use FOREIGN KEY constraints – and while there are practical considerations that come into play about when and whether those are useful, a lack of consistency is the worst of all worlds.

But stop for a minute, and ask yourself: “How would I represent these kinds of NULL values, if I were creating a schema?”

Unless you hate yourself, you probably thought “I’d add a missing_<x>_reason_id column for each NULL-able column, to refer to that reasons table.”

If you were a bit paranoid – a tactic I’ve often found prudent in these sorts of situations – you may also have thought to yourself: “And maybe, through a CONSTRAINT or application logic, I will enforce that only one of each pair of <x> and missing_<x>_reason_id columns could be NULL at a time.”

That is not what the AAIA did.

Instead, any column for which an absence must have this state tracked, is coerced to be a CHAR(3) column, with magic values to represent the key back to the reasons table, and it is left to the developer to test values against the set of keys in that table to determine if data is, or is not, actually missing. No WHERE <x> IS NULL for you!

Watch Your Parsing, Bub.

The AAIA provides each dataset in a varity of formats, including SQL-dumps (for Access, SQL Server, and MySQL), and as a set of pipe-delimited-value files.

In the pipe-delimited-value files, VARCHAR-typed columns have space-padded values. Always. This is not true of the SQL dumps.

One line in each of nine files has a \n terminator, whereas every other record has \r\n.

Also, they provide no guidance or mechanism for clarification on the difference between an empty string and a NULL value in the places where that is potentially ambiguous. (Yet another argument against NULLs in general!)

Oh, wait, did you think that just because they were using magic values to represent their four-kinds-of-NULL that everything was NOT NULL? Sadly, not so much.

Keep an eye on what your loader/parser is doing if you use the pipe-delimited-value files!

Towards, Uh, 297th Normal Form?!

There’s a time to normalize, and a time not to. While many real-world datasets sit in-between normal forms, by having some things more properly normalized than others – often as a result of organic growth and a need for optimization – you’d think a dataset designed specifically for publication and consumption by others would have a fair bit of care put into its design.

You’d be wrong.

The BaseVehicle table (they usually use some variation of CamelCase in the SQL dumps) has a YearID column. This refers to the Year table.

The Year table has precisely one column: YearID.

WAT.

… Except When We Don’t Feel Like It!

The table which I load as vehicle_configurations (VehicleConfig in their SQL dumps) has some 10 one-to-one associated tables, all directly corresponding to columns representing a many-to-one relationship from vehicle_configuration to some table representing a particular possible configuration of a particular subsystem, such as body style, bed, brakes, etc.

The 10 or so tables are… interesting to say the least. They represent information about the data source that was used to conclude that a particular vehicle is associated with a particular subsystem configuration.

Do they include a vehicle_configuration_id? No, of course not! That would be silly! Just because records in these tables correspond directly to records in vehicle_configuration, doesn’t mean they should actually refer to them, does it?? Well, apparently not – according to the AAIA.

Instead, they include a tuple of (vehicle_id, and <subsystem>_id). Both columns appear in vehicle_configurations.

More interestingly, there is no instance I can find in which a vehicle_configurations record exists, but a record in one of these peripheral tables does not. What does occur though is “stranded” data – records in the leaf tables that do not correspond to records in the vehicle_configurations table.

That may be very useful if you’re the AAIA and are in the perpetual process of tracking down information about sources – but as a consumer of the data, it just makes consuming it more annoying.

And then there’s the Source column that occurs in each of these, and a few other, tables: It’s a CHAR(10) everywhere. Not a source_id INT NOT NULL with a FOREIGN KEY reference to a Sources table – no no no! That would be too easy!

No, this is one place they have no problem simply duplicating values all over the place.

Year needs to be normalized out, despite the complete lack of any benefit, but Source? Not so much. Oh yes: I’ve found less than 20 distinct values for Source.

(And some of them appear to be errors. E.G. OE, when the most common values are OEM and OEM+…)

… and this is yet another place where they don’t bother to make a variable length string into a VARCHAR column. (There are lots of places where they do – just not these ones.)

What’s In a Name?

Of a total of 88 tables:

  • Number with singular names (excluding mapping tables – <Foo>To<Bar>): 66
  • Number with plural names (also excluding mapping tables): 11
  • Number with the dataset in their name: 2
  • Number with spaces in their name: 1
  • Number with names that collide across datasets: 5 tables, 2 distinct names

Additionally, many tables have columns that include the table name, or a singular version of it, or an alternative name for the same thing (the Parts table, has PartTerminology<Foo> columns, and Parts is synonymous with PartTerminology throughout the schema) – but of course, not all tables do. In fact, sometimes you’ll see a mix of conventions within one table.

Hell, they can’t even be consistent about the casing in their camel-casing. The table Subcategories has columns SubCategoryID, and SubCategoryName.

Semantics? What Are Semantics?

In one table, for one column, they choose to export this in the MySQL dumps as a TIMESTAMP column. Despite clearly not wanting the still-default auto-populate-with-current-time semantics, they don’t add a DEFAULT 0 to the definition.

Units? Figure it Out Yourself!

In many cases they have quantities representing physical measurements. They tend to provide these in both US and metric versions. As CHAR(<n>) columns, of course. And the values are not provided in enough precision to produce one from the other without having a deviation, even when perfect precision is possible without using some sort of Rational type to represent things – but that can be forgiven as mathematical accuracy may be less relevant than reproducing what appears on a manufacturer’s datasheet.

That said, “metric” is not a unit. Is BedLengthMetric in millimeters? Centimeters? Meters? Is it an integer, or a decimal value (remember, the columns are all CHAR(<n>))? The only way to know is to look at the values – in each column pair – and judge based on reality, or check the math versus the US version and see if you need a multiplier of 2.54, or 25.4 to get close.

Oh, did I forget to mention that the units are inconsistent between the US and metric versions of column pairs in various tables? Sorry…

But my personal favorite has to be engine volume. They have three columns to represent engine volume. And sometimes they just… don’t populate cid.

Nevermind that the table has all the information needed to do the math…

Of course that’s not the only case where one value or another might be missing, but it’s by far the most commonly occurring.

tl;dr

The ACES datasets look like a COBOL programmer got really drunk, read a book on relational database normalization, had some sort of seizure, and puked all over the database.

data 1717 words, est. time: 343 seconds.

« Know What You're Actually Testing Fast GeoIP Queries in MySQL »

Comments

Copyright © 2022 - Jon Frisby - Powered by Octopress