Fast GeoIP Queries in MySQL

Many moons ago, I tried to use the GeoIP database, via MySQL – and discovered that MySQL was doing the wrong thing with the queries. Naturally, I found some tricks to make it faster.

Of course, MaxMind has long provided their own custom data file and code for reading it efficiently, but for a variety of reasons I did not want to use it. The problem was that the obvious query for the table turns out to be rather horrendously inefficient – on the order of a couple seconds per lookup. Oy!

As it turns out, there’s a couple ways to fool MySQL’s query planner into doing a much more efficient lookup. I discovered today that even in MySQL 5.6, these tricks are still needed. Despite all the lovely improvements in query planning, MySQL still wears its pants squarely on its head when it comes to the type of query you want to do here.

For those of you who haven’t used MaxMind’s GeoIP database, the schema looks like this:

CREATE TABLE ip_addresses (
id INTEGER NOT NULL AUTO_INCREMENT PRIMARY KEY,
ip_start BIGINT NOT NULL,
ip_end BIGINT NOT NULL,
location_id INTEGER NOT NULL,
KEY ip_start (ip_start),
KEY ip_end (ip_end)
) ENGINE=InnoDB;

The obvious query you’d want to do is going to look like this:

SELECT location_id
FROM ip_addresses
WHERE ip_start <= 2979311012
AND ip_end >= 2979311012
LIMIT 1;

Unfortunately, even on my Haswell-based laptop, with an SSD, this winds up taking about 1.5 seconds to run.

Being Smart on Behalf of MySQL

As I said above, there’s a couple ways to trick to the query planner. But first, let’s look at the EXPLAIN output for the above query:

+----+-------------+--------------+-------+-----------------+----------+---------+------+--------+------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+--------------+-------+-----------------+----------+---------+------+--------+------------------------------------+
| 1 | SIMPLE | ip_addresses | range | ip_start,ip_end | ip_start | 8 | NULL | 884289 | Using index condition; Using where |
+----+-------------+--------------+-------+-----------------+----------+---------+------+--------+------------------------------------+

There are some 1.7 million rows in the table, and MySQL believes it will need to look at about half of them. The Using index condition is related to one of the new optimizations in MySQL 5.6: Index Condition Pushdown.

Judging by the performance of this query it seems like MySQL is, in fact, looking at all those rows. Ugh.

So, how can we make this better?

Trick 1: Coercing MySQL via Query Section

Consider this query and its corresponding EXPLAIN plan:

SELECT location_id
FROM ip_addresses
WHERE ip_start <= 2979311012
AND ip_end >= 2979311012
ORDER BY ip_end ASC
LIMIT 1;
+----+-------------+--------------+-------+-----------------+--------+---------+------+--------+------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+--------------+-------+-----------------+--------+---------+------+--------+------------------------------------+
| 1 | SIMPLE | ip_addresses | range | ip_start,ip_end | ip_end | 8 | NULL | 884289 | Using index condition; Using where |
+----+-------------+--------------+-------+-----------------+--------+---------+------+--------+------------------------------------+

So, the only difference here is that now we’re using the ip_end index, instead of the ip_start index. Interestingly though, this query is so fast that MySQL just reports the execution time as 0.00 seconds! MySQL is clearly being much smarter and is obviously able to early-out quickly instead of actually looking at all those rows – even though it doesn’t realize it will be able to do so when preparing the execution plan.

Running this query 100,000 times on my latop takes about 12.6 seconds – or about 0.000126 seconds per query. That’s excellent! So, we’re done, right? Well… No, not quite.

While this query will always return the same results as the original, it just feels a bit ugly – and, one has to wonder if perhaps we can’t do a bit better in terms of performance.

Trick 2: FORCE INDEX, To The Rescue!

Consider:

SELECT location_id
FROM ip_addresses FORCE INDEX(ip_end)
WHERE ip_start <= 2979311012
AND ip_end >= 2979311012
LIMIT 1;
+----+-------------+--------------+-------+---------------+--------+---------+------+--------+------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+--------------+-------+---------------+--------+---------+------+--------+------------------------------------+
| 1 | SIMPLE | ip_addresses | range | ip_end | ip_end | 8 | NULL | 884289 | Using index condition; Using where |
+----+-------------+--------------+-------+---------------+--------+---------+------+--------+------------------------------------+

So, no difference in query plan, but the use of FORCE INDEX makes it explicit and obvious that we are coercing the query planner behavior, which certainly helps with the maintainability of things. But what about the performance?

Running this query 100,000 times on my latop takes about 9 seconds – or about 0.00009 seconds per query. Fantastic! We’ve actually improved the already very fast query above by about 30%, while improving the clarity of the code!

What About Different Indexes?

In this case, it turns out that a different choice of indexes doesn’t help.

Strictly speaking, a UNIQUE INDEX on (ip_start, ip_end) should give MySQL more opportunity to optimize its execution strategy, and a covering index (adding location_id as the last column of the index) ought to reduce the number of random I/Os involved by letting MySQL return data directly from the index without going back to the actual table to fetch the row in order to obtain the location_id value.

Unfortunately, it seems the increased physical size of the indexes outweighs the gains.

Wrapping Up

So what have we learned? Just the obvious:

  1. MySQL’s query planner is still not terribly bright.
  2. What MySQL thinks it will need to do doesn’t always wind up being what it must do.
  3. Benchmark, benchmark, benchmark.

data, mysql 902 words, est. time: 180 seconds.

« The AAIA ACES Datasets An Unusual Case For Cucumber »

Comments

Copyright © 2016 - Jon Frisby - Powered by Octopress