Techniques for unique, correct and fast geo-queries

Posted on |

UPDATE: Better solution here.

Here’s the scenario: You have a lot (a million+) of geotagged addresses in a database and you want to show them as markers on Google Maps. Some further requirements are:

  • You only want to show some constant (10) amount at any given time, since too many markers clutters the map and kills the browser
  • You want the markers shown for a given frame to be selected randomly so that a user looking at some particular area with many geocoded addresses is not always shown the same 10 markers
  • You want the markers not to be on top of each other, so that even though many addresses in the database have the same coordinates (i.e. “Copenhagen, Denmark”), you only return one result per coordinate (this is to avoid placing two or more markers in the same spot)
  • You want it to run at interactive speed on a crummy box

Imagine you have a Documents table with columns including geoLng, geoLat representing geographical coordinates and intid, a unique integer identifier. @maxcount is the maximum number of rows desired while @maxLat, @minLat, @maxLng and @minLng define the corners of the map viewport. This query will do the job:

select *
from Documents dd
where dd.intid in
(
    select top (@maxcount) max(intid)
    from Documents d
    where d.geoLat < @maxLat
        and d.geoLat > @minLat
        and
        (
            ((@maxLng > @minLng) and (d.geoLng < @maxLng and d.geoLng > @minLng))
            or
            ((@maxLng < @minLng) and
            (
                (d.geoLng > @minLng and d.geoLng < 180) or
                (d.geoLng > -180 and d.geoLng < @maxLng))
            )
        )
    group by d.geoLng, d.geoLat
    order by newid()
)

“What’s this monkeying around with the longitudes?” I hear your cry? Well, if the map viewport straddles the International Dateline (which is somewhere in the Pacific), Google Maps will feed you viewport corners that “wrap around” and that has to be handled. If-elses in SQL is a mess, so it’s better to cook up some pure boolean voodoo like above. “Who the hell looks at Google Maps of the International Dateline?” you ask. Good point, but it’s actually not that uncommon. Looking at Japan at low zoom-levels will often provoke this behaviour, as will looking at French Polynesia. Note that this trick is not needed for latitudes because the maps don’t wrap vertically

The slowest bit will usually be the order by newid() part, which gives each row in the prospective result a new (random) guid, and then sorts all the rows based on this column. It can be replaced with tablesample calls which are much faster, but also a lot more erratic.

There’s a very slight problem with this query in that the max(intid) will always cause the same row to be returned for any given Lat/Lng coordinate. The use of max(intid) is completely arbitrary and min(intid) would work too. Had there been a rand(intid) aggregate, the problem would have been solved. I haven’t figured out an elegant solution to this problem yet. Note that max()doesn’t work on guids produced by newid().

To get this to perform, the tables in question are organised around a clustered (geoLat, geoLng, intid) index. You can see somewhat more elaborate versions of this query doing their thing at the TEDBot website.

Comments

Kasper Videbæk on

“Had there been a rand(intid) aggregate, the problem would have been solved.”

Maybe you could make an “aggregate” UDF, using this technique: http://weblogs.sqlteam.com/jeffs/articles/1490.aspx

Don’t know if it’d be what you call an elegant solution, though…

Reply

friism on

@Kasper: It’s not a bad bet, but user defined functions cannot call routines that alter state, and rand() counts as a state-altering routine…

Reply

Leave a Reply to friism Cancel reply

Your email address will not be published. Required fields are marked *