Invisible Places: Road Network
Points of Interest (POIs), such as stores and restaurants are everywhere, yet from a data science standpoint they remain invisible due to various factors that make them hard to find, predict, and quantify. In this series, we examine relevant datasets to estimate how many POIs exist in the world.
With all the advances in data science and predictive analysis, there is still no universal approach to measuring the quality and coverage of POI datasets. It’s a difficult problem with a dataset that’s ever changing and not easily observable from satellite imagery or street-view photography. Unfortunately, without an objective and repeatable mechanism for analyzing this critical dataset, there is no way to compare existing data offerings or gauge if progress is being made in a specific region. It is also impossible to compute the true value of a dataset. As data consumers, companies are left to wonder what they are really paying for when purchasing large sets of POI data. With raw numbers, such as the total number of records in the set, it's hard to gauge data freshness, saturation, validity, etc. There must be a comprehensive scoring system in order to allow comparison across various sets of data, or various regions within the same dataset, or the same region over a period of time.
StreetCred is laying the groundwork for establishing this POI data scoring system, which will become the driving force for the StreetCred network. We are building analysis models and visualizations that will empower the community with insights about their collected data. With these tools in hand, consumers and contributors alike will be able to direct their valuable resources towards the areas in which they will have the most impact: consumers can see areas where the data is exceptional, and contributors can see where to anticipate the greatest earning potential.
Build the Scale
In order to be able to score something, a scale must exist. We need to identify the worst and best possible outcomes. In the case of POI data coverage the bottom of the scale is obviously no data whatsoever. Of course, the opposite end of the scale should be all the data. Unfortunately, there is currently no way to know what all the data actually means; is it a million POIs? is it a billion? Can we compute this more granularly, for say a city? When Google claims over 100M places how can we tell if that’s everything? If that’s the total for the entire world, what’s the total for a country, a city, a neighborhood? Consumers of this type of data have grown accustomed to these absolute numbers and anecdotal experiences (search for a Starbucks where you know one to exist).
It doesn’t have to be this way! We just need to begin predicting the number of POIs across the globe. So that’s exactly what we’ve started doing. This will be an iterative process with many layers of data interwoven together to produce the most accurate estimates of POI counts, but our initial computations are encouraging!
Let’s Hit the Streets
It’s safe to assume that POI counts correlate to a few key factors in a given region. We decided to start with roads and see how the road network can help estimate POI counts in a few large cities across the world. We were fortunate to have access to one of the most complete road network datasets in existence: OpenStreetMap (OSM)! According to a recent analysis, OSM has significant global coverage and “provides the only global-level, openly licensed source of geospatial road data” (1)
Our assumption is relatively simple: a POI can’t exist without a road that leads patrons to it. Imagine a convenience store, barber shop, gym, or hospital without a road, street, or even footpath leading visitors to it. So we set off to analyze roads in a few cities to see what they potentially reveal about POIs.
Now is probably a good time to show you some pretty maps and if you’d like to know more about the method we used to derive them read below.
We started with New York City, because it’s home (and it’s got many roads)! According to our analysis, there are at least 185K POIs just waiting to be discovered.
NEW YORK, NY
What's With the Rectangles?
You’ll notice the map uses rectangles to break up the area. These are geohashes (play around with them here, thanks missinglink!). They provide a convenient mechanism for tiling the world in such a way that each resulting tile has a unique code to identify it. The number of digits in the code, referred to as the precision, dictates the size of the rectangle. We’ve chosen to go with a 6 character geohash for the purposes of this analysis, which results in rectangles of 1.22km × 0.61km. Performing this analysis over smaller uniform pieces allows us to drill into the map and see variation in POI presence, instead of seeing just an absolute number representing the final total.
Time for another map break! This time, let’s check out Berlin, Germany weighing in at 127K POIs. It’s interesting to note that Berlin has many large parks which our analysis correctly predicts have little to no POIs. These sparse tiles bring the average POI count per tile to only 59.6, which is only half of New York’s 111. Though the most dense tiles are comparable with 391 and 473 for Berlin and New York, respectively.
Let’s keep this map party rolling with London, United Kingdom! It’s the largest city we analyzed, at 1,570 km², and falls very much inline with Berlin’s stats with an average POI count of 74.4 and the densest tile coming in at 370. As expected, given the sheer size of the city the total estimated number of POIs is the highest yet with almost 271K.
Finally, let’s have a look at Seoul! We’ve saved this one for last because here the road network proved to be the densest, resulting in the highest maximum and average POI estimates per geohash tile. The total is 162K, which is surprisingly close to New York, given the difference in square footage: 605 km² for Seoul versus 789 km² for New York.
Here’s a breakdown of the stats for the cities we’ve analyzed in this post.
|New York||185K||111||473||789 km²|
The Scenic Route
As promised, we’d like to share more about our computations in the hopes of generating discussion amongst our fellow POI lovers.
We’re assuming that POIs mostly exist along roads. But we don’t just assume uniform co-location, we associate the frequency at which they can be found with the road types. OpenStreetMap contributors tags roads using the highway tag based on their importance in the network and accessibility to pedestrian. We used this classification to group the roads by type. Once grouped, we summed the lengths of each type to arrive at the total length of each road within each geohash tile. We then picked a few representative roads from each category type and computed the frequency with which POIs show up on them; some good ol’ fashioned walking around with a measuring wheel did the trick. Not all road types are assumed to have a POI presence, for example footways were mostly found within parks where we wouldn’t expect to have many if any POIs.
These are the numbers we arrived at and subsequently used in further computations.
|Highway Type||POI per km|
With those coefficients in hand, all that was left was some math. The total length computed earlier for each highway type is multiplied by a coefficient representing the average number of POI per kilometer. Summing up the results for all the road types gives us the total estimate for each geohash. Summing up the totals of all geohashes covering the city gives us the total estimate.
The Road Ahead
We know this is a very naive analysis and doesn’t alone represent the final answer. But it’s a good starting point. Next steps will involve layering in additional data, such as building height, known addresses (lookin’ at you OpenAddresses), population, and much much more. We’d love to hear suggestions from the community on what other datasets might correlate to POI count so we can experiment with all of them and share our findings with you. As we build a pipeline to perform this analysis on the entire world, and not just a few cities as we’ve done so far, we’ll be doing so in the open and invite your contributions!
1. Christopher Barrington-Leigh, Adam Millard-Ball, The world’s user-generated road map is more than 80% complete, August 10, 2017, http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0180698