# Invisible Places

**Diana Shkolnikov, Lily He**
- August 1, 2018

*Points of Interest (POIs), such as stores and restaurants are everywhere, yet from a data science standpoint they remain invisible due to various factors that make them hard to find, predict, and quantify. In this series, we examine relevant datasets to estimate how many POIs exist in the world.*

With all the advances in data science and predictive analysis, there is still no universal approach to measuring the quality and coverage of POI datasets. Itâ€™s a difficult problem with a dataset thatâ€™s ever changing and not easily observable from satellite imagery or street-view photography. Unfortunately, without an objective and repeatable mechanism for analyzing this critical dataset, there is no way to compare existing data offerings or gauge if progress is being made in a specific region. It is also impossible to compute the true value of a dataset. As data consumers, companies are left to wonder what they are really paying for when purchasing large sets of POI data. With raw numbers, such as the total number of records in the set, it's hard to gauge data freshness, saturation, validity, etc. There must be a comprehensive scoring system in order to allow comparison across various sets of data, or various regions within the same dataset, or the same region over a period of time.

StreetCred is laying the groundwork for establishing this POI data scoring system, which will become the driving force for the StreetCred network. We are building analysis models and visualizations that will empower the community with insights about their collected data. With these tools in hand, consumers and contributors alike will be able to direct their valuable resources towards the areas in which they will have the most impact: consumers can see areas where the data is exceptional, and contributors can see where to anticipate the greatest earning potential.

## Build the Scale

In order to be able to score something, a scale must exist. We need to identify the worst and best possible outcomes. In the case of POI data coverage the bottom of the scale is obviously no data whatsoever. Of course, the opposite end of the scale should be all the data. Unfortunately, there is currently no way to know what all the data actually means; is it a million POIs? is it a billion? Can we compute this more granularly, for say a city? When Google claims over 100M places how can we tell if thatâ€™s everything? If thatâ€™s the total for the entire world, whatâ€™s the total for a country, a city, a neighborhood? Consumers of this type of data have grown accustomed to these absolute numbers and anecdotal experiences (search for a Starbucks where you know one to exist).

It doesnâ€™t have to be this way! We just need to begin predicting the number of POIs across the globe. So thatâ€™s exactly what weâ€™ve started doing. This will be an iterative process with many layers of data interwoven together to produce the most accurate estimates of POI counts, but our initial computations are encouraging!

## Letâ€™s Hit the Streets

Itâ€™s safe to assume that POI counts correlate to a few key factors in a given region. We decided to start with roads and see how the road network can help estimate POI counts in a few large cities across the world. We were fortunate to have access to one of the most complete road network datasets in existence: OpenStreetMap (OSM)! According to a recent analysis, OSM has significant global coverage and â€śprovides the only global-level, openly licensed source of geospatial road dataâ€ť (1)

Our assumption is relatively simple: ** a POI canâ€™t exist without a road that leads patrons to it**. Imagine a convenience store, barber shop, gym, or hospital without a road, street, or even footpath leading visitors to it. So we set off to analyze roads in a few cities to see what they potentially reveal about POIs.

Now is probably a good time to show you some pretty maps and if youâ€™d like to know more about the method we used to derive them read below.

We started with New York City, because itâ€™s home (and itâ€™s got many roads)! According to our analysis, there are at least 185K POIs just waiting to be discovered.

## What's With the Rectangles?

Youâ€™ll notice the map uses rectangles to break up the area. These are geohashes (play around with them here, thanks missinglink!). They provide a convenient mechanism for tiling the world in such a way that each resulting tile has a unique code to identify it. The number of digits in the code, referred to as the precision, dictates the size of the rectangle. Weâ€™ve chosen to go with a 6 character geohash for the purposes of this analysis, which results in rectangles of 1.22km Ă— 0.61km. Performing this analysis over smaller uniform pieces allows us to drill into the map and see variation in POI presence, instead of seeing just an absolute number representing the final total.

Time for another map break! This time, letâ€™s check out Berlin, Germany weighing in at 127K POIs. Itâ€™s interesting to note that Berlin has many large parks which our analysis correctly predicts have little to no POIs. These sparse tiles bring the average POI count per tile to only 59.6, which is only half of New Yorkâ€™s 111. Though the most dense tiles are comparable with 391 and 473 for Berlin and New York, respectively.

[caption id="" align="alignnone" width="1106.0"] Berlin, DE [/caption]

Letâ€™s keep this map party rolling with London, United Kingdom! Itâ€™s the largest city we analyzed, at 1,570 kmÂ˛, and falls very much inline with Berlinâ€™s stats with an average POI count of 74.4 and the densest tile coming in at 370. As expected, given the sheer size of the city the total estimated number of POIs is the highest yet with almost 271K.

Finally, letâ€™s have a look at Seoul! Weâ€™ve saved this one for last because here the road network proved to be the densest, resulting in the highest maximum and average POI estimates per geohash tile. The total is 162K, which is surprisingly close to New York, given the difference in square footage: 605 kmÂ˛ for Seoul versus 789 kmÂ˛ for New York.

Hereâ€™s a breakdown of the stats for the cities weâ€™ve analyzed in this post.

City

Total

Avg

Max

Size

New York

185K

111

473

789 kmÂ˛

Berlin

127K

60

391

892 kmÂ˛

Seoul

162K

140

498

605 kmÂ˛

London

271K

75

370

1,570 kmÂ˛

## The Scenic Route

As promised, weâ€™d like to share more about our computations in the hopes of generating discussion amongst our fellow POI lovers.

Weâ€™re assuming that POIs mostly exist along roads. But we donâ€™t just assume uniform co-location, we associate the frequency at which they can be found with the road types. OpenStreetMap contributors tags roads using the highway tag based on their importance in the network and accessibility to pedestrian. We used this classification to group the roads by type. Once grouped, we summed the lengths of each type to arrive at the total length of each road within each geohash tile. We then picked a few representative roads from each category type and computed the frequency with which POIs show up on them; some good olâ€™ fashioned walking around with a measuring wheel did the trick. Not all road types are assumed to have a POI presence, for example footways were mostly found within parks where we wouldnâ€™t expect to have many if any POIs.

These are the numbers we arrived at and subsequently used in further computations.

Highway Type

POI per km

`primary`

74

`secondary`

46

`tertiary`

35

`pedestrian`

8

`residential`

8

`unclassified`

3

With those coefficients in hand, all that was left was some math. The total length computed earlier for each highway type is multiplied by a coefficient representing the average number of POI per kilometer. Summing up the results for all the road types gives us the total estimate for each geohash. Summing up the totals of all geohashes covering the city gives us the total estimate.

## The Road Ahead

We know this is a very naive analysis and doesnâ€™t alone represent the final answer. But itâ€™s a good starting point. Next steps will involve layering in additional data, such as building height, known addresses (lookinâ€™ at you OpenAddresses), population, and much much more. Weâ€™d love to hear suggestions from the community on what other datasets might correlate to POI count so we can experiment with all of them and share our findings with you. As we build a pipeline to perform this analysis on the entire world, and not just a few cities as weâ€™ve done so far, weâ€™ll be doing so in the open and invite your contributions!

1. Christopher Barrington-Leigh, Adam Millard-Ball, *The worldâ€™s user-generated road map is more than 80% complete, *August 10, 2017, http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0180698