Moment of (Ground) Truth

Chris Shughrue - August 7, 2020

Point of interest data can be unreliable, suffering from inaccurate sourcing and staleness. StreetCred’s real-time crowdsourced collection infrastructure seeks to solve these problems by collecting a continuous stream of new information. In this blog post, we’ll put the accuracy of StreetCred’s validation system to the test using approximately 1,500 ground truthed place records. We’ll show how we use record credibility metrics and predictive modeling to further enhance accuracy to meet our partners' quality standards.

Best foot forward

In conventional point of interest datasets, it’s difficult to differentiate good from bad records because record-level data sourcing information is limited. By contrast, StreetCred has developed a probabilistic credibility network to continuously assess the quality of each record based on the strength of evidence available throughout the entire dataset. Records with multiple contributions and those from highly trusted users receive high quality scores. Other records start out with a low quality score, which improves over time as contributing players build up their credibility and discrepancies are cleared up.

This innovation gives us a unique advantage. By including just the highest quality records, we can maximize the accuracy of any cut of data, while holding back more uncertain records until better evidence emerges.

Learning to be choosy

We use ground truth data to quantify the value added by this approach. We map the ground truthed accuracy of each data quality sub-segment onto the full population of records. This approach enables us to get a sense for the prevalence of accurate records throughout the data, rather than just in our sample.

Accuracy vs sample size
Figure 1. Optimally ordered sample accuracy versus relative size of sample. Baseline accuracy (orange) represents records included in a sample at random in a manner analogous to a conventional dataset without record-level quality. Optimal ordering using credibility network validation (black) increases accuracy by greater than 5% for large samples of the dataset. Ordering boosted by predictive modeling increases accuracy over baseline by about 8% for a majority-cut of the data.

Probabilistic and predictive quality modeling enables us to hold back under-ripe records until they graduate into export-ready status (Figure 1). Our approach optimally orders records by quality assessment, and selects the best records for a given cut of data. Credibility network validation quality scores lead to an out of the box improvement of 5% in accuracy for a large cut of data (more than 50% of all records).

We enhanced this approach using a predictive model based on hallmarks of accurate records such as number of unique contributors, time since creation, and data discrepancies. This model identifies accurate records to include in the data cut. By adding this layer of prediction on top of validation assessment, we increase accuracy of the large sample by 8%.

For more conservative data needs, our models can segment on the most well-vetted records to increase accuracy above baseline by as much as 13%. These findings illustrate how StreetCred uses information about the process by which data is contributed to effectively segment and deliver the highest quality data in real time.

Better with age

In conventional static datasets, the quality of records declines immediately upon collection as places change. StreetCred’s data, however, gets better over time.

Our credibility validation network uses new information contributed by players to make existing records more accurate in real time. Simultaneously, as contributors add new data, we get a better understanding of quality throughout the data set as we can more accurately estimate which contributions are good or need additional interaction.

This dynamic leads to a greater population of export-ready data, with even greater accuracy. For example, in two recent weeks, new data contributions expanded the quantity of export-ready records by 13% and improved accuracy by 2%

Perpetually improving

Continuous data collection and validation enable StreetCred to create the highest quality datasets. This ground truth analysis illustrates how effective this advantage can be compared to static data lacking record-level quality metrics. This enables us to segment our dataset to meet any quality standards.

We’re continuing to refine this effort by expanding our ground truth data and developing new machine learning methods to efficiently segment it. Experiments such as this illustrate how StreetCred is able to transform a raw stream of information into a data deliverable that improves itself in real time