# Validating Validation

**Chris Shughrue**
- August 16, 2019

Core to the StreetCred data collection process is bringing together multiple, independent users to create and validate the existence of places. Validation is a multi-step process requiring a consensus among user submission. We wanted to follow up on the intuition behind this validation system to assess the accuracy of user-generated data and the combined accuracy of multiply validated POIs.

To address this issue, we built a probabilistic model to solve two kinds of problems:

- How do you assess data quality in the absence of ground truth training data?
- How can you aggregate submissions from the community, given that different users might have varying levels of attentiveness and overall quality?

The approach we take is: do both at once. The idea is to simultaneously assess the quality of our users and POI data. This plays out in an iterative three-step simulation (see How did we do this? below). We use the results of this validation approach to assess the accuracy of a subset of StreetCred POIs and users.

## How did we do?

Our results are provisional for now, reflecting a test dataset drawn from recent POI additions. The accuracy of users (described in detail at the end) represents the rate at which they submit correct data. From this test set, user accuracy averages around 80%. This is based on the accuracy assessment of 84.3% of the 674 users in the test data (excluding users who have only submitted non-validated data).

Users produce similarly accurate data across experience level

It turns out that the community as a whole does a reliably great job (see scatter plot above). Users produce similarly accurate data across experience level, whether they are new to the app or are leaderboard champs.

The high accuracy rate of users translates into even higher accuracy of the POIs they create. This is the result of aggregating multiple corroborating data points from independent users. For example, if two users submit a matching data point for a place, the odds of both being wrong is lower than of just one making a mistake.

The average accuracy is 98.4% for approved POIs

Among all of the 42,000 POIs in the test data, the accuracy is above 75% (with the exception of 1 outlier; see figure above). These findings reflect the efficacy of our multi-user approach. By combining data across the community, StreetCred fundamentally improves the accuracy of the data it generates.

The distribution of POI accuracies reflects this underlying dynamic. The accuracy of POIs is bimodally distributed—one cluster corresponding with pending POIs ~82% accuracy and one highly accurate cluster corresponding with approved POIs ~98% accuracy.

This can be further teased apart by looking at the distribution of accuracy by approved vs pending statuses (see figure above). The average accuracy is 98.4% for approved POIs and the distribution around the mean is extremely narrow, with most POI accuracy right around the mean. This suggests that validated POIs are nearly completely accurate.

For pending POIs, accuracy averages 84.8%, though with a wider distribution. Even pending POIs are generally quite accurate: more than one third of pending places having an estimated accuracy >90%.

For now, we have used this model to get a better understanding of the data generated by the community. There are a number of caveats and assumptions built into this version of the model, and this will require further refinement before we can draw broader conclusions about the overall dataset. As a first pass, this probabilistic approach illustrates how the quality of user submissions translates into even better results at the community level.

## How did we do this?

The type of model we created is called an Expectation Maximization model and follows in the footsteps of other quantitative approaches for combining crowdsourced data.

*1. Most Likely POI Labels*

The first step is to come up with a consensus of all the user submissions for the true label for each data type (e.g., name, location, hours) for each POI. This is, in practice, equivalent to the current method used for validation: take the label for which at least two users independently agree to be accurate for each type of data.

We add a slight twist on this approach by including a weight on votes, preferring answers from users who have historically provided accurate data (more on this in the next section). Specifically, the vote of user i is weighted using a log-odds ratio of the accuracy rate pi:

We choose the most likely label from this weighted voting routine as the tentative true label.

The user accuracy rate, p, represents the proportion of correct data to the total data provided by a user. This interpretation of accuracy is similar to reliability in that we are not comparing to a ground truth and cannot be completely certain of a systematic bias. However, the typical user should be able to accurately reflect the real state of a POI, given the nature of the information being observed, so this type of bias is unlikely to affect our interpretation of the results.

*2. User Accuracy*

Using the tentative true labels from Step 1, we assess how well each user performed with the accuracy rating, p. We take p to be the mean of a Beta distribution updated with a running tally of correct and incorrect data for each user.

This parameter, *p*, is drawn from a distribution because we can’t observe it directly. Imagine, for example, if a user submits only one data point, it would be unrealistic to assume her accuracy is 100% or 0% based on so few observations. Instead, we can put more confidence in our assessment of this ratio for highly active users by parameterizing a distribution based on past performance.

*3. Repeat*

Re-calculate the weighted majority vote in Step 1 using updated user accuracies. Then, Step 2 again with updated data labels to reassess the accuracies of users. This process continues until labels and user accuracies stop adjusting. At this point, labels represent the most likely true label and user accuracies reflect how well their submissions agree with these labels.

*Aggregate Accuracy*

Combining the labels of users into an ensemble produces a more accurate prediction overall. The theory to back up this intuition is that if two individuals independently provide the same answer to a question, and we know the rate at which they answer correctly (our parameter, p above), the accuracy of the combined answer, y*, is proportional to:

Where θ is the error rate (1-p) of the kth user and takes a value of 0 if the kth label equals the majority label, yk=y*. Here we use an uninformative prior distribution of the prevalence of each candidate label, P0(y) (following Titov et al). This is based on the assumption that the labels are conditionally independent of the actual underlying class, which is to say that the probability of submitting a correct answer is not affected by what the answer actually is.

There are two major caveats underlying these provisional results. The first is that the test data may not reflect all of the disagreements in the data. Further refinement of the data set for this application is needed to bring it to scale. The second is that the model relies on some assumptions about the statistical independence of user data. The StreetCred platform inherently forces independence among users, but the potential for collaboration could weaken this assumption. This model represents a first step, and future refinements will increasingly support a robust accuracy assessment.