Taking the Bad with the Good: A Network Approach to Credible Crowdsourcing

Chris Shughrue - January 23, 2020

At StreetCred, data quality depends on our community of players. Crowdsourced data enables us to see the world through the collective eyes of thousands of people spread across the globe, but it can also pose challenges. We’ve developed analytics to ensure that our game is fair and our data is reliable.

In this post, we’ll walk through how we leverage our global community as a first line of defense against malicious users and bad data. Thanks to the credibility network model we’ve developed, we can simultaneously validate place data, assess credibility of players, and identify bad actors.

Validation framework

We rely on players to provide a continuous pipeline of data about the places they map. Overlapping place data lets us validate the detailed attributes of each place, including its name, hours, and phone number. Each additional piece of data about a place paints a clearer picture of what’s going on.

By looking at this information as a network, we can simultaneously validate place data and players. Learning more about how people play over time helps us contextualize incoming data with what we know about the reliability of each player, based on their history and standing in the community. This is valuable because it enables us to more confidently reconcile disparate pieces of information.

Credibility network
Sample of StreetCred credibility network. Players (orange) are connected by overlapping data contributions (edges) to a common set of places (blue). We assess player credibility, estimate confidence in place data, and identify abusive users with probabilistic graphical models built on top of the network data.

Here we focus specifically on how our analysis of the credibility network and player reliability helps us identify malicious actors and bad data.

Place data confidence

For each attribute of a place (e.g., hours) our validation framework identifies the most likely value given every data point contributed. Our confidence in a particular value depends on the number of players who agree (or disagree) with the most likely value, as well as the historical reliability of those players.

Low confidence place
Hypothetical example of significant disagreement between two users for all attributes of a single place. Typical places average >80% agreement among attributes. Occasional disagreement is expected, but significant disagreement might indicate low quality data submissions.

We expect occasional disagreements due to typos and subjective differences (is Starbucks a cafe or a coffee shop?). But when players are doing their best, errors are relatively uncommon and essentially random. These errors do not significantly affect the average confidence in a place’s attributes because most attributes will agree with the consensus. Suspicious behavior is indicated by unusually low average confidence across the attributes of a place.

Player credibility

We also apply the logic from Place data confidence to evaluate the behavior of specific players over time. Players act as a check on each other. As their data contributions build up, overlapping observations lead to a more strongly interconnected credibility network.

Low confidence user
Example of a player who consistently contributes data that is checked and rejected (red) by other StreetCred players. When averaged across multiple places, the hypothetical player has a significantly lower credibility score than others in the network.

Our credibility network is effective because most players are meticulous data creators. Our confidence in each player grows as they contribute more information which is then validated by the credibility network. If a player consistently has their data contributions rejected by the network as incorrect, their credibility score decreases. This allows us to minimize adverse effects of poor data contributions on the overall dataset. We can also use this pattern to identify players who may be up to no good. An anomalously low player credibility score is a strong signal of unusual behavior and triggers additional scrutiny.

Not playing well with others

Not playing with the community at all is its own kind of anomaly. Community is a critical concept to the functionality of StreetCred, and serves as a check on player behavior. The more confident we are in the players, the more we trust their data. Our network model prefers data from validated, credible users over data contributed by not-yet-validated players or players with a poor track record.

Isolated user
Example of a player isolated from the community. It is unexpected for a player not to overlap with contributions from the rest of the community. Our confidence in the quality of their data is limited because we cannot assess their behavior. This pattern illustrates an atypical usage which is characteristically distinct from anomalous data flagged for poor quality.

We analyze the structure of the credibility network to identify players whose data is isolated from the rest of the credibility network, which may suggest malicious activity by a user attempting to avoid validation. Graphical patterns such as this help us to identify anomalous behavior which might not otherwise be detected.

Content patterns

The StreetCred community is the first line of defense against malicious users. We’re also developing methods to directly analyze the content of data. As our database grows, so does our understanding of relationships in the data. For example, a restaurant’s hours might be related to its name. Patterns, distributions, and clusters that exist in the data can further help us quickly identify unusual and malicious behavior.

Keeping things honest

Weeding out bad actors is essential to keeping the game fair, and to making the data as meaningful as possible. Network analytics and pattern analysis give us a head start in this priority, but we are working continuously to undermine the potential for bad actors. As our credibility network grows, so does the power of our analytics to separate the good from the bad.