As researchers are mining Facebook and Twitter data to learn about online and offline human behaviour, a new study warns them to be wary of serious pitfalls that arise when working with huge social media data sets.
Such erroneous results can have huge implications as thousands of research papers each year are now based on data gleaned from social media.
"Publicly available data feeds used in social media research do not always provide an accurate representation of the platform's overall data - and researchers are generally in the dark about when and how social media providers filter their data streams," explained Derek Ruths, assistant professor at McGill University in Montreal, Canada.
"A large number of spammers and bots, which masquerade as normal users on social media, get mistakenly incorporated into many measurements and predictions of human behaviour," Ruths said.
The design of social media platforms can dictate how users behave and, therefore, what behaviour can be measured.
"For instance, on Facebook the absence of a "dislike" button makes negative responses to content harder to detect than positive "likes," added study co-author Jurgen Pfeffer of Carnegie Mellon University's Institute for Software Research.
Researchers often report results for groups of easy-to-classify users, topics and events - making new methods seem more accurate than they actually are.
For instance, efforts to infer political orientation of Twitter users achieve barely 65 percent accuracy for typical users - even though studies (focusing on politically active users) have claimed 90 percent accuracy, the authors contended.
"The common thread in all these issues is the need for researchers to be more acutely aware of what they are actually analysing when working with social media data," Ruths concluded.
The article appeared in the journal Science.