Predicting Twitter User Demographics using Distant Supervision from Website Traffic Data

Type Journal Article - Journal of Artificial Intelligence Research
Title Predicting Twitter User Demographics using Distant Supervision from Website Traffic Data
Author(s)
Volume 55
Publication (Day/Month/Year) 2016
Page numbers 389-408
URL http://www.jair.org/media/4935/accepted-4935-jair.pdf
Abstract
Understanding the demographics of users of online social networks has important applications
for health, marketing, and public messaging. Whereas most prior approaches
rely on a supervised learning approach, in which individual users are labeled with demographics
for training, we instead create a distantly labeled dataset by collecting audience
measurement data for 1,500 websites (e.g., 50% of visitors to gizmodo.com are estimated
to have a bachelor’s degree). We then fit a regression model to predict these demographics
from information about the followers of each website on Twitter. Using patterns derived
both from textual content and the social network of each user, our final model produces an
average held-out correlation of .77 across seven different variables (age, gender, education,
ethnicity, income, parental status, and political preference). We then apply this model
to classify individual Twitter users by ethnicity, gender, and political preference, finding
performance that is surprisingly competitive with a fully supervised approach.

Related studies

»