This was a paper that I submitted in my “Python for Social Data Science” class for my MSc in Oxford. Final grade: 70
Benford’s Law, or the law of anomalous numbers, is a concept that was forgotten for decades but has recently gained popularity in the digital age. The concept is simple: it states that naturally-occurring collections of numbers follow a diminishing frequency distribution of leading digits: i.e., number 1 appears more frequently than the rest of the numbers 2-9, the number 2 appears is more frequent than 3-9, and soon. This law has been shown to apply in numerous datasets, and is used to identify anomalies in finance, elections, addresses, and populations, among others.In the advent of big data analytics, Benford’s law has seen a resurgence in its vast application to online datasets. For instance, Benford’s law is used in bot detection and fake user identification on social media platforms such as Twitter, Facebook, and Google Plus (Alberto Perez-Melian et al., 2017; Golbeck, 2015).
My aim is to contribute to this evolving discourse, by investigating whether Benford’s Law applies to a sub-category of internet data: online collaboration platforms. Github and Wikipedia both utilize the principle of “collaborative editing,”though for different purposes: one collaborates on computer code, while the other collaborates on articles for an encyclopedia. By plotting a distribution of author contributions, I test whether I could use Benford’s law to identify unnatural behaviour in these platforms. I hypothesize that Github is less likely to have bot collaborators than Wikipedia, and that Github’s distribution is more likely to copy Benford’s. In what follows, I briefly describe the underlying theory and intuition behind the law, and cite examples from the literature. I then describe my data sources and method sfor data scraping, cleaning, wrangling, visualization, and analysis. Finally, I offer a brief discussion of my results and a conclusion.
The results of my analysis suggest that Github, being the more ”technical” platform, closely mimics the Benford distribution, while Wikipedia also exhibits a similar trend; statistical tests demonstrate a strong goodness-of-fit. Further investigation on both datasets shows that Benford’s law correctly guided the suspicion of user accounts that are bots or system-influenced.
Code: https://github.com/carlaint/OII_Python_DataWrangling/blob/master/Final%20project.ipynb
Full version of the paper: email me on LinkedIn!