Paper (accepted to ML4P'18).

The dataset was extracted from Public Git Archive and consists of:

  1. identifiers per language - 1 GB, same processing as (1) but extracted from specific programming language files: Python, Javacript, C, C++, PHP, Ruby, C#, Java, Shell, Go, Objective-C.


CSV, columns:

  • num_files - number of files where the identifier was found

  • num_occ - number of times the identifier was found overall

  • num_repos - number of repositories in which the identifier was found

  • token - the value of the identifier

  • token_split - the splitted parts using the sourced-ml heuristics

All the stats correspond to the HEAD revision of each repository in PGA.

Code examples

  • Jupyter notebook which reads the per-language identifiers (2) and plots the statistics.


Open Data Commons Open Database License (ODbL)