The dataset was extracted from Public Git Archive and consists of:
49 million distinct identifiers - 1 GB
identifiers per language - 1 GB, same processing as (1) but extracted from specific programming language files: Python, Javacript, C, C++, PHP, Ruby, C#, Java, Shell, Go, Objective-C.
num_files - number of files where the identifier was found
num_occ - number of times the identifier was found overall
num_repos - number of repositories in which the identifier was found
token - the value of the identifier
token_split - the splitted parts using the sourced-ml heuristics
All the stats correspond to the HEAD revision of each repository in PGA.
Jupyter notebook which reads the per-language identifiers (2) and plots the statistics.