5.7 million source code identifiers, extracted in october 2016 from all repositories we cloned - 10 million after de-duplication. Standard processing: splitting, stemming - as given in the paper. The document frequency here refers to the frequency of identifiers per repository.
from sourced.ml.models import DocumentFrequenciesdf = DocumentFrequencies().load("f64bacd4-67fb-4c64-8382-399a8e7db52a")print("Number of tokens:", len(df))
Data collection date
Number of (sub)tokens
Number of repositories