f64bacd4-67fb-4c64-8382-399a8e7db52a

Last updated 9 months ago

5.7 million source code identifiers, extracted in october 2016 from all repositories we cloned - 10 million after de-duplication. Standard processing: splitting, stemming - as given in the paper. The document frequency here refers to the frequency of identifiers per repository.

Example:

from sourced.ml.models import DocumentFrequencies
df = DocumentFrequencies().load("f64bacd4-67fb-4c64-8382-399a8e7db52a")
print("Number of tokens:", len(df))

References

ID

f64bacd4-67fb-4c64-8382-399a8e7db52a

Uploaded

2017-06-19 09:59:14.766638

Version

1.0.0

File

https://storage.googleapis.com/models.cdn.sourced.tech/models%2Fdocfreq%2Ff64bacd4-67fb-4c64-8382-399a8e7db52a.asdf

Size

24.3 MB

Data collection date

October 2016

Number of (sub)tokens

5,720,096

Number of repositories

112,273

License