Generated from 140,000 most starred projects on GitHub in October 2016. Legacy pipeline, no splitting and stemming, later converted with quality loss.
from sourced.ml.models import Id2Vecid2vec = Id2Vec().load("92609e70-f79c-46b5-8419-55726e873cfc")print("Number of tokens:", len(id2vec))
Data collection date
Number of (sub)tokens
Number of repositories