A little under 1M identifier embeddings, generated for identifiers extracted from half of PGA in June 2018. New pipeline was used, with splitting and stemming of identifiers, the full description can be found in the "Algorithms" section of the sourced.ml repository.
from sourced.ml.models import Id2Vecid2vec = Id2Vec().load("3467e9ca-ec11-444a-ba27-9fa55f5ee6c1")print("Number of tokens:", len(id2vec))
Data collection date
Number of tokens
Size of each embedding