The video embeddings in the paper are learned purely based on observing what users co-watch in sessions. In this sense, they can be thought of as latent factors in more traditional collaborative filtering approaches. When we inspect them, nearby vectors have a surprising amount of semantic similarity.
Features about the videos such as titles and tags, as well as features derived from audio and video, are introduced in the ranking phase.
word2vec did inspire earlier iterations of the model, but the key insight is that embeddings are learned jointly with all other model parameters. There is no separate source of embeddings. This way, embeddings are specialized for the the specific task.
In general what could be a separate source of embeddings? Also, how do these embeddings compare against traditional CF based latent factors?(I ask this in terms of a recommender metric and not complexity)
Features about the videos such as titles and tags, as well as features derived from audio and video, are introduced in the ranking phase.