Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

The video embeddings in the paper are learned purely based on observing what users co-watch in sessions. In this sense, they can be thought of as latent factors in more traditional collaborative filtering approaches. When we inspect them, nearby vectors have a surprising amount of semantic similarity.

Features about the videos such as titles and tags, as well as features derived from audio and video, are introduced in the ranking phase.



Are you guys using similar model like Word2Vec for obtaining the video embeddings?


word2vec did inspire earlier iterations of the model, but the key insight is that embeddings are learned jointly with all other model parameters. There is no separate source of embeddings. This way, embeddings are specialized for the the specific task.


In general what could be a separate source of embeddings? Also, how do these embeddings compare against traditional CF based latent factors?(I ask this in terms of a recommender metric and not complexity)




Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: