this post was submitted on 18 Feb 2024
10 points (91.7% liked)

Futurology

1812 readers
213 users here now

founded 1 year ago
MODERATORS
you are viewing a single comment's thread
view the rest of the comments

With V-JEPA, we mask out a large portion of a video so the model is only shown a little bit of the context. We then ask the predictor to fill in the blanks of what’s missing—not in terms of the actual pixels, but rather as a more abstract description in this representation space.

Hmm, it looks like it aims to do for videos what chatbot LLMs do for text or what content-aware fill does for images. A useful tool, to be sure, but I think the link to AGI seems a bit tenuous.