this post was submitted on 18 Feb 2024
10 points (91.7% liked)

Futurology

1812 readers
203 users here now

founded 1 year ago
MODERATORS
top 1 comments
sorted by: hot top controversial new old

With V-JEPA, we mask out a large portion of a video so the model is only shown a little bit of the context. We then ask the predictor to fill in the blanks of what’s missing—not in terms of the actual pixels, but rather as a more abstract description in this representation space.

Hmm, it looks like it aims to do for videos what chatbot LLMs do for text or what content-aware fill does for images. A useful tool, to be sure, but I think the link to AGI seems a bit tenuous.