Time Is Not in the Pixels Duration blindness in vision-language-action models: an information-theory proof and the structural gap it reveals.
Most of This Video Is Boring Event-Driven Encoding Makes Video Efficient for Decision-Making