The first task we scoped for a robot deployment was not dexterous — it was temporal. The robot needed to wait for water to boil before pouring it. We could get the arm to grasp the kettle. We could not get any model to understand that the water was not ready yet, or to wait before doing the simplest pick and place. This led us to a question we could not find an answer to anywhere in the literature: do any of the current foundation models for robotics have a representation of duration at all?
The models get better at the parts of robotics that have always been hard: dexterous manipulation, tool use, language conditioning, and ultimately long-horizon sequencing. They do not get better at the parts humans find trivially easy — knowing when the microwave is done before opening the door, staying interruptible during a long pour, remembering when something happened and not just that it did. The question is whether temporal reasoning deserves a first-class representation alongside vision and proprioception — one that lets models reason about duration, not just the sequencing of events.
Duration Blindness
We instructed five VLAs — π 0.5, π 0-FAST, GR00T N1.6, MemoryVLA, and V-JEPA 2 AC — to “wait N seconds, then pick up the block” and measured time to first meaningful action (TTFMA): the first moment the robot produces an action above a calibrated noise threshold. 255 episodes per model.
The red bar does not move. Whether the instruction says one second or ten, every model begins acting within the first inference step. The blue bar — the same task with NOOP actions injected externally for N seconds — scales linearly, confirming the pipeline works. The model ignores the temporal content of the instruction entirely. On LIBERO, the temporal prefix degrades task success from 46.7% to 3.9%. Asking the model to wait actively breaks the policy.
Teleoperation data systematically filters out pauses, action chunking compresses idle time, and success metrics grade final world state rather than temporal fidelity — LIBERO-fine-tuned checkpoints contain zero NOOP outputs. The action vocabulary does not include “do nothing.”
The natural objection: this is a training data problem. Add synthetic waiting demonstrations — static frames labeled NOOP — and the model will learn. Consider what that data looks like: N seconds of identical frames, each labeled NOOP, followed by one labeled ACT. Every frame is drawn from the same distribution. The observation at t = 1 s is statistically indistinguishable from t = 9 s. The model has no basis for learning which NOOP is the last one.
This is not a sample efficiency problem. More data does not help when the data itself is uninformative. The signal is not faint, or hard to decode, or buried under noise — it is absent from the inputs entirely.
Why the Signal Cannot Be Recovered
The impossibility is formal.
Theorem. Let $X_1, X_2, \ldots, X_T \overset{\text{i.i.d.}}{\sim} P$. Then for any measurable $f$:
$$P(X_t = x \mid T = t) \;=\; P(X_t = x) \quad \forall\, t$$
$$\implies\; P(T = t \mid X_t = x) \;=\; P(T = t) \quad (\text{posterior} = \text{prior})$$
$$\implies\; I(T;\, X_t) \;=\; 0$$
Corollary. $Z_t = g(X_t) \;\implies\; I(T;\, Z_t) \,\leq\, I(T;\, X_t) = 0$
No learned representation recovers information absent from its inputs.
No architecture — no matter how large, how deep, or how expensively trained — recovers information absent from its inputs. A static scene produces i.i.d. frames; no function of i.i.d. samples can estimate sample count without a counter. Scaling does not change this.
We probed the VLM backbone to see whether temporal magnitudes survive into the embedding space. VLAs typically split inference between a vision-language backbone that processes the instruction and an expert tower that generates actions — the question is whether the backbone encodes “5 seconds” differently from “10 seconds” and whether that difference reaches the expert.
The backbone preserves a faint numerical ordering (Mantel r = 0.77) but at cosine distances of 10⁻⁵ — magnitudes so small that the expert tower treats all temporal prompts as interchangeable. Zhou, Masmanidis, and Buonomano (2022) showed why this matters structurally: recurrent networks develop smooth duration scaling only when the temporal signal is sustained throughout the interval, not when it arrives as a transient stimulus at t = 0. VLA language instructions are transient by construction — the model receives “wait 5 seconds” once, then runs on observations. The neuroscience predicts categorical timing at best.
Three Kinds of Time
Current VLAs conflate three independent temporal quantities.
Position is the frame index — “step 7 of 80.” Every model has this via positional encoding. It says where you are in a sequence, nothing about what happened or how long it took.
Progress is a monotonically non-decreasing function of the task’s semantic state — concretely, how much the model’s internal representation changes between frames. It advances quickly during state transitions (reach → grasp, transport → place) and slowly during steady execution. A trajectory might spend 60% of its frames in transport but only 10% of its task-relevant progress.
Duration is elapsed wall-clock time since episode start. Uniform, always advancing, semantics-free. A robot running at 30 Hz during a grasp and 5 Hz during a wait produces the same positional increment per frame but vastly different durations per step.
Each missing quantity produces a distinct failure. Without progress, the model allocates equal attention to every frame — transport and grasp are interchangeable. Without duration, the model cannot distinguish a one-second wait from a ten-second wait. And when progress stalls during a static scene, even an event-driven temporal signal flatlines. The three quantities require different solutions because they fail for different reasons.
These are genuinely independent. Position 50 could be early in a long task or late in a short one. Progress 0.8 could arrive at step 20 or step 200. Duration 5 seconds could mean a great deal happened or nothing at all. Collapsing them into a single positional encoding discards two of the three signals.
Towards an Arrow of Time
Most real scenes are not perfectly static — water roils before it boils, adhesive changes sheen as it cures, a microwave’s turntable rotates. Humans reason from vision too; we read the timer, we do not intuit 30 seconds directly.
V-JEPA’s 3D spatiotemporal representations encode motion patterns that vary with elapsed time. The question worth formalizing is whether a model can learn an internal arrow of time purely from the latent dynamics of pixel movements — a representation where the direction and magnitude of elapsed time emerge from learned structure rather than an injected signal.
But there is a more immediate observation. The i.i.d. impossibility holds for cameras. It does not hold for all sensors. A robot arm’s IMU gyroscope exhibits bias drift — a random walk whose variance grows linearly with time. Its force-torque sensor has thermal noise that shifts with temperature. These are not flaws to be filtered out. They are features.
A model that tracks the running variance of its own sensor drift has access to a signal that a camera-only model provably does not. This has been validated on synthetic random walks but not yet on real robot sensor data — the direction is theoretical, not demonstrated.
Where to Build It
Several current architectures already contain scalar conditioning pathways that could carry a temporal signal — they are simply aimed at the wrong axis. DiT-based VLAs condition every transformer layer’s gain and bias on a single scalar via AdaLN-Zero (GR00T) or AdaRMS (π 0.5). This scalar is tdenoise, the flow-matching denoising step, which resets every inference call. The mechanism is exactly right. The axis is wrong.
We ran ten experiments across injection modes, optimization strategies, and training paradigms. The clock signal is learnable — when forced to be the only optimization lever, temporal gates grow and reduce training loss. But the clock alone does not produce duration-specific behavior. The bottleneck is twofold: the language encoder collapses “5 seconds” and “10 seconds” into near-identical embeddings (cosine > 0.999), and independent chunk training provides zero temporal gradient for 97.5% of training data. The clock is a watch without an appointment card — it tells the model what time it is, but the model cannot read what time to act from the prompt.
The question is not whether to give the robot a clock — every deployed robot will have one. The deeper question is whether temporal grounding should be bolted onto the side of a vision-language system or emerge from a physical process already present in the robot’s own sensors. The entropic signal from motor noise suggests a middle path: not an external clock, but an existing physical process leveraged as an informational pathway to curtail the categorical limitations of today’s video backbones.
The strongest version of our claim — that architecture, not training data, is the bottleneck — remains a falsifiable prediction. A Diffusion Policy trained on wait-augmented LIBERO data should have TTFMA ratio below 0.2: it will learn the average wait duration but will not modulate based on the commanded duration, because the observations during the wait are i.i.d. regardless of the label. If the ratio exceeds 0.5, the architectural necessity claim is refuted and the problem is one of data, not structure. We have stated this prediction precisely so that it can be tested.
Duration is the first structural gap we have characterized. It will not be the last.