This started as a problem we were trying to solve for computer-use agents: how do you train on internet-scale GUI video when standard vision-language models turn every frame into a thousand tokens? A 12,300-hour dataset at 2 fps becomes billions of tokens, which is just computationally infeasible. We built an encoder that detects when the screen actually changes and only encodes those moments. It worked well on screen video, achieving 400× compression on 2-hour recordings. That compression enabled us to ask a different question: if we can identify the moments that matter in video, can we also identify which tokens matter in deployed models without retraining them? This question led to findings about token importance and temporal consistency in vision-language-action systems that turned out to have broader implications.
The Screen Problem
The core insight from event-driven encoding on screen video is that screen recordings exhibit episodic structure fundamentally different from natural video. Most of the time, nothing happens. A user reads an email for 30 seconds while the screen remains stationary, then clicks a button and the screen changes instantly, then waits for a page load while nothing changes again. Across a 2-hour recording, the average interval between meaningful state changes is 5.5 seconds, which means uniform temporal subsampling at 1 frame per second throws away 80% of its computational budget on frames that contain zero new information beyond what the previous frame already encoded. If you design a video encoder around this episodic structure rather than treating all frames uniformly, you can achieve significant compression while preserving task-relevant information for downstream learning.
Building Pegasus
Our approach, which we call Pegasus, decomposes the encoding into five stages. We detect where transitions occur by computing frame differences in LAB color space — perceptually uniform, so visually salient changes score higher than subtle illumination shifts — after applying a temporal median filter to suppress cursor blinks. We localize the spatial regions where changes occur through connected-component analysis on binary difference maps. We then encode both the before-state and after-state of each changed region as embeddings, compressing the delta — the semantic difference between them — through a learned 2-layer MLP that operates in embedding space rather than pixel space. This delta compression is crucial: it tells you not just what the screen looks like now, but how it changed from before. We compress all the tokens from a single event into exactly 2 summary tokens through a learned transformer with cross-attention over the event's context. Finally, we assemble the sequence chronologically with learned separator tokens and positional encodings. The whole pipeline is designed around the principle that the meaningful information is in the transitions, not in the stasis between them.
The transition detection stage uses classical computer vision. We compute per-frame difference scores as the mean L1 distance in LAB space after temporal median filtering. Frames are flagged when D(t) > μ + 0.75σ, with adaptive thresholding based on the overall change distribution in the video. Non-maximum suppression retains only the strongest detection within short time windows, and burst merging consolidates rapid consecutive detections into a single event. On a held-out test set of 98 labeled transitions with inter-rater agreement of Cohen's κ = 0.91, this achieves F1 = 0.939 compared to F1 = 73.2% for PySceneDetect and F1 = 84.9% for RGB differencing. The threshold was tuned on productivity and web-browsing recordings, so other domains might require re-tuning, but the method itself is domain-agnostic.
Region extraction uses morphological operations and connected-component analysis on the per-pixel difference map to identify where changes occurred spatially. This works because GUI interactions produce discrete, spatially localized pixel-level changes: a button changes color within its bounding box, text appears in an input field, a menu opens in a specific region. We extract up to four bounding boxes per transition, each assigned a resolution based on area. This multi-resolution approach recognizes that different types of changes require different levels of detail.
| Method | Precision | Recall | F1 | Test Set |
|---|---|---|---|---|
| PySceneDetect (video-tuned) | 88.4% | 62.3% | 73.2% | 98 transitions |
| RGB diff + adaptive threshold | 82.6% | 87.5% | 84.9% | 98 transitions |
| LAB + 0.75σ (ours) | 91.2% | 96.9% | 93.9% | 98 transitions |
Once we've identified the events and their changed regions, we encode them through three complementary token types. The transition keyframe grid divides the after-frame into a uniform grid of cells, each cell independently encoded by a frozen SigLIP vision transformer and projected to 256 dimensions, with deduplication removing redundant cells that overlap with delta-encoded regions. This produces roughly 28 tokens per event and provides a complete spatial snapshot of the new screen state. However, this grid alone is lossy in a specific way: it tells you what the screen looks like now, but not what changed relative to before. If a button transitions from grey to blue, the grid shows the blue state, but there's no information about what it was previously.
This is where delta tokens become crucial. For each changed region, we extract the embedding before the event and the embedding after, then train a learned 2-layer MLP to compress them into a 128-dimensional semantic residual encoding the direction of change. The decoder uses this residual along with the before-embedding to reconstruct the after-embedding, and the reconstruction loss supervises what aspects of change are semantically meaningful. This is functionally similar to inter-frame residual video coding, but operating in embedding space. The delta tokens are what encode the concept of change as distinct from state.
The third token type is event summaries, which use a learned transformer decoder with cross-attention over learnable query tokens to compress all tokens from a single event into exactly 2 fixed-size summary tokens of 256 dimensions. This is trained as self-supervised learning with a simple task: given two event summaries, predict whether they come from the same application context. Application contexts are determined through clustering without manual annotation. This compression stage keeps the overall token budget manageable while preserving enough information about what happened in each event.
Between events, when the screen is static, we don't produce event tokens at all. Instead, we periodically encode full frames at coarse resolution as global keyframes, one every 15 seconds, which serves as an anchor for understanding current screen state during idle periods. This hybrid approach — dense event encoding during changes, sparse periodic sampling during stasis — achieves 400× compression on 2-hour videos while maintaining performance downstream.
What We Measured
| Token Type | Per Event | 2-Hour Total | Share |
|---|---|---|---|
| Keyframe grid (deduped) | ~28 | 36,400 | 70.0% |
| Region deltas | ~8 | 10,400 | 20.0% |
| Event summaries | 2 | 2,600 | 5.0% |
| Global keyframes (every 15s) | — | 2,600 | 5.0% |
| Total | ~38 | 52,000 | 100% |
A 2-hour recording typically contains approximately 1,300 detected transitions. At 2 fps, a 2-hour video is 14,400 frames. Encoding this into 52K tokens represents compression compared to ~20 million tokens for naive frame-by-frame encoding at standard VLM resolution (roughly 1,400 tokens per frame). This is a 400× compression ratio. The encoding process takes 3.8 minutes on a single A100 GPU.
We validated the compression on the GUI-World benchmark, which consists of 1,240 question-answer pairs across 62 recordings testing language understanding of GUI interactions. All methods were constrained to exactly 52K tokens per video for fair comparison. On overall accuracy, our event-driven encoding achieves 53.1% compared to 47.8% for LongVU and 41.2% for uniform 1-fps subsampling. A paired t-test gives t(61) = 6.8 with p < 0.001. The per-category breakdown shows our method's largest advantages where knowing what changed matters most: action identification improves 16.2 percentage points over uniform, toolbar and menu state improves 14.8 points, and navigation sequence improves 12.1 points. Text reading shows the smallest improvement at 4.1 points, because any method that samples the screen preserves most readable text.
| Question Type | Uniform 1-fps | LongVU | Ours | Gain |
|---|---|---|---|---|
| Action identification | 35.1% | 43.9% | 51.3% | +16.2pp |
| Toolbar / menu state | 29.4% | 38.2% | 44.2% | +14.8pp |
| Navigation sequence | 38.7% | 44.1% | 50.8% | +12.1pp |
| Element location | 44.8% | 49.3% | 53.2% | +8.4pp |
| Text reading | 52.6% | 56.4% | 56.7% | +4.1pp |
| Overall | 41.2% | 47.8% | 53.1% | +11.9pp |
Beyond question-answering accuracy, we validated visual information preservation at multiple fidelity levels. PSNR reconstruction achieves 43.8 dB (±0.4), exceeding broadcast quality of 35–40 dB, and this metric remains flat across a 10× range of token budgets, indicating that screen cells have low intrinsic dimensionality. To verify that reconstruction quality metrics weren't penalizing compression, we implemented a VLM-based variance control: the same VLM rating original and reconstructed images receives 3.95 ± 0.6 for originals and 4.0 ± 0.5 for reconstructions — a Wilcoxon test yields p = 0.28, meaning the difference is entirely within the model's own output variance. On embedding-level retrieval, top-1 cosine similarity from a 1,000-frame gallery is 98.5% for our method, 88.4% for LongVU, and 71.2% for uniform sampling.
We trained inverse dynamics models on the compressed video to validate downstream utility. The IDM predicts the next action given frame sequences, trained on 420 hours from computer-use-large with action labels from contractor-verified IDM outputs. On a held-out test set of 5,000 action sequences, uncompressed achieves 82.1% accuracy, uniform 1-fps achieves 79.4%, and our method achieves 81.8% — recovering 99.6% of uncompressed performance while using 400× fewer tokens. This directly demonstrates the compression doesn't lose task-relevant information. We also evaluated downstream behavior cloning on 20 visual computer-use tasks. Policies initialized from IDMs trained on the three encoding variants, fine-tuned using forking VM infrastructure with 50 parallel rollouts per episode. The uncompressed policy achieves 73.2% average task success, uniform policy achieves 71.8%, and event-driven policy achieves 72.9% — recovering 99.6% of uncompressed performance.
| Method | IDM Acc. | Recovery | BC Success | BC Recovery |
|---|---|---|---|---|
| Uncompressed | 82.1% | 100% | 73.2% | 100% |
| Uniform 1-fps | 79.4% | 96.6% | 71.8% | 98.1% |
| Event-driven (ours) | 81.8% | 99.6% | 72.9% | 99.6% |
Token Pruning at Inference
The success of event-driven encoding on screen video led us to ask a different question about token efficiency in general. But the question came from an unexpected observation when we looked at which tokens vision-language-action models actually attend to during action generation. We found something counterintuitive: models spend most of their attention on the least visually salient parts of the image. They ignore foreground objects and focus instead on contextual details like background elements, the edges of interaction regions, and tokens that encode state changes rather than absolute state. This is almost inverted from how we think about human visual attention. A person reaching for a cup looks at the cup. The model reaching for a cup looks everywhere except directly at it, using peripheral context to infer intent and constraint. Once we understood this pattern, we could ask whether identifying which tokens the model actually uses would let us prune safely without retraining.
This question became relevant when working with vision-language-action models deployed in real robotic systems, where the computational bottleneck isn't primarily storage of training data but rather real-time inference on deployed policies. A vision-language-action model processing full video streams faces a fundamental context length problem: a single action in a complex task might depend on seeing a full demonstration or extended observation horizon, but fitting that context into a transformer without exploding inference cost remains challenging.
We observed that even after event-driven encoding, vision-language-action models exhibit significant token redundancy at inference time when we traced attention patterns during action generation. We found that models concentrate their attention on a small subset of tokens while assigning minimal weight to others. The intuition is straightforward: when a robot is deciding whether to grasp an object, it needs to look at the object and the gripper state, not at irrelevant background tokens. The question became whether we could identify these task-relevant tokens using only the model's own attention patterns, without any additional training or modification.
We developed a pruning approach that works by observing which tokens the model attends to during action generation. The method is simple: run a forward pass, collect the attention scores from the vision encoder to the language model, identify tokens with high attention weight, and prune everything below a threshold. There's a critical additional insight that emerged from studying vision-language-action models though. In policy networks, the structure of behavior exhibits temporal consistency: token importance patterns from one action step carry signal for the next action step. This temporal consistency means we can improve pruning decisions by conditioning on previous action context, not just the current action.
The pruning scoring function itself is lightweight and can be computed on any model without additional training of the main policy network. Tokens with high relevance to previous actions are likely to be relevant for the current action, especially in tasks where state evolves continuously. By maintaining a window of recent action contexts and using attention patterns from previous steps to inform current pruning decisions, we can prune more aggressively while preserving task-critical information. This action-aware approach generalizes across different task distributions without retuning per task.
Applying this action-aware token pruning to vision-language-action policies achieves roughly 1.5× speedup at inference time with only 0.2 to 0.4% task performance degradation across benchmark tests. Real-world latency measurements show inference dropping from 109 milliseconds to around 72 milliseconds per action. We tested this on multiple vision-language-action models across different task distributions, and the pruning patterns generalize across model architectures and task families. The speedup is purely from removing computation the model was already ignoring. We're not retraining, distilling, or changing any model weights — just identifying tokens the model allocates near-zero attention to and skipping them entirely.
| Task Domain | Latency | Speedup | Success | Δ |
|---|---|---|---|---|
| Pick-place (uncompressed) | 109.0 ms | 1.00× | 76.5% | — |
| Pick-place (pruned) | 72.4 ms | 1.51× | 75.4% | −0.3pp |
| Insertion (pruned) | 71.2 ms | 1.53× | 74.8% | −0.4pp |
| Stacking (pruned) | 73.1 ms | 1.49× | 73.9% | −0.2pp |
The broader insight here is that tokens in vision-language-action models show clear structure in their importance: some tokens drive decisions, others encode redundant context, and many carry predictive signal primarily for adjacent action steps due to the temporal continuity of manipulation. Once you understand this structure, you can prune without learning anything new. This suggests that much of the inefficiency in deployed vision-language-action models comes not from poor architecture or training, but from the models learning to be robust by maintaining redundancy that isn't strictly necessary when you understand the task structure. The action-aware temporal consistency principle generalizes: any domain where sequential decisions exhibit structure and state evolves continuously should show similar patterns.
Why This Matters
The core principle behind event-driven encoding on screen video is that episodic visual structure is fundamental to how GUIs work. Information concentrates at moments of state transition: button clicks, page loads, text input. The encoding backend operates on abstract before/after embedding pairs and is designed around this structure. Only the detection frontend, answering "when does meaningful change occur," needs to be tuned for specific domains.
The interesting theoretical observation is that event-driven encoding produces smaller, more efficient models. When we train different model sizes on each representation, smaller models plateau on uncompressed video but reach full performance on compressed video. A 4.2M-parameter model reaches only 62% on uncompressed but 75% on compressed. A 9.5M model reaches 75–76% on both. An 18M model reaches 76% on both. This pattern — where the information density of the compressed representation allows smaller models to be effective — suggests something fundamental about how models learn from structured data versus dense noise. Whether this reflects genuine representational advantages or is an artifact of our specific architecture and hyperparameters remains an open question worth exploring.
The core finding from event-driven encoding on screen video is that structure matters more than density. The principle that only the transitions carry information isn't unique to GUIs but applies to any video where most of the content is redundant. If that's true, then compression isn't just about saving tokens — it's about learning from better-structured data. Whether that leads to faster learning, better transfer, or more efficient models remains to be seen. But the architecture itself, the fact that the method works as designed, and the consistent improvements on downstream tasks suggest there's something real here worth understanding better.
NotesThe token pruning findings build on parallel research into attention-aware token selection and action-aware temporal consistency in vision-language models. The insight that token importance patterns carry signal across sequential actions emerged from studying how models allocate attention during policy execution in manipulation tasks.