Relling

We trained the first great models on text because it was cheap and everywhere.

Now the limits are obvious. High quality text is finite. Single modality models miss complex relations among objects, scenes, and abstract concepts.

Images help, but most image systems still need engineering overhead to think in 3D (boxes, masks, keypoints, etc). Useful, but expensive and task specific.

The next generation will not be text models. The next generation will not be image models.

It will be models trained on raw video experience.

Stack video plus depth, IMU, audio, force, and gaze to measure the event directly. You get robustness to blur, occlusion, and out-of-frame motion.

The closer the signal is to the event, the less the model guesses.

Train on high-fidelity measurements and you get systems that see, predict, and act.

That is how AI will close the loop with reality.

RELLING