Lighthouse AI Lab
Pioneering the convergence of Large Language Models and model-based Reinforcement Learning — Transformer world models with contrastive representations for autonomous agents that reason, plan, and act.
Intelligence requires
understanding both the World and the Word
We believe that true intelligence is not a single capability — it is the convergence of two fundamental forms of understanding.
Understanding the World — predicting consequences, learning from experience, building internal models of dynamic environments — is the domain of model-based reinforcement learning. Understanding the Word — language, reasoning, abstract knowledge — is where Large Language Models excel.
Neither alone is enough. A mind that has read every book but never experienced cause and effect is an LLM without world understanding. A being that can see and act but has no language to reason with is an RL agent without words. Intelligence emerges only at their intersection.
The Lighthouse Ventures AI Lab exists to build models that understand both — and in doing so, take a meaningful step toward general intelligence.
Language Understanding
LLMs excel at reasoning, instruction following, and semantic understanding — but lack grounded decision-making in dynamic environments.
Reinforcement Learning
RL agents learn optimal behaviors through trial and error — but struggle with generalization and sample efficiency in complex, open-ended tasks.
The Missing Link
Transformer-based world models with contrastive representations bridge this gap — learning rich temporal features that become the foundation for uniting language and action.
Transformer World Models
with Contrastive Representations
The architectural foundation for agents that build and reason about internal models of the world
Our research builds on the principle that intelligent agents need rich internal world models — not just reactive mappings from observations to actions, but deep representations of how environments evolve over time.
Traditional model-based RL predicts only the next state — like reading one word at a time without understanding the sentence. By combining Transformer architectures with action-conditioned Contrastive Predictive Coding (AC-CPC), we extend predictions up to 10 steps into the future, learning representations that capture the deep temporal structure of environments. This approach was validated at ICLR 2025 under the name TWISTER.
class TransformerWorldModel:
# Transformer State-Space Model
encoder → z_t # image → latent state
transformer→ h_t # temporal context
dynamics → ẑ_t # next state prediction
decoder → ô_t # state → image
# Action-Conditioned CPC
representation → e_t^k # future targets
ac_cpc_predict → ê_t^k # K=10 step ahead
# Agent Behavior
actor → π(a_t | s_t) # policy
critic → V(s_t) # value fn
The Key Insight
Predict only the next state. Adjacent frames are too similar — the Transformer doesn't need deep understanding to predict trivially similar states.
Predict K=10 steps ahead using AC-CPC. Distant states are genuinely different, forcing the model to learn meaningful temporal representations.
Encoder Network
A convolutional VAE with categorical latents (32 categories × 32 classes) converts raw image observations into compact, discrete stochastic states zt. This compressed representation captures the essential information from each frame.
Transformer World Model
A masked self-attention Transformer with relative positional encodings processes sequences of latent states and actions to produce hidden states ht — building rich temporal context that carries historical information forward.
Action-Conditioned CPC
The core innovation: contrastive learning that maximizes mutual information between current model states and future stochastic states from augmented observations, conditioned on the sequence of future actions for reduced uncertainty.
Actor-Critic Agent
An actor network selects actions to maximize expected returns using REINFORCE with entropy regularization, while a critic network estimates state values using symlog cross-entropy loss and EMA stabilization.
Architecture
How the components of a Transformer world model work together in a unified learning pipeline
Training Objectives
The system optimizes a composite loss function that jointly trains all world model components:
MSE loss — trains the VAE encoder-decoder to learn faithful latent representations of image observations.
KL divergence — trains the Transformer to predict future latent states from context, with free-bits regularization.
Symlog cross-entropy — predicts environment rewards, handling scale variance across different task domains.
Binary cross-entropy — predicts episode termination signals for proper trajectory bootstrapping.
InfoNCE loss — the key innovation. Maximizes mutual information between model states and K=10 future augmented states, conditioned on actions.
The Convergence of
LLMs & World Models
Where language understanding meets world modeling — building the next generation of autonomous agents
LLMs
- Semantic reasoning
- Instruction following
- Common-sense knowledge
- Natural language planning
AI Lab
World Models
- World modeling
- Temporal representation
- Action optimization
- Contrastive learning
Language-Grounded World Models
Augmenting the Transformer world model with natural language state descriptions. Instead of learning latent representations solely from visual input, we enrich the encoder with language embeddings — enabling the world model to reason about states using both perceptual and semantic information.
Hierarchical Planning with LLM Priors
Using LLMs as high-level planners that decompose complex tasks into sub-goals, while the Transformer world model handles low-level action execution. The LLM provides structured reward signals and goal specifications; the world model simulates and optimizes trajectories to achieve them.
Contrastive Language-Action Alignment
Extending the AC-CPC framework to align language descriptions with action sequences. By contrasting language-described outcomes with observed trajectories, we create a shared embedding space where instructions can be directly mapped to optimal behavior policies.
RL-Optimized Language Reasoning
Using reinforcement learning with verifiable rewards to fine-tune LLMs for improved world model reasoning. The contrastive representations provide dense reward signals that guide the LLM toward generating more accurate environment predictions and more effective action plans.
Proven Performance
Our approach — validated as TWISTER at ICLR 2025 — sets new records on the Atari 100k benchmark among methods without look-ahead search
Games with
superhuman performance
Human-normalized
median score
Environment interactions
(~2 hours real-time)
Let's Build the Future Together
We're looking for collaborators, researchers, and visionaries who share our belief that the next breakthrough in AI lies at the intersection of language and action.
Our Location
Quarzweg 3
22395 Hamburg
Germany