Lighthouse AI Lab

Pioneering the convergence of Large Language Models and model-based Reinforcement Learning — Transformer world models with contrastive representations for autonomous agents that reason, plan, and act.

Our Vision

Intelligence requires
understanding both the World and the Word

We believe that true intelligence is not a single capability — it is the convergence of two fundamental forms of understanding.

Understanding the World — predicting consequences, learning from experience, building internal models of dynamic environments — is the domain of model-based reinforcement learning. Understanding the Word — language, reasoning, abstract knowledge — is where Large Language Models excel.

Neither alone is enough. A mind that has read every book but never experienced cause and effect is an LLM without world understanding. A being that can see and act but has no language to reason with is an RL agent without words. Intelligence emerges only at their intersection.

The World World Models
Intelligence
The Word LLMs
The Lighthouse Ventures AI Lab exists to build models that understand both — and in doing so, take a meaningful step toward general intelligence.

Language Understanding

LLMs excel at reasoning, instruction following, and semantic understanding — but lack grounded decision-making in dynamic environments.

Reinforcement Learning

RL agents learn optimal behaviors through trial and error — but struggle with generalization and sample efficiency in complex, open-ended tasks.

The Missing Link

Transformer-based world models with contrastive representations bridge this gap — learning rich temporal features that become the foundation for uniting language and action.

Core Technology

Transformer World Models
with Contrastive Representations

The architectural foundation for agents that build and reason about internal models of the world

Our research builds on the principle that intelligent agents need rich internal world models — not just reactive mappings from observations to actions, but deep representations of how environments evolve over time.

Traditional model-based RL predicts only the next state — like reading one word at a time without understanding the sentence. By combining Transformer architectures with action-conditioned Contrastive Predictive Coding (AC-CPC), we extend predictions up to 10 steps into the future, learning representations that capture the deep temporal structure of environments. This approach was validated at ICLR 2025 under the name TWISTER.

world_model.py
class TransformerWorldModel:
    # Transformer State-Space Model
    encoder    → z_t  # image → latent state
    transformer→ h_t  # temporal context
    dynamics   → ẑ_t  # next state prediction
    decoder    → ô_t  # state → image

    # Action-Conditioned CPC
    representation → e_t^k  # future targets
    ac_cpc_predict → ê_t^k  # K=10 step ahead

    # Agent Behavior
    actor  → π(a_t | s_t)   # policy
    critic → V(s_t)         # value fn

The Key Insight

Previous Approaches
t
t+1
t+2
...

Predict only the next state. Adjacent frames are too similar — the Transformer doesn't need deep understanding to predict trivially similar states.

Our Approach
t
t+1
t+2
...t+10

Predict K=10 steps ahead using AC-CPC. Distant states are genuinely different, forcing the model to learn meaningful temporal representations.

01

Encoder Network

A convolutional VAE with categorical latents (32 categories × 32 classes) converts raw image observations into compact, discrete stochastic states zt. This compressed representation captures the essential information from each frame.

Perception
02

Transformer World Model

A masked self-attention Transformer with relative positional encodings processes sequences of latent states and actions to produce hidden states ht — building rich temporal context that carries historical information forward.

Temporal Reasoning
03

Action-Conditioned CPC

The core innovation: contrastive learning that maximizes mutual information between current model states and future stochastic states from augmented observations, conditioned on the sequence of future actions for reduced uncertainty.

Representation Learning
04

Actor-Critic Agent

An actor network selects actions to maximize expected returns using REINFORCE with entropy regularization, while a critic network estimates state values using symlog cross-entropy loss and EMA stabilization.

Decision Making
System Design

Architecture

How the components of a Transformer world model work together in a unified learning pipeline

World Model Learning
Observations
o1:T
Encoder
Conv VAE
Latent States
zt ∈ 32×32
Transformer
Masked Self-Attn
Predictions
ẑ, r̂, ĉ, êk
Contrastive Predictive Coding
Augmented Views
o't
Representation
qk(z't+k)
InfoNCE
Maximize MI
AC-CPC Predictor
pk(st, at:t+k)
Agent Behavior Learning
Imagination
H=15 steps
Actor πθ
REINFORCE
Critic Vψ
λ-returns
Actions
at ~ π(·|st)

Training Objectives

The system optimizes a composite loss function that jointly trains all world model components:

Lrec
Reconstruction

MSE loss — trains the VAE encoder-decoder to learn faithful latent representations of image observations.

Ldyn
Dynamics

KL divergence — trains the Transformer to predict future latent states from context, with free-bits regularization.

Lrew
Reward

Symlog cross-entropy — predicts environment rewards, handling scale variance across different task domains.

Lcon
Continuation

Binary cross-entropy — predicts episode termination signals for proper trajectory bootstrapping.

Lcpc
Contrastive (AC-CPC)

InfoNCE loss — the key innovation. Maximizes mutual information between model states and K=10 future augmented states, conditioned on actions.

Research Frontier

The Convergence of
LLMs & World Models

Where language understanding meets world modeling — building the next generation of autonomous agents

LLMs

  • Semantic reasoning
  • Instruction following
  • Common-sense knowledge
  • Natural language planning

AI Lab
Focus

World Models

  • World modeling
  • Temporal representation
  • Action optimization
  • Contrastive learning

Language-Grounded World Models

Augmenting the Transformer world model with natural language state descriptions. Instead of learning latent representations solely from visual input, we enrich the encoder with language embeddings — enabling the world model to reason about states using both perceptual and semantic information.

Multimodal Encoding Grounded Representations

Hierarchical Planning with LLM Priors

Using LLMs as high-level planners that decompose complex tasks into sub-goals, while the Transformer world model handles low-level action execution. The LLM provides structured reward signals and goal specifications; the world model simulates and optimizes trajectories to achieve them.

Task Decomposition Goal-Conditioned RL

Contrastive Language-Action Alignment

Extending the AC-CPC framework to align language descriptions with action sequences. By contrasting language-described outcomes with observed trajectories, we create a shared embedding space where instructions can be directly mapped to optimal behavior policies.

Cross-Modal CPC Instruction Following

RL-Optimized Language Reasoning

Using reinforcement learning with verifiable rewards to fine-tune LLMs for improved world model reasoning. The contrastive representations provide dense reward signals that guide the LLM toward generating more accurate environment predictions and more effective action plans.

RLVR Dense Rewards
Benchmark Results

Proven Performance

Our approach — validated as TWISTER at ICLR 2025 — sets new records on the Atari 100k benchmark among methods without look-ahead search

Human-Normalized Mean Score (%) — Atari 100k
SimPLe
33%
TWM
96%
IRIS
105%
DreamerV3
112%
STORM
127%
Δ-IRIS
139%
TWISTER
162%
Human Level
12/26

Games with
superhuman performance

77%

Human-normalized
median score

100k

Environment interactions
(~2 hours real-time)

Get Involved

Let's Build the Future Together

We're looking for collaborators, researchers, and visionaries who share our belief that the next breakthrough in AI lies at the intersection of language and action.

Our Location

Quarzweg 3
22395 Hamburg
Germany