Lighthouse AI Lab

Pioneering the convergence of Large Language Models and model-based Reinforcement Learning — Transformer world models with contrastive representations for autonomous agents that reason, plan, and act.

Explore the Technology LLM × RL

Our Vision

Intelligence requires
understanding both the World and the Word

We believe that true intelligence is not a single capability — it is the convergence of two fundamental forms of understanding.

Understanding the World — predicting consequences, learning from experience, building internal models of dynamic environments — is the domain of model-based reinforcement learning. Understanding the Word — language, reasoning, abstract knowledge — is where Large Language Models excel.

Neither alone is enough. A mind that has read every book but never experienced cause and effect is an LLM without world understanding. A being that can see and act but has no language to reason with is an RL agent without words. Intelligence emerges only at their intersection.

The World World Models

Intelligence

The Word LLMs

The Lighthouse Ventures AI Lab exists to build models that understand both — and in doing so, take a meaningful step toward general intelligence.

Language Understanding

LLMs excel at reasoning, instruction following, and semantic understanding — but lack grounded decision-making in dynamic environments.

Reinforcement Learning

RL agents learn optimal behaviors through trial and error — but struggle with generalization and sample efficiency in complex, open-ended tasks.

The Missing Link

Transformer-based world models with contrastive representations bridge this gap — learning rich temporal features that become the foundation for uniting language and action.

Core Technology

Transformer World Models
with Contrastive Representations

The architectural foundation for agents that build and reason about internal models of the world

Our research builds on the principle that intelligent agents need rich internal world models — not just reactive mappings from observations to actions, but deep representations of how environments evolve over time.

Traditional model-based RL predicts only the next state — like reading one word at a time without understanding the sentence. By combining Transformer architectures with action-conditioned Contrastive Predictive Coding (AC-CPC), we extend predictions up to 10 steps into the future, learning representations that capture the deep temporal structure of environments. This approach was validated at ICLR 2025 under the name TWISTER.

world_model.py

class TransformerWorldModel:
    # Transformer State-Space Model
    encoder    → z_t  # image → latent state
    transformer→ h_t  # temporal context
    dynamics   → ẑ_t  # next state prediction
    decoder    → ô_t  # state → image

    # Action-Conditioned CPC
    representation → e_t^k  # future targets
    ac_cpc_predict → ê_t^k  # K=10 step ahead

    # Agent Behavior
    actor  → π(a_t | s_t)   # policy
    critic → V(s_t)         # value fn

The Key Insight

Previous Approaches

→

t+1

→

t+2

→

...

Predict only the next state. Adjacent frames are too similar — the Transformer doesn't need deep understanding to predict trivially similar states.

Our Approach

→

t+1

→

t+2

→

...t+10

Predict K=10 steps ahead using AC-CPC. Distant states are genuinely different, forcing the model to learn meaningful temporal representations.

Encoder Network

A convolutional VAE with categorical latents (32 categories × 32 classes) converts raw image observations into compact, discrete stochastic states z_t. This compressed representation captures the essential information from each frame.

Perception

Transformer World Model

A masked self-attention Transformer with relative positional encodings processes sequences of latent states and actions to produce hidden states h_t — building rich temporal context that carries historical information forward.

Temporal Reasoning

Action-Conditioned CPC

The core innovation: contrastive learning that maximizes mutual information between current model states and future stochastic states from augmented observations, conditioned on the sequence of future actions for reduced uncertainty.

Representation Learning

Actor-Critic Agent

An actor network selects actions to maximize expected returns using REINFORCE with entropy regularization, while a critic network estimates state values using symlog cross-entropy loss and EMA stabilization.

Decision Making

System Design

Architecture

How the components of a Transformer world model work together in a unified learning pipeline

World Model Learning

Observations

o_1:T

Encoder

Conv VAE

Latent States

z_t ∈ 32×32

Transformer

Masked Self-Attn

Predictions

ẑ, r̂, ĉ, ê^k

Contrastive Predictive Coding

Augmented Views

o'_t

Representation

q^k(z'_t+k)

InfoNCE

Maximize MI

AC-CPC Predictor

p^k(s_t, a_t:t+k)

Agent Behavior Learning

Imagination

H=15 steps

Actor π_θ

REINFORCE

Critic V_ψ

λ-returns

Actions

a_t ~ π(·|s_t)

Training Objectives

The system optimizes a composite loss function that jointly trains all world model components:

L_rec

Reconstruction

MSE loss — trains the VAE encoder-decoder to learn faithful latent representations of image observations.

L_dyn

Dynamics

KL divergence — trains the Transformer to predict future latent states from context, with free-bits regularization.

L_rew

Reward

Symlog cross-entropy — predicts environment rewards, handling scale variance across different task domains.

L_con

Continuation

Binary cross-entropy — predicts episode termination signals for proper trajectory bootstrapping.

L_cpc

Contrastive (AC-CPC)

InfoNCE loss — the key innovation. Maximizes mutual information between model states and K=10 future augmented states, conditioned on actions.

Research Frontier

The Convergence of
LLMs & World Models

Where language understanding meets world modeling — building the next generation of autonomous agents

LLMs

Semantic reasoning
Instruction following
Common-sense knowledge
Natural language planning

AI Lab
Focus

World Models

World modeling
Temporal representation
Action optimization
Contrastive learning

Language-Grounded World Models

Augmenting the Transformer world model with natural language state descriptions. Instead of learning latent representations solely from visual input, we enrich the encoder with language embeddings — enabling the world model to reason about states using both perceptual and semantic information.

Multimodal Encoding Grounded Representations

Hierarchical Planning with LLM Priors

Using LLMs as high-level planners that decompose complex tasks into sub-goals, while the Transformer world model handles low-level action execution. The LLM provides structured reward signals and goal specifications; the world model simulates and optimizes trajectories to achieve them.

Task Decomposition Goal-Conditioned RL

Contrastive Language-Action Alignment

Extending the AC-CPC framework to align language descriptions with action sequences. By contrasting language-described outcomes with observed trajectories, we create a shared embedding space where instructions can be directly mapped to optimal behavior policies.

Cross-Modal CPC Instruction Following

RL-Optimized Language Reasoning

Using reinforcement learning with verifiable rewards to fine-tune LLMs for improved world model reasoning. The contrastive representations provide dense reward signals that guide the LLM toward generating more accurate environment predictions and more effective action plans.

RLVR Dense Rewards

Benchmark Results

Proven Performance

Our approach — validated as TWISTER at ICLR 2025 — sets new records on the Atari 100k benchmark among methods without look-ahead search

Human-Normalized Mean Score (%) — Atari 100k

SimPLe

33%

TWM

96%

IRIS

105%

DreamerV3

112%

STORM

127%

Δ-IRIS

139%

TWISTER162%

Human Level

12/26

Games with
superhuman performance

77%

Human-normalized
median score

100k

Environment interactions
(~2 hours real-time)

Get Involved

Let's Build the Future Together

We're looking for collaborators, researchers, and visionaries who share our belief that the next breakthrough in AI lies at the intersection of language and action.

Our Location

Quarzweg 3
22395 Hamburg
Germany

Email

info@lighthouse-venture.de

Name

E-Mail

Message