Talk "Expanding our Horizon with Agents and Open World Models"

Topic was presented at Munich Datageeks - 100th Munich Datageeks Meetup - October 2025.

Abstract

The presentation features personal use cases demonstrating how technology can solve real-world problems, emphasizing that learning through personal projects can lead to solutions applicable in professional settings. The talk covers topics ranging from Gemini and Genie models, which had not been mentioned in previous talks at the event.

About the speaker

The speaker, Christoph Mittendorf, leads a customer engineering team at Google Germany that focuses on AI and machine learning problems.

Transcript summary

Google's Evolution and Technology Stack

Google originated as a research project in 1995, starting as a search engine. The company's iconic colors were inspired by Lego bricks, and one of the original servers built from Lego is now in a museum. In the early days, Google competed with market leaders like Yahoo, Alta Vista, Excite, Lycos, and Infoseek. By 2008, Google Cloud emerged alongside other ventures.

Agent Development and Logo Recognition Challenge

The speaker introduced a challenge about identifying correct Google logo color combinations. Out of 12 logo variations shown, participants needed to identify which were correct. The answer 12 is correct, though this changes to a trick question when considering October 1998, when Google briefly used a green G for one month.

To solve this computationally, the speaker used Google's Agent Development Kit (ADK), which sits between no-code solutions and complex frameworks like LangChain and LangGraph. The ADK provides a user interface for building agentic workflows and is less complex than traditional approaches while offering more control than no-code options.

A simple agent initially fails at this task because large language models struggle when presented with 12 logos simultaneously. The solution involves creating a parent agent that calls two specialized tool agents: one that identifies location and counts logos in the image, and another that checks color combinations. Without this decomposition, both ChatGPT and Gemini produce incorrect results with counting and color identification errors. The agent system can detect even subtle errors, such as identifying two identical logos in the set that the speaker had not noticed.

Lederhosen Trend Prediction System

The main use case addresses buying optimal Lederhosen for Oktoberfest, which the speaker found surprisingly complex when moving to Bavaria two years ago. This represents a broader demand sensing and sales planning problem applicable to many industries.

The Data Challenge

Google processes approximately 17 million Lederhosen-related searches in Germany alone, with clear seasonality showing two-week peaks annually from 2004 to 2024. The challenge lies in understanding what trends are currently popular and, more importantly, what will be trendy during the actual Oktoberfest period. Search queries provide clues about consumer interests, and the real-time nature of this data is valuable because it shows what people search for on specific days and in specific locations.

Approximately 15 percent of daily searches Google receives have never been seen before, making rule-based systems impractical. These could be new car models, new songs, or any emerging topics. A capable machine learning model can identify thousands of product and service categories from this search data.

Technical Architecture

The system uses an agent-based approach with several components:

Data Sources: Search data stored in BigQuery, Google's data warehouse. BigQuery also provides public datasets of search data available for free to anyone, which the speaker considers more valuable than traditional datasets for market prediction.
Data Insights Agent: A pre-built agent that analyzes data in BigQuery. Using natural language instructions, it can train models like K-means clustering to identify important categories without requiring manual model development.
Time Series Model: An ARIMA model for forecasting trends. The goal is not to know what is trending currently, but what will be trending during Oktoberfest.
Creative Agent: Uses time series data to generate visual outputs showing how the trends would look on a person.

Analysis Results

The data scraped from YouTube (the world's second-largest search engine) and Google Search revealed search queries in plain text with absolute search volumes. Much of the data appears in English, suggesting significant tourist interest.

The K-means clustering model identified six critical Lederhosen attributes:

Size
Color
Fit
Material
Pattern
Shirt (which the speaker initially overlooked)

The time series forecast for Oktoberfest 2024 predicted: short, brown, slim fit, deerskin material with a floral pattern, and a white shirt. The creative agent takes these attributes and generates personalized images showing how someone would look wearing the predicted trends.

Personalization and Trend Setting

The system goes beyond simply following trends. Users can request to be twice as trendy, adding more floral patterns or other exaggerations. The system also supports personalization based on additional context, such as generating Lederhosen designs incorporating football club colors for fans from specific regions.

The system updates hourly based on changing search trends. As different patterns gain or lose popularity in search data, the model generates new creative images reflecting the latest predictions. The speaker used diffusion models (Imagen 3 or Veo for videos) to generate these personalized visualizations, even incorporating the user's face into the images.

SynthID Watermarking

An important consideration is bias and authenticity. Google implements SynthID watermarking for content generated by its models. This watermarking is imperceptible to humans but works across audio, images, and even text. For text, large language models can be configured to produce watermarked output. This allows later verification of whether content was AI-generated using Google technology. Practical applications include checking apartment rental applications to determine if they were AI-generated.

Running Efficiency Analysis with Computer Vision

The speaker is also a hobby triathlete, and struggles particularly with running. The goal was to improve running efficiency by comparing performance against professional runner Eliud Kipchoge, who ran a marathon in under two hours.

Technical Approach

The analysis used TensorFlow Hub's pre-trained models, specifically PoseNet or MoveNet. These are the same models used in dance applications, breakdancing apps, or fitness apps that count push-ups. The model analyzes video frame-by-frame to capture movement patterns and posture, tracking how body parts move relative to each other over time (for example, left foot movement relative to right arm).

Kipchoge's running was established as the benchmark at 100 percent efficiency. The speaker filmed himself running on a treadmill and ran the same analysis. The initial result showed 70 percent similarity to Kipchoge's form. When presenting this to his coach Yim, the coach pointed out that the remaining 30 percent was precisely what needed improvement.

Shoe Analysis

The speaker used Gemini to analyze the running footage and identify areas for improvement. Surprisingly, Gemini immediately identified that the shoes looked old and worn out. The speaker had purchased these shoes cheaply as a student and bought them in every color, but they were no longer state-of-the-art.

To test the shoe impact, three experiments were conducted:

Original old shoes (70% Kipchoge similarity)
Flip-flops (found in the basement)
Carbon fiber plate running shoes (second-best option available)

The visual difference was apparent even without machine learning models. Running at 20 kilometers per hour in flip-flops proved extremely difficult. The carbon fiber shoes showed excellent improvement in running modality, achieving 76 percent similarity to Kipchoge.

As a humorous conclusion, the speaker used a diffusion model to generate an enhanced shoe design, which supposedly achieved 200 kilometers per hour, breaking the world record. The key takeaway is that an entire pipeline processed unstructured video material and extracted meaningful insights within one to two minutes.

Genie 3: Open World Models

Genie 3 represents a new category of model that the speaker finds particularly exciting. Approximately half the audience was already aware of this model, which is notable given the volume of research being published.

How Genie 3 Works

Genie 3 is classified as an open world model. Unlike traditional video generation, it creates interactive worlds. Users can input natural language prompts or images, and the model generates an explorable environment. The model learned how physics works by watching thousands of hours of YouTube videos.

The key innovation is interactivity. Users can perform actions within these generated worlds, such as:

Moving around
Opening car doors
Sitting down
Falling off cliffs

Dynamic World Generation

A particularly striking example from the blog post showed someone providing an image of a house interior. The person looked out a window, then moved left to a wall and began painting it blue. The critical insight is that the left side of the room did not exist until the person moved there—it was generated on the fly.

More remarkably, when the person turned back to look at the painted wall again, the blue paint persisted. Despite the model having millions or billions of parameters and constantly generating new content, actions taken in the world remain consistent.

Applications Beyond Gaming

While gaming is an obvious application, the implications extend to robotics. Current humanoid robots, including Google's, have limited capabilities because insufficient training data exists for all their degrees of freedom. With Genie 3, researchers can generate artificial worlds and create vast amounts of training data for robots to learn in simulated environments. This capability could significantly accelerate robot development.

Low Latency Models and Non-Deterministic Operating Systems

The presentation concluded with a demonstration of Gemini 2.5 Flash's capabilities. This research prototype showcases extreme low latency in action.

The Operating System Demo

The demonstration showed an operating system where Gemini writes the code for the UI and its contents in real-time based on previous screen context—specifically, in the time it takes to click a button. This creates a non-deterministic operating system where navigating to a folder and returning to it results in completely different contents each time.

The model processes content and generates responses in fractions of a second. While the practical applications are still unclear, this showcases the speed at which modern models can understand user actions and generate appropriate outputs. The speaker humorously noted that while this might confuse older users who struggle when desktop icons move, it demonstrates the power of these rapid response systems.