Munich Datageeks e.V.
Talk "AI-Assisted Requirements Engineering"
Start screenshot of the talk

Talk "AI-Assisted Requirements Engineering"

Felix Reuthlinger

BSH applies LLMs to systems engineering — automating requirements syntax checking, quality validation, and rewriting — while building a centralized AI platform to standardize RAG pipelines, observability, and governance across 400+ enterprise use cases.

Topic was presented at Munich Datageeks - February Edition 2026

Abstract

This talk from a Munich Datageeks event features two speakers from BSH, a manufacturer of home appliances. The first speaker, Philip, presents how BSH is applying large language models to their systems engineering process — specifically to requirements engineering — as part of a Bavarian state-funded project. He walks through three concrete use cases: automated syntax checking, quality checking, and AI-assisted rewriting of requirements. The second speaker, Seif Elafi, zooms out to address the broader organizational challenge: how does a large global company like BSH build and govern a central AI platform capable of supporting hundreds of diverse use cases? He discusses the importance of standardization, RAG pipelines, observability, guard rails, and cost governance as foundational building blocks of a scalable enterprise AI platform.


About the Speaker

The talk features two speakers from BSH (BSH Hausgeräte GmbH). Philip works in product development with a focus on applying AI to systems engineering processes. Seif Elafi works on BSH's central AI platform, focusing on the organizational and architectural challenges of scaling AI use cases across a large global enterprise.


Transcript Summary

BSH and the Challenge of Complex Distributed Systems

BSH appliances are complex distributed systems that span embedded hardware, embedded software, cloud software (Home Connect Cloud), and mobile applications (Home Connect App). To develop such systems, BSH follows an industry-standard systems engineering approach — most notably the V-model. The V-model structures development across multiple layers of system definition down to implementation, with each layer paired by a corresponding verification or validation step and iterative feedback loops between them.

A key observation motivating this project is that nearly all tasks in the V-model involve the generation of language artifacts: requirements are language artifacts, source code is a language artifact, and test cases are language artifacts. This made LLMs a natural fit.

The AI in Systems Engineering Funding Project

Two years ago, BSH began exploring how to apply the LLM revolution to their systems engineering process, with the goals of increasing efficiency and reducing time to market. This work is carried out within a project funded by the Bavarian state through a state funding program.

The project spans four main areas:

  • Requirements Management: Checking quality, finding similar requirements, generating requirements
  • Software Development: Generating embedded source code, unit tests, API documentation, and automating translation of UI strings into 30+ languages
  • Knowledge Base Generation: Using AI to build a company knowledge base to accelerate onboarding of new engineers
  • Validation & Verification: Generating test scenarios and test code from requirements, log file analysis, and error severity classification

Overarching all areas is the challenge of AI trust and hallucinations — ensuring that LLM outputs are always double-checked, especially when regulatory compliance is involved.

Focus Area: Requirements Engineering

Requirements in product development follow a strict syntax template that is different from the user stories more common in software teams. A requirement typically has a precondition, a subject system, the keyword "shall," and a defined task. This strict syntax ensures intersubjective comprehensibility — any engineer reading the requirement can understand it unambiguously.

After a requirement is written, it goes through a review process. The first step is always a syntax check, which verifies correct nomenclature, proper use of keywords, adherence to the template, correct preconditions, and the absence of spelling errors.

Use Case 1 – Syntax Checking

An LLM-powered tool checks a given requirement against a defined set of syntax rules. In a live demo, the model correctly identifies that a requirement is written in passive voice rather than active voice and provides a detailed explanation of the issue. Building this required manually annotating a large dataset of requirements with ground-truth labels, which is then used to evaluate and fine-tune prompts and models.

Use Case 2 – Quality Checking

Beyond syntax, requirements must also meet quality criteria, including:

  • Clarity – clearly formulated
  • Atomicity – one topic per requirement only
  • Completeness – no missing information
  • Internal Consistency – no contradictions within the requirement
  • Unambiguity – no room for multiple interpretations
  • Verifiability – the requirement must be testable
  • Implementation-freedom – the requirement should define what, not how

Some criteria can be checked against the requirement text alone. Others require checking against the full set of requirements in the system — such as consistency, traceability (linking requirement to implementation and test), and feasibility (whether BSH has the resources to implement it). The latter two are significantly harder to automate, as implementations and test cases often don't yet exist when requirements are written. This remains an active area of work.

In a demo, the quality check correctly identifies that a requirement mentioning a D-Bus connector violates the implementation-freedom criterion, as it prescribes an implementation detail.

Use Case 3 – Automated Requirement Rewriting

To support engineers more proactively, BSH implemented a bounded refinement loop: the LLM takes a requirement, runs it through the syntax and quality checks, uses the check results as additional prompt context, and rewrites the requirement. This loop repeats until the requirement passes or a maximum number of iterations is reached — five for syntax checks, two to three for quality checks. If the requirement still fails after the maximum iterations, it is flagged as one the model cannot fix (e.g., requirements containing tables or incoherent content).

Integration into the Authoring Tool (Polarion)

Requirements engineers do not work in generic web UIs — they use a dedicated requirements management tool called Polarion, which has a frontend built on Apache Velocity (Java and JavaScript). The use cases described above are being integrated directly into Polarion's frontend.

Data architecture considerations include:

  • Mirroring the Polarion database and vectorizing it to avoid overloading the live system
  • Potentially creating an MCP server to access Polarion's data and functions
  • Deciding between daily updates vs. real-time data depending on the workflow
  • Enforcing data security and a need-to-know principle so engineers only access data relevant to their project

The AI Platform Challenge at Scale (Seif Elafi)

From Use Cases to Platform Thinking

While Philip's work covers a handful of use cases in requirements engineering, BSH as a whole faces a much larger challenge: managing 400+ AI use cases across multiple departments — digital platform services, product development, consumer feedback analysis, and more. This creates pressure to build fast while also governing effectively.

Governance and Business Justification

A central challenge is measuring the value of AI use cases. Teams are often excited about LLM-based ideas, but management demands clear revenue or profit impact, which is difficult to quantify for many productivity-enhancing tools. Sometimes competitive pressure alone — the fact that a competitor already has a feature — is sufficient justification.

The Case for Standardization

BSH learned that standardization is critical, even if it means not always using the latest technology. The AI landscape moves extremely fast — ChatGPT was dominant, then Gemini 2.5, then Claude Code — and organizations that integrate new tools too slowly fall behind. But chaotic adoption also causes problems: if 400 teams each build POCs using different stacks (e.g., Lovable, Vercel, or other tools), migrating all of them to production-grade, compliant systems becomes an enormous burden.

RAG as a Core Platform Primitive

Retrieval-Augmented Generation (RAG) is one of the first capabilities added to BSH's AI platform. LLMs are powerful in general, but BSH needs them grounded in company-specific data — product manuals, serial number-specific information, regional compliance data. Approaches like semantic chunking and graph-based knowledge bases are evaluated, but the key platform decision is to standardize on one approach so that each new RAG-based application doesn't require starting from scratch.

Running redundant indexing pipelines — each department spending large sums on document intelligence or self-hosted OCR — would make nearly every use case unprofitable. Shared infrastructure avoids this.

Key Platform Components

  • Observability: Tools like Langfuse are used to trace what LLMs are doing in production. Without this, deploying to production is too risky.
  • Analytics: Power BI dashboards track token usage and costs across teams — important because while individual tokens are cheap, running many parallel agents continuously adds up.
  • LLM Access Management: Rather than distributing individual API keys to thousands of employees, BSH uses a routing layer (such as LiteLLM) that provides a single key and routes requests to the appropriate model. This also provides flexibility when switching providers.
  • Data Ingestion Quality: AI use cases depend heavily on well-prepared data. The concept of "AI-ready data" is emphasized — plugging an agent into an unprepared system like SAP BW will not yield usable results. Data engineers have a significant role in making data fast and structured enough for agents to use.
  • Guard Rails: LLMs must be prevented from saying things that are legally or commercially problematic — for example, commenting on competitor products, or accidentally surfacing private user data like credit card numbers from a database.
  • Ground Truth & Evaluation: Automated metrics based on well-defined ground truth baselines are needed to catch undesired LLM outputs before they reach production and to continuously improve retrieval quality.

Conclusion

The central message from the platform perspective is that a shared, standardized AI platform strategy pays off more than each department building independently. Reusability of components outweighs always chasing the latest tool or framework. The investment in platform governance, observability, and shared infrastructure is what makes AI use cases scalable, compliant, and ultimately profitable at enterprise scale.