Talk "Scaling from POC to Production"

Scaling from POC to Production by Nicolas Neudeck was presented at Munich Datageeks - February Edition 2025

Abstract

Turning an AI-powered proof of concept into a scalable, production-ready product is significantly more complex than simply rolling out the existing solution. POCs often lack robustness, scalability, and long-term maintainability. This talk explores the transformation process through a real-world example, covering key aspects such as cloud architecture, prompt engineering, and optimizing API performance. Critical challenges will be examined alongside technical solutions to ensure a seamless transition from prototype to production.

About the speaker

Nicolas Neudeck is an AI Engineer at BCG X, with previous experience at Amazon and E.ON. He holds a degree in Computer Science from the Technical University of Munich, specializing in Data Engineering and Machine Learning. Passionate about startups and innovation, Nicolas is focused on building scalable products that impact a wide audience.

Transcript summary

Overview

Nicholas, an AI engineer at BCG X (the data science department of Boston Consulting Group), shares his experience transitioning a generative AI proof of concept into a production-ready, scalable system. He emphasizes that building a PoC represents only about 20% of the actual work, with the real challenges emerging during the scalability and production phases.

The Project Context

The client was a meal kit delivery company similar to HelloFresh, operating a support hotline for customer inquiries about recipes, missing ingredients, and other issues. The goal was to build a copilot tool that would increase the number of calls each agent could handle per day by helping them process calls more efficiently.

The Solution

The team developed a web application displayed on agents' second monitors, providing both static and real-time information during calls:

Static data: customer identity, subscription package details
Real-time data: live call transcription, call intent identification, and suggested questions for agents to advance conversations smoothly

The focus feature for this discussion was the potential checklist, which automatically tracks which questions agents have already asked based on the live conversation transcript.

Demo Workflow

When a customer calls about a missing ingredient, the system:

Identifies the call intent from the conversation
Retrieves the appropriate question checklist from the company database
Automatically checks off questions as agents ask them during the call
Helps agents stay on track and ensures all necessary information is gathered

Initial PoC Development

The initial proof of concept was built in four weeks with 4.5 team members (two engineers, two data scientists, and a front-end engineer for two weeks). The rapid development cycle included agent interviews, call shadowing, data gathering, cloud architecture design, data pipeline construction, and front-end development. The PoC was tested with 20 agents in an A/B testing scenario.

While the fast development allowed for early results and quick feedback, many corners were cut to meet the four-week deadline, which later impacted scalability significantly.

Technical Architecture

The system architecture consisted of:

Telephony platform connecting customers and agents
Audio stream consumer receiving audio data
Transcription engine converting audio to text
Process transcript pipeline making LLM calls to analyze conversations
Database storing transcripts and LLM responses
Front-end API serving data to agents

The entire system was built on AWS using primarily serverless components.

Initial Latency Challenges

The initial data flow from when an agent finished asking a question to when that question appeared as checked off in the front-end took between 15 and 25 seconds:

Telephony platform delay: 1 second
Audio buffering: 10 seconds
Transcription processing: 4 seconds
LLM calls: 8 seconds
Database operations: 0.5 seconds
Front-end polling: 1-5 seconds

This latency made the tool less useful, as agents would finish asking questions and still see them as unchecked on their monitors.

Client Requirements for Production

After the successful initial test, the client had two hard requirements:

Reduce latency to under 7 seconds for the entire data pipeline
Scale the system to 200 call center agents (from the initial 20)

Achieving these requirements took over double the time needed to build the entire PoC, representing significant unexpected work and technical debt resolution.

Latency Reduction: Transcription

The most significant improvement came from completely refactoring the transcription approach. The team shifted from file-based transcription to AWS Transcribe streaming:

The original approach buffered audio for 10 seconds to improve transcription quality and avoid cutting words mid-sentence
The new streaming approach opened a transcribe stream when calls began, sending audio bytes directly and asynchronously
The stream returned individual words with about half a second delay
Words were buffered until complete sentences formed
Since measurement started at the end of agent questions, there was minimal buffering delay

This change alone eliminated most of the original delay but required over three weeks to implement. The complexity involved moving components from stateless services to isolated Lambda functions and implementing multi-threading in Lambda, which proved challenging. The speaker notes this complexity made it impractical for the PoC phase.

Latency Reduction: LLM Optimization

The process transcript pipeline was reduced from 8 seconds to 4 seconds through multiple optimizations:

Structural improvements:

Eliminating unnecessarily long LLM calls
Parallelizing independent LLM calls
Classifying calls as simple (yes/no answers) or complex (summarization) and assigning appropriately sized models
Replacing some LLM calls entirely with standard NLP packages like spaCy

Prompt optimization:

Reducing input token size by shortening labels (agent and customer became a and c)
Sending only relevant transcript portions (customer speech only for name identification, last half only for satisfaction checks)
Achieving 5% token reduction through careful prompt engineering

Latency Reduction: Database

While database calls only took 0.5 seconds, optimization was crucial for scaling to 200 users with large tables containing hundreds of thousands of entries.

Key findings using AWS RDS Aurora PostgreSQL dashboard:

The slowest queries came from a daily customer behavior analysis notebook that selected all data, then filtered in Pandas
Missing indexes on commonly queried fields (email, partial phone numbers)
Execution plan analysis revealed optimization opportunities

The improvements didn't significantly reduce the 0.5 second delay but proved critical for later scalability.

Latency Reduction: Front-End

The team replaced the polling mechanism (5-second intervals checking the API) with WebSockets, which was surprisingly easy to implement using FastAPI. This reduced delay to an estimated 0.5 seconds.

The telephony platform's 1-second delay remained unchanged as the client was committed to that specific platform.

Final result: The team achieved a total latency of 6.3 seconds, meeting the under-7-seconds requirement.

Scaling Challenges: The Numbers

Scaling from 20 to 200 users represented a 900% increase. With 75% of agents in calls 24/7, the system needed to handle:

150 simultaneous transcribe streams and Kinesis data streams
900 pipeline invocations per minute (6 sentences per minute per agent)
9,000 database transactions per minute
3,000+ LLM requests per minute

The speaker emphasizes that these 3,000 LLM requests per minute would be sufficient to handle over 100,000 users in a typical chatbot scenario, highlighting the intensive nature of real-time call center applications.

Scaling Challenges: Initial Failures

When first tested with 200 users, everything crashed with error alerts flooding the system and agents seeing frozen interfaces.

AWS Quota Limitations: The most obvious issue was hitting AWS service quotas (25 concurrent transcribe streams, 75 Kinesis streams by default). While simple to identify, resolving this through AWS support became a significant blocker for continued development and testing.

Scaling Challenges: ECS and FastAPI Workers

The process transcript pipeline and API, both hosted in an ECS container with a FastAPI server, experienced bottlenecks despite auto-scaling configuration set to increase instance size at 60% CPU load.

Root cause: The team used Gunicorn with four workers for FastAPI. When three workers hit 100% CPU but one remained idle, the ECS couldn't detect the overload properly. Two scaling mechanisms clashed with each other.

Solution: Switching back to standard Uvicorn with a single worker allowed ECS to properly detect CPU load and scale accordingly.

Scaling Challenges: LLM Throttling

Despite AWS Bedrock team assurances that 3,000 LLM requests per minute would be easily handled, the system experienced soft throttling with invoke latencies reaching 20 seconds. The issue required significant escalation involving senior personnel and eventually required physical installation of additional compute resources in the data warehouse. Resolution took longer than the original project timeline.

Scaling Challenges: Database Bottleneck

The database experienced high CPU loads and request timeouts despite optimization efforts.

Initial attempts:

Scaling from two to eight cores (helped somewhat but not enough)
Expanding from one read/one write instance to three read/one write instances with proper endpoint separation (also helped but insufficient)

Root cause discovery: The RDS dashboard revealed the database was handling not just 9,000 requests per minute but also 9,000 authentication attempts per minute. The stateless API opened and closed database connections for each request.

Solution: Adding a database proxy that maintained one stable open connection eliminated authentication overhead, as the system authenticated against the proxy instead of the database directly.

Lessons and Conclusions

After all optimizations and fixes, the system successfully went live with 200 agents and continues running to this day (as of the talk).

Key takeaways:

The refinement process took two months without adding any new features or improving output quality
This work is often underappreciated by stakeholders focused on immediately visible outcomes
Technical debt from rapid PoC development requires significant time to resolve
Scaling challenges are difficult to anticipate during initial development
The journey from PoC to market-ready product involves substantial engineering effort that project managers and clients need to understand and plan for
Long-term product success depends on recognizing the importance of this production-readiness phase, not just for technical challenges but for overall product viability