Scaling AI Agents With Filesystems and Bash
Stop building agents like interns. Nicholas (Superglue) argues that fewer, general-purpose tools — terminals, CLIs, file systems — dramatically outperform large curated toolsets, backed by examples from AWS, Vercel, and Anthropic.
Topic was presented at Munich Datageeks - March 2026
Abstract
In this talk, Nicholas from Superglue argues that AI agents are being systematically under-empowered by the engineers who build them. Using the analogy of hiring a senior engineer versus micromanaging an intern, he makes the case that the common practice of loading agents with dozens of highly specific tools — each with many optional parameters — bloats the context window and actually degrades agent performance. Instead, he advocates for giving agents fewer, more general-purpose tools (like a terminal, a CLI, or a file system) and trusting them to figure out the rest. To support this argument, he draws on real-world examples from AWS, Vercel, Anthropic, and his own company Superglue, all of which independently arrived at similar conclusions: less context pollution, more agent autonomy, and dramatically better results.
About the Speaker
Nicholas studied computer science at TU Munich, then spent two years in consulting building AI agents across various industries. For the past six months he has been working at Superglue, an AI-native integration platform based in Munich, where an AI agent forms the core of the product — helping users understand their system landscape and build and maintain integrations. He has been building and observing AI agents evolve for three years, from basic chatbots to capable autonomous systems.
The Current State of LLMs
Nicholas opens by polling the audience on how much they trust LLMs, using a scale from "intern" to "staff engineer." The audience skews toward "junior" to "senior engineer," which he considers a fair and even slightly optimistic assessment. His central thesis is that while LLMs have evolved significantly the way engineers implement and constrain agents has not kept pace with that growth.
The Intern Problem: How Agents Are Being Over-Managed
To illustrate the problem, he walks through a concrete scenario: a company wants its internal chatbot to read a CSV and write it to a database. The natural engineering response is to build a set of specific tools:
- A
read_filetool for the CSV - A
get_database_schematool - An
execute_python_codetool
This seems reasonable at first, but requirements quickly accumulate. The CSV might be semicolon-delimited, use non-standard encoding for international characters, or be too large for the context window. The database may need a table listing tool. A rollback tool gets added for when the wrong file is uploaded. Each new edge case spawns a new tool or optional parameter — and every one of those adds tokens to every single LLM call, regardless of whether they are relevant to the task at hand.
The result: by the time the agent starts working, its context window is already substantially occupied by JSON schemas it may never need for this particular request. Since LLM output quality correlates with effective context management, this pre-pollution of the context degrades the quality of responses across the board.
The Senior Engineer Alternative
The proposed solution is to stop making decisions for the agent and instead give it general-purpose tools that allow it to reason and navigate independently. Using the same CSV-to-database task, he demonstrates what an agent with a simple bash/terminal tool would do on its own:
- Run
head -5 data.csvto inspect structure - Run
file data.csvto determine encoding - Run a
psqlcommand to inspect the target table schema - Write and execute a Python or JavaScript transformation as needed
The agent selects the right approach for the specific situation rather than following a predetermined script. This means one tool instead of six, a much shorter system prompt, and self-correcting behavior when a command fails — because LLMs are already trained on terminal interactions and understand error output natively.
Industry Examples
AWS and Amazon Q
Amazon Q, the AI assistant embedded in the AWS console, answers questions about AWS resources and configurations by using the AWS CLI as its primary tool. Rather than building hundreds of individual resource-specific tools, it issues CLI commands to fetch exactly the information needed. This is especially elegant at AWS's scale, where the number of resource types and configuration options would make a traditional tool-based approach unmanageable. Additionally, CLIs are self-documenting: even an agent encountering an unfamiliar CLI can call --help on specific subcommands to retrieve only the relevant documentation, filling its context progressively rather than all at once.
Vercel and d0
Vercel built an internal data analytics agent called d0 that sits in Slack and answers questions by querying an analytical database. Their development process was methodical: they created a test set of questions with expected answers and used it to benchmark the agent. They iteratively added tools — including entity join helpers, a catalog search to prevent table name hallucination, and various guardrails — eventually reaching 17 tools. Despite this, performance plateaued at around 80% accuracy.
In a pivotal decision, they stripped out every tool except a single execute_sql tool, provided the database schema and a Cube DSL file, and re-ran the benchmark. The results were striking:
- 17 tools → 1 tool
- Response time reduced from ~274 seconds to ~70 seconds (3× faster)
- 37% fewer tokens per query
- Success rate improved to 99%
The explanation: with less context pollution from tool definitions, the LLM could focus on the actual task and handle SQL errors naturally by reading the output and self-correcting.
Anthropic and File System Navigation
Anthropic's engineering blog described a customer support agent they helped build. Traditional approaches either prefilled the system prompt with customer data (overfetching irrelevant information) or used a specific get_customer_info tool (risking underfetching, e.g. missing relevant error logs).
Anthropic's solution was to give the agent a file system: each customer had a dedicated folder containing their ticket history, usage logs, profile summary, and error stack traces. The agent navigates this structure using standard commands (ls, cd, cat, grep) and retrieves only what it needs for the specific question being asked. This reduces engineering overhead to a single tool interface while making the agent's responses significantly more precise.
Superglue's Own Journey
Nicholas describes three iterations of Superglue's own workflow-building agent:
V1 — Context Micromanagement
The agent had two tools (build workflow, save workflow). Superglue's backend constructed what it believed was the perfect prompt: fixed proportions of Salesforce documentation, Snowflake API docs, customer context, workflow-building instructions, and OAuth2 guidance. It worked, but the approach was too rigid and presumptuous about what the agent actually needed.
V2 — Context Gathering Jungle
Recognizing that the agent might know better than the engineers what it needs, they gave it around 13 tools including web search, documentation search, and intent clarification. The agent gathered its own context — but over-compensated. Tool definitions were enormous, the system prompt was fragmented, and the agent had to decide between 13 options on every single turn. Context bloat returned in a different form.
V3 — Skills-Based Architecture (Current)
The team moved most of the detailed system prompt content and tool documentation into approximately 20 Markdown "skills files" stored in a dedicated folder. The system prompt was reduced by around 80%. The agent now starts each task by deciding which skills it needs, reads only those files, and then proceeds with a much cleaner context. Workflow construction was also moved directly into the conversation rather than delegated to a backend process.
Additionally, skills files now handle knowledge that previously required hardcoded system prompt sections — such as pricing information or questions about supported integrations — without that content polluting every single request. The result has been a significant improvement in both workflow complexity and success rate, though exact metrics were not shared.
A V4 is anticipated, with current exploration focused on organizing third-party API documentation into file trees accessible to the agent — a natural extension of the file-system-navigation philosophy.
Conclusion
Nicholas closes with three takeaways:
- Agents are senior engineers, not interns. LLMs have evolved, and how we deploy them should reflect that.
- Empower agents, don't think for them. Give them general-purpose tools and let them navigate toward the solution rather than pre-packaging every possible answer.
- Build a CLI for your product. As agents become the primary interface layer for many tools, CLIs will be how agents interact with products — they are self-documenting, established, and well-represented in LLM training data. Tools like OpenClaw already demonstrate this direction.