in Blog

March 12, 2026

Why Building Enterprise AI Right Is Mostly About the Edge Cases

Home » Why Building Enterprise AI Right Is Mostly About the Edge Cases

Author:

Bartłomiej Grasza

Principal AI Engineer

Reading time:

11 minutes

Anyone can wire up a large language model and demo it on clean data. I’ve seen it done in an afternoon. The hard part — the part that takes months — is making it work on data that was never designed to be searched, on systems that were never designed to interoperate, inside an organization that was never designed with AI in mind.

This is what I learned building an intelligent AI platform for one of the largest heavy-duty engineering manufacturers in Europe.

Key Takeaways

Enterprise AI systems succeed or fail at the data ingestion and parsing layer, not at the model layer.
Handling edge cases in real-world documents and systems is the core engineering challenge in production AI.
Domain-specific architecture choices, such as structured document extraction and agentic reasoning, significantly improve reliability.
The most valuable enterprise AI solutions come from deep understanding of business workflows and data structures, not generic AI capabilities.

It Looks Simple Until It Doesn’t

From a high level, the premise is straightforward: employees should be able to ask natural language questions and get accurate answers from internal documents.

No more digging through folders. No more reading 30-page PDFs to find one sentence.

Simple concept. Brutal execution.

The company’s knowledge base is what you’d expect from decades of industrial operation: thousands of PDFs, spreadsheets, engineering diagrams, HR policies, damage reports — scattered across departments, stored in formats that range from clean structured text to scanned images of hand-annotated engine schematics.

Some documents are three columns wide with embedded diagrams at unusual orientations. Some are pure tables stretching across five pages. Some are a mix of everything at once.

Standard document parsers fail silently on these. They return something that looks like text, but the column order is wrong, the diagram is dropped, the table is truncated. And here’s the thing about silent failures: any misinformation introduced at the parsing layer propagates through every layer above it. If the retrieval system is built on corrupted input, the answers it returns are corrupted too — just fluently worded.

So we started at the foundation.

Parsing: The Work Nobody Talks About

Most teams treat document ingestion as a solved problem. Upload PDF, chunk text, embed, store. We treated it as the most critical engineering surface of the entire project.

Every document type required explicit handling. Multi-column layouts needed to be linearized in reading order, not scan order. Markdown formatting was preserved to maintain heading hierarchies, which in turn informed how we split text into chunks — by logical section, not by character count. Tables needed to stay intact, or at minimum have their truncation points carefully managed.

The long tail of edge cases here is genuinely non-trivial. You think you’ve covered everything, and then someone uploads a presentation where every slide is a full-page image, or a legacy report where all text is embedded in a watermarked scan. Every one of these cases had to be catalogued and explicitly addressed — because in production, a parser that works 95% of the time is a parser that quietly corrupts 5% of your knowledge base.

Images as First-Class Data

Most retrieval systems ignore images entirely. At best, they extract any text overlaid on an image and index that. The image itself — its structure, its meaning, its relationship to surrounding context — is invisible to search.

For a company whose core technical documentation is full of engine diagrams, flowcharts, Gantt charts, and annotated schematics, this is a significant problem. If an employee searches for a specific engine fault and the relevant information lives in a diagram rather than a paragraph, a text-only system simply won’t find it.

We built a dedicated image processing module that classifies images into approximately 20 distinct categories before extracting information from them. The category determines the extraction strategy. A flowchart is read directionally — following the logical sequence of nodes, not the spatial top-to-bottom order a scanner would produce. An annotated engine diagram is described in terms of its labeled components and their relationships. A photo of a damaged part is described in terms of what is visible, what is damaged, and in what context.

The extracted text is then stored alongside and linked to the surrounding document context — so retrieval works across both text and visual content simultaneously. When a user finds a relevant result, they see the original image alongside the extracted description, not just a block of text that summarizes it.

This took a carefully labeled dataset, a lot of prompt engineering per category, and significant iteration. It’s the kind of work that’s invisible to the end user and essential to the product actually functioning.

The Single-Chunk Strategy for Damage Reports

Standard RAG systems split documents into small chunks — typically a few hundred tokens each. This is a reasonable default, but it breaks down for a specific and important class of documents: structured reports where the meaningful unit of information is the whole document, not a paragraph.

Engine damage reports are a good example. A typical report might be 30 pages. The engine type is mentioned on page one. The root cause of the fault is described on page ten. The affected component list is on page twenty-two. If you chunk this document and a user asks “what damage has been reported for engine type X where the fault involved the pistons,” no single chunk contains the answer. The retrieval system finds fragments. The language model synthesizes from fragments. The result is unreliable.

Our approach for this document class was different: rather than chunking, we parse each report into a single richly structured unit that captures all the key fields — engine type, fault description, root cause, affected components, date, author — in a consistent extractable format. Each report becomes one searchable object.

This means that when a user asks a broad question across thousands of historical reports — something like “have we ever seen oil leakage under the piston in this engine class?” — the retrieval system is matching against complete, context-preserving records, not against isolated fragments that may or may not be near the relevant information.

The result is that employees can now search across years of damage history in seconds, with genuinely useful results. That’s the business value. The architecture decision that enables it is the single-chunk strategy.

Agentic RAG: When Simple Retrieval Isn’t Enough

The basic RAG pattern is well understood: take a user query, retrieve relevant chunks, pass them to a language model, return an answer. It works for simple lookups. It falls apart when queries are complex.

Users don’t always ask simple questions. They ask things like: “Compare the fault rates for engine models A, B, and C over the last five years.” Or: “Show me all documents where the root cause involved a component from supplier X.” Or they ask questions that contain domain acronyms that the retrieval system has never seen defined.

For queries like these, a single retrieval pass followed by a single generation step isn’t sufficient. You need a system that can decompose the query, retrieve multiple times, check whether it has enough information, and synthesize across sources — without ever falling back on the model’s own pre-trained knowledge, which may be stale, averaged, or simply wrong in this specific context.

That’s what we built with the agentic reasoning module. The system works roughly as follows: when a query arrives, an orchestrating agent first decomposes it into a structured plan of sub-queries. Each sub-query is routed to the appropriate tool — document retrieval, document summarization, full-document extraction, acronym expansion, or mathematical calculation. Results are collected. The agent evaluates whether the information gathered is sufficient to answer the original question. If not, it retrieves again. Only when the agent determines it has verified, retrieved content covering the full query does it pass everything to the language model for final synthesis.

The model’s own parametric knowledge is treated as inadmissible. If the answer isn’t in the retrieved content, the system says so. This is a hard constraint, not a preference — in an industrial context where a wrong answer about a damage report could have real operational consequences, hallucination isn’t a product quality issue. It’s a safety issue.

One concrete example: if a user asks “what was the population of [city] in 1947, and what is the difference compared to 2025?” — the agent won’t answer from training data. It will issue two separate retrieval queries, collect the figures from source documents, and only then invoke the calculation tool to compute the difference. The model doesn’t guess. It reasons from evidence.

The acronym expansion tool is another practical example. Engineers write in domain shorthand. A query containing an acronym that doesn’t appear in the knowledge base will return nothing useful. Rather than fail silently, the agent detects unrecognized acronyms before searching, attempts to resolve them, and if it can’t, returns to the user and asks for clarification before proceeding. A small thing that makes a significant difference in day-to-day usability.

Excel and Cross-Document Reasoning

Structured data in spreadsheets presents different challenges than unstructured text in PDFs — and combining the two is harder still. General-purpose tools often struggle with multi-sheet Excel files where data is distributed across tabs, with varying schemas, requiring joins and aggregations before anything meaningful can be said.

We built a dedicated module for this. It operates as a sub-agent within the broader agentic framework — the orchestrating agent delegates to it when it detects that structured data is involved. The Excel module handles multi-sheet analysis, cross-file comparison, and aggregation, and returns a synthesized result to the orchestrating agent, which can then combine it with information pulled from unstructured documents.

A user can, for example, select a PDF containing engineering specifications and an Excel file containing supplier performance data, and ask a question that requires reading from both. The agent figures out what to get from where. The user just asks the question.

The Bigger Lesson: Solve Real Problems, Not General Ones

Looking back at the architecture decisions that made the most difference, the pattern is consistent: the value came from going narrow and deep on specific business problems, not from building a general-purpose document chat interface.

The damage report module is the clearest illustration of this. Most enterprise AI platforms would handle those reports the same way they handle everything else — chunk, embed, retrieve, generate. It works adequately. The single-chunk structured extraction approach we built works dramatically better, because it was designed around the specific shape of the problem: thousands of long-form reports, each a self-contained record, queried by users who need cross-document pattern recognition, not document-level summarization.

The same logic applies to image processing, to acronym expansion, to Excel integration. Each of these is solving a specific problem that the general-purpose approach handles poorly or not at all. Each required time to understand the business use case before writing a line of code.

This is, I think, the thing that separates enterprise AI work that creates genuine value from enterprise AI work that produces a demo. The general capability — retrieval-augmented generation, agent orchestration, multimodal processing — is accessible to everyone. What isn’t accessible is the understanding of how a specific business stores its knowledge, where that knowledge gets lost, and what it would mean to make it genuinely findable.

The Real Work

The hardest part of this project wasn’t the AI. It was understanding the business well enough to know what to build.

It meant spending time with real documents — not clean examples but the actual PDFs that broke every assumption our parser was built on. It meant cataloguing edge cases rather than papering over them. It meant pushing back on feature requests and lobbying for the use cases that would create real value, not just impressive demos.

The modular architecture we ended up with wasn’t a technical preference — it was a strategic one. We knew that what the client needed on day one would look different from what they’d need in year two. Building each capability as an independently deployable service meant we could extend the platform as new departments started using it and new use cases emerged, without rearchitecting from scratch.

That’s the job. Not writing AI code. Understanding how a company thinks, where its knowledge lives, and building something that survives contact with that reality.

FAQ

Why are edge cases so important when building enterprise AI systems?

Enterprise data rarely follows clean or consistent structures. Documents, spreadsheets, diagrams, and legacy formats introduce many irregularities. If these edge cases are not handled correctly, they can corrupt the retrieval pipeline and lead to unreliable outputs.

What is retrieval-augmented generation (RAG)?

Retrieval-augmented generation is an AI architecture where a system retrieves relevant information from a knowledge base and provides it to a language model to generate answers grounded in that data rather than relying only on the model’s training.

Why can standard document chunking fail for some enterprise documents?

In long structured documents such as technical reports, the relevant information may be distributed across many sections. Splitting these documents into small chunks can break the logical context, making it harder for retrieval systems to return accurate results.

What is agentic RAG and how does it improve reliability?

Agentic RAG introduces an orchestration layer that can break complex questions into smaller tasks, retrieve information multiple times, and use specialized tools such as calculations or acronym resolution before generating the final answer.

Why is multimodal processing important in enterprise knowledge systems?

Many industries rely heavily on visual information such as diagrams, charts, or technical schematics. Multimodal processing allows AI systems to analyze both text and images, ensuring that valuable information stored in visual formats is not lost during retrieval.

How can AI systems work with both documents and spreadsheets?

Dedicated modules or agents can analyze structured data in spreadsheets while other components process unstructured documents. An orchestrating agent can then combine results from both sources to answer complex cross-document queries.

Category:

Data Engineering

Artificial Intelligence

Share this article: