Martin Fowler - Exploring Generative AI

Building Reliable Agentic AI Systems

This paper presents the Preclinical Information Center (PRINCE), a cloud-hosted platform developed by Bayer AG with Thoughtworks to address pharmaceutical industry challenges in drug development. PRINCE leverages Agentic Retrieval-Augmented Generation (RAG) and Text-to-SQL to integrate decades of safety study reports. The system evolved from keyword-based search to an intelligent research assistant capable of answering complex questions and drafting regulatory documents. Key engineering decisions include context engineering—shaping and routing information between specialized agents—and harness engineering—orchestration, recovery, and observability around the models to maintain control and reliability. The system prioritizes trust through transparency, explainability, and human-in-the-loop integration, significantly improving data accessibility and research efficiency while ensuring governance and compliance.

Statesummarized

Snapshots1

AI Outputs2

Open issues0

validated summary

English

Building Reliable Agentic AI Systems

PRINCE evolved through Search, Ask, and Do phases using Agentic RAG and Text-to-SQL.
Multi-agent architecture: Clarify Intent, Think & Plan, Researcher, Reflection, and Writer agents.
Context engineering defines what information each agent receives; harness engineering ensures reliability.
Hybrid retrieval: RAG for unstructured PDF reports and Text-to-SQL for structured metadata.
Transparency, explainability, and human-in-the-loop integration build trust.
Built with LangGraph, FastAPI, OpenSearch, Athena, PostgreSQL, and internal GenAI platforms.

Chinese

构建可靠的智能体AI系统

本文介绍了由拜耳与Thoughtworks共同开发的PRINCE平台，这是一个基于云端的临床前信息中心，旨在解决药物开发中的数据挑战。PRINCE利用智能体检索增强生成（RAG）和Text-to-SQL技术整合了数十年的安全性研究报告。系统从关键词搜索演进为能够回答复杂问题并起草监管文件的智能研究助手。关键的工程决策包括上下文工程（在专门智能体之间塑造和路由信息）和驾驭工程（围绕模型构建编排、恢复和可观测性以保持控制和可靠性）。系统通过透明度、可解释性和人在回路中集成来优先建立信任，显著提高了数据可访问性和研究效率，同时确保治理和合规性。

PRINCE通过搜索、询问和执行三个阶段演进，使用了智能体RAG和Text-to-SQL。
多智能体架构：意图澄清、思考与规划、研究者、反思和写作智能体。
上下文工程定义每个智能体接收的信息；驾驭工程确保可靠性。
混合检索：RAG用于非结构化PDF报告，Text-to-SQL用于结构化元数据。
透明度、可解释性和人在回路中集成建立信任。
使用LangGraph、FastAPI、OpenSearch、Athena、PostgreSQL和内部GenAI平台构建。

agentic AI / RAG / pharmaceutical / drug development / LangGraph / multi-agent / Text-to-SQL / context engineering / harness engineering / Bayer / PRINCE / production AI

Full Article

Building Reliable Agentic AI SystemsA Case Study in building production-ready agentic AI systems

This paper presents the Preclinical Information Center (PRINCE), a cloud-hosted platform

developed by Bayer AG with Thoughtworks to address pharmaceutical industry challenges in drug

development. PRINCE leverages Agentic Retrieval-Augmented Generation

and Text-to-SQL to integrate decades of safety study reports. We describe PRINCE's evolution

from keyword-based search to an intelligent research assistant capable of answering complex

questions and drafting regulatory documents. We reflect on key engineering decisions through

the lens of context engineering—how information was shaped and routed between specialized

agents—and harness engineering—how orchestration, recovery, and observability were built

around the models to maintain control and reliability. The system prioritizes trust through

transparency, explainability, and human-in-the-loop integration. PRINCE demonstrates AI's

transformative potential in pharmaceuticals, significantly improving data accessibility and

research efficiency while ensuring governance and compliance.16 June 2026Sarang Sanjay KulkarniSarang Kulkarni is a Principal Consultant at Thoughtworks, working at the intersection of

software engineering, data platforms, and applied AI. He focuses on building

production-grade GenAI systems, particularly Retrieval-Augmented Generation (RAG) and

multi-agent workflows, and helps teams take these systems from early ideas to real-world

use. Sarang also contributes to Thoughtworks’ Global AI Service Development team and teaches

an O’Reilly

course on building production-ready RAG applications.ContentsThe Challenge: Navigating the Preclinical Data MazeThe Solution: PRINCE - An Evolutionary PlatformSystem Architecture: Engineering a Reliable Agentic RAG SystemThe Agentic RAG SystemClarify User IntentThink & Plan: Process ReflectionThe Researcher AgentThe Reflection Agent: Data Validation and SufficiencyThe Writer Agent: Answer Synthesis and FormattingBuilding Trust in a Production LLM SystemTransparency and ExplainabilityEvaluationMonitoringEngineering for Resilience: Error Handling and RecoveryEnhancing Data Quality: Named Entity Recognition and AnnotationThe Journey Continues: Iterative DevelopmentConclusionPreclinical drug discovery is inherently complex and data-intensive.

Researchers face the significant challenge of efficiently accessing and

analyzing vast volumes of information generated during this critical phase.

Traditional keyword-based search methods, often reliant on rigid Boolean

logic, frequently fall short when confronted with the nuanced and intricate

nature of preclinical research questions.The advent of Large Language Models (LLMs) has presented a transformative opportunity. By

combining the generative power of LLMs with the precision of information retrieval systems, Retrieval-Augmented Generation (RAG) has emerged as a promising technique.

This approach holds the potential to revolutionize preclinical data access, enabling

researchers to pose complex questions in natural language and receive accurate, context-rich

answers grounded in proprietary data.Recognizing this potential early, Bayer committed to exploring how these

technologies could address longstanding challenges in preclinical research.In this post, we share that journey—how Bayer's early investment in generative AI

has resulted in PRINCE, an agentic AI system built on Agentic RAG. This case study

explores the technical architecture, engineering decisions, and lessons

learned in transforming preclinical data retrieval from a challenging maze

into an intuitive conversational experience.Many of the engineering decisions behind PRINCE can now be understood through the lens of context

engineering and harness engineering, although when the system was first designed we did not use these terms. Context engineering shaped what information each model

received, what it did not receive, and how context moved between specialized steps such as

research, reflection, and writing. Harness engineering shaped the scaffolding around the

models: orchestration, tool boundaries, state persistence, retries, fallbacks, validation,

reflection loops, observability, and human review.While this post focuses on the technical architecture and engineering challenges, our paper

published in Frontiers in Artificial Intelligence covers the

product evolution and business impact in more detail.The Challenge: Navigating the Preclinical Data MazeThe preclinical research landscape at Bayer, like many large

pharmaceutical organizations, is characterized by a diverse and extensive

array of data. This includes highly structured datasets from various studies, alongside vast

amounts of unstructured

information embedded within text documents such as study reports,

publications, and regulatory submissions. Researchers frequently

encountered significant hurdles in accessing and analyzing this

information effectively:Data Silos: information was fragmented and scattered across numerous

disparate systems and repositories, making it exceedingly difficult to gain a

comprehensive, holistic view of preclinical data related to a specific compound

or study.Limited Search Capabilities: traditional keyword-based search engines

struggled with the complexity and variability of preclinical terminology and

research questions, often yielding irrelevant, incomplete, or overwhelming

results.Time-Consuming Manual Analysis: extracting specific insights or compiling

information across multiple documents required considerable manual effort,

diverting valuable researcher time away from core scientific activities.These inherent challenges highlighted a clear need for a more

efficient, intelligent, and integrated approach to preclinical data

retrieval and analysis.The Solution: PRINCE - An Evolutionary PlatformTo address these challenges, Bayer developed the Preclinical

Information Center (PRINCE) platform. PRINCE was conceived as a unified

gateway to preclinical data, initially focusing on consolidating

previously siloed structured study metadata and exposing them in a “Searchable” manner.

This initial phase allowed users to apply advanced filters and retrieve

information primarily from structured study metadata.However, a significant portion of Bayer's valuable preclinical

knowledge resides within unstructured PDF study reports accumulated over

decades. Due to numerous system migrations over the years, the structured

metadata associated with these reports could be incomplete, missing, or

even contain incorrect annotations. Crucially, the authoritative “gold

standard” information was consistently present within the approved PDF

study reports.The emergence of Generative AI, particularly RAG, provided the key to

unlocking this wealth of unstructured data. By integrating RAG

capabilities, PRINCE began to shift the paradigm from a filter-based

'search' tool to a natural language 'ask' system, enabling researchers to

query the content of these study reports directly.This evolution reflects PRINCE's progression through three distinct

phases:Search: the initial phase focused on creating a unified gateway to

thousands of nonclinical study reports, consolidating multiple in-house data silos from

various preclinical domains into a

searchable format, primarily leveraging structured metadata.Ask: this phase introduced an AI-powered question-answering system utilizing

Retrieval Augmented Generation (RAG). This enabled researchers to derive insights directly

from unstructured data, including scanned PDFs from historical reports, by posing

questions in natural language.Do: the current phase positions PRINCE as an active research assistant capable of

executing complex tasks. This is achieved through the integration of multi-agent systems,

allowing the platform to handle intricate queries, orchestrate workflows, and support

activities like drafting regulatory documents.This deliberate evolution from Search to Ask to Do represents a strategic

response to the industry's need for greater efficiency and innovation in

preclinical development. By providing researchers with increasingly powerful

tools to access, analyze, and act upon preclinical data, PRINCE aims to enable

faster data-driven decision-making, reduce the need for unnecessary experiments,

and ultimately accelerate the development of safer, more effective

therapies.System Architecture: Engineering a Reliable Agentic RAG SystemThe system functions as an interactive conversational UI, powered by a robust backend

infrastructure. Its architecture, designed for handling complex queries and delivering

accurate, context-rich answers, is orchestrated using LangGraph and served via a

FastAPI application.Figure 1 provides the system context—UI, backend, data

stores, LLM fallbacks, and observability—while Figure 2

zooms into how the system coordinates its specialized agents.Figure 1: System context and supporting

platforms.User Request: the process begins when a user submits a request through the

Conversational UI which is built with React.Orchestration: the user's request is routed to a LangGraph-based orchestration layer in

the backend. This workflow engine coordinates a multi-stage process that progresses

through

clarifying user intent, thinking and planning, conducting research (using RAG and

Text-to-SQL),

validating data completion, and finally generating a response through the Writer agent.

The

workflow includes deliberate pause points and feedback loops to ensure data completeness

before

proceeding. (We explore the details of this agentic workflow in a dedicated section

later.)Data Retrieval and State Management: the Researcher agents interact with a comprehensive

and

distributed data ecosystem:Vector representations of all study reports are stored in OpenSearch , forming

the core knowledge base for information retrieval.Curated structured data , resulting from various ETL and harmonization

processes, is accessed via Athena .The state of the agent's execution is meticulously tracked. After each logical

step (a LangGraph node execution), the corresponding state is persisted in

PostgreSQL using a LangGraph checkpointer .Broader application-level state is managed in

DynamoDB .The system leverages internal GenAI platforms that host models from OpenAI, Anthropic,

Google, and open-source providers. These platforms expose all models via a unified

OpenAI-compatible endpoint, making it easy to swap models and choose the best tool for

each task. They also manage the control plane, enforcing rate limits and other safeguards

to prevent abuse.Resilience and Error Handling: robustness is a critical design principle, with

multiple fallback mechanisms in place:If a specific LLM fails, the system automatically retries

the request several times before falling back to an alternative model or platform to

ensure service continuity.To recover quickly from transient failures, retries are

implemented at both the individual LLM call level and the logical node level (i.e., an

entire step in the agent's plan).Also, agents are provided the context of the errors so that they can chart a different

trajectory or alternative plan of action as a response.Observability and Evaluation: the entire system is monitored for performance and

reliability:General system health and metrics are tracked using Cloudwatch .Langfuse serves as the primary observability tool, providing detailed traces of

all production traffic. This allows for in-depth debugging of issues. Furthermore,

evaluation datasets are stored and managed within Langfuse, making it easier to analyze

performance scores and diagnose specific failures. The evaluation is done using RAGAS

evaluation framework. The live traffic evaluation is done on a daily basis while the

dataset evaluation is done whenever significant changes are made to the core workflow,

prompts, or underlying models.Final Response: once the agents have processed the request and generated a

satisfactory response, it is sent back to the Conversational UI to be presented to the

user.A design principle running through this architecture is context discipline. Larger context

windows did not remove the need to be selective about what each agent sees. In early

iterations, putting too much information into the context made the system harder to steer

and harder to evaluate. PRINCE therefore avoids treating the prompt as one large container

for all available information. Instead, different stages receive different context: planning

context for Think & Plan, retrieval context for the Researcher Agent, evidence context

for the Reflection Agent, and synthesis context for the Writer Agent. This reduces context

pollution and makes the system easier to debug, evaluate, and improve.These steps ensure that the system can provide reliable and contextually relevant answers

to a wide range of complex queries by leveraging a sophisticated, multi-agent architecture

and a diverse set of powerful tools and data sources.The Agentic RAG SystemPRINCE incorporates an agentic RAG system ( Figure 2 ) to handle complex user requests that require multiple

steps, reasoning, and interaction with different tools or data sources. This setup,

implemented using LangGraph , orchestrates the overall workflow and leverages Researcher

Agent , Writer Agent , and Reflection Agent for specific tasks. The system

is designed to be robust and reliable, with multiple fallback mechanisms in place to ensure

that the system can continue to function even if some of the components fail.Figure 2: The research workflow.Clarify User IntentThe Clarify User Intent step serves as the first line of defense against

ambiguity. As the system scaled to include diverse domains like toxicology and

pharmacology, simple user queries often became ambiguous, making it difficult to

automatically select the right tools. Rather than relying on expensive trial-and-error

across all data sources, the system proactively asks clarifying questions to pinpoint the

specific domain or data type.This ensures the system enhances the query with the necessary constraints to target the

correct tools. We are also optimizing this by developing domain-level selection in

the UI, which will allow users to pre-filter valid tools upfront. To further reduce

friction, the system also provides AI-assisted source recommendations: when a user has not

selected any data source — or has selected several without a clear focus — the model

analyzes the intent behind the user's query and suggests the most relevant sources. The

user retains full control and can accept, adjust, or override the recommendation, ensuring

domain expertise always has the final say. This “fail-fast” mechanism prevents wasted

execution on vague queries, while careful tuning ensures the system remains unobtrusive

when the intent is already clear.From a context engineering perspective, this step is the first assembly decision in the

workflow: it constrains which tools, domains, and data sources will be in scope before any

retrieval begins, ensuring subsequent agents receive a focused rather than open-ended

problem.Think & Plan: Process ReflectionThe Think & Plan step is responsible for devising a strategy to fulfill the

user's request. This critical component gives the system a dedicated space to reason about

the next steps before taking action—a technique inspired by Anthropic's Think tool .

Importantly, this step performs process reflection : evaluating whether the agent is

making the right progress toward its end goal and is on right trajectory, rather than

evaluating the data itself.In multi-step agentic workflows, particularly those involving many sequential actions,

process reflection is essential. Consider a scenario where the system needs to execute 50

steps to complete a complex task. At each juncture, the system must ask: Am I taking these

steps in the right manner? Am I making the progress I'm supposed to make? Is the current

trajectory leading toward the user's goal? The Think & Plan step provides this

metacognitive capability, allowing the system to reflect on its own workflow and adjust

its strategy accordingly.This “thinking space” has proven particularly valuable in scenarios involving multiple

tool calls.

When PRINCE was initially developed, it had only a couple of tools: one for RAG-based

retrieval and

another for Text-to-SQL queries. However, as we integrated more data sources to expand the

system's

capabilities, the number of available tools grew significantly. With this explosion of

tools came an

inherent challenge: overlapping concerns and domain boundaries across different tools.For example, multiple tools might serve similar but subtly different purposes—querying

structured

metadata versus unstructured reports, or retrieving study summaries versus detailed

experimental data.

When presented with tools that belong to similar domains but handle slightly different

data, the LLM

would sometimes struggle to select the most appropriate tool for a given query. By

introducing a

dedicated thinking step, the system can explicitly reason about which tool best matches

the user's

intent, evaluate the characteristics of each available tool, and make a more informed

decision. This

approach led to a dramatic improvement in the accuracy of tool selection.Beyond tool selection, the Think & Plan step is essential for orchestrating

multi-step processes. Many complex queries in PRINCE require a series of tool calls where

the output of one tool must be analyzed before determining the next action. For instance,

the system might first query structured metadata to identify relevant studies, then use

those study IDs to retrieve detailed information from unstructured reports, and finally

synthesize the findings. Without a dedicated space for process reflection, the system

would attempt to execute these steps linearly without evaluating whether each step is

bringing it closer to the goal. With the thinking step in place, the system can pause,

assess its progress in the workflow, and intelligently plan the subsequent tool calls

needed to complete the user's request.The Researcher AgentThe Researcher Agent serves as the system's primary information gatherer. As we

onboard new scientific domains onto PRINCE, we consistently observe that data falls into

two primary categories: structured and unstructured . While specific

implementation techniques may vary across domains — for instance, leveraging Snowflake

Cortex Analyst for pharmacology queries for Text-to-SQL versus other more custom methods

for toxicology—the fundamentals behind these retrieval strategies remain consistent.As PRINCE expands across multiple preclinical domains, a single Researcher agent with a

flat tool list

becomes increasingly hard to manage. Many tools operate on similar concepts—“studies”,

“findings”, “assays”—but point to different underlying datasets, schemas, and regulatory

interpretations depending on the domain. For example, when a user refers to “the study”,

the relevant context might be a repeat‑dose toxicology study, a cardiovascular safety

pharmacology package, or a particular assay in aggregated mass‑data tables, each with its

own preferred sources of truth.To avoid one monolithic agent juggling overlapping tools and subtly different data

contracts, we are actively evolving the Researcher capability into a hierarchy of

domain‑specific

sub‑agents. In this proposed architecture, each domain agent will own its own toolset (for

example, toxicology RAG + tox

metadata SQL, or pharmacology RAG + assay‑level SQL) along with tailored prompt

instructions that encode how that domain’s data model works, which tables or indices are

authoritative, and how to interpret key concepts. We anticipate this will keep

responsibilities coherent,

reduce accidental cross‑domain leakage, and make it easier to reason about and test

retrieval behaviour per domain.To effectively harvest insights from this diverse landscape, the Researcher Agent employs

a hybrid retriever approach focused on two distinct

patterns:Retrieval-Augmented Generation (RAG): for processing unstructured data,

primarily PDF reports.Text-to-SQL: for querying structured data housed in Amazon Athena.This dual-strategy allows the system to bridge the gap between narrative scientific

reports and quantitative experimental data.In this updated vision, the top‑level Researcher Agent is designed to act as a

coordinator rather than a

single all‑knowing component. Given the clarified user intent and any explicit domain

selection from the UI, it will route the query to the appropriate domain sub‑agent, which

can then

decide how to combine RAG and Text‑to‑SQL within its own boundary. This pattern aims to

preserve the simplicity of “one researcher” from the user’s perspective, while internally

allowing each domain to evolve its own tools, schemas, and retrieval recipes without

destabilizing the rest of the system.Retrieval-Augmented Generation (RAG) for Unstructured DataGiven the vast repository of thousands of preclinical study reports and other

unstructured documents, RAG is essential for extracting relevant insights by grounding

LLM responses in this specific knowledge base. The RAG pipeline comprises a

comprehensive ingestion process and a sophisticated

query-time architecture.Ingestion Process: Preclinical study reports, mostly PDFs spanning decades and

often including scanned documents with complex tables, are first centralized into an S3

data lake and passed through an extraction pipeline tuned for this corpus. The extracted

text is normalized into structured JSON and then chunked using a strategy that preserves

enough scientific context while keeping chunks efficient for retrieval.Each chunk is enriched with study‑ and section‑level metadata from Amazon Athena (for

example study ID, compound, species, route, page, and parent section), which later

enables precise metadata filtering in the RAG layer. Finally, these annotated chunks are

embedded and indexed in Amazon OpenSearch Service ,

forming the vector store that backs semantic and metadata‑aware retrieval over both the

historical corpus and the daily deltas as new or updated reports arrive.Query-Time RAG Pipeline: When a user submits a query, the system initiates a

multi-stage retrieval process. This pipeline is engineered to effectively retrieve the

most relevant and trustworthy information from the vector database to ground the LLM's

response.To illustrate this pipeline, consider the example query: “Were any of the

following clinical findings observed in study T123456-2: piloerection, ataxia,

eyes partially closed, and loose faeces?”. The system processes this query

through the following steps:Keyword Extraction: the user's natural language query is first analyzed by an

LLM. Through careful prompt engineering, the model is instructed to extract

keywords highly relevant for keyword search within our document corpus (e.g.,

“piloerection”, “ataxia”, “eyes partially closed”, “loose faeces”).Metadata Filter Generation: concurrently, the LLM generates a

metadata filter based on the query. For example, a filter eq(study_id, T123456-2) is

extracted to narrow the search space. This filter is dynamically generated using

few-shot prompting with various permutation and combination examples provided to the

model, ensuring it can handle diverse filtering requests.Query Expansion: to ensure comprehensive retrieval and account for variations in

phrasing and terminology, query expansion (multi

query or query rewrite) is performed by a smaller, faster model. This generates n=5

semantically similar queries based on the original question. For the example query,

this might include variations like:“Clinical symptoms reported in research T123456-2, including goosebumps,

lack of coordination, semi-closed eyelids, or diarrhea.”“Recorded observations in experiment T123456-2 regarding hair standing on

end, unsteady movement, eyes not fully open, or watery stools.”“What were the clinical observations noted in trial T123456-2,

particularly regarding the presence of hair bristling, impaired balance,

partially shut eyes, or soft bowel movements.”Hybrid Retriever: information retrieval from the vector database ( Amazon OpenSearch

Service ) utilizes a Hybrid Search approach that combines metadata filtering,

semantic vector similarity search (kNN), and keyword-based retrieval. This process is

executed as follows:Metadata Filtering: the metadata filter generated in the previous step

(e.g., eq(study_id, T123456-2)) is applied directly to the vector database query.

This pre-filters the search space based on the structured metadata attached to the

chunks during the ingestion process from Amazon Athena, ensuring that only chunks

associated with the specified study ID (or other relevant metadata) are considered.

This significantly reduces the search space from millions of vectors to a more

manageable range of tens to hundreds, improving efficiency and relevance.Parallel Hybrid Search Execution: for each of the n=5 expanded queries, a

single hybrid search query is executed in parallel against the filtered Amazon

OpenSearch Service vector database. This query combines both semantic vector

similarity search (kNN) and keyword-based search, leveraging OpenSearch's

capabilities for efficient multi-vector and text search.Weighted Result Scoring: within each individual hybrid search executed in

parallel, a weighted approach is applied to the results. A weight of 0.7 is given to

the semantic search results and 0.3 to the keyword search results to balance

contextual understanding and precise term matching. This weighting was determined

through experimentation to optimize retrieval effectiveness for our data.Result Aggregation and Initial Ranking: the results (sets of relevant

chunks with their weighted scores) from all 5 parallel hybrid search executions are

aggregated. Unique chunks from all search results are pulled together, and their

highest weighted score across the parallel searches is used to determine an initial

ranking. This step initially retrieves a larger set of potential context chunks

(k=~20) based on these aggregated and weighted scores.Reranking: the initial set of retrieved chunks (k=~20) is then refined using a Rerank step. A cross-encoder model (bge-reranker-large)

evaluates the relevance of each retrieved chunk against the original question,

selecting the top k=7 most relevant chunks to be used as context for the LLM. This

reranking step is crucial for ensuring that the most pertinent information, even if

not the highest in initial semantic similarity or keyword match, is prioritized for

the final response generation.Final LLM Prompt Generation: the refined context (k=7 chunks) is then

combined with the original question to form the final LLM prompt. This prompt is

carefully constructed to guide the LLM in generating a focused and accurate response

based on the provided context, minimizing the risk of hallucination.Response Generation with Citation: a state-of-the-art reasoning model then processes

the final

prompt and the provided context to generate response with citation. The LLM

synthesizes the information from the context to formulate a coherent and accurate

answer. Crucially, the response automatically includes citations linking back to the

specific chunks in the original document(s) that support the generated answer.Monitoring: the entire Query-Time RAG process, from initial query to final

response generation, is continuously monitored using Langfuse for

observability, performance and quality analysis.Text-to-SQL for Structured DataWhile RAG excels at unstructured data, queries requiring precise filtering,

aggregation, or comparison of structured data points are better suited for Text-to-SQL.

Examples include “Give me 50 example studies done on RAT” or retrieving specific

numerical assay results including dosage groups. As shown in the

Researcher Agent can intelligently decide to hand over such queries to the

Text-to-SQL tool.Figure 3: Text-to-SQL toolThe process for converting a natural language question into an executable

SQL query and retrieving results involves several key steps:Query Analysis and Intent Recognition: the user's natural language query is

analyzed to understand the user's intent and identify the specific data points and

filters being requested from the structured metadata.Schema Understanding and Relevant Schema Selection: to accurately generate a

SQL query, the LLM requires an understanding of the relevant database schema. For

large and complex schemas, only the necessary schema components relevant to the user's

query are dynamically injected into the LLM's context. This reduces the complexity for

the model and improves the accuracy of the generated SQL.Dynamic Few-Shot Prompting for SQL Generation: converting complex natural

language queries into precise SQL dialect (in our case, Athena) can be challenging for

LLMs. To address this, we employ dynamic few-shot prompting. A collection of carefully

hand-picked examples, representing various complex query patterns and their

corresponding correct SQL translations in the Athena dialect, is stored in a separate

collection within our vector database. Based on the user's query, relevant examples

are retrieved from this “semantic layer” using vector similarity search and included

in the prompt to the LLM. This provides the LLM with in-context learning examples,

guiding it to generate accurate SQL queries in the correct dialect. Continuous

addition of new examples based on encountered challenges further improves the system's

performance over time.SQL Query Generation and Validation: a model with strong code generation

capabilities,

conditioned on the relevant schema information and dynamic few-shot examples,

generates the

corresponding SQL query. To ensure the LLM can accurately process the results and

identify the correct rows for subsequent synthesis, certain essential columns, such as

study ID and study title, are always included in the generated SELECT query. The

generated query is then validated to ensure it adheres to allowed operations (e.g.,

only SELECT queries are permitted; DELETE, INSERT, or UPDATE queries are explicitly

blocked for data integrity and security). Notably, an earlier iteration of this

process included an LLM review step for generated SQL queries; however, this step was

later removed as it was found that the reviewing LLM sometimes incorrectly flagged

valid queries as erroneous, hindering efficiency without a commensurate gain in

accuracy.Query Execution and Result Limiting: the validated SQL query is executed

against the structured metadata database in Amazon Athena. To prevent data flooding

and manage response size, the system enforces a limit, fetching not more than 50

records at a time.Error Handling and Iteration: if the SQL query execution is successful, the

retrieved results (up to the specified limit) are returned and integrated into the

overall response generation process. If the query fails due to syntax errors, schema

issues, or other execution errors, the error message from the database, along with the

generated query and the original context, is passed back to the same model.

The LLM analyzes the error and the context to generate a corrected SQL query.

This iterative process of generating and executing SQL queries is attempted up to 3

times before the tool gives up and reports a failure, potentially indicating an

unresolvable query or a limitation in the model's ability to handle the specific

request.The Reflection Agent: Data Validation and SufficiencyWhile the Think & Plan step provides process reflection, the Reflection

Agent performs a complementary but distinct type of reflection: data reflection .

This crucial component evaluates whether the data retrieved from various tools is

sufficient and relevant to answer the user's question—a fundamentally different concern

from whether the workflow itself is progressing correctly.In multi-step agentic workflows, these two types of reflection serve different but

equally important

purposes. Process reflection (Think & Plan) ensures the agent is taking the right

steps and making

appropriate progress toward the goal. Data reflection (Reflection Agent) ensures that the

information

gathered through those steps is adequate to fulfill the user's request. Both are

essential: an agent

might execute a perfectly valid workflow (good process) but still retrieve insufficient

data to answer

the question, or conversely, might have access to sufficient data but fail to progress

effectively

through the workflow.As illustrated in the research workflow diagram ( Figure 2 ), after initial information retrieval and 'think

& plan' loops, the Reflection Agent is invoked when Think & Plan step

thinks that the process has progressed well enough and is ready to evaluate the data.

'Reflection Agent' evaluates the sufficiency and relevance of the collected data by

comparing the retrieved context against the user's original query and identifying

potential gaps or missing information. If the gathered information is deemed insufficient

to provide a complete response, the Reflection Agent generates specific follow-up

questions designed to acquire the necessary missing information. These follow-up questions

are then handed back to the Think & Plan step, which initiates further

retrieval steps to obtain more comprehensive results. This iterative process of data

validation and subsequent information retrieval, driven by the Reflection Agent 's

generated questions, demonstrates the system's ability to refine its search strategy based

on the initial results. If the information is sufficient, the workflow proceeds to the

next step.The Writer Agent: Answer Synthesis and FormattingOnce the Researcher Agent has collected the relevant evidence from RAG and Text-to-SQL,

the Writer Agent is responsible for turning that raw material into the final answer

shown to the user. Its job is not to “discover” new information, but to synthesize the

retrieved context, respect user instructions, and enforce PRINCE's quality constraints

during generation.The Writer Agent operates with a few non-negotiable rules. It must ground every claim in

the supplied context and attach accurate citations back to the underlying chunks and study

IDs, since verifiability is critical in a regulated environment. It is also responsible

for honoring user-level formatting requirements (for example, tables, bullet points, or

specific section structures) and for aligning with domain-specific answer standards used

by the preclinical scientists.For more complex responses—such as multi-section summaries or partially filled regulatory

templates—the architecture supports extending the Writer Agent with a short internal

review loop. In this pattern, the Writer would first draft an answer, then a reviewing

step would check for missing sections, inconsistent tables, or gaps relative to the

original question, and may send targeted instructions back to the Writer to revise

specific parts. This design enables a lightweight form of reflection focused on answer

completeness and

presentation , complementing the Reflection Agent's focus on data sufficiency

earlier in the workflow. Importantly, all outputs from these regulatory drafting workflows

are intended for expert review; final submissions are authored and approved by qualified

personnel.This gives PRINCE three complementary reflection loops. Process reflection checks whether

the workflow is on the right path and helps catch bad trajectory, wrong tool choice, or

poor sequencing. Data reflection checks whether the gathered evidence is sufficient and

helps catch thin evidence, missing context, or gaps in coverage. Draft reflection checks

whether the generated output is complete and helps catch missing sections, incomplete

tables, or synthesis gaps.Together, these agents form a practical context engineering pattern. The system does not

simply keep adding more information to the prompt. It routes the right context to the right

capability at the right time: planning context for Think & Plan, retrieval context for

the Researcher, evidence context for the Reflection Agent, and synthesis context for the

Writer. This plays out in concrete decisions throughout the system: the Text-to-SQL step

injects only the schema components relevant to the current query rather than the full

database schema; the Reflection Agent receives the original question alongside collected

evidence to assess gaps, not the full workflow history; and the Writer Agent receives curated

chunks with citation constraints, not raw retrieval output. Moving from a monolithic agent

to this structured workflow meant each agent could be evaluated, debugged, and improved in

isolation.Building Trust in a Production LLM SystemBuilding and maintaining user trust is paramount for the successful

adoption of any AI system, particularly in a critical environment like

preclinical drug discovery where decisions have significant implications. For

a production LLM application, trust is not just about accuracy; it's also

about reliability, transparency, and the ability for users to verify the

information provided. Several mechanisms are integrated into PRINCE

to achieve this:Transparency and ExplainabilityEnsuring transparency and explainability is a critical aspect of PRINCE's

design, fostering user trust and enabling verification of the

generated responses. The system incorporates several mechanisms to achieve

this:Intermediate Steps and Transparency: given the iterative nature of the workflow

and the potential time required to generate a final answer, maintaining transparency is

crucial. The intermediate steps executed by the system during query processing,

information retrieval, and reflection, including the queries formulated and the tools

utilized, are displayed to the user. This provides visibility into the system's

reasoning process and allows users to follow the steps taken to arrive at the final

answer. Additionally, when relevant context (chunks) is identified, links to these

source materials are presented on the screen, allowing users to see precisely which

information was shortlisted and used to formulate the final response.Factuality Verification through Citation: the system facilitates user

verification of factuality through a robust citation mechanism. The generated answer is

consistently accompanied by citations referencing the original source documents and

structured metadata. These citations are directly linked to the context displayed to the

user, enabling them to easily verify the accuracy of the claims made in the response and

trace the information back to its origin. Users can hover over any sentence in the

generated response to see the corresponding citation, which provides a link to the

PRINCE and to the source document, including the page number and the exact quote from

the report used to support that part of the answer. This granular level of citation

significantly enhances the credibility and trustworthiness of the system's output and

simplifies the human review process.EvaluationRigorous evaluation is fundamental to building and maintaining a reliable

LLM application. PRINCE's performance and reliability are assessed

through a combination of two types of evaluations: Dataset Evaluations and

Live Traffic Evaluations.Dataset Evaluations: conducted whenever significant changes are made to the core

workflow, prompts, or underlying models, these evaluations utilize curated datasets with

pre-defined reference answers, meticulously prepared by subject matter experts and

stored in Langfuse. A custom evaluation script processes each question and compares the

generated response against the reference answer, yielding quantitative metrics such as

Faithfulness (degree to which the answer is supported by context), Answer

Relevancy (how well the answer addresses the query), Context Relevancy

(relevance of retrieved chunks), Answer Accuracy (comparison to ground truth),

and Semantic

Similarity with Reference (semantic similarity to reference answer). Given the

agentic nature of the system, applying appropriate evaluation metrics at different

workflow stages, analogous to a testing pyramid, is crucial in addition to evaluating

overall end-to-end performance.Live Traffic Evaluations: performed daily as a batch job on real user queries

from the live environment (without pre-defined reference answers), these evaluations

provide valuable insights into real-world performance. Metrics such as Faithfulness and

Answer Relevancy can still be assessed. Live traffic evaluations are essential for

monitoring system behavior, identifying potential issues like hallucinations in

production, and understanding performance on diverse live queries.MonitoringContinuous monitoring of the system's performance and outputs is essential

for proactive identification and resolution of issues in a production

environment. Using platforms like Langfuse, we continuously monitor

PRINCE to identify potential biases, errors, or areas for improvement,

ensuring the reliability and safety of the system's responses.Engineering for Resilience: Error Handling and RecoveryGiven the complexity of the multi-step workflow inherent in PRINCE,

robust error handling and recovery mechanisms are critical to ensure

the system's reliability and provide a seamless user experience. The system is

engineered to recover gracefully from failures at various stages without

requiring a complete restart of the entire workflow.Key aspects of our error handling and recovery approach include:State Persistence: the state of the entire workflow graph is persistently stored,

enabling the system to resume execution directly from the failed node. This is achieved by

storing the Agent State , representing the progress of the agents through the

workflow, in Postgres. Other aspects of the application state, such as logs, intermediate

steps, and citations, are stored in DynamoDB. This separation and persistence of state are

crucial for achieving robustness in a stateful agentic system.Built-in Retries: the system is configured with built-in retries at various steps

in the workflow. If a particular step encounters a transient failure, the system will

automatically attempt to re-execute it a predefined number of times before signaling a

more permanent error.User-Initiated Retries: in addition to automated retries, users have the option

to manually retry a failed query through the interface. When a user initiates a retry, the

system leverages the persisted state to continue the workflow directly from the point of

failure, intelligently skipping the steps that were successfully completed in the previous

attempt. This significantly improves user experience and saves computational resources.Framework-Level Support: the error recovery mechanisms are significantly

supported by the underlying framework, LangGraph, which offers solid built-in capabilities

for managing workflow state and handling errors within the graph structure. This provides

a robust foundation for building resilient agentic workflows.LLM Fallbacks: to enhance reliability and mitigate issues related to model

availability or performance, the system incorporates custom LLM fallback handling. If a

call to a primary LLM provider or a specific model fails after a few retries, the system

automatically falls back to an alternative LLM from a different provider. This mechanism

is crucial for maintaining system availability and responsiveness, especially as platform

downtimes for external services are outside of our direct control.This comprehensive approach to error handling and recovery minimizes the

impact of transient failures, reduces the need for users to restart complex

queries from scratch, and contributes to cost and latency savings by avoiding

redundant execution of successful steps and LLM calls, all of which are

essential for a production-ready system.These mechanisms are harness engineering in practice. The LangGraph workflow acts as

the control layer around the agents: it defines which component can act, which tools it can

use, where the workflow can pause, how failures are retried, how state is persisted, and

when the system should move from research to reflection to writing. This harness makes the

system less opaque and more reliable than an unconstrained autonomous agent. It gives the

application clear control points for recovery, inspection, evaluation, and human

intervention.Enhancing Data Quality: Named Entity Recognition and AnnotationThe accuracy and completeness of the structured metadata in Amazon Athena

are critical for the performance of the Text-to-SQL component and overall data

discoverability within PRINCE. Due to historical data migrations and varied

annotation practices across different laboratories and systems over Bayer's

extensive operational history, the metadata can sometimes be incomplete,

missing, or incorrect.To address this challenge and continuously enhance the quality of the

structured metadata, we have developed a utility system that employs Named

Entity Recognition (NER) to extract and create accurate annotations directly

from the study PDFs. This system is designed to read the textual content of

the preclinical reports and identify key entities and associated information

that should be represented in the structured metadata.The process involves:Processing study PDFs to extract text and identify relevant entities (e.g.,

study IDs, compound names, species, routes of administration, dosage

information, clinical findings, etc.).Generating structured annotations based on the identified entities and their

relationships within the text.We are actively working on integrating this utility system into our data

pipelines to automatically correct and enrich the data within the Amazon

Athena database. The system's performance in generating accurate annotations

has been evaluated against curated datasets, demonstrating promising results.

To manage the integration of these annotations into the production database,

we are developing an evaluation system that provides a confidence score for

each extracted field. Fields with a high confidence score will be

automatically used to update the corresponding entries in Amazon Athena.

Fields with lower confidence scores will be quarantined and flagged for human

review and intervention, ensuring data accuracy while leveraging automation.

This approach aims to continuously improve the quality of the structured

metadata, making it a more reliable source of information for PRINCE

and other downstream applications.The Journey Continues: Iterative DevelopmentPRINCE has been available to end-users since early 2024, with the agentic

integration introduced later that year.

This has been crucial for gathering real-world feedback

and driving iterative development. A key principle guiding our development

has been the understanding that building a production-ready LLM application is

an iterative process; we don't wait for features to be absolutely perfect

before seeking user feedback. Instead, we prioritize delivering value

early and continuously refining the system based on real-world usage.In the initial stages, our focus was squarely on achieving the desired

accuracy and performance for core functionalities, even if it meant incurring

higher costs. We recognized that optimizing for cost prematurely could

compromise the system's effectiveness and hinder user adoption. Only after

achieving the desired level of accuracy and performance did we begin to focus

on cost optimization, ensuring that efficiency gains did not negatively impact

the user experience or the quality of the results.The development of PRINCE follows a continuous, iterative

process. User feedback, ongoing monitoring data, and insights from expert

scientists are continuously fed back into the development cycle, leading to

refinements in the architecture, retrieval techniques, agent behaviors, and

user interface to enhance performance, usability, and ultimately, scientific

impact.ConclusionBuilding a production-ready LLM application in a complex enterprise

environment like preclinical drug discovery is a journey marked by significant

technical and engineering challenges. The PRINCE case study

demonstrates that by combining robust data infrastructure, sophisticated

information retrieval techniques like RAG and Text-to-SQL, and an intelligent

multi-agent orchestration system, it is possible to unlock valuable insights

from vast, previously inaccessible data repositories.Our experience highlights the critical importance of focusing on

engineering for reliability, including robust error handling, state

persistence, and LLM fallbacks. Furthermore, building user trust is paramount,

achieved through transparency in the workflow, clear explainability via

granular citations, and continuous evaluation and monitoring of the system's

performance.PRINCE has already shown promising results in enhancing data

accessibility and research efficiency at Bayer, transforming how scientists

interact with preclinical information. This is not the end of the journey, but

rather a significant step towards creating truly intelligent research

assistants.The broader lesson from PRINCE is that production-ready agentic AI is not only about better

models or better prompts. Reliability comes from engineering both the context the model sees

and the harness within which the model acts. Context engineering helped ensure that each

model had the right information, and only the right information, at the right stage of the

workflow. Harness engineering helped ensure that the workflow remained bounded, observable,

recoverable, and suitable for a regulated research environment.As model capabilities improve, some parts of today's harness may become thinner or move

into native model capabilities. But in enterprise research systems, especially where trust,

traceability, and reviewability matter, explicit control over context, workflow state,

recovery, reflection, and verification remains essential.We hope this overview provides valuable insights into the practical

considerations and technical depth required to build and productionise LLM

applications in a regulated and data-rich domain.AcknowledgmentsThe author gratefully acknowledges the invaluable contributions of Adam Zalewski, Annika

Kreuchwig, Carlos Henrique Vieira-Vieira, Jobst Löffler, and Jonas Münch from the Bayer

team.The author also thanks Bala Hari, Balu Saravanan, Bernice Mercy Sharon M, Deril X, Jigar

Jani, Manibalan Baskaran, Nafis Aslam, Priyalakshmi R, Rohit Bansal, Sai Prabhanj Turaga,

Saksham Srivastava, Shivam Sehgal, Sowmya Adimoulame, and

Subhashini Rajamani from the Thoughtworks team for their contributions to this work.The author used AI assistance during the writing of this article. AI tools were used for

brainstorming ideas, creating outlines, and reviewing drafts to polish language and

improve clarity.DisclaimerAll activities described conform to Bayer's information classification, data governance,

and external communication policies, and do not constitute claims regarding regulatory

decision‑making or product performance.

Significant Revisions16 June 2026: published

Read Original Article →

Snapshots

Fetch evidence retained for parsing and audit.

200 · text/html

06/17/2026, 10:59 AM

0950710b5a7c37588595eb9d4e62599404099c1a8af9dd1bd67db5b628d53cb3

AI Outputs

Structured model outputs with validation status.

article.summarize

deepseek-v4-flash · valid

{"tags":["agentic AI","RAG","pharmaceutical","drug development","LangGraph","multi-agent","Text-to-SQL","context engineering","harness engineering","Bayer","PRINCE","production AI"],"titleEn":"Building Reliable Agentic AI Systems","titleZh":"构建可靠的智能体AI系统","summaryEn":"This paper ...

article.classify

deepseek-v4-flash · valid

{"relevant":true,"confidence":0.95,"primaryTopic":"agent-engineering","secondaryTopics":["ai-engineering","software-engineering"]}

Quality And Digest Links

Open and resolved issues, plus daily digest appearances.