DDQ Automation Platform | Grayscale Investments

Built an AI-powered DDQ platform that drafts cited, confidence-scored answers from a governed knowledge base.

ClaudeMCPOpenSearchRAGAWS Step FunctionsAnthropic CitationsAWS BedrockS3TypeScript

Overview

The DDQ (Due Diligence Questionnaire) Automation Platform is an AI-powered system I built at Grayscale to automate what was previously a fully manual process for the Operations team.

Institutional investors and partners regularly send due diligence questionnaires as part of their evaluation process. These questionnaires often run to hundreds of questions, many of them repeated across submissions. Before this platform, the Operations team had to manually search through source documents, verify accuracy, and draft responses by hand every time. It was repetitive, time-consuming, and prone to inconsistency.

I embedded directly with the Operations team to understand their workflow, then worked with engineering to rebuild it as an AI system that compresses a multi-hour manual process into something far faster and more consistent.

Why Generic Retrieval Falls Short

DDQ answering looks like a simple search problem until you try it. A few failure modes show up fast with naive vector search:

Chunking can fragment answers. Splitting complete, human-reviewed answers into small fragments means the retriever might surface only half of a vetted response.
Good answers often require synthesis, not just lookup. A question like "Describe your approach to third-party risk management" can pull from vendor assessments, SOC 2 controls, procurement policies, and security reviews at once. You have to reason across sources, not just grab the single closest match.
Semantic similarity can mislead. "How do you handle data deletion?" and "How do you handle data retention?" look almost identical to an embedding model, but the correct answers are completely different.

The takeaway wasn't to abandon retrieval. It was to engineer it so each of these failure modes gets handled deliberately.

How It Works

The platform runs each questionnaire through a consistent loop: parse the blank document, triage anything uncertain to a human, answer each question from a governed knowledge base, assemble a companion Word document, and feed reviewer corrections back in.

Answers come from a governed corpus in OpenSearch, not from the model's training memory. For each question, retrieval blends three signals: keyword search for precise wording, semantic (vector) search for meaning, and a learned-sparse signal that catches aliases and acronyms the other two miss. Those rankings are fused and reranked for relevance. That hybrid approach is what handles the "retention versus deletion" problem: meaning-only search confuses them, but the keyword and sparse signals keep them apart.

Previously approved answers are preserved as whole units rather than chopped into fragments, and when a new question closely matches an approved one, that precedent is reused directly. Source documents are split into pieces with a short generated summary prepended to each, so a bare line like "It supports SSO" carries the context of where it came from and stays retrievable. Together these address the fragmentation and synthesis problems above.

Then comes the part that matters most for compliance work: trust. Claude writes each answer grounded only in the retrieved evidence, with Anthropic Citations enabled, which means the model can only cite a source we actually handed it. A fabricated citation is impossible at the API level, not just discouraged by instructions. Every answer carries a confidence score, and a coverage check refuses to answer at all when the evidence is too thin, flagging the question for a human instead of guessing.

People stay in the loop at two points. Before answering, the system separates questions it's confident are real from ambiguous ones, and an operator confirms the uncertain set in plain language through their chat client. After answering, anything the guardrails flag (a citation that can't be matched, or thin coverage) lands in a queue a steward works through. Those corrections then feed back into the knowledge base, so the system gets more accurate with every questionnaire it processes.

Technical Architecture

The system runs on AWS and integrates with Claude through a custom-built MCP server:

TypeScript MCP server over Streamable HTTP, so the platform is driven directly from the Claude.ai interface. Operators work through chat, and the agent calls the platform's tools on their behalf.
OpenSearch as the governed knowledge base, with hybrid retrieval (keyword BM25, vector KNN over Amazon Titan v2 embeddings, and an optional learned-sparse signal) fused with reciprocal rank fusion and reranked with Cohere Rerank.
Claude Sonnet for answer synthesis, grounded in retrieved evidence with Anthropic Citations. Claude Haiku handles lighter classification and routing steps.
AWS Step Functions with a Lambda fan-out for durable, long-running fills, so large questionnaires process in parallel and a job can't get stranded if something fails mid-run.
S3 for document and corpus storage, with the public MCP edge served by Bedrock AgentCore.
Okta for authentication, with role-based access (operator, steward, admin) so each user only sees their own work.

My Role

I owned requirements definition for the platform, working closely with our lead engineer on architecture planning and ideation. I embedded with the Operations team to map their workflow, facilitated testing with business stakeholders, and continue to refine the knowledge base and data structures to improve answer quality.

The project also served as an internal proof point for AI adoption at Grayscale, showing how AI systems can take on real operational work.

Additional Thoughts

I'm proud of how this one came together. The easy paths were to buy an off-the-shelf tool or stand up generic vector search and call it done. Taking the time to understand the problem alongside the Operations team led somewhere better: a system that treats trust as a first-class feature.

The decisions I'd carry forward are the ones around grounding and honesty. Citations enforced at the API level mean every answer traces back to a real, approved source. A system that refuses to answer when the evidence is thin is far more useful in a compliance setting than one that always has something to say. And a feedback loop that learns from reviewer corrections means the platform gets better the more it's used, instead of going stale. That combination, grounded answers, honest uncertainty, and compounding quality, is what I'm most proud of.

Meeting Assistant | Personal Project