Evaluation

Golden Dataset

What Is a Golden Dataset?

A golden dataset is a curated set of question-and-answer pairs that represents the queries your system should be able to answer correctly. It serves as the ground truth for evaluating retrieval quality — the benchmark against which Vedana’s responses are measured. You can use the golden dataset we prepared at: Grist

Or you can prepare a golden dataset of your own.

Each entry in the golden dataset contains a question as a user would ask it, and the expected correct answer. During evaluation, Vedana sends each question through the full retrieval pipeline and compares the result against the expected answer.

A well-constructed golden dataset should:

Cover the main question types your users will ask
Include both structured questions (specific values, dates, names) and open-ended questions (explanations, summaries, policies)
Reflect real or realistic phrasing, not idealized input
Include edge cases and questions that are likely to stress retrieval boundaries

The golden dataset is not a test of the language model’s general ability. It is a test of whether the graph is correctly structured, whether the right data was ingested, and whether the retrieval tools are selecting the right information.

How to Start a Test

1. Prepare the golden dataset

Create a table with two columns: question and expected_answer. Each row is one evaluation pair.

Example:

question	expected_answer
Who likes Quokkas?	Geneva Durben
What are Geneva Durben’s interests?	Quokkas, Slide Rules, Mosaic, Eating Disorders, Tantric, Marrakesh
Which people are interested in Joshua Trees?	Flo Zaugg, Nathen Saadia

Keep expected answers concise and factual. For structured questions the answer should match the exact value stored in the graph. For document-based questions the answer should capture the key information a correct response would contain.

2. Upload the golden dataset to Grist

Upload your golden dataset table to Grist during the initial setup step (see step 5 of the Quickstart). It should be added as its own table, separate from your domain data and data model.

3. Run ETL

The golden dataset is loaded into the evaluation pipeline during ETL.

Navigate to the ETL section > main tab, and run the pipeline by clicking ‘Run Selected’.
Ensure ETL has completed successfully before running a test, including data model load, data load, and embedding generation.

4. Trigger an evaluation run

Open the Backoffice at http://localhost:8000, navigate to the ETL section, and trigger a metrics run. Open eval tab in ETL section and start evaluation pipeline by clicking ‘Run Selected’. The pipeline will iterate over each question in the golden dataset, send it to the chat endpoint, and record the response alongside the expected answer.

How to Evaluate Metrics

Once ETL is complete and the chat endpoint is responding, run the evaluation pipeline from the Backoffice. Navigate to the Evaluation (Eval) section and trigger a metrics run:

In Golden QA Dataset window select questions.
Check Judge Configuration and Pipeline Configuration.
Refresh Data Model.
Start metrics run by clicking ‘Run Selected’.

The pipeline will:

Iterate over each question in the golden dataset
Send it to the chat endpoint
Compare the response against the expected answer
Compute retrieval metrics

Hit Rate

The primary metric is Hit Rate — the proportion of questions for which the correct answer was retrieved. It is expressed as a value between 0 and 1, where 1.0 means every question returned the correct answer.

A hit rate above 0.8 is generally considered a good baseline for a well-modeled domain. Below 0.6 indicates systematic retrieval problems that need to be addressed before the system is production-ready.

Per-question breakdown

In addition to the aggregate score, the evaluation output includes a per-question breakdown showing which questions passed, which failed, and what the system returned. Use this to identify patterns in failures rather than treating hit rate as a single number to optimize.

Common failure patterns and what they indicate:

Pattern	Likely cause
Structured questions failing (names, dates, values)	Anchor or attribute missing from data model, or ETL did not complete
Document questions failing	Poor chunking, missing embeddings, or wrong playbook tool selection
A whole category of questions failing	Playbook routing the wrong intent to the wrong tool
Inconsistent results across similar questions	Embed threshold too low or too high for semantic search fields

Iterating on results

Evaluation is most useful as an iteration loop, not a one-time check. After identifying failure patterns:

Adjust the data model, playbook, or chunking configuration in Grist
Re-run ETL
Re-run the evaluation
Compare hit rate before and after

Each iteration should be targeted — change one thing at a time so the effect of each adjustment is measurable. Adding questions to the golden dataset over time, particularly questions derived from real user queries, will make the benchmark progressively more representative of production behavior.