Adding Documents
Vedana ingests documents through the same Anchor / Link / Attribute mechanism as any other entity — there is no special “document” code path. The convention in this guide (and in test fixtures) is to declare anchors document and document_chunk and a link between them, then point an embeddable content attribute at the chunk text.
Read first: Documents and Chunks — there is no built-in chunking step in the default ETL (
prepare_nodesis a pass-through). You either pre-chunk the document text before loading it into Grist, or you add a custom step in your own ETL.
1. Prepare the files
Supported:
- PDF, DOCX, TXT, Markdown, HTML, exported Google Docs, CSV (as text).
Before uploading:
- check that the text is extracted correctly (especially from PDF — many parsers mangle tables and columns);
- remove boilerplate pages (cover pages, tables of contents) if they hurt semantic search;
- split very large files into logical sections if they’re too heterogeneous.
2. Upload to Grist > Data > Anchor_document
GristDataProvider discovers anchor data by table-name prefix: every table named Anchor_<noun> is treated as the data for the matching anchor (vedana_core/data_provider.py:69). So for a document anchor, create a table called Anchor_document with the columns that map to the anchor’s attributes:
| id | title | source_url | content |
|---|---|---|---|
| doc-001 | Returns and exchanges | https://acme.example.com/policy/refund | (full text) |
| doc-002 | Warranty policy 2026 | https://acme.example.com/policy/warranty | (full text) |
The content field is the full extracted text. You are responsible for splitting it into chunks before storing — either by pre-chunking and writing rows into a separate Anchor_document_chunk table, or by adding a chunking step to your custom ETL.
Alternatively, if there are many documents:
- store them in an S3 bucket and put the link in
source_url, while extractingcontentin custom ETL; - keep the texts in another DB and load them through a custom ETL step.
3. Configure chunking (if needed)
There is no built-in chunking step in the default ETL — prepare_nodes returns the input DataFrame unchanged. Recommended chunk sizes (300–800 tokens, with 0–50 token overlap for documents where context across paragraphs matters) are a target for your own pre-processing or a custom Datapipe step you add via Custom ETL.
When to tune:
- very short documents (FAQ-style) → smaller chunks, no overlap;
- very long structured documents (contracts, regulations) → more overlap so heading terms appear in detail chunks.
4. Run ETL
Backoffice → ETL → Run Selected for:
data_model_steps(if you changed the default model);grist_steps(load documents);default_custom_steps(chunk them);memgraph_steps(load into the graph + build embeddings).
5. Verify in Memgraph Lab
// edge label below depends on the `sentence` you declared in Grist > Links.
// The recommended form is ANCHOR1_verb_ANCHOR2 — e.g. DOCUMENT_has_DOCUMENT_CHUNK.
// If you declared it differently, substitute your label here.
MATCH (d:document)-[:DOCUMENT_has_DOCUMENT_CHUNK]-(c:document_chunk)
RETURN d.title, count(c) AS num_chunks
ORDER BY num_chunks DESC
This should show that documents have been split into chunks.
MATCH (c:document_chunk) RETURN c.content LIMIT 3
The chunk content should be human-readable.
6. Verify in chat
Ask a document question:
“What does our return policy say about returns after 14 days?”
In Details a tool call vector_text_search(label="document_chunk", property="content", text="...") should appear. The assistant’s answer should be grounded in the retrieved chunks.
7. If answers are bad
| Symptom | What to fix |
|---|---|
| The assistant doesn’t find a document that exists | embed_threshold too high → lower to 0.55–0.65 for chunk content. |
| The assistant finds a lot of irrelevant material | embed_threshold too low → raise it. |
| The assistant gets facts confused | Chunks are too big — chunk smaller. |
| Context is lost between chunks | Add overlap (10–20% of chunk size). |
| It doesn’t call vector search at all | Playbook problem — add a “document question” scenario. |
8. Source URLs / citations
To let the assistant cite sources, in the playbook (Queries) write:
3) Format the answer as: "<answer text> (Source: <document.title>, <document.source_url>)"
The LLM will then automatically add the link to the answer.
Best practices
- Always pair documents with FAQ. Users ask basic questions — let FAQ answer them deterministically. Documents stay for deeper / specific questions.
- Don’t dump the whole knowledge base into one file. Better to have dozens of documents with meaningful titles — improves vector search results.
- Run the golden dataset on document questions regularly — you’ll quickly notice if a new document broke existing scenarios.
What’s next
- Tuning Embeddings — how to choose thresholds.
- Adding FAQ Entries — for canonical answers.
- Adding Structured Data — hybrid approach (document + structured attributes).