Monitoring & Metrics

Vedana exposes Prometheus metrics and OpenTelemetry traces. This document covers the key metrics, recommended dashboards, and the alerts to set.

See also: Architecture → Observability.

Key metrics

LLM (cost & throughput)

MetricWhat it shows
llm_calls_total{model}number of LLM calls per model
llm_usage_prompt_tokens_total{model}how many prompt tokens were used
llm_usage_completion_tokens_total{model}completion tokens

Dashboard:

  • request rate per minute per model;
  • USD cost (via a recording rule: tokens × $/1k_tokens);
  • share of cached_tokens (when the provider supports it);
  • top-5 models by traffic.

Alerts:

  • rate(llm_usage_prompt_tokens_total[1h]) > X — sudden cost spike;
  • a >80% drop over 1h — your provider may have died or you’re missing traffic.

Pipeline (latency & reliability)

MetricWhat it shows
jims_pipeline_runs_total{status,pipeline}number of pipeline runs
jims_pipeline_run_duration_seconds_bucket{status,pipeline}duration histogram

Dashboard:

  • p50 / p95 / p99 latency for RagPipeline (success);
  • error rate: rate(jims_pipeline_runs_total{status="failure"}[5m]);
  • breakdown by pipeline type (RagPipeline, StartPipeline, custom ones).

Alerts:

  • error_rate > 1% over 5 minutes;
  • p95_latency > 15s over 10 minutes;
  • success_rate == 0 for 1 minute (zero traffic — something fell over).

Database

postgres_exporter and memgraph_exporter are recommended optional add-ons, not part of Vedana’s repo or apps/vedana/docker-compose.yml. To get the metrics below you need to install and scrape them yourself.

Standard Postgres metrics through postgres_exporter (e.g. prometheus-community/postgres_exporter):

  • connections;
  • query duration;
  • size of thread_events, rag_anchor_embeddings;
  • bloat and autovacuum activity.

Memgraph through a Memgraph metrics exporter (if used):

  • query duration;
  • memory usage;
  • vector / text index size.

Application logs

All components write to stdout via loguru / standard logging. In production, aggregate through Loki / ELK / Datadog.

Levels:

  • INFO — normal operation.
  • WARNING — something suspicious (data model filtering fell back to full DM, tool-call iteration limit hit).
  • ERROR / EXCEPTION — something failed.

Useful patterns for grep / Loki queries:

  • RagPipeline — every main pipeline run;
  • Data model filter selection — what filtering picked;
  • Reached tool call iteration limit — the pipeline hit 5 iterations;
  • Error executing tool — a tool call failed;
  • vts_fn(on=...) / cypher_fn(...) — debug for tool calls.

OpenTelemetry traces

A typical trace flow (LLM spans below are illustrative; in Jaeger/Tempo you’ll see span names produced by openinference.instrumentation.litellm — typically litellm.completion — not the llm.chat_completion_* labels):

jims.run_pipeline_with_context (12.4s)
├── memgraph.execute_ro_cypher_query (0.1s)
├── litellm.completion (1.2s) [filtering, structured]
├── litellm.completion (3.5s) [iter 1, with tools]
│   ├── memgraph.execute_ro_cypher_query (0.2s)
│   └── pgvector.vector_search (0.1s)
├── litellm.completion (4.1s) [iter 2, with tools]
│   └── memgraph.execute_ro_cypher_query (0.3s)
└── litellm.completion (3.0s) [iter 3 - final answer]

This shows where time is actually spent. The bottleneck is usually LLM calls; sometimes a slow Cypher.

Use Jaeger / Tempo / Datadog APM for visualisation.

Business metrics

In addition to the technical metrics, it’s worth counting:

  • rate of unique users per day (COUNT(DISTINCT contact_id) in threads);
  • average thread length (COUNT(thread_events) / COUNT(threads));
  • share of rag.error events — quality drops or LLM provider failures;
  • share of smalltalk vs real questions — helps assess whether the assistant is being used as intended;
  • CSAT — if you give the user a “useful/not” button, log it as comm.user_feedback. Note: this is not a built-in event type — Vedana doesn’t produce comm.user_feedback itself. You add the rating capture in your interface (widget / Telegram / your API) and write it through ctx.send_event("comm.user_feedback", {...}) or ctl.store_event_dict(...). Pick any event_type you like; comm.user_feedback is just a convention.

Sampling real conversations

The most valuable thing for the product team — once a week, take a random sample of 20–50 threads and do manual review:

  • did the assistant answer correctly?
  • did it use the right tool?
  • was the answer useful for the user?

This ritual gives qualitative feedback that the golden dataset can’t.

Alerting

A minimal alert set for production:

AlertSeverityCondition
Pipeline error rate > 1%highrate(jims_pipeline_runs_total{status="failure"}[5m]) > 0.01
Pipeline p95 > 15smediumhistogram_quantile(0.95, jims_pipeline_run_duration_seconds_bucket) > 15
LLM tokens spike (>2x baseline)mediumrate(llm_usage_prompt_tokens_total[15m]) > 2 * baseline
Postgres connections > 80%highpg_stat_activity_count > 0.8 * max_connections
Memgraph memory > 80%highprocess_resident_memory_bytes > 0.8 * limit
API healthcheck downcriticalup{job="vedana-api"} == 0 for 1 minute
Sentry: rate of new errors > 10/minmediumvia Sentry alert rules

Dashboards (recommendations)

Ops dashboard

  • request rate (per service);
  • latency p50/p95/p99 (per service);
  • error rate;
  • DB connections / query latency;
  • memory / CPU.

LLM dashboard

  • tokens per minute (per model);
  • cost per minute (per model);
  • cache hit ratio;
  • top scenarios by cost.

Product dashboard

  • DAU / MAU;
  • average thread length;
  • rag.error share;
  • smalltalk share;
  • CSAT (if available).

What’s next