Skip to content

NL→SQL planner

NL→SQL planner

services/nl_to_sql.py. Turns a natural-language question into validated, scope-predicated SQL.

Pipeline

question
▼ embed
▼ schema retrieval — top-K relevant tables via cosine over
│ schema_column_embeddings (per-connection)
▼ LLM planner call — task=`planner` via AIProvider task routing
│ System prompt includes:
│ - the retrieved table DDL
│ - the connection's dialect (postgres / mysql / bq / sf / ...)
│ - row-cap target
│ - scope predicates declared on the connection
│ - examples per dialect
▼ raw SQL
▼ sqlglot parse + safety check (SELECT-only, no DROP/DELETE/TRUNCATE)
▼ scope predicates AND-injected (planner can't omit them)
▼ row-cap LIMIT injected (caps the result set)
▼ EXPLAIN (FORMAT JSON) preflight — block if estimated rows > threshold
▼ validated SQL
▼ driver execute
▼ result + auto-inferred chart spec

Schema retrieval

Each connection’s tables + columns + types + comments are embedded once at introspection time into schema_column_embeddings. When a question lands, we embed the question and pull the top-K most- similar columns. The planner sees only those tables — never the full warehouse schema.

This is what makes NL→SQL tractable on 1000-table warehouses.

LLM provider routing

The planner uses the AIProvider with task_routing.planner = true. Configure under AI providers. You typically want a deterministic model here (temp 0, low-creativity) — wrong SQL is worse than no SQL.

Why sqlglot

sqlglot parses + AST-walks SQL across 20+ dialects. We use it to:

  • Reject DROP / DELETE / TRUNCATE / GRANT / etc. (SELECT-only).
  • Inject scope predicates as additional WHERE clauses.
  • Inject LIMIT clauses when the SQL has none.
  • Detect cross-connection references (which we reject).
  • Format SQL for display.

EXPLAIN preflight

Before executing, we run EXPLAIN (FORMAT JSON) (or the dialect equivalent for non-Postgres) and read the estimated row count. If the estimate exceeds the configured threshold (db.explain.max_rows in settings, default ~1e9), we block with a clear error: “this query would scan ~12B rows; refine your filters.”

pg_class.reltuples is consulted as a fallback signal for tables where the planner statistics are stale.

What the planner does NOT do

  • Doesn’t auto-JOIN across connections.
  • Doesn’t invent columns. If the question requires a column that doesn’t exist in the retrieved tables, the planner answers “no matching column” instead of fabricating.
  • Doesn’t write back. Any SQL the planner produces that’s not a SELECT is rejected by the validator before it touches the driver.