NL→SQL planner
NL→SQL planner
services/nl_to_sql.py. Turns a natural-language question into
validated, scope-predicated SQL.
Pipeline
question │ ▼ embed │ ▼ schema retrieval — top-K relevant tables via cosine over │ schema_column_embeddings (per-connection) │ ▼ LLM planner call — task=`planner` via AIProvider task routing │ System prompt includes: │ - the retrieved table DDL │ - the connection's dialect (postgres / mysql / bq / sf / ...) │ - row-cap target │ - scope predicates declared on the connection │ - examples per dialect │ ▼ raw SQL │ ▼ sqlglot parse + safety check (SELECT-only, no DROP/DELETE/TRUNCATE) │ ▼ scope predicates AND-injected (planner can't omit them) │ ▼ row-cap LIMIT injected (caps the result set) │ ▼ EXPLAIN (FORMAT JSON) preflight — block if estimated rows > threshold │ ▼ validated SQL │ ▼ driver execute │ ▼ result + auto-inferred chart specSchema retrieval
Each connection’s tables + columns + types + comments are embedded
once at introspection time into schema_column_embeddings. When a
question lands, we embed the question and pull the top-K most-
similar columns. The planner sees only those tables — never the
full warehouse schema.
This is what makes NL→SQL tractable on 1000-table warehouses.
LLM provider routing
The planner uses the AIProvider with task_routing.planner = true.
Configure under AI providers. You typically
want a deterministic model here (temp 0, low-creativity) — wrong
SQL is worse than no SQL.
Why sqlglot
sqlglot parses + AST-walks SQL across 20+ dialects. We use it to:
- Reject DROP / DELETE / TRUNCATE / GRANT / etc. (SELECT-only).
- Inject scope predicates as additional WHERE clauses.
- Inject LIMIT clauses when the SQL has none.
- Detect cross-connection references (which we reject).
- Format SQL for display.
EXPLAIN preflight
Before executing, we run EXPLAIN (FORMAT JSON) (or the dialect
equivalent for non-Postgres) and read the estimated row count. If
the estimate exceeds the configured threshold (db.explain.max_rows
in settings, default ~1e9), we block with a clear error: “this
query would scan ~12B rows; refine your filters.”
pg_class.reltuples is consulted as a fallback signal for tables
where the planner statistics are stale.
What the planner does NOT do
- Doesn’t auto-JOIN across connections.
- Doesn’t invent columns. If the question requires a column that doesn’t exist in the retrieved tables, the planner answers “no matching column” instead of fabricating.
- Doesn’t write back. Any SQL the planner produces that’s not a SELECT is rejected by the validator before it touches the driver.