Natural Language to Reproducible Pandas Code
A data agent is useful only if its natural-language answer leaves behind executable, inspectable code.
Site connection
The Data Cleaning Assistant turns plain-English CSV cleaning requests into reproducible pandas scripts and cleaned outputs.
Visual model
From request to transformations
Toggle common transformations to see how raw columns become model-ready features or cleaned fields.
Interactive
Raw columns become model-ready signals through feature engineering
The Translation Problem
A user says, 'make hired a boolean, remove bad rows, and graph followers against following.' The agent has to infer columns, data types, missing-value rules, output expectations, and whether a change should mutate the file or create a derived column.
A robust data-cleaning agent should produce three things:
- A cleaned dataset the user can download.
- A script that can reproduce the transformation.
- A short audit trail explaining assumptions and rows affected.
Reproducibility as the Product
The script is not a side effect; it is the trust object. A cleaned CSV without code is hard to audit and hard to reuse when the source file changes.
Generated code should be small, dependency-light, and deterministic. It should avoid hidden state, random mutation, and ambiguous column guessing when the schema is unclear.
Execution Safety
Running generated code needs guardrails: whitelisted imports, file sandboxing, timeouts, memory limits, and clear separation between code generation and execution.
The user should be able to inspect the code before trusting the output, and the system should show shape changes, null counts, and modified columns after execution.
| Risk | Guardrail |
|---|---|
| Bad column assumption | Preview schema and ask or infer conservatively |
| Destructive mutation | Work on a copy and keep original available |
| Unsafe code | Whitelist imports and block arbitrary system calls |
| Silent row loss | Report before/after row counts |
| Non-reproducibility | Export the exact script |
Common Pitfalls
- Returning a verbal answer without code.
- Silently dropping rows.
- Using inplace mutations that make debugging harder.
- Conflating data cleaning with statistical modeling.
- Letting generated code access the network or filesystem broadly.
Quick check
Quiz
Why should the agent export the pandas script?
- So the transformation can be audited and rerun
- To make the UI slower
- To avoid cleaning the data
- To hide assumptions
The script is the reproducible record of the cleaning operation.