Clean a dataset that seems uncleanable

Does your lab have a dataset that has been sitting in a shared folder for months because the cleanup is so painful nobody wants to start? Maybe the column names changed halfway through the study, someone hand edited rows in Excel without documenting it, or the units are inconsistent in a way where fixing one column breaks assumptions in another.

Now, you can point an AI model at the file and tell it everything you know about the data's history. It proposes a cleaning strategy before touching anything: what it will standardise, what it will infer, and what it will flag for you to decide. Then it writes and runs a pipeline that does the actual work. Every transformation gets documented in a change log: the original value, the new value, and why it was changed.

Ambiguous cases get handled carefully. A unit that could be milligrams or micrograms gets checked against known values in the literature so you can make an educated fix. Columns where the meaning shifted mid-study get flagged for your review instead of silently guessed. You make the judgment calls on the hard ones. It handles everything else.

You end up with a cleaned dataset, a citable change log for your methods section, and a list of edge cases it could not resolve on its own. The change log means your cleaning process is reproducible and auditable. The whole thing takes an afternoon instead of the week you have been putting off.

Option 1

Web browser

The browser cannot read your CSV, run a cleaning pass, and write the fixed rows back, so you end up pasting slices around by hand, which falls apart past a trivial dataset.

Option 2

Power user

You have Claude Code or Codex CLI and you're comfortable letting an agent read and rewrite files in your project directory.

This workflow really needs the power-user setup.

See the Power-user walkthrough →

Lock down raw data before the agent sees it

The first move is to make it impossible for the agent to corrupt your ground truth. Ask Claude to lock down your raw-data folder as read-only at the harness level, so no matter what happens mid-run, the model can't edit, overwrite, or delete anything under it. Claude works out the right config files and rules and shows you what it set before you start cleaning.

Prompt: lock down raw data

Before we clean anything, I want you unable to edit, overwrite, or delete my raw data, no matter what I ask later. Set up whatever CLAUDE.md and permission rules you need so the harness blocks any destructive action against my raw-data folder. Propose a sensible folder layout for raw versus processed versus AI-generated files. When it's in place, show me what you set and confirm the blocks are active.

Enter plan mode and propose a cleaning strategy

In the Claude Code panel, press Shift+Tab twice to enter plan mode. The agent can read files but cannot edit or run anything until you approve. Ask it to read your messy CSV and write a plan grouped into three buckets: what it will standardise automatically, what it will infer with the inference logged, and what it will flag for you to decide. You read, push back, then approve.

Prompt: propose a plan before any code runs

I want to clean the messy CSV in data/raw/. Before you write or run anything, read the file and propose a cleaning plan. Group your proposal into three buckets:

1. What you will standardise automatically (column renames, whitespace, date formats, obvious typos).
2. What you will infer with the inference logged (units that shifted mid-study, columns whose meaning changed, suspected hand-edits).
3. What you will flag for me to decide rather than guess.

For every bucket, name the columns and rows you are touching and the rule you would apply. Do not write code yet. Wait for me to approve or edit the plan.

The lock-down from the previous step is what makes this safe. Even if the agent decides mid-execution that it just needs to tweak one value in the raw file, the harness blocks the write before it happens.

Promote the working pattern into a reusable skill

Once the plan-then-execute loop produces a clean batch you trust, capture it as a reusable skill so the next batch is a single command. Ask Claude to turn the pattern you just ran into a self-contained skill in your project. From then on the cleaning routine enforces itself every time a new file arrives.

Prompt: save the workflow as a skill

The cleaning pattern we just worked out is going to repeat every time a new batch lands. Turn it into a reusable skill I can invoke with one command on future files. The skill should enforce the same discipline we just used: read the file first, propose a plan grouped into standardise / infer / flag, wait for my approval before touching anything, then write out the cleaned file, a change log, and a list of ambiguous rows that need human judgement. When you're done, test it on the current file and confirm it works end to end.

Build canonical lookups against a reference you trust

The genuinely hard part of any cleanup is the lookup table that maps every variant of a value to its one canonical form. For tasks like these, we often don't need to invent the workflow ourselves. For example, in his Claude Code for computational biology write-up, Brian Naughton documented the extract-reimplement-test pattern: he rebuilt a protein scoring function inside a new codebase by diffing against the original until every value matched. Same idea here: clone the source, extract the relevant slice, reimplement the lookup in your script, then diff every mapping back against the reference until the values match exactly.

Prompt: extract, reimplement, diff against the reference

The hardest part of this cleanup is the canonical lookup: the table that maps every variant of a value to its one trusted form. Instead of inventing one, build it from a reference I already trust. I'll point you at the source (a repo, a public dataset, or a file on a shared drive). Extract the relevant slice, save it clearly labelled as AI-generated, and wire the lookup into our cleaning script so it reads from that file rather than your guesses. Then diff every distinct raw value in our file against the reference, row by row. Report any mismatch and stop before overwriting. Fold the resulting lookup into the cleaning skill so future batches use it automatically.

If the reference data lives on a shared lab drive rather than a repo, tell Claude where it is and ask it to give itself read access to the shared path. The lookup script then pulls from the canonical source directly, with no copies drifting across machines.