Does your lab have a dataset that has been sitting in a shared folder for
months because the cleanup is so painful nobody wants to start? Maybe the
column names changed halfway through the study, someone hand edited rows in
Excel without documenting it, or the units are inconsistent in a way where
fixing one column breaks assumptions in another.
Now, you can point an AI model at the file and tell it everything you know
about the data's history. It proposes a cleaning strategy before touching
anything: what it will standardise, what it will infer, and what it will flag
for you to decide. Then it writes and runs a pipeline that does the actual
work. Every transformation gets documented in a change log: the original value,
the new value, and why it was changed.
Ambiguous cases get handled carefully. A unit that could be milligrams or
micrograms gets checked against known values in the literature so you can make
an educated fix. Columns where the meaning shifted mid-study get flagged for
your review instead of silently guessed. You make the judgment calls on the
hard ones. It handles everything else.
You end up with a cleaned dataset, a citable change log for your methods
section, and a list of edge cases it could not resolve on its own. The change
log means your cleaning process is reproducible and auditable. The whole thing
takes an afternoon instead of the week you have been putting off.
This workflow really needs the power-user setup.
Lock down raw data before the agent sees it
The first move is to make it impossible for the agent to corrupt
your ground truth. Ask Claude to lock down your raw-data folder
as read-only at the harness level, so no matter what happens
mid-run, the model can't edit, overwrite, or delete anything
under it. Claude works out the right config files and rules and
shows you what it set before you start cleaning.
Prompt: lock down raw data
Before we clean anything, I want you unable to edit, overwrite, or delete my raw data, no matter what I ask later. Set up whatever CLAUDE.md and permission rules you need so the harness blocks any destructive action against my raw-data folder. Propose a sensible folder layout for raw versus processed versus AI-generated files. When it's in place, show me what you set and confirm the blocks are active.
Enter plan mode and propose a cleaning strategy
In the Claude Code panel, press Shift+Tab twice
to enter plan mode. The agent can read files but cannot edit or run
anything until you approve. Ask it to read your messy CSV and write a
plan grouped into three buckets: what it will standardise automatically,
what it will infer with the inference logged, and what it will flag for
you to decide. You read, push back, then approve.
Prompt: propose a plan before any code runs
I want to clean the messy CSV in data/raw/. Before you write or run anything, read the file and propose a cleaning plan. Group your proposal into three buckets:
1. What you will standardise automatically (column renames, whitespace, date formats, obvious typos).
2. What you will infer with the inference logged (units that shifted mid-study, columns whose meaning changed, suspected hand-edits).
3. What you will flag for me to decide rather than guess.
For every bucket, name the columns and rows you are touching and the rule you would apply. Do not write code yet. Wait for me to approve or edit the plan.
The lock-down from the previous step is what makes this safe. Even
if the agent decides mid-execution that it just needs to tweak one
value in the raw file, the harness blocks the write before it
happens.
Promote the working pattern into a reusable skill
Once the plan-then-execute loop produces a clean batch you trust,
capture it as a reusable skill so the next batch is a single
command. Ask Claude to turn the pattern you just ran into a
self-contained skill in your project. From then on the cleaning
routine enforces itself every time a new file arrives.
Prompt: save the workflow as a skill
The cleaning pattern we just worked out is going to repeat every time a new batch lands. Turn it into a reusable skill I can invoke with one command on future files. The skill should enforce the same discipline we just used: read the file first, propose a plan grouped into standardise / infer / flag, wait for my approval before touching anything, then write out the cleaned file, a change log, and a list of ambiguous rows that need human judgement. When you're done, test it on the current file and confirm it works end to end.
Build canonical lookups against a reference you trust
The genuinely hard part of any cleanup is the lookup table that
maps every variant of a value to its one canonical form. For
tasks like these, we often don't need to invent the workflow
ourselves. For example, in his
Claude Code for computational biology
write-up, Brian Naughton documented the extract-reimplement-test
pattern: he rebuilt a protein scoring function inside a new
codebase by diffing against the original until every value
matched. Same idea here: clone the source, extract the relevant
slice, reimplement the lookup in your script, then diff every
mapping back against the reference until the values match exactly.
Prompt: extract, reimplement, diff against the reference
The hardest part of this cleanup is the canonical lookup: the table that maps every variant of a value to its one trusted form. Instead of inventing one, build it from a reference I already trust. I'll point you at the source (a repo, a public dataset, or a file on a shared drive). Extract the relevant slice, save it clearly labelled as AI-generated, and wire the lookup into our cleaning script so it reads from that file rather than your guesses. Then diff every distinct raw value in our file against the reference, row by row. Report any mismatch and stop before overwriting. Fold the resulting lookup into the cleaning skill so future batches use it automatically.
If the reference data lives on a shared lab drive rather than a
repo, tell Claude where it is and ask it to give itself read
access to the shared path. The lookup script then pulls from the
canonical source directly, with no copies drifting across
machines.