Harvard Agentic Science

Clean a dataset that seems uncleanable

Does your lab have a dataset that has been sitting in a shared folder for months because the cleanup is so painful nobody wants to start? Maybe the column names changed halfway through the study, someone hand edited rows in Excel without documenting it, or the units are inconsistent in a way where fixing one column breaks assumptions in another.

Now, you can point an AI model at the file and tell it everything you know about the data's history. It proposes a cleaning strategy before touching anything: what it will standardise, what it will infer, and what it will flag for you to decide. Then it writes and runs a pipeline that does the actual work. Every transformation gets documented in a change log: the original value, the new value, and why it was changed.

Ambiguous cases get handled carefully. A unit that could be milligrams or micrograms gets checked against known values in the literature so you can make an educated fix. Columns where the meaning shifted mid-study get flagged for your review instead of silently guessed. You make the judgment calls on the hard ones. It handles everything else.

You end up with a cleaned dataset, a citable change log for your methods section, and a list of edge cases it could not resolve on its own. The change log means your cleaning process is reproducible and auditable. The whole thing takes an afternoon instead of the week you have been putting off.