Finance & Accounting

Expense categorization and anomaly flags

Build an owned script that reads your card and expense exports, assigns a GL category to each line with a confidence score, and flags the odd ones (duplicates, out-of-policy amounts, unusual vendors) for a human to check.

7 min read2026-06-17Human in the loopMedium-sensitivity data

Ease

4/5

Impact

4/5

Risk

3/5

Tools you'll use

Claude CodeCodexClaude Cowork

Expense categorization and anomaly flagging is the work of reading your card and expense exports, assigning a general-ledger category to each line, and pulling the odd ones — duplicates, out-of-policy amounts, unfamiliar vendors — into a short list for a human to check. Here, it's an owned script over your own exports: it suggests a GL category for each transaction, attaches a confidence score, and returns high-confidence lines ready to post while routing low-confidence and unusual ones to review with a plain-English reason.

The manual version is slow and the close waits on it. Someone codes each line to an account, then hunts for the receipt submitted twice or the dinner that's three times the usual. It's mostly pattern-matching with a few real judgment calls mixed in. The errors are not rare: the GBTA Foundation found that 19% of expense reports contain mistakes or missing information, and each one costs an extra $52 and 18 minutes to fix (GBTA, in partnership with HRS).

A categorization-and-flagging tool does the pattern-matching so the human only confirms a handful of exceptions instead of coding hundreds of lines. Because your team owns it, you can tighten rules as the chart of accounts changes and add a policy check the week a new policy lands — no waiting on a vendor. The human stays on the exceptions; the categorization itself is reversible and internal, so the main thing to govern is keeping financial data in tools you control.

Moriva's take

Gate 1, real work: yes — someone codes and reviews expenses on a fixed cadence, usually weekly or at close. Gate 2, owned: yes — this is a script over your own exports that your team can read, run, and change without us. Gate 3, measured: easy — track hours per close and the count of anomalies caught before posting. It's a GO with a human kept on the exceptions; the categorization itself is reversible and internal, so the main thing to govern is keeping financial data in tools you control.

How do you expense categorization and anomaly flags?

1
Gather a few months of already-coded expenses
Export 3 to 6 months of card and expense data that a person has already categorized correctly. This is your ground truth — it teaches the tool how YOU code, not how some generic model thinks expenses should be coded. Include the columns that carry signal: merchant name, merchant category code (MCC) if you have it, amount, date, employee or cost center, and the final GL account. Drop these files in a folder the tool can read.
2
Point Claude Code or Codex at the folder and describe the job
In plain English, tell the tool: read these exports, learn how each line was coded, then suggest a GL category and a confidence score for new transactions. Ask it to write a script you can run again next month — not a one-off answer. It will build the categorizer using your chart of accounts and your past coding as the reference, and it will explain each suggestion so you can check its reasoning.
3
Run it in shadow mode against a month you've already closed
Before it touches anything live, have the tool code a month you already finished by hand, then compare its output to your actual coding line by line. This tells you the real accuracy on YOUR data — expect roughly 80% agreement early, climbing as you correct it. Where it disagrees, you'll see whether it's the tool that's wrong or your old coding that was inconsistent. Do not let it auto-post anything until you trust this report.
4
Add the anomaly checks
Ask the tool to flag the lines that don't fit the pattern: likely duplicates (same vendor and amount within a few days, even with a tweaked invoice number — use fuzzy matching), amounts well outside the normal range for that category, out-of-policy items, and first-time vendors. Each flag should carry a one-line reason a non-accountant can read. These flags are the high-value output — a missed duplicate is real money.
5
Set the confidence threshold and the review list
Decide the line above which a suggestion is trustworthy enough to accept with a glance, and below which it goes to a human. Start conservative — route more to review than you think you need — and loosen it as the shadow reports earn your trust. The output is two lists: confident suggestions ready to confirm in bulk, and an exceptions list with the tool's guess plus the reason attached.
6
Use Claude Cowork for the policy and review side
The non-coding work — writing the categorization rules in words, drafting the expense policy the anomaly checks enforce, and turning the exceptions list into a clean note for the reviewer or the employee who needs to explain a charge — fits Claude Cowork. An operator who doesn't touch the script can keep the rules and the policy current, then hand the updated wording to whoever maintains the code.
7
Schedule it and feed corrections back in
Have the tool run on your cadence — weekly, or each time a new export lands — and produce the two lists automatically. When a reviewer corrects a suggestion, save that correction into your ground-truth folder so next month's run is smarter. This feedback loop is what moves accuracy from the low 80s into the 90s over a few months. Your team owns the loop; no retraining contract required.

What could go wrong (and how to handle it)

Financial data leaving systems you control. Card and expense data is sensitive — vendor names, amounts, employee spending.

Keep the data and the tool inside your own environment. Work from local exports or your own warehouse rather than pasting statements into third-party services. Treat the script and its files like any other finance system: access-controlled and logged.

Quiet miscoding. The tool confidently assigns the wrong GL account and it posts without anyone looking.

Never auto-post low-confidence lines. Run shadow mode until the agreement rate is high, keep a human confirming the exceptions, and spot-check a sample of the auto-accepted lines every close for the first few months.

Over-automation. The team stops looking entirely and trusts the green checkmarks.

Keep the exceptions list short on purpose, so reviewing it stays a real task, not a rubber stamp. Audit a random sample of confident lines periodically. The tool is the first pass; sign-off stays human.

Missed or false duplicate flags. Real duplicates slip through, or legitimate recurring charges get flagged every month.

Tune the fuzzy-match window against your actual history, and let reviewers mark known-recurring vendors so they stop alerting. Track both misses and false alarms in the shadow reports before going live.

Garbage in, garbage out. Inconsistent past coding teaches the tool bad habits.

Clean and standardize your chart of accounts before you start, and use a vetted period of correctly-coded data as the reference. Where the shadow report shows the tool disagreeing with old coding, check whether the old coding was the error.

Compliance and audit trail gaps. Auditors need to see why each line was coded the way it was.

Have the tool record its reasoning and confidence for every line, plus any human override. That trail is often clearer than manual coding, but only if you keep it. Confirm the approach with your controller or external auditor before it touches the books.

Prompts to get started

Build the categorizer

In this folder are six months of credit card and expense exports that we've already coded to GL accounts. Read them and learn how we categorize. Then write a script I can run each month that takes a new export and, for every line, suggests a GL account plus a confidence score, using merchant name, MCC, amount, date, and employee. For each suggestion, include a one-line reason. Don't post anything — just produce a file of suggestions I can review.

Shadow-mode accuracy check

Run the categorizer against last month's export, which I already coded by hand. Compare your suggested GL account to the actual one for every line and give me an accuracy report: overall agreement rate, the categories where you disagree most, and a list of the specific lines where we differ so I can see who's right.

Add anomaly flags

Add anomaly detection to the script. Flag likely duplicates using fuzzy matching on vendor and amount within a 5-day window even if the invoice number differs, amounts more than 3x the median for that category, expenses that break our policy (I'll paste the policy), and first-time vendors. Output a separate exceptions list with a plain-English reason for each flag.

Turn exceptions into a review note

Here is this week's exceptions list from our expense tool. Group it by reviewer, write a short, neutral note for each flagged line explaining what looks off and what we need confirmed, and draft a one-line message I can send to the employee for the three charges that need their explanation.

FAQ

Will this replace our accounting or expense software?

No. It sits next to it. You still book entries in your accounting system and approve expenses where you do today. This tool does the coding pass and the anomaly hunt on your exports, then hands you a clean suggestions list and a short exceptions list. Think of it as the assistant that preps the work, not the system of record.

How accurate is it really?

On your own data, expect roughly 80% agreement with your manual coding early on, climbing into the 90s over a few months as you correct it and feed those corrections back. That's why you run it in shadow mode first — you measure the real number on your transactions before trusting it, instead of taking a vendor's claim on faith.

We don't have engineers. Can we actually own this?

Yes. Claude Code and Codex build the script from a plain-English description and you run it by pointing it at a folder. The non-coding parts — the rules, the policy, the review notes — live in Claude Cowork, which an operator handles without code. When your chart of accounts or policy changes, you describe the change and the tool updates itself. No standing consultant.

Is it safe to put expense data through an AI tool?

It can be, if you keep the data in an environment you control rather than pasting statements into public services. Work from local exports or your own warehouse, control who can access the files, and keep the audit trail. The sensitivity here is medium — internal financial data, not customer PII or regulated records — but it still deserves real handling.

What if it codes something wrong and it ends up on the books?

That's why low-confidence lines never post automatically and a human signs off the exceptions. Categorization is also reversible — a miscoded line is a journal correction, not an irreversible event. The bigger win is the anomaly flags: catching a duplicate payment before it goes out is worth more than the coding time you save.

Sources

19 percent of expense reports contain errors or missing information, costing an additional $52 and 18 minutes to correct each report — GBTA Foundation (in partnership with HRS), 2015

More from Finance & Accounting

Finance & Accounting

Reconcile accounts faster at close

Use an agentic tool to match transactions, surface only the real exceptions, and draft reconciliation write-ups every close cycle, so your team shrinks a multi-day chore into a few reviewed hours.

Read the guide Finance & Accounting

Month-end variance commentary and board-pack drafts

Turn your closed actuals into a first-draft variance narrative and board-pack pages your team owns and edits, instead of writing every line from a blank page each month.

Ease

Impact

Risk

Claude CodeClaude CoworkCodex

Read the guide Finance & Accounting

Draft and review journal entries

An AI agent drafts recurring and adjusting journal entries from your source data and prior periods, attaches the support, and flags anomalies for a human to approve, cutting the manual grind out of month-end close.