We automated our own GST return with an AI agent. Here is how it actually works.

Bi-monthly GST is the recurring tax-admin tax on every NZ owner-operator. For a small business with a few bank accounts, some personal-card business spend, a stack of invoice PDFs in email, and home-office deductions, it is usually a half-day to a full day of unpaid work every two months. Eight to twelve days of owner time a year.

We use ours as a real test bed for agentic automation. Below is the pipeline we built and run on our own books, the multi-phase agent that does the GST cycle end-to-end and hands a draft return to a human for approval. The architecture is bank-agnostic, accounting-platform-agnostic and works for any NZ owner-operator who runs a similar pattern: tradies, professional-services firms, anyone billing through a personal entity with a personal card in the mix.

This is not a magic "just press the button" story. It is a careful five-phase pipeline with a verification step after every action that matters. Read it as a reference for what a properly engineered agent looks like, not a sales pitch.

The shape of the problem

Most NZ owner-operators have the same picture at GST time:

Multiple bank accounts. A business operating account (often split into sub-accounts: income, expenses, GST withholding, tax withholding, net), plus one or two personal accounts where home-office bills get paid (power, internet, rates, sometimes fuel and parking).
A pile of invoices in email. Supplier invoices, contractor RCTIs, subscription receipts. Some are PDFs, some are web links. They need attaching to the corresponding bank transaction in the accounting system so the GST claim has evidence.
Reconciliation rules in the accounting system that pre-suggest a code for each transaction, but get it wrong often enough that you cannot just rubber-stamp them.
Manual journals that have to be hand-built every period for home-office deductions, because the spending happened on a personal card.
A GST return that has to be cross-checked against the period's revenue, the bank balances and historical norms before filing.

The agent that does this end-to-end has to be reliable across all of the above, with no silent mistakes, because every silent mistake is a $50-500 cleanup the next quarter.

The five-phase pipeline

Each phase has a single, narrow job. Each phase ends in a checkpoint that a human can approve, modify or reject. Nothing destructive happens without that human checkpoint.

Phase 1: Pre-flight

The agent pulls live state and asserts the period under work matches the human's intent. Specifically:

Queries the accounting system's API for current bank balances and items-to-reconcile counts per account.
Reads the chart of accounts live, not from a hardcoded map. Account UUIDs drift; never trust yesterday's IDs.
Confirms the GST period from the previous filing date, presents start and end dates, waits for confirmation.

Why this is here: stops the most expensive class of error before any work happens, agent operating on the wrong period or wrong account. Costs maybe two minutes of human time; saves the periodic disaster.

Phase 2: Bank-feed aggregation and classification

The agent pulls every transaction from every relevant account across the period:

Business accounts via the accounting platform's bank feed.
Personal accounts via a separate banking aggregator API (we use one; any with a clean read API will do).

Every transaction gets classified into one of four buckets, with a confidence score:

Bucket	Meaning	Next action
Business income	Revenue into a business account	Code to revenue, attach RCTI if applicable
Business expense	Business spend on either business or personal card	Code to the matching expense GL with GST treatment, attach invoice
Personal	Personal spend on personal accounts	Ignore for GST
Transfer	Inter-account movement	Reconcile both legs without GST treatment

The classifier uses three signals: the merchant pattern, the existing reconciliation rule the accounting system has learned, and the agent's own LLM-based read of the transaction memo. When all three agree, it is "confident". When any disagree, it is "needs review".

Crucially, the classifier is never the actor. It only proposes. The next phases act, and only after a human checkpoint.

Phase 3: Invoice retrieval and matching

For every transaction the classifier marked as a business expense or income, the agent goes hunting for the corresponding invoice or RCTI:

Searches the email account over the GST period for messages from suppliers (contractor RCTIs, subscription receipts, supplier invoices, anything matching a known business contact).
For PDF attachments, downloads and OCR-extracts the invoice number, date, amount, GST treatment and description.
For invoices that arrive as web links (e.g. a "view your invoice" link rather than an attachment), follows the link, signs in if needed, and pulls the PDF.
Matches each invoice to a bank transaction using invoice number → reference, then amount + date, then supplier + amount.

The output is a triple match table:

MATCHED:
  Bank line               Amount     Invoice                       Confidence
  3 Mar  CONTRACTOR INC   $5,880.00  RCTI-8994017.pdf (12 Mar)     Ref + amount
  10 Mar SUPPLIER LTD     $4,704.00  inv-2026-2287.pdf (15 Mar)    Ref + amount
  11 Mar FUEL STATION     $69.17     receipt-20260311.pdf          Date + amount

UNMATCHED:
  9 Mar  SOFTWARE SUB     $24.39     No receipt email found
  16 Mar PROPERTY MGR     $810.00    No receipt email found

EXTRA INVOICES:
  invoice-target-mar.pdf  $37.66     Possibly matches Target Co.

The human reviews the table, fixes the unmatched and the speculative matches, and only then approves the agent to attach the PDFs to the corresponding bank transactions in the accounting system.

Phase 4: Reconciliation with mechanical verification

This is the phase where most agentic systems quietly break. Reconciliation feels like rubber-stamping but it is the step that does irreversible damage to the books if the agent gets sloppy.

We learned this the hard way: the agent built a confident batch of 21 reconciliation suggestions, all of which looked correct on a quick read. We approved the batch. The 21st was a transfer line where the accounting system's learned rule pointed at the wrong destination account, a withholding sub-account that happened to also pattern-match the transaction memo. The agent confirmed it, the books absorbed a wrong transfer, and the cleanup took ninety minutes the next day.

The fix was mechanical, not heuristic. Every transfer line carries a destination-account suffix encoded in its bank reference (a common pattern for sub-account transfers). The agent now asserts the suggested destination account matches the suffix in the statement-line text programmatically, on every single transfer line, before flagging anything as confident. A 21st correct suggestion does not earn trust; the 21st suggestion gets the same mechanical check as the 1st.

Two more rules baked in from the same incident:

"The UI confirmed" is not "the action succeeded". After every reconciliation the agent does not pure-rule-confirm, the agent then queries the accounting system's API for the created transaction by date and amount, and verifies it landed against the right account. The visible UI confirmation alone is treated as a hypothesis; the API read is the proof.
Stop on the first failure. Batch loops track a seen set and abort the moment a row would be clicked twice. The earlier version of our agent once clicked the same OK button 57 times because it verified at end-of-loop rather than after each action, the rule now is "verify the effect after every action, abort if the effect did not happen".

Once these rules are in, reconciliation runs reliably. The agent proposes a plan; the human approves; the agent executes; the agent verifies. No silent mistakes.

Phase 5: Manual journals, GST report, human approval

The home-office deduction journal is the last phase before the GST return.

The agent scans the personal accounts (already pulled in Phase 2) for utility bills and other business-deductible spend (power, internet, rates, parking, fuel, accommodation, office supplies), correlates each line with a deduction rate (e.g. internet 100%, power 60%, rates 40%), then drafts the manual journal in the accounting system's expected layout:

Inclusive GST line amount type
One line per GL code (collapse multiple items of the same type)
Descriptive line items ("Power utility, Feb & Mar invoices") not generic categories
A balancing entry on the owner's-funds account

The journal goes into the system as DRAFT, not posted. The agent then pulls the GST return from the accounting system's reporting API, prints the headline figures with a comparison to the previous two periods, runs sanity checks (GST collected = revenue × 3/23, GST withholding balance covers the return amount, P&L for the period aligns with the GST figures), and surfaces any variance over a threshold.

GST RETURN: [Period]
  Total sales and income (incl GST):  $XX,XXX
  GST collected:                      $X,XXX
  Total purchases (incl GST):         $X,XXX
  GST credit on purchases:            $X,XXX
  ────────────────────────────────
  NET GST TO PAY:                     $X,XXX

  Compare to previous periods:
    Last period:   $X,XXX
    Period before: $X,XXX
    Average:       $X,XXX

  Sanity checks:
  ✓ GST collected matches revenue × 3/23
  ✓ Withholding balance covers payable amount
  ✓ P&L for period aligns
  ⚠ Power deduction up 20% vs last period, confirm utility increase

The agent never files. Filing is a single human action on the tax authority's site, with the agent's draft journal and the printed sanity-checked return in front of the human. The agent's job is to make that human action take two minutes instead of two days.

What makes this agent reliable

Three rules that matter more than the prompt-engineering:

Every action that touches money is reversible and approval-gated. The agent reads freely. It only writes after a human says yes, and writes are bounded, one batch at a time, with a verify step after each action.
The agent does not invent. It reads live state and proposes against live state. Hardcoded IDs, cached account maps and remembered context all rot; the agent re-pulls each session.
Verification is mechanical, not LLM. The "did the action succeed" check is an API query, not the agent reading the UI back and saying "looks good". UIs lie about success more often than humans assume.

These are not specific to GST. They are the rules that make any agentic automation that touches money or records survive contact with reality.

Tool / bank / accounting-platform agnostic

The pipeline above is described in terms of generic roles, not specific products. We built our version against the stack we use, but the architecture moves cleanly to any combination of:

Banks: any with a read API or aggregator support, the major NZ banks (ANZ, ASB, BNZ, Westpac, Kiwibank, TSB) all aggregate via the same two or three commercial APIs.
Accounting platforms: Xero, MYOB, QuickBooks all expose the same primitives, bank feeds, transactions, attachments, manual journals, GST reports.
Email systems: Gmail, Microsoft 365, even IMAP, the agent only needs message search and attachment download.
Job-management systems (for tradies): ServiceM8, Tradify, Fergus, simPRO all expose invoice exports that feed the same pipeline.

The right agent for a tradie running ServiceM8 + Xero + Gmail + a TSB business account looks identical in shape to ours: five phases, the same checkpoints, the same verification rules. The plumbing changes; the architecture does not.

What this is worth in real terms

For our own books, this took GST cycle time from about a day per bi-monthly return to roughly thirty to sixty minutes of human review. The maths is not subtle:

Manual GST cycle: 6 to 8 hours every two months, 36 to 48 hours a year.
Agent-assisted: 30 to 60 minutes every two months, 3 to 6 hours a year.

Thirty plus hours a year saved on a process that previously absorbed a full half-day of owner-operator time. At a real owner-operator opportunity cost of $150-300/hr, that is $5,000-10,000 a year, on a single process, on a single business. Multiply across the dozen recurring administrative cycles a typical small business runs and the case for agentic automation as an actual business decision (not a hype trade) is straightforward.

If you want it built for you

This is exactly the kind of agentic automation we build for NZ professional-services firms and increasingly for larger trade firms on retainers. We are tool-agnostic, accounting-platform agnostic, and bank-agnostic, the pipeline above runs on whatever you already have. The first build is usually a fixed $5,000-15,000 depending on scope, paid back inside a year on a single process. See business automation for the wider picture, or practical AI automation for the architectural side of how we build agents that do not silently break things.