Data warehousing · a field guide

Your reports stopped agreeing. Now what?

Most data warehouse projects don't fail on the tech. They fail on four decisions made — or skipped — in the first two weeks. Here's what those decisions are.

The moment you actually need a warehouse

There's a specific day this gets real. Two people pull "revenue last month" from two reports and get two different numbers. Both are correct. They're just counting differently — one includes refunds, the other doesn't, and nobody wrote that rule down.

That's the signal. Not data volume, not the fancy tool — disagreement. A warehouse exists to settle those arguments: one place where revenue means one thing, defined once, and every report reads from it. Get that and the dashboards mostly look after themselves. Skip it and you'll spend Mondays reconciling spreadsheets for as long as the company exists.

The four decisions that set the ceiling

Almost everything that goes wrong later traces back to four early calls.

1

Grain

What does one row in your fact table mean — one order, one order line, one shipment? Pick wrong and every measure downstream fights you. This is the one people rush, then regret.

2

Conformed dimensions

If "customer" means one thing in sales and another in support, you can't join them — and you'll never get a clean cross-team view. Agreeing on shared dimensions early is dull work that saves months.

3

History

Do you overwrite a customer's old address, or keep both? Slowly changing dimensions sound academic — until finance asks for last year's numbers using last year's structure, and you've already overwritten it.

4

Load pattern

Batch or near-real-time, full reload or incremental. This depends on volume and how fresh the data has to be, and it's the easiest of the four to change later. So don't over-engineer it on day one.

The first three of those are the ones you live with for years — which is why teams that bring in data warehouse consulting services early usually spend less overall, not more. The schema you draw in week two is the schema you're still maintaining in year three.

ETL or ELT — the short, honest version

You'll hear both terms thrown around. The only real difference is where the transformation happens.

ETL — transform first

Clean and reshape the data before it lands in the warehouse. The older pattern. Still a good fit when storage is pricey or the source is messy enough that you want it tidied before loading.

ELT — load first

Drop the raw data in, then transform inside the warehouse. Suits cloud platforms where compute is cheap and you'd rather keep the raw around to re-transform later.

For most teams on a modern cloud warehouse, ELT wins on flexibility. But "raw first" only helps if someone actually governs the raw layer — otherwise it's a swamp with a nicer name.

Questions people actually ask

Can't a spreadsheet do this?

For a while, yes. The break point is usually when more than one person edits the source of truth, or when "last quarter" needs to mean something stable. A spreadsheet has no grain and no history — that's the wall you hit.

How big does our data need to be?

Size matters less than people think. A 50 MB dataset feeding ten teams needs a warehouse more than a 2 TB dataset feeding one analyst. The driver is shared definitions, not row count.

Build it ourselves, or get help?

If you've shipped a star schema before and have time to maintain it, build. If it's your first one and it's load-bearing for the business, the cheap mistakes get expensive fast — and that's the case for outside help.

Start with one sentence, not a platform

Before you choose a database, write down what one row of your most important table means. If three people give three answers, you've found your first task — and it has nothing to do with which tool you buy.