Scalable Analytics Architecture

The perfect data ingestion API design

If you ask me, this is pretty much perfect.

Javier Santana

Feb 21, 2025 • 2 min read

The perfect data ingestion API design... does not exist 🙂.

I used the title to catch your attention, but I do think I’ve designed something close to perfect. Check it out and tell me what you'd change.

Easy to use

You can do that with any programming language in a few lines of code.

A format for the web

It accepts NDJSON and JSON. Maybe I'd add support for Parquet, but I think compressed NDJSON is good enough.

Being web-compatible allows you to connect almost any kind of webhook. Or send it from a JavaScript snippet.

Schema >>> schemaless

When working with a lot of data, schemaless is a waste of money and resources, both on storage and processing. The API transforms the attributes into columns (that are stored with the right type in a columnar database) leading to 10x-100x improvements in both.

You can always save the raw data to process it later but, in general, it’s a bad idea.

ACK

The API sends you an ack when the data is received and safely stored. You can forget about it, you know it will eventually be written to the database.

Failing gracefully

Things fail, and this is the most interesting part. If you fail inserting data, you want to know with 100% certainty. If your app dies while you are pushing data, should you retry?

The API is idempotent. You can retry within a 5 hour window and if the data was inserted, it’s not inserted again as long as you send the same data batch (it uses a hash of the data to know if it was inserted).

The first layer of the API is so simple, so if something does fail internally, in almost every case at least the data is buffered.

Buffering

Speaking of buffering... the API does buffer data. This is generally good performance hygiene for an ingestion API, but it’s also critical if you have an analytical database (as we do). These databases aren't build to accept streaming inserts; they need to insert data in batches, otherwise it’s too expensive (both on CPU and S3 write operations).

This buffer layer also works as the safety net when things fail. For example, overloading a database is quite easy, this helps you to mitigate that without even noticing.

Scale

You can throw 1000 QPS with one event each or 200 QPS with a 50Mb payload. Even if you have a lot of data, that handles at least 99% of use cases.

Real time

Even with some buffering, it works in real time. It usually takes no more than 4 seconds for the data to be available to query from the database, but even that can be reduced to close to a second.

And in general, it just works.

Try it

What do you think? Is it the perfect data ingestion API? Try it out and let me know.