Every System is a Log: Avoiding coordination in distributed applications

(restate.dev)

228 points | by sewen 7 days ago ago

46 comments

Animats 7 days ago ago
This is a basic concept in accounting. The general ledger is an immutable log of transactions. Other accounting documents are constructed from the general ledger, and can, if necessary, be rebuilt from it. This is the accepted way to do money-related things.
Synchronization is called "reconcilation" in accounting terminology.
The computer concept is that we have a current state, and changes to it come in. The database with the current state is authoritative. This is not suitable for handling money.
The real question is, do you really care what happened last month? Last year? If yes, a log-based approach is appropriate.
[-]
- inopinatus 7 days ago ago
  I’ve always concurred with the Helland/Kleppman observation mentioned viz. that the transaction log of a typical RDBMS is the canonical form and all the rows & tables merely projections.
  It’s curious that over those projections, we then build event stores for CQRS/ES systems, ledgers etc, with their own projections mediated by application code.
  But look underneath too. The journaled filesystem on which the database resides also has a log representation, and under that, a modern SSD is using an adaptive log structure to balance block writes.
  It’s been a long time since we wrote an application event stream linearly straight to media, and although I appreciate the separate concerns that each of these layers addresses, I’d probably struggle to justify them all from first principles to even a slightly more Socratic version of myself.
  [-]
  - formerly_proven 6 days ago ago
    This is similar to the observation that memory semantics in most any powerful machine since the late 60s are implemented using messaging, and then applications go ahead and build messaging out of memory semantics. Or the more general observation that every layer of information exchange tends towards implementing packet switching if there's sufficient budget (power/performance/cost) to support doing so.
  - globular-toast 7 days ago ago
    > It’s curious that over those projections, we then build event stores for CQRS/ES systems, ledgers etc, with their own projections mediated by application code.
    The database only supports CRUD. So while the CDC stream is the truth, it's very low level. We build higher-level event types (as in event sourcing) for the same reason we build any higher-level abstraction: it gives us a language in which to talk about business rules. Kleppmann makes this point in his book and it was something of an aha moment for me.
- LeanOnSheena 5 days ago ago
  You're correct on all points. Some additional refining points regarding accounting concepts:
  - General legers are formed by way of transactions recorded as journal entries. Journal entries are where two or more accounts from the general ledger are debited & credited such that total debits equals total credits. For example, a sale will involve a journal entry which debits cash or accounts receivable, and credits revenue.
  - The concept of the debits always needing to equal credits is the most important and fundamental control in accounting. It's is the core idea around which all of double entry bookkeeping is built.
  - temporally ordered Journal entries are what form a log from which a general ledger can be derived. That log of journal entries is append-only and immutable. If you make an mistake with a journal entry, you typically don't delete it, you just make another adjusting (i.e. correcting) entry.
  Having a traditional background in accounting as a CPA, as a programmer I have written systems that are built around a log of temporally ordered transactions that can be used to construct state across time. To my colleagues that didn't have that background they found it interesting but very strange as an idea (led to a lot of really interesting discussions!). It was totally strange to me that they found it odd because it was the most comfortable & natural way for me to think about many problems.
- funcDropShadow 6 days ago ago
  Could you recommend some resource to understand this view of accounting better?
- fellowniusmonk 7 days ago ago
  This is why EG-Walker is so important, diamond types adoption and a solid TS port can't come soon enough for distributed systems.
- calvinmorrison 6 days ago ago
  You over estimate ERP and accounting systems.
shikhar 7 days ago ago
This post makes a great case for how universal logs are in data systems. It was strange to me that there was no log-as-service with the qualities that make it suitable for building higher-level systems like durable execution: conditional appends (as called out by the post!), support very large numbers of logs, allow pushing high throughputs with strict ordering, and just generally provide a simple serverless experience like object storage. This led to https://s2.dev/ which is now available in preview.
It was interesting to learn how Restate links events for a key, with key-level logical logs multiplexed over partitioned physical logs. I imagine this is implemented with a leader per physical log, so you can consistently maintain an index. A log service supporting conditional appends allows such a leader to act like the log is local to it, despite offering replicated durability.
Leadership can be an important optimization for most systems, but shared logs also allow for multi-writer systems pretty easily. We blogged about this pattern https://s2.dev/blog/kv-store
[-]
- xuancanh 7 days ago ago
  > It was strange to me that there was no log-as-service with the qualities that make it suitable for building higher-level systems like durable execution
  There are several services like that, but they are mostly kept behind the scene as a competitive advantage when building distributed systems. AWS uses it behind the scene for many services, as mentioned here by Marc Brooker https://brooker.co.za/blog/2024/04/25/memorydb.html. Facebook has similar systems like LogDevice https://logdevice.io/, and recently Delos https://research.facebook.com/publications/log-structured-pr...
- logsr 7 days ago ago
  > log as a service
  very exciting. this is the future. i am working on a very similar concept. every database is a log at its core, so the log, which is the highest performance part of the system, is buried behind many layers of much lower performing cruft. edge persistence with log-per-user application patterns opens up so many possibilities.
- hinkley 7 days ago ago
  I just want a recognized standard format for write ahead logs. Start with replicating data between OLTP and OLAP databases with minimal glue code, and start moving other systems to a similar structure, like Kafka, then new things we haven’t thought of yet.
- thesz 4 days ago ago
```
  > It was strange to me that there was no log-as-service...
```
  Actually, there are plenty of them. Most heard of is Ethereum 2.0 - it is a distributed log of distributed logs.
  Any blockchain that is built upon PBFT derivative is such a system.
- gavindean90 7 days ago ago
  What about journalctl?
sewen 7 days ago ago
A short summary:
Complex distributed coordination and orchestration is at the root of what makes many apps brittle and prone to inconsistencies.
But we can mitigate much of complexity with a neat trick, building on the fact that every system (database, queue, state machine) is effectively a log underneath the hood. By implementing interaction with those systems as (conditional) events on a shared log, we can build amazingly robust apps.
If you have come across “Turning the Database Inside Out” (https://martin.kleppmann.com/2015/11/05/database-inside-out-...), you can think of this a bit like “Turning the Microservice Inside Out”
The post also looks at how this can be used in practice, given that our DBs and queues aren't built like this, and how to strike a sweet-spot balance between this model with its great consistency, and maintaining healthy decoupling and separation of concerns.
[-]
- teddyh 7 days ago ago
  Is this summary AI generated?
EGreg 7 days ago ago
Since we’re on the subject of logs and embarassingly parallel distributed systems, I know someone who’s also in NYC who’s been building a project exactly along these lines. It’s called gossiplog and it uses Prolly trees to make some interesting results.
https://www.npmjs.com/package/@canvas-js/gossiplog
Joel Gustafson started this stuff at MIT and used to work at Protocol Labs. It’s very straightforward. By any chance sewen do you know him?
I first became aware of this guy’s work when he posted “merklizing the key value store for fun and profit” or something like that. Afterwards I looked at log protocols, including SLEEP protocol for Dat/Hypercore/ pear and time-travel DBs that track diffs, including including Dolt and even Quadrable.
https://news.ycombinator.com/item?id=36265429
Gossiplog’s README says exactly what this article says— everything is a log underneath and if you can sync that (using prolly tree techniques) people can just focus on business logic and get sync for free!
[-]
- sewen 7 days ago ago
  Never encountered it before, but it looks cool.
  I think they are trying to solve a related problem. "We can consolidate the work by making a generic log that has networking and syncing built-in. This can be used by developers to make automatically-decentralized apps without writing a single line of networking code."
  At a first glance, I would say that Gossiplog is a bit more low level, targeting developers of databases and queues, to save them from re-building a log every time. But then there are elements of sharing the log between components. Worth a deeper look, but seems a bit lower level abstraction.
- hem777 6 days ago ago
  There’s also OrbitDB https://github.com/orbitdb/orbitdb which to my understanding has been a pioneer for p2p logs, databases and CRDTs.
- vdm 7 days ago ago
  Thank you @EGreg for sharing this.
sewen 7 days ago ago
Some clarification on what "one log" means here:
- It means using one log across different concerns like state a, communication with b, lock c. Often that is in the scope of a single entity (payment, user, session, etc.) and thus the scope for the one log is still small. You would have a lot of independent logs still, for separate payments.
- It does _not_ mean that one should share the same log (and partition) for all the entities in your app, like necessarily funneling all users, payments, etc. through the same log. That goes actually beyond the proposal here - has some benefits of its own, but have a hard time scaling.
daxfohl 7 days ago ago
Haven't formed thoughts on the content yet, but happy to see a company launching something non-AI for a change.
jaseemabid 7 days ago ago
A notable example of a large-scale app built with a very similar architecture is ATproto/Bluesky[1].
"ATProto for Distributed Systems Engineers" describes how updates from the users end up in their own small databases (called PDS) and then a replicated log. What we traditionally think of as an API server (called a view server in ATProto) is simply one among the many materializations of this log.
I personally find this model of thinking about dataflow in large-scale apps pretty neat and easy to understand. The parallels are unsurprising since both the Restate blog and ATProto docs link to the same blog post by Martin Kleppmann.
This arch seems to be working really well for Bluesky, as they clearly aced through multiple 10x events very recently.
[1]: https://atproto.com/articles/atproto-for-distsys-engineers
[-]
- sewen 7 days ago ago
  That blog post is a great read as well. Truely, the log abstraction [1] and "Turning the DB inside out" [2] have been hugely influential.
  In a way this article here suggests to extend that
  (1) from a log that represents data (upserts, cdc, etc.) to a log of coordination commands (update this, acquire that log, journal that steo)
  (2) have a way to link the events related to a broader operation (handler execution) together
  (3) make the log aware of handler execution (better yet, put it in charge), so you can automatically fence outdated executions
  [1] https://engineering.linkedin.com/distributed-systems/log-wha...
trollbridge 7 days ago ago
I’ve been doing a similar thing, although I called it “append only transaction ledgers”. Same idea as a log. A few principles:
- The order of log entries does not matter.
- Users of the log are peers. No client / server distinction.
- When appending a log entry, you can send a copy of the append to all your peers.
- You can ask your peers to refresh the latest log entries.
- When creating a new entry, it is a very good idea to have a nonce field. (I use nano IDs for this purpose along with a timestamp, which is probabilistically unique.)
- If you want to do database style queries of the data, load all the log entries into an in memory database and query away.
- You can append a log entry containing a summary of all log entries you have so far. For example: you’ve been given 10 new customer entries. You can create a log entry of “We have 10 customers as of this date.”
- When creating new entries, prepare the entry or list of entries in memory, allow the user to edit/revise them as a draft, then when they click “Save”, they are in the permanent record.
- To fix a mistake in an entry, create a new entry that “negates” that entry.
A lot of parallelism / concurrency problems just go away with this design.
bfair 5 days ago ago
Isn't this just trading consistency for availability? From what I understand the single log is single node. What happens when throughput is not enough at scale? "We will distribute the log." you say. Well, then we are back to square one.
ris 6 days ago ago
> Restate is open source and you can download it at...
https://github.com/restatedev/restate/blob/main/LICENSE#L1
> Business Source License 1.1
https://spdx.org/licenses/BUSL-1.1.html
> The Business Source License (this document, or the “License”) is not an Open Source license.
Suggest exploring e.g. https://github.com/dbos-inc/dbos-transact-py
mrkeen 7 days ago ago
Using one-log-only for an entire system does have its upsides, but it will kill performance. It would be like building a CRUD system with a single mutex for everyone to share.
zellyn 7 days ago ago
sewen (et al)
This is lovely and I'm itching to try it. One question:
We have a use case where a location gets cut off completely from the internet at large. In that case, it makes sense for the local hardware (typically Android and/or iOS tablets or equivalent) to take over as a log owner: even though you're cut off, if you're willing to swallow the risk (and hence cost) of offline payments, you should be able to create orders, fulfill them, pay for them, close them out, send tickets to the kitchen to cook the food or to the warehouse to fetch the tractor, etc.
Does restate include something that covers that use-case? In the noodling/daydreaming a colleague and I have done, we ended up with something very close to restate (I imagined just using Kafka), except that additionally many operations would have a CRDT nature: eg. you should _always_ be allowed to add a payment to an order, because presumably a real-life payment happened.
I've also noodled with the idea of logs whose canonical ownership can be transferred. That covers cases where you start offline and then reconnect, but doesn't work so well for transactions that start out connected (and thus owned in the datacenter) and need to continue offline.
One could also imagine ensuring that > n/2 consensus members are always located inside the restaurant/hardware store/etc., so if you go offline, you can still proceed. It might even be possible to recognize disconnection and then take one full vote to further subdivide that pool of consensus members so if one dies it doesn't halt progress. This feels like it would be getting very tricksy…
TuringTest 7 days ago ago
Excuse me for sounding rough, but - isn't this reinventing comp-sci, one step at a time?
I learned about distributed incrementally -monotonic logs back at the late 90s, with many other ways to do guaranteed transactional database actions. And I'm quite certain these must have been invented in the 50s or 60s, as these are the problems that early business computer users had: banking software. These are the techniques that were buried in legacy COBOL routines, and needed to be slowly replaced by robust Java core services.
I'm sure the Restate designers will have learned terribly useful insights in how to translate these basic principles into a working system with the complexities of today's hardware/software ecosystem.
Yet it makes me wonder if young programmers are only being taught the "build fast-break things" mentality and there are no longer SW engineers able to insert these guarantees into their systems from the beginning, by standing on the shoulders of the ancients that invented our discipline, so that their lore is actually used in practice? Or am I just missing something new in the article that describes some novel twist?
davexunit 7 days ago ago
My takeaway from this article is that the proposed solution for distributed app coordination is a shared, centralized log. What did I miss?
pjc50 7 days ago ago
> Having a single place (the one log) that forces a linear history of events as the ground truth and owns the decision of who can add to that ground truth, means we don’t have to coordinate much any more.
Well, yes, but then you've backed into CAP again because you only have one log.
jamamp 7 days ago ago
I wonder how this compares, conceptually, to Temporal? While Temporal doesn't talk about a single centralized log, I feel the output is the same: your event handlers become durable and can be retried without re-executing certain actions with outside systems. Both Restate and Temporal feel, as a developer coding these event handlers, like a framework where they handle a lot of the "has this action been performed yet?" and such for you.
Though to be fair I've only read Temporal docs, and this Restate blog post, without much experience in either. Temporal may not have as much on the distributed locking (or concept of) side of things that Restate does, in this post.
hiAndrewQuinn 7 days ago ago
I am a huge fan of append-only logs as a fundamental architectural principle. The Log [1] should be required reading for any CS undergraduate.
[1]: https://engineering.linkedin.com/distributed-systems/log-wha...
xnorswap 7 days ago ago
It sounds like they have just re-discovered Distributed Transactions with a Distributed Transaction Coordinator.
But DTs have a huge problem: What happens if the owner of the lock netsplits?
Either the DTC waits (potentially forever?) for the owner of the lock to get back in touch and release the lock, or a timeout is applied and now the owner of the lock (who may be unaware of the netsplit) will be out of sync with the system.
paulsutter 7 days ago ago
This is very compelling, nice work. I'm going to spend some quality time on this.
Thaxll 7 days ago ago
This is exactly this example from Temporal: https://github.com/temporal-sa/temporal-order-fulfill-demo
bruce343434 7 days ago ago
> If everything’s in one log, there’s nothing to coordinate #
On the contrary. Everything becomes coordinated.
The entire "log" becomes a giant ass mutex lock. Good luck scaling it.
qudat 7 days ago ago
Great post! At pico we've been spending a lot of time thinking about logs and a distributed system that can read and respond to events from logs. This is being driven in part by building out global services and a need for centralized logs for monitoring.
The end result is https://pipe.pico.sh which is an authenticated, networked *nix pipes over SSH. Since it relies on stdin/stdout via SSH it's one of the easiest pubsub systems we've used and we keep finding its ergonomics powerful. We have a centralized log-drain, metric-drain, and cache-clearing-drain all using `pipe`.
yftsui 6 days ago ago
The diagrams are remarkably neat, has a feeling of both excalidraw and draw.io, anyone know what tool was used create those?
whoiskatrin 7 days ago ago
whats your take on handling log compaction to prevent unbounded growth, especially in systems with high write throughput?
baq 6 days ago ago
Or, every database is bitemporal, some just don’t know it yet
random3 7 days ago ago
looks like a new generation is ready to discover Paxos, Zab, Raft
amirjak 7 days ago ago
How would you compare this to the actor model or to temporal?
erikerikson 7 days ago ago
It's there a hosted offering? (Or plans to offer one?)
dboreham 7 days ago ago
This is basically CSP no?