The history storage problem

Most blockchains don’t incentivize storage. They incentivize validators to participate honestly in consensus and assume that once non-consensus nodes have verified newly added blocks, they will partially or fully store those blocks. With Ethereum’s history reaching over 10TB despite its limited throughput and short lifespan, it’s evident that assuming non-consensus nodes will continue to store and serve this data won’t hold forever.

Note that history is distinct from the state. History is comprised of the complete set of transactions that have ever occurred on the network. The state is like a current snapshot of the network, which is comprised of data on account balances, smart contract balances, and validator set info. Luckily, the assumption for history is a weak one. It is a 1-of-N assumption where only a single copy of the data needs to be stored and persist somewhere on the internet.

Aside from altruistic nodes, there are other entities that have an incentive to store all the data. For example, block explorers store history so that they can serve it to individuals that search the explorer for past transactions. Indexing protocols like The Graph may also store the entire history to serve API queries for past data.

Applications may choose to store the history of transactions for their users if they want stronger guarantees for retrieval. The same applies to rollups, and even at the individual user level. If the data is important enough and they require stronger guarantees they may simply store it themselves. Unlike the state growth problem, storing history is cheap since inexpensive hard drives are all that’s required.

Incentivizing a solution

There are multiple ways that blockchains can utilize to provide explicit guarantees for the persistent storage of history. An easy fix simply involves sticking all the history onto a data storage blockchain like Arweave or Filecoin that specifically incentives storage and retrieval. That’s the route that Solana has taken with history growth currently estimated at 60TB per year.

The other solution is to directly incentivize storage at the protocol level. Nodes would then receive payments corresponding to the amount of data and backups they store. Further incentives may also ensure that nodes continue to serve requests for their data because only paying for storage does not mean that a node will provide data to those that request it – some data storage blockchains incentivize both.

Where data storage and retrieval differs from data availability is that data availability ensures that data is available when a new block is proposed. Once the block is appended to the tip of the chain and its data is propagated through the network, the weak storage and retrieval assumptions kick in.

Syncing from genesis

To receive the most trustless guarantee about the current state of the blockchain a node can sync the full history of the network from genesis up until the current tip of the chain. This may become problematic if the history of the chain becomes exorbitantly large or part of the history is lost. However, since PoS introduces the concept of weak subjectivity a node need only sync from a checkpoint provided by a trusted peer. With the concept of a checkpoint, the network collectively agrees upon a block at a given point in time where the prior history cannot be changed, and therefore will always be part of the canonical chain. This ensures that long-range attacks are impossible, which occur when a validator unbonds and sells their private keys, allowing an adversary to build a longer chain and overtake the current canonical chain.

While this does introduce further trust assumptions, if we expect these blockchains to survive for decades to come, requiring the ability to sync from genesis is excessive as it only serves as a bottleneck for scalability and sustainability.