We’re facing the end of the cloud. It’s a bold statement, I know, and maybe it even sounds a little mad. But bear with me.
The conventional wisdom about running server applications, be it web apps or mobile app backends, is that the future is in the cloud. Amazon, Google, and Microsoft are adding layers of tools to their cloud offerings to make running server software more and more easy and convenient, so it would seem that hosting your code in AWS, GCP, or Azure is the best you can do — it’s convenient, cheap, easy to fully automate, you can scale elastically … I could keep going. So why am I predicting the end of it all?
A few reasons:
It can’t meet long-term scaling requirements. Building a scalable, reliable, highly available web application, even in the cloud, is pretty difficult. And if you do it right and make your app a huge success, the scale will cost you both money and effort. Even if your business is really successful, you eventually hit the limits of what the cloud, the web itself can do: The compute speed and storage capacity of computers are growing faster than the bandwidth of the networks. Ignoring the net neutrality debate, this may not be a problem for most (apart from Netflix and Amazon) at the moment, but it will be soon. The volumes of data we’re pushing through the network are growing massively as we move from HD, to 4k to 8k, and soon there will be VR datasets to move around.
This is a problem mostly because of the way we’ve organized the web. There are many clients that want to get content and use programs and only a relatively few servers that have those programs and content. When someone posts a funny picture of a cat on Slack, even though I’m sitting next to 20 other people who want to look at that same picture, we all have to download it from the server where it’s hosted, and the server needs to send it 20 times.
As servers move to the cloud, i.e. onto Amazon’s or Google’s computers in Amazon’s or Google’s data centers, the networks close to these places need to have incredible throughput to handle all of this data. There also have to be huge numbers of hard drives that store the data for everyone and CPUs that push it through the network to every single person that wants it. This gets worse with the rise of streaming services.
All of that activity requires a lot of energy and cooling and makes the whole system fairly inefficient, expensive, and bad for the environment.
It’s centralized and vulnerable. The other issue with centrally storing our data and programs is availability and permanence. What if Amazon’s data center gets flooded, hit by an asteroid, or destroyed by a tornado? Or, less drastically, what if it loses power for a while? The data stored on its machines now can’t be accessed temporarily or even gets lost permanently.
We’re generally mitigating this problem by storing data in multiple locations, but that only means more data centers. That may greatly reduce the risk of accidental loss, but how about the data that you really, really care about? Your wedding videos, pictures of your kids growing up, or important public information sources, like Wikipedia. All of that is now stored in the cloud — on Facebook, in Google Drive, iCloud, or Dropbox and others. What happens to the data when any of these services go out of business or lose funding? And even if they don’t, it is pretty restricting that in order to access your data, you have to go to their services, and to share it with friends, they have to go through these services too.
It demands trust but offers no guarantees. The only way for your friends to trust that the data they get is the data you sent is by trusting the middleman and their honesty. This is okay in most cases, but websites and networks we use are operated by legal entities registered in nation states, and the governments of these nations have the power to force them to do a lot of things. While most of the time, this is a good thing and is used to help solve crime or remove illegal content from the web, there are also many cases where this power has been abused.
Just a few weeks ago, the Spanish government did everything in its power to stop an independence referendum in the Catalonia region, including blocking information websites telling people where to vote. Blocking inconvenient websites or secretly modifying content on its way to users has long been a standard practice in places like China. While free speech is probably not a high-priority issue for most Westerners, it would be nice to keep the internet as free and open as it was intended to be and have a built-in way of verifying that content you are reading is the content the authors published.
It makes us — and our data — sitting ducks. The really scary side of the highly centralized internet is the accumulation of personal data. Large companies that provide services we all need to use in one way or another are sitting on monumental caches of people’s data — data that gives them enough information about you to predict what you’re going to buy, who you’re going to vote for, when you are likely to buy a house, even how many children you’re likely to have. Information that is more than enough to get a credit card, a loan, or even buy a house in your name.
You may be ok with that. After all, they were trustworthy enough for you to give them your information in the first place, but it’s not them you need to worry about. It’s everyone else. Earlier this year, credit reporting agency Equifax lost data on 140 million of its customers in one of the biggest data breaches in history. That data is now public. We can dismiss this as a once in a decade event that could have been prevented if we’d been more careful, but it is becoming increasingly clear that data breaches like this are very hard to prevent entirely and too dangerous to tolerate. The only way to really prevent them is to not gather the data on that scale in the first place.
So, what will replace the cloud?
An internet powered largely by client-server protocols (like HTTP) and security based on trust in a central authority (like TLS), is flawed and causes problems that are fundamentally either really hard or impossible to solve. It’s time to look for something better — a model where nobody else is storing your personal data, large media files are spread across the entire network, and the whole system is entirely peer-to-peer and serverless (and I don’t mean “serverless” in the cloud-hosted sense here, I mean literally no servers).
I’ve been reading extensively about emerging technologies in this space and have become pretty convinced that peer-to-peer is where we’re inevitably going. Peer-to-peer web technologies are aiming to replace the building blocks of the web we know with protocols and strategies that solve most of the problems I’ve outlined above. Their goal is a completely distributed, permanent, redundant data storage, where each participating client in the network is storing copies of some of the data available in it.
Above: Source: Wikimedia Commons
If you’ve heard about BitTorrent, the following should all sound familiar. In BitTorrent, users of the network share large data files split into smaller blocks (each with a unique ID) without the need for any central authority. In order to download a file, all you need is a “magic” number — a hash — a fingerprint of the content. The BitTorrent client will then find peers that have pieces of the file and download them, until you have all the pieces.
The interesting part is how the peers are found. BitTorrent uses a protocol called Kademlia for this. In Kademlia, each peer on the network has a unique ID number, which is of the same length as the unique block IDs. It stores a block with a particular ID on a node whose ID is “closest” to the ID of the block. For random IDs of both blocks and network peers, the distribution of storage should be pretty uniform across the network. There is a benefit, however, to not choosing the block ID randomly and instead using a cryptographic hash — a unique fingerprint of the content of the block itself. The blocks are content-addressable. This also makes it easy to verify the content of the block (by re-calculating and comparing the fingerprint) and provides the guarantee that given a block ID, it is impossible to download any other data than the original.
The other interesting property of using a content hash for addressing is that by embedding the ID of one block in the content of another, you link the two together in a way that can’t be tampered with. If the content of the linked block is changed, its ID would change and the link would be broken. If the embedded link is changed, the ID of the containing block would change as well.
This mechanism of embedding the ID of one block in the content of another makes it possible to create chains of such blocks (like the blockchain powering Bitcoin and other cryptocurrencies) or even more complicated structures, generally known as Directed Acyclic Graphs, or DAGs for short. (This kind of link is called a Merkle link after the inventor Ralph Merkle. So if you hear someone talking about Merkel DAGs, you know roughly what they are.) One common example of a Merkle DAG is git repositories. Git stores the commit history and all directories and files as blocks in a giant Merkle DAG.
And that leads us to another interesting property of distributed storage based on content-addressing: It’s immutable. The content cannot change in place. Instead, new revisions are stored next to existing ones. Blocks that have not changed between revisions get reused, because they have, by definition, the same ID. This also means identical files cannot be duplicated in such a storage system, translating into efficient storage. So on this new web, every unique cat picture will only exist once (although in multiple redundant copies across the swarm).
Protocols like Kademlia together with Merkle chains and Merkle DAGs give us the tools to model file hierarchies and revision timelines and share them in a large-scale peer-to-peer network. There are already protocols that use these technologies to build a distributed storage that fits our needs. One that looks very promising is IPFS.
The problem with names and shared things
Ok, so with the above techniques, we can solve quite a few of the problems I outlined at the beginning: We get distributed, highly redundant storage on devices connected to the web that can keep track of the history of files and keep all the versions around for as long as they are needed. This (almost) solves the availability, capacity, permanence, and content verification problem. It also addresses bandwidth problems — peers send data to each other, so there are no major hotspots/bottlenecks.
We will also need a scalable compute resource, but this shouldn’t be too difficult: Everyone’s laptops and phones are now orders of magnitude more powerful than what most apps need (including fairly complex machine learning computations), and compute is generally pretty horizontally scalable. So as long as we can make every device do the work necessary for its user, there shouldn’t be a major problem.
So now that cat image I want to see on Slack can come from one of my co-workers sitting next to me instead of from the Slack servers (and without crossing any oceans in the process). In order to post a cat picture, though, I need to update a channel in place (i.e., the channel will no longer be what it was before my message, it will have changed). This fairly innocuous sounding thing turns out to be the hard part (Feel free to skip to the next section if this bit gets too technical).
The hard part: Updating in place
The concept of an entity that changes over time is really just a human idea to give the world some order and stability in our minds. We can also think about such an entity as just an identity — a name — that takes on a series of different values (which are static, immutable) as time progresses (Rich Hickey explains this really well in his talks Are we there yet? and The value of values). This is a much more natural way of modelling information in a computer, with more natural consequences. If I tell you something, I can no longer change what I told you, or make you unlearn it. Facts, e.g. who the President of the United States is, don’t change over time; they just get superseded by other facts referred to by the same name, the same identity. In the git example, a ref (branch or tag) can point to (hold an ID and thus a value of) a different commit at different times, and making a commit replace the value it currently holds. The Slack channel would also represent an identity whose values over time are growing lists of messages.
The real trouble is, we’re not alone in the channel. Multiple people try to post messages and change the channel, sometimes simultaneously, and someone needs to decide what the result should be.
In centralized systems, such as pretty much all current web apps, there is a single central entity deciding this “update race” and serializing the events. Whichever event reaches it first wins. In a distributed system, however, everyone is an equal, so there needs to be a mechanism that ensures the network reaches a consensus about the “history of the world.”
Consensus is the most difficult problem to solve for a truly distributed web supporting the whole range of applications we are using to today. It doesn’t only affect concurrent updates, but also any other updates that need to happen “in-place” — updates where the “one source of truth” is changing over time. This issue is particularly difficult for databases, but it also affects other key services, like the DNS. Registering a human name for a particular block ID or series of IDs in a decentralized way means everyone involved needs to agree about a name existing and having a particular meaning, otherwise two different users could see two different files under the same name. Content-based addressing solves this for machines (remember a name can only ever point to one particular piece of matching content), but not humans.
A few major strategies exist for dealing with distributed consensus. One of them involves selecting a relatively small “quorum” of managers with a mechanism for electing a “leader” who decides the truth (if you’re interested, look at the Paxos and Raft protocols). All changes then go through the manager. This is essentially a centralized system that can tolerate a loss of the central deciding entity or an interruption (a “partition”) in the network.
Another approach is a proof-of-work based system like Bitcoin blockchain, where consensus is ensured by making peers solve a puzzle in order to write an update (i.e. add a valid block to a Merkle chain). The puzzle is hard to solve but easy to check, and some additional rules exist to resolve a conflict if it still happens. Several other distributed blockchains use a proof-of-stake based consensus while reducing the energy demands required to solve a puzzle. If you’re interested, you can read about proof of stake in this whitepaper by BitFury.
Yet another approach for specific problems revolves around CRDTs — conflict-free replicated data types, which, for specific cases, don’t suffer from the consensus problem at all. The simplest example is an incrementing counter. If all the updates are just “add one,” as long as we can make sure each update is applied just once, the order doesn’t matter and the result will be the same.
There doesn’t seem to be a clear answer to this problem just yet and there may never be only one, but a whole lot of clever people are working on it, and there are already a lot of interesting solutions out there to pick from. You just need to select the particular trade-off you can afford. The trade-off generally lies in the scale of a swarm you’re aiming for and picking a property of the consensus you’re willing to let go of at least a little — availability or consistency (or, technically, network partitioning, but that seems difficult to avoid in a highly distributed system like the ones we’re talking about). Most applications seem to be able to favor availability over immediate consistency — as long as the state ends up being consistent in reasonable time.
Privacy in the web of public files
One obvious problem that needs addressing is privacy. How do we store content in the distributed swarm of peers without making everything public? If it’s enough to hide things, content addressed storage is a good choice, since in order to find something, you need to know the hash of its content (somewhat like private Gists on Github). So essentially we have three levels of privacy: public, hidden, and private. The answer to the third one, it seems, is in cryptography — strongly encrypting the stored content and sharing the key “out of band” (e.g. physically on paper, by touching two NFC devices, by scanning a QR code, etc.).
Relying on cryptography may sound risky at first (after all, hackers find vulnerabilities all the time), but it’s actually not that much worse than what we do today. In fact, it’s most likely better in practice. Companies and governments generally store sensitive data in ways that aren’t shareable with the public (including the individuals the data is about). Instead, it’s accessible only to an undisclosed number of people employed by the organizations holding the data and is protected, at best, by cryptography based methods anyway. More often than not, if you can gain access to the systems storing this data, you can have all of it.
But if we move instead to storing private data in a way that’s essentially public, we are forced to protect it (with strong encryption) so that it is no good to anyone who gains access to it. This idea is roughly the same as the one behind making security-related software open source so that anyone can look at it and find problems. Knowing how the security works shouldn’t help you break it.
An interesting property of this kind of access control is that once you’ve granted someone access to some data, they will have it forever for that particular revision of the data. You can always change the encryption key for future revisions, of course. This is also no worse than what we have today, even though it may not be obvious: Given access to some data, anyone can always make a private copy of it.
The interesting challenge in this area is coming up with a good system of establishing and verifying identities and sharing private data among a group of people that needs to change over time, e.g. a group of collaborators on a private git repository. It can definitely be done with some combination of private-key cryptography and rotating keys, but making the user experience smooth is likely going to be a challenge.
From the cloud to a … fog
Hard problems to solve notwithstanding, our migration away from the cloud will be quite an exciting future. First, on the technical front, we should get a fair number of improvements out of a peer-to-peer web. Content-addressable storage provides cryptographic verification of content itself without a trusted authority, hosted content is permanent (for as long as any humans are interested in it), and we should see fairly significant speed improvements, even at the edges in the developing world (or even on another planet!), far away from data centers.
At some point, even data centers may become a thing of the past. Consumer devices are getting so powerful and ubiquitous that computing power and storage (a computing “substrate”) is almost literally lying in the streets.
For businesses running web applications, this change should translate to significant cost savings and far fewer headaches building reliable digital products. Businesses will also be able to focus less on downtime risk mitigation and more on adding customer value, benefitting everyone. We are still going to be in need of cloud hosted servers, but they will only be one of many similar peers. We could also see heterogeneous applications, where not all the peers are the same — where there are consumer-facing peers and back office peers as part of the same application “swarm” and the difference in access is only in access level based on cryptography.
The other large benefit for both organizations and customers is in the treatment of customer data. When there’s no longer any need to centrally store huge amounts of customer information, there’s less risk of losing such data in bulk. Leaders in the software engineering community (like Joe Armstrong, creator of Erlang, whose talk from Strange Loop 2014 is worth a watch) have long argued that the design of the internet where customers send data to programs owned by businesses is backwards and that we should instead send programs to customers to execute on their privately held data that is never directly shared. Such a model seems much safer and doesn’t in any way prevent businesses from collecting useful customer metrics they need.
And nothing prevents a hybrid approach with some services being opaque and holding on to private data.
This type of application architecture seems a much more natural way to do large scale computing and software services — an Internet closer to the original idea of open information exchange, where anyone can easily publish content for everyone else and control over what can be published and accessed is exercised by consensus of the network’s users, not by private entities owning servers.