Nearly all of my casual conversations about decentralized network architecture for the past three months or so have quickly come to be (at least partly) a discussion about node discovery — usually around the 6-minute mark.

Node discovery has become a big topic again.

There are two natural (and in this case, I think, correct) conclusions to be drawn from the resurgence in interest in this topic:

  • People who are developing decentralized applications think that node discovery is important.
  • There is no single general purpose node discovery methodology that enjoys widespread confidence.

When I (during these early minutes of conversation) mention that we at NuCypher have rolled our own node discovery, my conversational partner(s) are often interested. But sometimes they are quite taken aback — wondering why we don’t use devp2p or something similar. This very thing happened at devcon4 this year when I described our “Learning Loop” to my good friend Piper Merriam, whose opinion on this topic (and, for that matter, on the future of peaceful coexistence of the human condition) is highly meaningful in my estimation.

Isn’t devp2p already ideal?

Maybe you thought that point #2 above is a dig of sorts at {lib|dev}p2p — it’s not. libp2p generally (and devp2p in particular) is a wonderful movement with great prospects for the very specific problem it seeks to solve. This post is not meant to deride this tech in the least!

But the problem that devp2p seeks to solve is, if I’m understanding the ecosystem correctly, properly described as having (among others) the following properties:

  • A high number of nodes (for the purposes of this post, I’ll use 20,000 as the distinction between a high and low number)
  • A difficult-to-predict pace of rollout for nodes
  • An externalized solution for Byzantine fault tolerance, like IPFS does with Merkle DAGs and Ethereum does with Proof-of-Work (and as of Serenity, Proof-of-Stake)
  • An externalized solution for crash fault tolerance, like IPFS does with distributed pinning and Ethereum does with Full Nodes (and as of Serenity, sharding)
  • (In most cases…) the nodes in question, and the applications for which they are suitable, are characterized at least in part by high bandwidth and high latency.

This is a great description for many of the coolest projects in the world, including of course IPFS. When I first heard the phrase “inter-planetary file system”, I *instantly* understood the problem space. It’s a perfect name for a phenomenal project.

So, we’ll call the above phenomena “inter-planetary networks.”

As you have perhaps already predicted, my assertion is the libp2p (and thus devp2p) is not a *general purpose* solution, but is specific to inter-planetary networks.

The underlying protocol of devp2p is essentially Kademlia (a DHT).

When we first broke ground on our proxy re-encryption network, our imaginations plugged Kademlia into nearly every NuCypher need. When we ran into head-scratching ordeals with this, we started opening Pull Requests to Brian Muller’s popular Python implementation of Kademlia, called (small-k) kademlia — and in so doing, had the wonderful opportunity to repeatedly interact with Muller, who is a very sporting open source maintainer and merged nearly all of our PRs.

The decision to stay on one planet

After a few rounds of iteratively working with (small-k) kademlia (and a brief implementation of devp2p, only to realize that it was exactly the same), an at-first depressing realization dawned on us:

Our network — like many emerging decentralized technologies, is not inter-planetary in nature.

I say “at-first depressing” because we were not only deeply inspired by IPFS (both its topological metaphor and its feature set) but we also viewed (and continue to view) NuCypher as an ideal match for IPFS, with both of them together combining to make a decentralized filesystem with access management and privacy features beyond those available on any centralized system.

(We all want to be inter-planetary, right?)

Building decentralized access management has taught us something (and this is the essence of this post, at least to me): inter-planetary federations (like IPFS) need intra-planetary support systems in order to be truly exciting (and truly decentralized).

(Let me make plain that “planetary” — in either the “inter” or “intra” metaphor — does not literally imply that the nodes do (or even can) work on different planets. It’s just a way of thinking about the styles of robustness and efficiency that make sense for these two paradigms.)

The intra-planetary paradigm

Since I have given some of the key features of “inter-planetary” above, I’ll now give some properties of our network, which I think make it “intra-planetary”:

  • A small number of nodes (probably less than 5,000, maybe much less)
  • A reasonably predictable rollout, both for node operators and application developers
  • Byzantine- and crash-fault tolerance within the protocol, using blockchain data but not blockchain nodes as transport
  • Node discovery as an integral, rather than bolted-on, mechanism
  • Low bandwidth required; low latency desired

The disadvantages of this paradigm are obvious: something like IPFS can’t easily be built atop this style of node discovery.

However, the advantages are substantial:

  • Nodes will likely learn about every node about which they need to know on the very first iteration (ie, at bootstrapping time).
  • The complexities introduced by sharding, hamming distance, and overlay topology are gone.
  • “Cheating” nodes (ie, those claiming to be someone they’re not) are never propagated in the first place.
  • Nodes can share with one another (and use as the basis of agreement) the “full picture” of the network, and know to stop trying to learn once they have it.
  • Everything is TLS, with no need for a CA or other authority — nobody can claim to be running a node without a certificate signed by her wallet address.

Moreover, intra-planetary networks are robust against problems elsewhere in the “galaxy.” Using devp2p and Ethereum-based node discovery for an intra-planetary network feels like calling your mother via a cell tower on Mars.

Once we realized this, it quickly became obvious that we wanted blockchain only for the Byzantine fault tolerance aspect of our node discovery, and that we did not want to rely on blockchain nodes for transport.

What about existing intra-planetary solutions?

I’m not aware of any prior protocol or implementation expressly angling itself as “intra-planetary”, but there are some out there that probably reasonably qualify — PING, SWIM, Smudge, Gossip — and I’m sure there are yet others.

At this point, maybe I have convinced you that an intra-planetary solution is advantageous for a certain problem space. And maybe I have convinced you that Kademlia / devp2p, despite their awesomeness, don’t squarely address this need.

Of these existing solutions, we came closest to choosing Gossip, so I’ll just outline the reasons we didn’t do that:

  • We’d still need to roll our own verification scheme to integrate token-based Byzantine fault tolerance. There’s no obvious way to stop propagation using an integral token-based scheme. This is all the reason we need.
  • The python implementation is wanting; it has only 2 commits and, while seemingly well thought-out, directly using the socket and multiprocessing libraries instead of providing other good places to hook in.
  • Since node discovery is a core feature of our network, we don’t want to force people to run anything other than a NuCypher process, except perhaps a blockchain node (ie, we don’t want to run Apache Gossip). There is no existing gossip solution that doesn’t require us to add substantially to our spin-up heft.

Despite these concerns, it’s still possible that The Learning Loop might eventually fully implement Gossip.

What we settled on

So far, we’ve only called our solution, which is (for the moment) tightly coupled to the NuCypher codebase, “The Learning Loop.”

It’s built on Twisted (which I’m not shy about describing as the highest quality open source project I’ve used) and hendrix, whose core developers include myself and fellow NuCypherino Kieran Prasch.

We’ve only put about 4 solid days of work into it (and a few more on touchups), so it’s really more of a prototype than a finalized solution. Nevertheless, it’s powering our current testnet, and seems to be kicking ass.

tl;dr — here’s how node discovery works at NuCypher:

  • A smart contract lists some “seed nodes”- triples consisting of (a wallet address, a host, and a port) — we call this a “seeder” contract.
  • To bootstrap, a node or user (a “Learner” in our parlance) chooses a known node, treating it as a “Teacher”. It can be, but does not need to be, one of the nodes listed in the seeder contract.
  • The Learner first checks to ensure that the teacher is staking enough NU to be a NuCypher node in the first place.
  • The Learner connects to the Teacher and retrieves its TLS certificate.
  • Using this (so far unverified) cert, the Learner makes a TLS connection to the teacher and asks for validation metadata.
  • The teacher provides a signature showing that it has signed its TLS certificate with its wallet keypair. Here’s our token-based integral Byzantine fault tolerance. It also provides a list of all known nodes on the network, including TLS certificates signed by wallet keypairs for each node.
  • The Learner continues to check for new nodes every 5 seconds (with some optimizations to prevent having to download the entire list again).
  • After 10 rounds go by with no new nodes learned, the loop slows down so that learning only happens every 90 seconds (and speeds up again if a new node is discovered).

We’re looking for feedback. Two main questions:

  • Are we crazy for thinking that this is an optimal solution for us and similarly positioned projects? We’re certainly open to going back to devp2p. Is it the case, for reasons that elude us at the moment, that this is the thing to do?
  • Can your project make use of something like The Learning Loop? Is it worth it for us to take the time to separate it, formalize it, and release it separately from NuCypher?

Development discussion:
GitHub Repo: