Access control for real-world data, generated by real-world humans

Case study: A network of on-the-ground Dataeum collectors continually generates valuable real-world data, selling on to location-based…

Access control for real-world data, generated by real-world humans

Case study: A network of on-the-ground Dataeum collectors continually generates valuable real-world data, selling on to location-based services (e.g. Yelp, Waze). A partner network of NuCypher proxies manages permissions, protects collector privacy, and enforces compensation.

How would you inform a fleet of autonomous vehicles that new construction work is blocking a major intersection? Relying entirely on lidar (real-time radar detection) might overload the sensors and processor. Better for the on-board computer to simply confirm the existence of road work, rather than working it out from scratch. Fine, let’s ping a maps database to check for road permutations. This resource must be updated very regularly, or it won’t help much — but, we also know that a single organization would struggle to cover sufficient ground — you’d need hundreds of Google Street View Cars on permanent patrol in urban zones (plus a team to clean the eggs thrown at them). Ok, let’s outsource the data collection to everyday drivers! Plenty of vehicles now come with built-in cameras and sensors. All willing contributors need to do is switch on the detection software when they drive. Everyone donating to a common resource — an unbeatable maps database, that always contains the latest real-world data. Hold on. Who owns the database? Who can see the raw data? After all, we’re sharing our timestamped, coordinate-by-coordinate movements. Shouldn’t those who gather greater quantities of new data earn more? For that matter, shouldn’t every contributor earn something?

Tokyo students trolling the Google Street View Car

The value of multi-contributor, continual, human-driven data streams is enormous. But, so are the hurdles. We need a framework that transparently takes care of permissions, privacy, and compensation — not to mention coordination, verification, and other logistical challenges.

Dataeum, a recent adopter of NuCypher technology, is one such framework through which this kind of crowdsourced, ‘real-world’ data may emerge. Initially, Dataeum will facilitate the collection of relatively simple datasets — data associated with places people visit (‘Points of Interest’) — department stores, sports centres, fruit markets, etc. Groups of dedicated ‘collectors’ visit Points of Interest and record contextual, often transitory attributes. For example: opening times, services offered, schedules, contact information, menus, product imagery — you get the idea. Competing data acquisition techniques, like those based on satellite imagery, may have transformed the way we understand our planet, but the argument is that they cannot see, access, or contextualize everything a human on the ground can. And attempts to leverage this uniquely human interpretation — for example, by farming out data entry to an application’s user base — have failed to sustain a compelling value proposition. How often do you write reviews on FourSquare?

Real-world data collectors may not have the security chops of Ethereum shard validators, but they may represent the next wave of citizens in the decentralized world.

Dataeum’s thesis is that a small crowd of roaming, purposeful and (relatively) tech-savvy individuals can collectively fulfil demand for real-world data — if they are carefully organized and reliably compensated. This data can be sold to consumer-facing mobile applications, civil authorities, market researchers, NGOs, and as explored above, the databases relied on by self-driving cars. Arguably, this model embodies a natural widening of cryptoeconomic schemes — utilized by many protocols to incentivize desirable miner behaviour — to the next concentric circle of economic participants. Real-world data collectors may not have the security chops of Ethereum shard validators, but they may represent the next wave of citizens in the decentralized world. Indeed, Dataeum has already proven that members of the public can be motivated to perform this service in a pre-crypto setting, with fiat rewards — despite the geographical constrictions and inflexibility with respect to economic design.

Challenges: Privacy, Availability & Licensing

To understand the relevance of NuCypher and decentralized access control, let’s dig into the requirements of a human-driven, real-world data acquisition system.

Data privacy

For contributions to be valid, collectors must physically visit every location they report on. Also in the interest of maximizing validity, collectors are incentivized to associate all the data they produce with a single user profile (or indeed, prevented from creating multiple profiles). Although pseudonymous, the longer a single profile has existed, the greater the opportunity to develop a reputation for accurate data creation, and be compensated appropriately. Together, these validity requirements give rise to a privacy challenge — namely, participation in real-world data collection builds up a rich database of individual collectors’ timestamped whereabouts. Data of this type are certainly sensitive enough to warrant serious privacy measures, that ensure real-world data are only shared with designated acquirers, and nobody else. Moreover, reliable privacy/security measures help prevent valuable data being leaked to acquirers who have not paid licensing fees.

A Dataeum collector profile. Note that the quality score is tied to a single person.

Therefore, the transfer of data from collectors to acquirers would ideally be end-to-end encrypted — circumventing centralized service-providers (e.g. a key management system), and avoiding the risk of trusting a third-party custodian with sensitive, exploitable data.

Data availability

Given our first requirement for data transfers to be end-to-end encrypted, we might be tempted to use a straightforward form of public-key cryptography. Put simply, the data collector would encrypt their data contributions with the acquirers’ public keys, enabling subsequent decryption with their private keys. However, the manner in which real-world data tend to be utilized renders this approach impractical:

a. Data acquirers need to regularly update their databases with fresh data, to best reflect the continuously changing physical world.

b. The data created by collectors are most valuable in aggregate — i.e. collated with the contributions of other collectors. This is true both from a data verification perspective, and the more obvious reason that a single collector is unable to cover sufficient ground to be useful by themselves. From a trial in Barcelona, a city of 1.6 million inhabitants, Dataeum established that 1,000 active collectors, of which at least 10% were active 5x per week, were required to provide the minimum useful increment of aggregated real-world data.

If we leveraged basic public-key encryption, it would commence with the data acquirer posting an access request (and/or bid) for a set of recently created data. Next, contributors to that dataset would asynchronously come online and approve access. Until all the collectors have responded (and/or decided to accept the bids), the dataset would be incomplete. At best, this delays updates to the map. For the most perishable data — e.g. a group of collectors reporting on a real-time traffic accident, or sports event, or demonstration — this is entirely unacceptable. This workflow also implies a tiresome user experience for collectors, who would be repeatedly pinged with incoming requests to go online from every single acquirer who needs their individual contribution to the larger set.

Hence, an ideal access control layer would be able to manage sharing on behalf of each collector, grant access based on specified conditions (e.g. the bid is above a predefined threshold — more on this later), and do so at a moment’s notice.

Data licensing

Given the cryptoeconomic grounding of this project, touched upon in the opening paragraphs, compensating data collectors in a transparent, efficient and trustless manner can be considered mission-critical. In general, this remuneration step involves a small group of data acquirers purchasing a license to access the collated contributions of a larger group of data collectors. Whichever the pricing model, which will vary from acquirer to acquirer (and even collector to collector), it further complicates attempts to use basic public-key cryptography — most effective for predictable, one-to-one sharing flows.

Another challenging, but highly desirable licensing requirement revolves around access revocation. For example, if the sharing relationship between a data acquirer and a set of collectors ends, the acquirer should no longer have access to any future data created by those collectors. The highly perishable nature of location-related data means reliably enforced revocation is a potent incentive for acquirers to renew their licenses.

The highly perishable nature of location-related data means reliably enforced revocation is a potent incentive for acquirers to renew their licenses.

Ideally, the access control mechanism would be programmable such that data is only shared when clear and immutable conditions are met, and revoked when they expire.

Solution: decentralized access control

We’ve established that privacy, availability and licensing are interwoven, indispensable requirements. Concession of any one requirement compromises the collectors, devalues the output, or both. Yet, as we’ve also confirmed, neither centralized access control nor basic public key cryptography is up to the task.

Enter decentralized access control as a service — the NuCypher network. To understand how this approach simultaneously addresses each requirement, let’s walk through a hypothetical data acquisition workflow.

Note that the primary purpose of this workflow is to map NuCypher network capabilities to user objectives and requirements. Therefore, various architectural (e.g. data storage), security (e.g. key splitting) and authenticity (e.g. signing) notions are deliberately simplified or ignored. To get an accurate sense of those features, please read this code-oriented walkthrough.

Our hypothetical data acquirer is the developer of a sophisticated maps application. They seek highly contextual, human-collected data to enrich their databases and give their application a competitive edge. They promise users the most up-to-date, most useful information on Points of Interest in their city.

(1) Engaging with the collector network with collection specifications 
The data acquirer begins by issuing a licensing offer to the collector network. This specifies that they are willing to compensate collectors for additions/updates to a defined list of attributes, pertaining to a defined set of Points of Interest, situated within a defined group of urban zones. Attributes will be considered stale a fortnight after their last update — which means there’s an opportunity for a collector to re-update those attributes and earn fees accordingly.

(2) Creating NuCypher-architected sharing policies & re-encryption keys
Next, local collectors are pinged with the acquirer’s offer. If they accept, each of their devices generates a sharing policy, architected to enable NuCypher network nodes to efficiently update the permissions associated with encrypted data they produce going forward. Sharing policies take the following inputs: the collector’s private key¹, the data acquirer’s public key, a directory to store the real-world data for the acquirer², and sharing conditions (in this case, only grant access to new data if payment transferred, plus a standard policy expiration date). In order for the NuCypher network to enforce each policy, a unique set of re-encryption keys are generated. These are issued to and held by participating nodes in the NuCypher network³.

(3) Collecting the real-world data
With a sharing policy established between each collector and the acquirer, and a subset of the NuCypher network holding onto corresponding re-encryption keys, the collectors set about visiting the specified Points of Interest and adding/updating required attributes. For example, most weeks Alice is the first collector to update the event schedule at three local community centres, and she also occasionally verifies other local collectors’ updates pertaining to an outdoor food market, gas station and a few bars. She provides verifications when her app notifies her that they’re required, and if she happens to be passing those locations. Alice, like most of the collectors gathering data for this acquirer, updates attributes sporadically, out of sync with her fellow collectors, and not every single week.

(4) Identifying new data and enforcing advance payments
The maps application developer needs access to fresh collector data at least every 24 hours, and more regularly during large events (e.g. a city marathon). This first step of this retrieval process is for the acquirer’s server to check which of the live sharing policies has had new data added since the previous access request⁴. Next, the acquirer pays the owners of those policies, then, pings the NuCypher network with a request for access to their data. The NuCypher network autonomously verifies that the correct sums have been transferred into the correct wallets.

(5) Encrypting the data for re-encryption
With payment confirmed, the data is encrypted⁵ — this generates a ciphertext associated with the new data, and unique to the sharing policy⁶. In other words, the output ciphertext (which when decrypted, becomes a conduit to the plaintext data), is constructed to strictly follow the rules of the sharing policy — it cannot be used to grant access to any recipient other than the corresponding acquirer.

Note that this encryption process can occur entirely independently of the data owner

(6) Granting access to the data acquirer
Next, the NuCypher network fulfils their most important responsibility — securely updating the permissions associated with each ciphertext, one for each set of new collector data. They do this by operating on the ciphertexts with the re-encryption keys they already hold, transforming it into a form decryptable by the data acquirer.

(7) Accessing the real-world data
The set of transformed ciphertexts is delivered to the data acquirer, who decrypts them with their private key (corresponding with the public key added to the sharing policy in step 3). The acquirer then uses the results of this decryption⁷ to access the sets of new real-world data and update their production databases. Users of the maps application are thrilled.

Concluding thoughts

At first glance, this workflow might appear slow or expensive. However, none of the ongoing access control steps require meaningful computation and generally comprise simple network requests. The exceptions are the initial creation of the re-encryption keys (step 3) and the payment verification performed by the NuCypher network (step 5) — the former tends to be a few multiples slower than network speed, and the latter depends on the blockchain client used for the transaction (e.g. Ethereum). Continually re-encrypting new data is also very economical, as the cost of the NuCypher network is primarily dependent on the duration of the sharing policy, rather than the number of access requests.

We have also managed to achieve all our initial objectives:

  1. Preserving privacy. No entity other than the designated recipient is trusted with sensitive/valuable data.
  2. Maximizing data utility. Every data point is available to acquirers as soon as they need it, irrespective of whether the collector is online.
  3. Enforcing compensation. Access is not granted to the acquirer until they have verifiably transferred funds to each payee.

It’s also worth mentioning that this workflow involved a single data acquirer. In practice, there will almost certainly be many organizations bidding for real-world data, increasing the complexity of sharing flows and therefore the reliance on a scalable access control system like the NuCypher network.

Excitingly, though this workflow is fairly specific, its three objectives are not. We believe that decentralized access control can be combined with technologies such as mobile interfaces and cryptoeconomics, enabling a wide range of crowdsourcing use cases beyond maps data. These use cases have the potential to widen the adoption of decentralized technology, bestow end-users with applications of unprecedented utility, and enable more economically inclusive models than the blockchain world has offered thus far.


[1] The generation of the sharing policy and re-encryption key occurs in the client of each collector’s device/application, so their private key is not transferred away from their device.
[2] This ‘directory’ has a tag functionality: associating an unlimited set of data with a specific sharing policy (and therefore with a recipient, or recipients). Tagging data in this way does not automatically share it, but it does mean that the data can be shared with those recipients, at some point in the future.
[3] The NuCypher network has precisely zero access to the underlying real-world data, or any associated metadata. The only power granted to NuCypher network nodes is to update permissions associated with a given ciphertext to the owner(s) of the public key(s) specified at policy creation.
[4] This process involves each collector’s application sending a notification to the acquirer whenever new data is added (effectively a metadata update), but does not give the acquirer access to the underlying real-world data itself. This communication occurs independently of the NuCypher network. 
[5] The encryption step can be performed by any untrusted third-party. They are given the power to write data to a sharing policy but are not given read permissions. This third-party could be an always-online server run by the developer of the collector-facing application, which would allow the data owner to be offline when the data is encrypted for transfer. For more information on this NuCypher capability, please read the paragraph entitled Encrypting and sharing on Alice’s behalf with DataSource in this walkthrough
[6] This is achieved by taking the policy’s public key as an input at the encryption step. This gives the NuCypher network the power to manage permissions associated with the new data, exactly as they have been able to for all previous data added to that same policy.
[7] The ciphertext decrypts to a symmetric key, which in turn can be used by the acquirer’s server to decrypt the underlying real-world data.