Privately Querying Privacy

13 May 2022 - Von Dr. Anna Hildebrandt

One crucial yet often neglected challenge when estimating a user’s privacy within an online database is shielding the privacy of both parties involved. Traditional approaches like differential privacy, k-anonymity, or its successors l-diversity and t-closeness aim at user privacy, but in a three-party scenario, i.e.

  • the data owner (say Alice), whose data is to be protected
  • the data user (say Bob), who poses queries to the database where Alice’s data is stored
  • the database owner, who helps securing Alice’s data against Bob’s curiosity by applying privacy techniques

While such approaches do help to protect Alice’s information against queries from Bob (or from a malicious attacker, say Eve, who eavesdrops on the communication channel), they require Alice to fully trust the database owner, and hence still provide Bob with the full wealth of Alice’s information.

But what if Alice wants to be protected against Bob’s curiosity, too? In this scenario, data user and database owner are the same person, again called Bob. Letting Alice make informed and independent decisions about sharing her data in this situation typically leads to a chicken-or-egg problem of mutual trust: either Alice trusts Bob with her data for the purpose of computing her, say, k-anonymity before she knows if she is willing to share her data with Bob (as this decision depends on the result of this computation), or Bob needs to trust Alice with internals of his database to let her compute her k-anonymity herself (obviously, this latter approach is unlikely to ever happen in practice). In the course of the DaSoMan-project, we have developed the PQP-protocol that resolves the problem for many practically relevant choices of privacy metrics.

To understand the idea behind PQP, imagine a platform where Alice can determine her privacy before sharing it with Bob, and Bob can answer her query without leaking information about his database content (apart from the value of the privacy metric, i.e., the result of the anonymity query, which Alice learns in the end). Obviously, a trusted third-party, acting as a man-in-the-middle, could perform this function. Here, however, both Alice and Bob would have to disclose their data to that party. The PQP protocol achieves the same goal without the need for a trusted third party.

The core of our protocol is the re-formulation of Alice’s privacy query (“is my data in your database better hidden than a threshold t?”, with t being similar to the ​ɛ​ in differential privacy), in a private way using established cryptographic primitives like oblivious transfer (OT), secure multiparty computation (SMC) and private set intersection (PSI).

Private z-score computation

Let’s start with a basic query, where the user Alice wants to check the z-Score of her age x within the database. This requires the knowledge of the mean μ and standard deviation σ from the age distribution in the database:

x = z(x) * σ + μ

But σ and μ are sensitive information for the database owner.
To protect them, we reformulate Alice’s query to “is the z-Score of my data larger than a threshold t”, i.e. z(x) > t.

Fortunately, such a setting, where the evaluation of a joined function with input values of both parties, yet the opponent party must not see the other values, can be addressed by applying an SMC protocol. By carefully designing the SMC approach, we can prevent Alice from learning μ, Bob from learning σ, and both from learning z(x). That even prevents the return of z(x). Problem solved. Well, nearly solved, as we still need a stable business ready SMC implementation.

Simple z-scores, however, only help in situations where we have univariate, and ideally unimodal, distributions to consider.

Realistic user data is typically quite complex, like age, gender, or zip code. One way to handle these diverse kinds of data is to privately query multidimensional histograms with a mixture of floating-point numbers, integer quantities, and categorical features. Instead of our z-Score question, the number of indistinguishable other users with the same or very close features in the database, i.e. the anonymity set, has to be computed privately.

Thus, when interested in anonymity set sizes, we have to look into histogram bucket sizes privately. Let’s break this down into three parts: first, we have to identify the relevant bucket, then to evaluate its size, and finally to pose the privacy threshold question.

Find the bucket - user privacy

The privacy of the user Alice can be easily protected if the database histogram bucket limits are available to Alice for each feature. Then, she can compute the bucket-indices for each feature relevant to her by herself. If she additionally knew the sizes of all database buckets, she could perform the thresholding test for the relevant bucket by herself. But sharing the dataset distribution violates Bob’s privacy concern.

Threshold flags in a cyclic shifted bucket order – database privacy

Similar to the above trick of reformulating the z-Score query, instead of providing the data distribution to the user, Alice now defines her threshold and is given the buckets thresholding flags. Thus, we shift the bucket-thresholding computation back to the database privacy domain.

Another sensitive property of the database is the index of a bucket, e.g. a small integer as index and its corresponding bucket size within the database allow conclusions about the database quantiles. We address that problem by applying a random cyclic index shift of all bucket indices. So basically, Alice needs to privately query 1 out of n choices, which leads us to Oblivious Transfer (OT), which allows a user to query one of multiple pieces of data from the database owner with the guarantee that the database owner does not know which piece was actually important to the user. Given these shortcuts, via an 1-out-of-n OT Alice can query the fitting bucket flag, by applying the bucket-limits-to-index trick.

Such oblivious transfers, however, require the transfer of large amounts of data. In our case, we would have to transfer encrypted versions of each threshold flag for each bucket, which grows exponentially with the number of features considered in the database. We solve the problem by relying on a private set inclusion (PSI) algorithm instead: we first let Bob compute all bucket indices such that the corresponding bucket would fulfill the privacy requirements. Alice then computes her own bucket index and intersects this set with the set of private buckets of Bob. The PSI algorithm guarantees that this neither leaks Bob’s set of private buckets, nor Alice’s bucket index. At the same time, the amount of data that needs to be transferred is much smaller than for the OT-approach described above.

Again, this solves the problem theoretically, but we still need a stable implementation. In our paper Privately Querying Privacy, we used the established ABY-library to implement the garbled circuits of the SMC protocols. On Intel Core i7-8650 CPUs at 1.9GHz with a Gb LAN connection, z-score evaluation took approximately 0.6s, while querying the privacy in a histogram with 100 dimensions and 100 buckets per dimension (a huge histogram of 100^100 buckets) took about 1.6s. In both cases, the code was not optimized for performance, and we believe that those running times can be significantly reduced by careful engineering.

Picture credits:
Adobe Stock

Dr. Anna Hildebrandt

Anna ist Project Manager und Data Scientist bei Mondata. Nach ihrer Informatik-Promotion im Bereich Machine Learning und Bioinformatik entwickelt Sie bei Mondata Lösungen zur Vereinbarkeit maschineller Lernverfahren mit hohen Datenschutzansprüchen.