Notes on Privacy through Anonymity

Anonymity: A Tool in the Privacy Kit

Every system has certain actors who perform particular actions. Anonymity seeks to "mask" the perpetrator of a given action in a "crowd" of actors. Anonymity contends with more rigorous definitions of privacy like Information Theoretic or Computational Privacy. While being weaker than either, Anonymity affords simpler/efficient solutions.

Framework for Anonymity Problems

The following roadmap can be used for reasoning about Anonymity-based solutions.

Identify the domain D.
Identify the actors, actions and entities in D. For example, in a Communication Channel domain, {senders, recipients, routers} are actors, {generate, receive, forward} are actions the actors can do, and {message, address, key pairs} are entities.
Define the desired privacy P
Decide the particular actors whose actions are to be anonymized. For example, in a Communication Channel domain, we might wish to enable Sender Anonymity: a sender should be able to send a message to a recipient without other actors in the domain knowing the sender's identity.
Propose a scheme S.
Think.
Specify the threat model relevant to D
Identify the adversaries that are relevant in D and their abilities. The abilities may depend on the roles that adversaries can assume. For example, a network administrator can only look at messages on the wire, a database administrator can only look at data in tables, an insurance company can only query for data, and so on. Apart from such role-specific actions, adversaries can also be assumed to have "traits" as shown in the table below.

Honest But Curious := Passive Local Independent

Deliberate And Malicious := Active Global Colluding
Reason about degree of anonymity assured by S in the face of A.
The Crowds paper presents a probabilistic scale to quantify the degree of anonymity observed. The important probability ranges are shown below, with a special "Beyond Suspicion" label associated with probability 1/n where n is the number of actors in D.

Absolute Privacy Probable Innocence Possible Innocence Provable Exposure

0.0 |-----------------------0.5-----------------------| 1.0

Example 1: Anonymous Communication Channel

D= Communication Channel.
P= Sender Anonymity.
S= Remailers {Type I, Type II, Type III}. Type I uses a trusted third party, Type II uses onion routing, Type III uses cover traffic.
Adversary= Honest but curious; Deliberate and malicious.
Degree= Type I < Type II < Type III.

Example 2: Community Interactions [Huberman, Hogg and Franklin]

A moderator poses a question Q to a set of participants. The question Q has an enumerable set of answers (e.g., Yes/No). Each participant has an answer (opinion) to Q. Devise a scheme that allows the moderator to send a message m to only those participants that answered in a particular way to Q(e.g., all participants who voted Yes).

The privacy desired is "Opinion Anonymity". It should not be possible for an adversary (moderator/other participants) to realize that a participant answered a certain way.

The scheme relies on the Discrete Log is Hard assumption and is inspired from Oblivious Transfer protocols.

Example 3: Exporting Medical Records [Latanya Sweeney]

A hospital has a relation R(A₁, A₂, ..., A_n) in which each tuple corresponds to a person. A set of attributes (A₁, A₂, ..., A_i) uniquely identify the corresponding person. Devise a scheme to allow the hospital to export R in as precise a form as possible, without compromising the privacy of the persons ("Record Anonymity").

For example, R might be HealthRecords(SSN, Sex, YearOfBirth, Zip, Race, Symptoms) in which both {SSN} and {Sex, YearOfBirth, Zip, Race} uniquely identify a person.

The scheme relies on generalizing values of (A₁,...,A_i) for each tuple such that there are at least k-1 other tuples with identical values of the i attributes.

Precision for each value can be measured by the number of times the generalizing function was applied to it. Precision for a table is then the sum of precision values for all (row, column) cells.

Open Problems

Can we design an optimal algorithm or is it NP-Hard? Meyerson and Williams show that generalized k-anonymity is NP-Hard. Sweeney, in an unpublished report cited by Meyerson and Williams, proposes an exact polynomial time algorithm when n < log(|R|), where n is the number of attributes and |R| is the number of tuples in R.
The aim of generalizing values is to ensure that for each tuple in R, there are k-1 tuples in R with same values in foreign key attributes. Can we replace this with a more "general" requirement: each tuple in actual R can be any of k tuples in exported R?
Instead of having a fixed/arbitrary set of generalization functions, can we incorporate the notion of looking at frequency distributions/histograms of domain values in a column to select generalizations that are data-driven? The notion of precision could also be made more fine-grained to incorporate distributions.
More generally, can we be instance-driven in defining the notion of anonymity. That is, we guarantee anonymity totally in terms of the provided instance, rather than selecting a set of columns which we believe have a foreign-key relationships with external databases.
Suppose e (>=2) entities wish to perform joins on their own relations. One possibility is to anonymize each table, and then perform a join. How bad would the precision of such a join result be? Is there an algorithm that joins exact tables but produces k-anonymized results? What is the degree of privacy for such an algorithm?
An organization might have to periodically release R. Across time there might have been inserts, deletes and updates in R. How can the generalization algorithm be made sensitive to future/past releases?
Instead of putting out a single k-anonymized R, can the hospital put out multiple versions of R, each generalizing different sets of attributes to different degrees? What are the gains from such an approach? What is the degree of privacy that can be enabled across all versions of R?
The precision metric seems to be sensitive to data distributions in R. How much skew in the data makes exported R virtually unusable?

bawa@cs.stanford.edu
10-22-2003

Honest But Curious	:=	Passive	Local	Independent
Deliberate And Malicious	:=	Active	Global	Colluding

Absolute Privacy	Probable Innocence	Possible Innocence	Provable Exposure
0.0	\|-----------------------0.5-----------------------\|		1.0