Databases should include responsibility for privacy as a fundamental tenet. Efficiency is no longer a central focus.
Statistical Databases: These provide only aggregate information without revealing any sensitive private information. The two major methods are:
Query Restriction: Keep query result-size small, control overlap among successive queries.
Data Perturbation: Swapping values, adding noise, replacing from the same distribution. May perturb query results too, return a sample of the result.
Secure Databases: These mainly incorporate access control of the database. In multilevel secure databases, there is a level associated with each attribute, and one associated with each query too. A query can access only those attributes allowed by its level.
The following are the principles on which hippocratic databases should be built. These are mainly inspired from the U.S. Privacy Act of 1974 and similar acts from other countries. These are called fair information practices. They apply to any information being collected about individuals:
Purpose Specification: The purpose for which the information is being collected should be specified.
Consent: The donor must have consented to the data collection.
Limited Collection: Only as much data as necessary should be collected.
Limited Use: Information should be used only for the purpose for which it is collected.
Limited Disclosure: The information should not be disclosed to third parties without the donor's consent.
Limited Retention: The information should be retained only for as long as required.
Accuracy: The information maintained about the individual should be accurate and up-to-date.
Safety: The collecting agency should ensure the safety of the data and protect against unauthorized access.
Openness: The donor must be able to see his information and edit it.
Compliance: Donor should be able to verify whether the privacy of the data is being respected.
Purpose is a fundamental concept on which this whole structure is built. Associated with each purpose (say P), we have the following:
External recipients: Which external organizations can be given the information collected for purpose P.
Retention Period: For how long shall information collected for P be retained.
Authorized users: Which are the users within the corporation which have access to the data collected for P. For example, if some information e.g. the credit card information is obtained for only completing the current transaction, then the sales department can have access to it, but not the business analysis division.
Further, each attribute of each table is collected for a certain set of purposes. Hence we have privacy metadata of the form:
Purpose | Table | Attribute | Recipients | Retention | Authorized Users |
The W3C recommends P3P (Platform for Privacy Preferences) for companies to specify their privacy policy in a machine-readable XML-like format. The users can specify their privacy preferences in APPEL which also has a similar format.
The function of P3P is basically to specify the privacy metadata shown above, i.e., it specifies what information is being collected for which purpose, how long it will be retained and so on. It does so by having some predefined values for these fields:
Purpose (12 possible values): For example current - for current transaction, individual- for making personalized recommendations. Recipient (6 possible values): For example ourselves - the agency collecting information, same - other organizations following the same practices. Retention (5 possible values): For example stated purpose - for only the current transaction, business-practice - retained but discarded according to a timetable.Privacy preferences are specified in APPEL. They basically consist of a set of rules which can be matched against the privacy policy. These can be viewed as an XQuery over the policy specified in XML. Each rule consists of a behavior ( either block or request (i.e. allow) ) and a body. When the body matches a portion of the policy, the behavior is triggered.
Implementations: Microsoft IE6, AT&T Privacy Bird Policy Creation Tools: P3Pedit, IBM Trivoli Privacy Wizard Checking or policy matching: Presently done using a proxy. Agrawal et. al. propose the use of server-centric technology for matching preferences against policies. They argue that this can allow us to use much of the power of relational databases. For using relational databases, the policy is first flattened out into relations. The privacy preference is then converted into a SQL query over this database. This implementation works faster than a client-side implementation and is more extensible. Besides, server-centric implementation is good for mobile clients. If the preference matches against the policy, data is collected. Each tuple is then tagged with the set of purposes for which it is collected. Audit trails of each successful match are stored for any legal challenges to compliance.Queries are tagged with a purpose. These queries can access only certain columns of the tables depending on the privacy metadata. Further, only those tuples in the data that are tagged with that purpose are visible to the query.
There should be a separate intrusion detection module to detect anomalous queries posed by even authorized users. For example although the sales department is allowed to see the shipping address on an order-by-order basis, if a person in the sales department tries to steal all the addresses, such a query should be declared as anomalous.
A data item should be retained for the maximum retention period among all the purposes for which it has been collected. After this period, it should be deleted from all logs, checkpoints etc.
This module is responsible for checking if a particular attribute is unused, or if a particular authorized user has more than privileges than required.
There are various challenges to developing such a system.
A good language is needed for specifying privacy policies We need a language more general than P3P. For example, how about a user who is willing to give away some information provided he is given a certain compensation?
There are traditional efficiency concerns in using this design. For example, if every time a customer give information for the same purpose, the purpose tag can be kept in the customer table, rather than keeping the it with every possible transaction.
For ensuring this, we can do
Access Analysis - see which attributes are not being used. The problem is that the use of an attribute can be conditional. For example, assets are required for a mortgage application only if the salary is low.
Granularity analysis - see that not more information than required is being collected. For example, if queries are only of the form numChildren>0 and numChildren==0, the exact number of children is higher granularity data than required.
A query may be rewritten in terms of minimal queries to actually reveal things as in (1) and (2).
Data should be removed permanently after expiry, i.e. from all logs, checkpoints etc.
For safety, data on disk must be secured by encryption. However, this opens the very difficult problem of query-processing over encrypted data.
Universal logging: Each donor is provided with logs of accesses of his/her data. However, this is not scaleable.
Tracking Breaches: Fingerprinting - Insert some false email accounts. If an email on that account is received, then privacy breach has taken place. Fingerprints should be assigned not uniformly, but according to the distribution, e.g., more in the populated areas.
A strawman design for the Hippocratic Databases paradigm and the major challenges therein are given above.