Trust and Security in Biological Databases: Security When Collaborating. Gio Wiederhold, Stanford University, CA.

2/18/2004

Gio AAAS 04

Abstract: Security when Collaborating

Panel presentation on “Trust and Security in Biological Databases”; Gio Wiederhold, Ph.D, Stanford University, CA

Traditional security mechanisms have focused on access control, assuming that we can distinguish the good and the bad guys, and can label any data collection as being accessible to the good guys. If those assumptions hold the technology is conceptually simple, and only made hard by technical faults. However, there are many practical situations where such sharp distinctions cannot be made, so that the technologies developed to solve access control become inadequate. In medicine, but also in many commercial data collections we find unstructured data. Such data are collected and stored without the submitter being fully aware of their future use and hence unable to consider all future access needs. A complementary technology to augment access control is result filtering: namely inspecting the contents of documents before they leave the boundary of the protected system.

I will briefly cite the issue in two settings, one simple and one more complex. Military documents have long been classified into mandatory and discretionary classifications. Legitimate accessors are identified with respect to those categories. But when a new situation arises, the old labels are inadequate. When we had to share information with the Russians in Kosovo, no adequate labeling existed. Relabeling all stored documents was clearly impractical. A filter can be written to check the text for limited, locally relevant contents, and make those available. Any document containing unrecognized noun-phrases would be withheld, or could be handed over to a security officer for manual processing.

More complex situations occurs when we have statistical data, as census, or, as in bioinformatics, phenotypic and genomic data. We want to prevent the release of statistical summaries for cells that have fewer than 10 instances say, to reduce the likelihood of inference back to an individual. If we use access control, we have to precompute the minima for columns and rows and aggregate their categorizations for access to prevent release. However, the distributions in those cells is very uneven. So if we check the actual contents at the time of release, we can allow much smaller categories to be used for access and only omit or aggregate cells that are too small.

Checking results being released can also provide a barrier for credit card theft and the like. If a person who masquerades as a customer locates a trapdoor and removes 10,000 credit cards instead of an MP3 tune, that can easily be recognized, since those data have very different signatures.

In summary, many of our accessors are collaborators or customers, although we know little about them. We want to give them the best possible service, and still protect our property or the privacy that individuals are trusting us to keep. Focusing only on access control, and then not checking what is released is an inadequate, even a naive approach for systems involving collaboration.

Research leading to these concepts and supporting technologies was supported by NSF under the HPCC and DL2 programs