PADS: Processing Arbitrary Data Sources Kathleen Fisher, AT&T Labs Many high-volume data sources can be mined profitably at various points in the data-management workflow process, e.g., identifying erroneous or out-of-band data before sending the data to the next workflow phase or before loading it into a persistent storage system. Such high-volume sources arise in many application domains, for example: call detail records in telephony systems, web server logs, network packets, network configuration and log files, provisioning records, credit card records, stock market data, scientific datasets, etc. Unfortunately, many such data sources are in semi-structured, ad hoc formats over which data consumers have no control. A significant effort is required to understand such a data source and write a parser for the data, a process that is both tedious and error-prone. Often, the hard-won understanding of the data ends up embedded in parsing code, making both sharing the understanding and maintaining the parser difficult. Typically, such parsers are incomplete, failing to specify how to handle situations where the data does not conform to the expected format. In this talk, I will describe the PADS project, which provides languages and tools for simplifying the analysis of ad hoc data. We have designed a declarative data-description language, PADS/C, expressive enough to describe the structured and semi-structured data sources we see in practice at AT&T, including ASCII, binary, EBCDIC (Cobol), and mixed formats. From PADS/C we generate a C library with functions for parsing, manipulating, summarizing, querying, and writing the data in other ad hoc and standard formats like XML. This work is joint with Bob Gruber, Mary Fernandez, David Walker, Yitzhak Mandelbaum, and Mark Daly.