From Cookies to Cooks: Insights on Dietary
 Patterns via Analysis of Web Usage Logs

This page refers to the following paper and provides supplementary details about data collection and preprocessing:

Robert West, Ryen W. White, and Eric Horvitz. From Cookies to Cooks: Insights on Dietary Patterns via Analysis of Web Usage Logs. In Proceedings of the 22nd International World Wide Web Conference (WWW'13), Rio de Janeiro, Brazil, 2013. [PDF]

Abstract

Nutrition is a key factor in people's overall health. Hence, understanding the nature and dynamics of population-wide dietary preferences over time and space can be valuable in public health. To date, studies have leveraged small samples of participants via food intake logs or treatment data. We propose a complementary source of population data on nutrition obtained via Web logs. Our main contribution is a spatiotemporal analysis of population-wide dietary preferences through the lens of logs gathered by a widely distributed Web-browser add-on, using the access volume of recipes that users seek via search as a proxy for actual food consumption. We discover that variation in dietary preferences as expressed via recipe access has two main periodic components, one yearly and the other weekly, and that there exist characteristic regional differences in terms of diet within the United States. In a second study, we identify users who show evidence of having made an acute decision to lose weight. We characterize the shifts in interests that they express in their search queries and focus on changes in their recipe queries in particular. Last, we correlate nutritional time series obtained from recipe queries with time-aligned data on hospital admissions, aimed at understanding how behavioral data captured in Web logs might be harnessed to identify potential relationships between diet and acute health problems. In this preliminary study, we focus on patterns of sodium identified in recipes over time and patterns of admission for congestive heart failure, a chronic illness that can be exacerbated by increases in sodium intake.

Recipe preprocessing

Data collection

First, we identified websites containing large recipe collections by manually selecting 163 sites (listed here) from among

  1. the websites listed in the Open Directory Project (dmoz.org) under Home/Cooking/Recipe Collections,
  2. the top results returned by Bing for the query recipes, and
  3. the most frequent domains among the pages clicked in response to search queries containing the words recipe or recipes (in the browsing logs we use in the paper).
In the set of 163 sites found this way, we identified the following 14 high-traffic sites that offer nutritional facts with many of their recipes: These 14 domains cover 69% of all distinct recipe pages (defined as pages on our 163 original sites) in our log sample.

Extracting information from recipes

For each of the above 14 sites, we hand-built regular expressions for extracting the title, a list of ingredients, and nutritional contents (cf. Table 1 in the paper) from the HTML source. The regular expressions can be found in the following Perl scripts:

Care was taken to only include recipes for which nutritional information was available per serving (to normalize for varying dish sizes). Our regular expressions managed to extract nutritional information for 53% of the recipe pages from the above sites.

Questions? Comments? Ideas?

Just send me an email. I'm happy to discuss my research with you.

Last modified on June 11, 2013