Data-Intensive Text Processing with MapReduce

Tutorial at SIGIR 2009

Sunday, July 19, 2009

Jimmy Lin
University of Maryland, College Park

Slides and Useful Information

  • Complete tutorial slides: [.pptx] (5.13 MB) [.pdf] (2.61 MB)
  • Cloud9: A MapReduce Library for Hadoop (tutorials, getting started guides, tips & tricks, etc.)


The ability to scale out to collections tens of terabytes (and soon, petabytes) in size is one of the central challenges facing the academic information retrieval community today. In the near future, working with Web-scale collections will no longer be a luxury, but a necessity. As part of this transition, the community must move from designing algorithms for individual machines to designing distributed algorithms that run on large clusters. The MapReduce programming model provides a convenient framework for organizing distributed computations that exhibits good "scale out" characteristics on clusters of commodity machines. With the release of the open-source Hadoop implementation, academic researchers now have ready access to a cost-effective tool for tackling Web-scale problems.

The goal of this tutorial is to provide an accessible introduction to those not already familiar with MapReduce and its potential to revolutionize academic information retrieval research. This full-day tutorial will be divided into two parts: the morning session will cover fundamental MapReduce concepts, while the afternoon session will focus on hands-on exercises with Hadoop.

The emphasis of this tutorial is scalability and the tradeoffs associated with distributed processing of large datasets. The tutorial will cover "core" information retrieval topics (e.g., inverted index construction) as well as related topics in the broader area of human language technologies (e.g., distributed parameter estimation, graphs algorithms). Content will include general discussions of algorithm design, presentation of illustrative algorithms, relevant case studies, as well as practical advice in writing Hadoop programs and running Hadoop clusters.

Thanks to the generous support of Amazon Web Services (AWS), participants in this tutorial will be supplied with free AWS credits which can be applied to Amazon's Elastic Compute Cloud (EC2). With this "utility computing" service, users can rapidly provision Hadoop clusters on the fly without needing to purchase any hardware. EC2 allows interested researchers to experiment with Hadoop at a reasonable cost, before committing to large capital investments necessary to acquire and maintain dedicated clusters.

Instructor Bio

Jimmy Lin is an associate professor in the iSchool at the University of Maryland. He joined the faculty in 2004 after completing his Ph.D. in Electrical Engineering and Computer Science at MIT. Jimmy's research interests lie at the intersection of information retrieval and natural language processing. He leads the University of Maryland's efforts in the Google/IBM Academic Cloud Computing Initiative (ACCI), which includes a grant from the U.S. National Science Foundation. Jimmy has taught two semester-long "cloud computing" courses on MapReduce and has given numerous talks about data-intensive text processing to a wide audience.


This work is supported by NSF under award IIS-0836560; the Intramural Research Program of the NIH, National Library of Medicine; Google and IBM under the Academic Cloud Computing Initiative (ACCI); and Amazon Web Services. Any opinions, findings, conclusions, or recommendations expressed here are the instructor's and do not necessarily reflect those of the sponsors.

This page, first created: 18 Jul 2009; last updated: Creative Commons: Attribution-Noncommercial-Share Alike 3.0 United States Valid XHTML 1.0! Valid CSS!