Scalable Information Extraction and Integration

Eugene Agichtein and Sunita Sarawagi

Data mining applications over text require efficient methods for extracting and integrating the information ``buried'' in millions, or billions, of text documents. This tutorial reviews the state of the art in scaling up the extraction, mining, and integration of information in large amounts of unstructured text. We review key approaches for scaling up information extraction, including using general-purpose as well as specialized indexing techniques. We also overview scalable techniques for integration and cleaning of the extracted information. We highlight research opportunities in applications of scalable information extraction and integration, as well as the fundamental challenges that remain.

Final tutorial notes now available: KDD06Tutorial.pdf

Presenter Information

Eugene Agichtein is an Assistant Professor in the Mathematics and Computer Science department at Emory Universiry. He was previously a Postdoctoral Researcher in the Text Mining, Search, and Navigation group at Microsoft Research. Eugene has been working on scalable methods for extracting, discovering, and managing information in large unstructured text collections and the web. He has a Ph.D. in Computer Science from Columbia University, and a B.S. in Engineering from The Cooper Union. He has several publications on scalable and efficient information extraction, including the best student paper award at the IEEE ICDE 2003 conference and the best paper award at the SIGMOD 2006 conference. Eugene has been a program committee member of the ICDM, SIGKDD, SIGIR, ACL/COLING, and PKDD conferences.

Sunita Sarawagi researches in the fields of databases, data mining, machine learning and statistics. She is associate professor at IIT Bombay. Prior to that she was a research staff member at IBM Almaden Research Center. She got her PhD in databases from the University of California at Berkeley and a bachelors degree from IIT Kharagpur. She has several publications in databases and data mining including a best paper award at the 1998 ACM SIGMOD conference and several patents. She is on the editorial board of the ACM TODS and ACM KDD journals and was editor-in-chief of the ACM SIGKDD newsletter. She has served as program committee member for ACM SIGMOD, VLDB, ACM SIGKDD and IEEE ICDE, ICML conferences.