Data mining applications over text require efficient methods for extracting and integrating the information ``buried'' in millions, or billions, of text documents. This tutorial reviews the state of the art in scaling up the extraction, mining, and integration of information in large amounts of unstructured text. We review key approaches for scaling up information extraction, including using general-purpose as well as specialized indexing techniques. We also overview scalable techniques for integration and cleaning of the extracted information. We highlight research opportunities in applications of scalable information extraction and integration, as well as the fundamental challenges that remain.
Sunita Sarawagi researches in the fields of databases, data mining, machine learning and statistics. She is associate professor at IIT Bombay. Prior to that she was a research staff member at IBM Almaden Research Center. She got her PhD in databases from the University of California at Berkeley and a bachelors degree from IIT Kharagpur. She has several publications in databases and data mining including a best paper award at the 1998 ACM SIGMOD conference and several patents. She is on the editorial board of the ACM TODS and ACM KDD journals and was editor-in-chief of the ACM SIGKDD newsletter. She has served as program committee member for ACM SIGMOD, VLDB, ACM SIGKDD and IEEE ICDE, ICML conferences.