Towards Web-Scale Information Extraction

Eugene Agichtein
Emory University Mathematics & Computer Science


Data mining applications over text require efficient methods for extracting and structuring the information embedded in millions, or billions, of text documents. This presentation reviews the current research on enabling information extraction to operate on Web scale. Different dimensions of scalability include corpus size, heterogeneity of the information sources, access to the documents, and the diversity of the extraction domains. This presentation will focus on the first three dimensions. First I will briefly review common information extraction tasks such as entity, relation, and event extraction, indicating the main scalability bottlenecks associated with each task. I will then review the key algorithmic approaches to improving the efficiency of information extraction, which include ideas adapted from text mining, information retrieval and recently developed specialized indexing techniques.


  • Recording of the presentation (22 March 2007): available soon.

  • The slides for the talk: [ ppt ] [ ps ] [ pdf ].



  • Answers to questions asked live: to be posted

  • E-mail me any additional questions


  1. E. Agichtein, Scaling Information Extraction to Large Document Collections, IEEE Data Engineering Bulletin, 2005

  2. E. Agichtein and L. Gravano. Querying text databases for efficient information extraction. ICDE 2003

  3. E. Agichtein, P. Ipeirotis, and L. Gravano, Modeling Query-Based Access to Text Databases, WebDB 2003

  4. M. Banko, M. J. Cafarella, S.Soderland, M. Broadhead, and O. Etzioni, Open Information Extraction from the Web, IJCAI 2007

  5.  S. Cucerzan and E. Agichtein, Factoid question answering over unstructured and structured content on the web, TREC 2005

  6. M. J. Cafarella, D. Downey, S. Soderland, and O. Etzioni, KnowItNow: Fast, Scalable Information Extraction from the Web“, EMNLP 2005

  7. S. Bird, Y. Chen, S.B. Davidson, H.Lee, and Y. Zheng, Designing and Evaluating an XPath Dialect for Linguistic Queries,  ICDE 2006

  8.  J. Cho, S. Rajagopalan, "A Fast Regular Expression Indexing Engine," ICDE 2002.

  9. M. J. Cafarella and O. Etzioni. A search engine for natural language applications. (WWW), 2005

  10. A. Chandel, P.C. Nagesh, and S. Sarawagi. Efficient batch top-k search for dictionary-based entity recognition. ICDE 2006

  11.  K. Chakrabarti, V. Ganti, Jiawei Han, D. Xin, Ranking Objects Based on Relationships, SIGMOD 2006

  12.  S. Chakrabarti, Kriti Puniyani and Sujatha Das, Optimizing Scoring Functions and Indexes for Proximity Search in Type-annotated Corpora. WWW 2006

  13.  M. J. Cafarella, C. Re, D. Suciu, O. Etzioni, and M. Banko, Structured Querying of Web Text: A Technical Challenge, CIDR 2007

  14.  S. Chaudhuri, R. Ramakrishnan, and G. Weikum. Integrating db and ir technologies: What is the sound of one hand clapping?, CIDR 2005.

  15.  S. Dumais, M. Banko, E. Brill, J. Lin and A. Ng (2002). P. Bennett, S. Dumais and E. Horvitz (2002). Web question answering: Is more always better? SIGIR 2002

  16. R. Grishman, S. Huttunen, and R. Yangarber. Information extraction for enhanced access to disease outbreak reports. Journal of Biomedical Informatics, 2002.

  17.  D. Gruhl, L. Chavet, D. Gibson, J. Meyer, P. Pattanayak, A. Tomkins, and J. Zien. How to build a WebFountain: An architecture for very large-scale text analytics. IBM Systems Journal, 2004.

  18.  Silja Huttunen, Roman Yangarber and Ralph Grishman,  Diversity of Scenarios in Information Extraction, LREC 2002

  19.  Ipeirotis, E. Agichtein, P. Jain, and L. Gravano, To Search or to Crawl: Towards a Query Optimizer for Text-Centric Tasks, SIGMOD 2006

  20. A. Klementiev and D. Roth, Weakly Supervised Named Entity Transliteration and Discovery from Multilingual Comparable Corpora. Proc. of the Annual Meeting of the ACL  (2006)

  21. A. Klementiev and D. Roth, Named Entity Transliteration and Discovery from Multilingual Comparable Corpora. NAACL HLT 2006

  22. R. Krishnamurthy, S. Raghavan, S. Vaithyanathan, H. Zhu, Avatar: A Database Approach to Semantic Search, SIGMOD 2006

  23. P. Pantel, D. Ravichandran, and E. Hovy. Towards terascale knowledge acquisition. In Conference on Computational Linguistics (COLING), 2004.

  24. P. Resnik and A. Elkiss. The linguist’s search engine: An overview (demonstration). In ACL, 2005.

  25.  Organizing the World Wide Web of Facts - Step One: the One-Million Fact Extraction Challenge, Marius Pasca, Dekang Lin, Jeffrey Bigham, Andrei Lifchits, and Alpa Jain, AAAI, 2006.

  26. P.D. Turney. Mining the web for synonyms: PMI-IR versus LSA on TOEFL. In European Conference on Machine Learning (ECML), 2001.

  27. Ping Li, Kenneth Ward Church: Using Sketches to Estimate Associations. HLT/EMNLP 2005

  28. C. König and E. Brill, Reducing the Human Overhead in Text Categorization, KDD 2006

  29. S. Sekine, On-Demand Information Extraction, ACL 2006

  30. D.  Ravichandran, P. Pantel, and E. Hovy, Randomized Algorithms and NLP: Using Locality Sensitive Hash Functions for High Speed Noun Clustering, ACL 2005

  31. A. Yates and O. Etzioni, Unsupervised Resolution of Objects and Relations on the Web, NAACL-HLT 2007

  32.  Jeffrey Dean, Sanjay Ghemawat, MapReduce: Simplified Data Processing on Large Clusters, OSDI'04: Sixth Symposium on Operating System Design and Implementation, 2004

  33.  M. Pasca, B. Van Durme, What You Seek is What You Get: Extraction of Class Attributes from Query Logs, IJCAI-07, 2007

  34. M. Pasca, D. Lin, J. Bigham, A. Lifchits, A.Jain: Names and Similarities on the Web: Fact Extraction in the Fast Lane. ACL 2006

  35. B. Rozenfeld, R. Feldman, "High-Performance Unsupervised Relation Extraction from Large Corpora," ICDM 2006.