Data mining applications over text require efficient methods for extracting and structuring the information embedded in millions, or billions, of text documents. This presentation reviews the current research on enabling information extraction to operate on Web scale. Different dimensions of scalability include corpus size, heterogeneity of the information sources, access to the documents, and the diversity of the extraction domains. This presentation will focus on the first three dimensions. First I will briefly review common information extraction tasks such as entity, relation, and event extraction, indicating the main scalability bottlenecks associated with each task. I will then review the key algorithmic approaches to improving the efficiency of information extraction, which include ideas adapted from text mining, information retrieval and recently developed specialized indexing techniques.
Recording of the presentation (22 March 2007): available soon.
Answers to questions asked live: to be posted
E-mail me any additional questions
E. Agichtein, Scaling Information Extraction to Large Document Collections, IEEE Data Engineering Bulletin, 2005
E. Agichtein and L. Gravano. Querying text databases for efficient information extraction. ICDE 2003
E. Agichtein, P. Ipeirotis, and L. Gravano, Modeling Query-Based Access to Text Databases, WebDB 2003
M. Banko, M. J. Cafarella, S.Soderland, M. Broadhead, and O. Etzioni, Open Information Extraction from the Web, IJCAI 2007
S. Cucerzan and E. Agichtein, Factoid question answering over unstructured and structured content on the web, TREC 2005
M. J. Cafarella, D.
S. Bird, Y. Chen, S.B. Davidson, H.Lee, and Y. Zheng, Designing and Evaluating an XPath Dialect for Linguistic Queries, ICDE 2006
J. Cho, S. Rajagopalan, "A Fast Regular Expression Indexing Engine," ICDE 2002.
M. J. Cafarella and O. Etzioni. A search engine for natural language applications. (WWW), 2005
A. Chandel, P.C. Nagesh, and S. Sarawagi. Efficient batch top-k search for dictionary-based entity recognition. ICDE 2006
K. Chakrabarti, V. Ganti, Jiawei Han, D. Xin, Ranking Objects Based on Relationships, SIGMOD 2006
S. Chakrabarti, Kriti Puniyani and Sujatha Das, Optimizing Scoring Functions and Indexes for Proximity Search in Type-annotated Corpora. WWW 2006
M. J. Cafarella, C. Re, D. Suciu, O. Etzioni, and M. Banko, Structured Querying of Web Text: A Technical Challenge, CIDR 2007
S. Chaudhuri, R. Ramakrishnan, and G. Weikum. Integrating db and ir technologies: What is the sound of one hand clapping?, CIDR 2005.
S. Dumais, M. Banko, E. Brill, J. Lin and A. Ng (2002). P. Bennett, S. Dumais and E. Horvitz (2002). Web question answering: Is more always better? SIGIR 2002
R. Grishman, S. Huttunen, and R. Yangarber. Information extraction for enhanced access to disease outbreak reports. Journal of Biomedical Informatics, 2002.
D. Gruhl, L. Chavet, D. Gibson, J. Meyer, P. Pattanayak, A. Tomkins, and J. Zien. How to build a WebFountain: An architecture for very large-scale text analytics. IBM Systems Journal, 2004.
Silja Huttunen, Roman Yangarber and Ralph Grishman, Diversity of Scenarios in Information Extraction, LREC 2002
Ipeirotis, E. Agichtein, P. Jain, and L. Gravano, To Search or to Crawl: Towards a Query Optimizer for Text-Centric Tasks, SIGMOD 2006
A. Klementiev and D. Roth, Weakly Supervised Named Entity Transliteration and Discovery from Multilingual Comparable Corpora. Proc. of the Annual Meeting of the ACL (2006)
A. Klementiev and D. Roth, Named Entity Transliteration and Discovery from Multilingual Comparable Corpora. NAACL HLT 2006
R. Krishnamurthy, S. Raghavan, S. Vaithyanathan, H. Zhu, Avatar: A Database Approach to Semantic Search, SIGMOD 2006
P. Pantel, D. Ravichandran, and E. Hovy. Towards terascale knowledge acquisition. In Conference on Computational Linguistics (COLING), 2004.
P. Resnik and A. Elkiss. The linguist’s search engine: An overview (demonstration). In ACL, 2005.
Organizing the World Wide Web of Facts - Step One: the One-Million Fact Extraction Challenge, Marius Pasca, Dekang Lin, Jeffrey Bigham, Andrei Lifchits, and Alpa Jain, AAAI, 2006.
P.D. Turney. Mining the web for synonyms: PMI-IR versus LSA on TOEFL. In European Conference on Machine Learning (ECML), 2001.
Ping Li, Kenneth Ward Church: Using Sketches to Estimate Associations. HLT/EMNLP 2005
C. König and E. Brill, Reducing the Human Overhead in Text Categorization, KDD 2006
S. Sekine, On-Demand Information Extraction, ACL 2006
D. Ravichandran, P. Pantel, and E. Hovy, Randomized Algorithms and NLP: Using Locality Sensitive Hash Functions for High Speed Noun Clustering, ACL 2005
A. Yates and O. Etzioni, Unsupervised Resolution of Objects and Relations on the Web, NAACL-HLT 2007
Jeffrey Dean, Sanjay Ghemawat, MapReduce: Simplified Data Processing on Large Clusters, OSDI'04: Sixth Symposium on Operating System Design and Implementation, 2004
M. Pasca, B. Van Durme, What You Seek is What You Get: Extraction of Class Attributes from Query Logs, IJCAI-07, 2007
M. Pasca, D. Lin, J. Bigham, A. Lifchits, A.Jain: Names and Similarities on the Web: Fact Extraction in the Fast Lane. ACL 2006
B. Rozenfeld, R. Feldman, "High-Performance Unsupervised Relation Extraction from Large Corpora," ICDM 2006.