CS700:Graduate Seminar in Computer Science & Informatics

High Performance Spatial Query Processing for Large Scale Scientific Data
Abulimiti Aji, Department of Mathematics and Computer Science

Support of high performance queries on large volumes of scientific spatial data is becoming increasingly important in scientific research. This growth is driven by not only geospatial problems in numerous fields including ecology, geology, and epidemiology, but also emerging scientific applications that are increasingly data-intensive and compute-intensive. With the rapid improvement of data acquisition technologies such as high-resolution tissue slide scanners and remote sensing instruments, it has become more efficient to capture extremely large spatial scientific data to support research. Systematic analysis of large-scale whole slide tissue images, for example, generates tremendous amount of spatially derived quantifications of microanatomic objects in biomedical imaging. Exploring analysis results involves complex queries such as spatial cross-matching, overlay of multiple sets of spatial objects, spatial proximity computations between objects, and queries for global spatial pattern discovery. These queries often involve millions to billions of spatial objects, with heavy geometric computations. Our goal is to develop a novel framework that rethinks and restructures end-to-end support (from storage and indexing to query execution to query composition) for exploration of spatial datasets at big data scales on emerging platforms such as MapReduce. In this talk, I will present Hadoop-GIS -- a highly scalable and high performance query system for running large scale spatial queries on MapReduce. Hadoop-GIS provides an indexing-on-demand based spatial query engine to process spatial queries, data and space based partitioning, and query pipelines that parallelize queries implicitly on MapReduce. To provide expressive queries with declarative spatial query language, we have extended the QL language of Hive with spatial query constructs, and integrated the spatial query framework into Hive. Our experiments demonstrate that the proposed system is a scalable, cost effective solution and achieves high performance spatial query execution.