Data Intensive Abstractions for High End Biometric Applications

PIs: Douglas Thain and Patrick Flynn. This work is supported by the National Science Foundation under grant CNS-06-21434.

Biometric research requires the execution of very large data intensive batch workloads. To evaluate new matching algorithms, researchers wish to compare thousands of images to each other by brute force. When these sort of workloads are submitted to conventional batch systems in the usual way, they induce massive amount ofs network and I/O traffic that result in very poor throughput. How can we execute such large workloads effectively?

To attack this problem, we are introducing data intensive abstractions that allow the user to easily provide the system with more information about the structure of a workload so that is can partition the data and execute it effectively. The abstraction explicitly specifies the data to be processed, the code that will process it, and the relationship between the two. One example of an abstraction is All-Pairs:


All-Pairs( set S, function F ):
For all Si and Sj in set S, compute: F( Si, Sj )

A computing system with an All-Pairs interface can easily find a more efficient implementation than a demand-paged filesystem. The input data can be staged to the computation nodes by a spanning tree, and the partitioning of work units into jobs can be done according to the performance properties of the system. In this project, we are designing a variety of similar data intensive abstractions that allow for the easy and efficient execution of large scientific workloads.

Publications

  • DataLab: Transactional Data Parallel Computing on an Active Storage Cloud, Brandon Rich and Douglas Thain, IEEE/ACM High Performance Distributed Computing, June 2008.
  • Towards a Data Analysis Grid for Biometric Research, Douglas Thain and Patrick Flynn, Technical Report 2008-05, Department of Computer Science and Engineering, University of Notre Dame, April 2008.
  • All-Pairs: An Abstraction for Data-Intensive Cloud Computing, Christopher Moretti, Jared Bulosan, Douglas Thain, and Patrick Flynn, International Parallel and Distributed Processing Symposium, April 2008.
  • DataLab: Transactional Data-Parallel Computing on an Active Storage Cloud, Brandon Rich and Douglas Thain, Technical Report 2008-02, Department of Computer Science and Engineering, University of Notre Dame, March 2008.
  • Poster: All-Pairs: An Abstraction for Data Intensive Grid Computing, Christopher Moretti, Jared Bulosan, and Douglas Thain, IEEE Grid Computing 2007.
  • Talk: Data Intensive Abstractions for High End Biometric Applications, High End Computing FSIO Working Group, August 2007.
  • Paper: Challenges in Executing Data Intensive Biometric Workloads on a Desktop Grid, Christopher Moretti, Timothy Faltemier, Douglas Thain, and Patrick J. Flynn, Workshop on Large-Scale and Volatile Desktop Grids (PCGRID) at IPDPS, Long Beach, California, March 2007.

  • Cooperative Computing Lab - CSE Department - Notre Dame