Dr. Jeff Vitter
Texas A&M University
Abstract
The world is drowning in data. A key challenge is how to make use of this data in a meaningful way. This talk addresses compressed data structures, a new paradigm for fast search with small space usage. We address in particular how to search huge document collections and, for a given query pattern, to find the most relevant documents that contain that pattern. "Most relevant" may mean the documents that contain the greatest number of instances of the pattern or the documents with the highest Page Rank (such as used by Google). Inverted indexes do not handle general pattern search. Suffix trees and suffix arrays support general pattern search, but they are too expensive in terms of space usage, and in addition, they require the finding of every occurrence of the pattern, which can be very expensive when the number of pattern occurrences is much larger than the number of documents. We improve upon the results of Muthukrishnan with a linear-space data structure that yields optimum time performance. We also develop a more succinct search structure whose size is proportional to the size of the documents when compressed using a high-order entropy method.
Joint work with Wing-Kai Hon and Rahul Shah.
Bio
Jeff Vitter is professor of computer science and engineering at Texas A&M University. From 2008 to 2009, he served as is provost and executive vice president for academics at Texas A&M. Previously he served six years as the Frederick L. Hovde Dean of the College of Science and Professor of Computer Science at Purdue University. After earning a bachelor's degree in mathematics with highest honors from the University of Notre Dame in 1977 and a PhD in computer science from Stanford University in 1980, he began his academic career as assistant professor of computer science at Brown University in 1980, where he progressed through the faculty ranks and served in various leadership roles. From 1993 to 2002, Dr. Vitter held a distinguished professorship at Duke University, where he was the Gilbert, Louis, and Edward Lehrman Professor of Computer Science. He served as chair of the Department of Computer Science at Duke from 1993-2001 and as co-director and a founding member of Duke's Center for Geometric and Biological Computing from 1997-2002. While serving on the faculty at Duke, he earned an M.B.A. from the Fuqua School of Business at Duke in 2002.
Dr. Vitter has been named a Guggenheim Foundation Fellow, a Fellow of the Association for Computing Machinery, a Fellow of the Institute of Electrical and Electronics Engineers, a National Science Foundation Presidential Young Investigator, a Fulbright Scholar, and an IBM Faculty Development Awardee. He has over 280 book, journal, and conference publications; his Google Scholar h-index is 54. He authored the book Algorithms and Data Structures for External Memory (now Publishers, 2008), which covers the I/O field he helped found. He has also coauthored the books Efficient Algorithms for MPEG Video Compression (Wiley & Sons, 2002) and Design and Analysis of Coalesced Hashing (Oxford University Press, 1987). He is coeditor of the collections External Memory Algorithms and Algorithm Engineering and co-holder of patents in the areas of external sorting, parallel I/O, prediction, and approximate data structures. Editorial board memberships have included Algorithmica, Communications of the ACM, IEEE Transactions on Computers, Theory of Computing Systems (formerly Mathematical Systems Theory: An International Journal on Mathematical Computing Theory), and SIAM Journal on Computing; in addition, he has edited several special issues.