Acknowledgements

Our work is supported under various grants from these funding agencies:



Current Projects

Learning from unbalanced data sets and cost-sensitive learning
Evaluating Classifiers under Dynamic Testing Conditions
Community Detection in Networks
Predicting Individual Disease Risk Based on Medical History
Graph Based Model of Customer Purchase Behavior
TeamTrak: A Testbed for Cooperative Mobile Computing

Collaboration

Computer Vision Research Laboratory (CVRL)
Distributed and Real-Time Systems Laboratory (DARTS)
Cooperative Computing Lab (CCL)
Center for Complex Networks Research (CCNR)

Evaluating Classifiers under Dynamic Testing Conditions

about A primary assumption in most mining and learning applications is the Stationary Distribution Assumption, that training and testing conditions will remain similar. However, this is often not the case. To illustrate, a disease may occur naturally in 15% of a North American population. However, an epidemic condition may drastically increase the rate of infection to 45%, instigating differences in P(disease) between the training and testing datasets. Thus, the class distribution between negative and positive classes changes significantly. Scalar evaluations of a classifier learned on the original population will not offer a reasonable expectation for performance during the epidemic. A separate, but related problem occurs when the model trained from a segment of North American population is then applied to a European population where the distribution of measured features can potentially differ significantly, even if the disease base-rate remains at the original 15%. The goal of this research is to comprehend how situations like these affect classifier performance, to determine how they may be detected, and to develop a best practice for managing such a dynamic environment.
people David Cieslak

Community Detection in Networks

about Recently, rich datasets that can be represented as interaction networks have received increasing attention in a variety of domains. Application areas include social networks, online communities, protein-protein interactions, transactions of goods and services, and many others. While such networks provide an intuitive representation of complex data, mining this information can be quite difficult. Our primary focus is on community detection, an important task in network analysis. Most current approaches are computationally expensive, limiting their scalability to networks of several thousand nodes, yet datasets containing millions of nodes are becoming readily available. Aside from the network connectivity, many datasets also contain additional attributes of individual nodes, but existing methods do not incorporate this information. To address these challenges, we propose a three-stage ommunity detection framework called UNICODE (Using Node Attributes to Improve COmmunity DEtection). The assignment of suitable edge weights is a critical component of community detection, and we show that using node attributes for weighting can significantly improve the community structure. In addition, we demonstrate scalability to networks of over one million nodes with an execution time of approximately 40 seconds. This work takes a step towards meeting the increasing demand for effective and highly scalable algorithms for network analysis.
people Karsten Steinhaeuser

Predicting Individual Disease Risk Based on Medical History

about Medical care and research are literally the most vital part of science for humans, as none of us are immune to physical ailments and biological deterioration. Annual health care expenditure in the U.S. alone is an overwhelming sum, with a strong majority of this money is used for chronic disease treatment. However, research has shown thousands of conditions to have recognizable indicators before onset or preventable risk factors. From these discoveries comes the idea of prospective medical practices, aimed at determining and minimizing individual risk, and recognizing conditions at the earlier indication. In theory, these practices reduce the conditions needing treatment, and improve the effectiveness of interventions that are necessary. This research seeks to aid the development of a efficient, inexpensive, and noninvasive system to provide a "first line of defense" by computing probable risks and guide selection of further tests or treatments.
people Darcy Davis

Graph Based Model of Customer Purchase Behavior

about Many companies analyze their sales data in order to understand the buying patterns and motives of their customers. One traditional approach is to use association rules: looking at items that are purchased together and making statements such as people who buy beer also buy diapers 5 percent of the time. Amazon uses collaborative filtering to predict what else a user will like based on purchases and ratings from similar users. We take a different approach to the same problem. Instead of looking at pairs of products that show a high degree of association, we construct a graph out of a store's transaction history. Vertices represent individual items, and edges represent connections, i.e that two items have been purchased together. The weight of each edge is simply the number of transactions in which the two items appear together. This framework allows us to explore many different properties of the data, including interrelationships between products, the influence of certain products on others, and changes in the popularity of products over time.
people Troy Raeder


Past Projects

Scalable Learning with Thread-Level Parallelism

Scalable Learning with Thread-Level Parallelism

about High-dimensional datasets containing millions of records and/or thousands of attributes are quite common today. Despite steady progress in processor and memory technologies, extreme dataset sizes and algorithmic complexity place unprecedented demands on computing infrastructures. A gap has developed between the available real-world datasets and our ability to mine them for patterns of interest. Datasets are quickly approaching Tera and Petabytes, and this rate of increase even challenges the subsampling paradigm as samples run into Gigabytes. Massively parallel architectures offer a possible solution to this problem. These systems not only contain multiple processors, but each processor is itself capable of multi-threading. We propose to leverage this fine-grain parallelsim to improve the scalability of data mining applications. We implement an array of learning algorithms on the Cray MTA-2 -- a high-performance parallel architecture -- and perform an evaluation and comparison to conventional sequential architectures. Our experimental results show that employing thread-level parallelism provides more graceful scalability with increasing data size. We also demonstrate the ability to construct a single decision tree from a dataset of 50 million records. We believe that the MTA-2 and/or similar parallel architectures will play an important role in the advance of high-performance data mining in commercial applications and the scientific domain.
people Karsten Steinhaeuser