 |
 |
 |
|
Acknowledgements
Our work is supported under various grants from these funding agencies:
Current Projects
Collaboration
|
Evaluating Classifiers under Dynamic Testing Conditions
| about |
A primary assumption in most mining and learning applications is the
Stationary Distribution Assumption, that training and testing
conditions will remain similar. However, this is often not the case. To
illustrate, a disease may occur naturally in 15% of a North American
population. However, an epidemic condition may drastically increase the
rate of infection to 45%, instigating differences in P(disease)
between the training and testing datasets. Thus, the class distribution
between negative and positive classes changes significantly. Scalar
evaluations of a classifier learned on the original population will not
offer a reasonable expectation for performance during the epidemic. A
separate, but related problem occurs when the model trained from a
segment of North American population is then applied to a European
population where the distribution of measured features can potentially
differ significantly, even if the disease base-rate remains at the
original 15%. The goal of this research is to comprehend how situations
like these affect classifier performance, to determine how they may be
detected, and to develop a best practice for managing such a dynamic
environment.
|
| people |
David Cieslak |
| |
Community Detection in Networks
| about |
Recently, rich datasets that can be represented as interaction networks have received
increasing attention in a variety of domains. Application areas include social networks,
online communities, protein-protein interactions, transactions of goods and services, and many
others. While such networks provide an intuitive representation of complex data, mining
this information can be quite difficult. Our primary focus is on community detection,
an important task in network analysis. Most current approaches are computationally expensive,
limiting their scalability to networks of several thousand nodes, yet datasets containing
millions of nodes are becoming readily available. Aside from the network connectivity,
many datasets also contain additional attributes of individual nodes, but existing methods
do not incorporate this information. To address these challenges, we propose a three-stage
ommunity detection framework called UNICODE (Using Node Attributes to
Improve COmmunity DEtection). The assignment of suitable edge weights
is a critical component of community detection, and we show that using node attributes for
weighting can significantly improve the community structure. In addition, we demonstrate
scalability to networks of over one million nodes with an execution time of approximately
40 seconds. This work takes a step towards meeting the increasing demand for effective and
highly scalable algorithms for network analysis.
|
| people |
Karsten Steinhaeuser |
| |
Predicting Individual Disease Risk Based on Medical History
| about |
Medical care and research are literally the most vital part of science for humans, as none
of us are immune to physical ailments and biological deterioration. Annual health care
expenditure in the U.S. alone is an overwhelming sum, with a strong majority of this money
is used for chronic disease treatment. However, research has shown thousands of conditions
to have recognizable indicators before onset or preventable risk factors. From these
discoveries comes the idea of prospective medical practices, aimed at determining and
minimizing individual risk, and recognizing conditions at the earlier indication.
In theory, these practices reduce the conditions needing treatment, and improve the
effectiveness of interventions that are necessary. This research seeks to aid the
development of a efficient, inexpensive, and noninvasive system to provide a "first line
of defense" by computing probable risks and guide selection of further tests or treatments.
|
| people |
Darcy Davis |
| |
Graph Based Model of Customer Purchase Behavior
| about |
Many companies analyze their sales data in order to understand the
buying patterns and motives of their customers. One traditional
approach is to use association rules: looking at items that are
purchased together and making statements such as people who buy
beer also buy diapers 5 percent of the time. Amazon uses
collaborative filtering to predict what else a user will like
based on purchases and ratings from similar users. We take
a different approach to the same problem. Instead of looking at
pairs of products that show a high degree of association, we
construct a graph out of a store's transaction history. Vertices
represent individual items, and edges represent connections,
i.e that two items have been purchased together. The weight of
each edge is simply the number of transactions in which the two
items appear together. This framework allows us to explore many
different properties of the data, including interrelationships
between products, the influence of certain products on others,
and changes in the popularity of products over time.
|
| people |
Troy Raeder |
|

Past Projects
|
Scalable Learning with Thread-Level Parallelism
| about |
High-dimensional datasets containing millions of records and/or thousands of
attributes are quite common today. Despite steady progress in processor and memory
technologies, extreme dataset sizes and algorithmic complexity place unprecedented
demands on computing infrastructures. A gap has developed between the available
real-world datasets and our ability to mine them for patterns of interest.
Datasets are quickly approaching Tera and Petabytes, and this rate of increase
even challenges the subsampling paradigm as samples run into Gigabytes.
Massively parallel architectures offer a possible solution to this problem.
These systems not only contain multiple processors, but each processor is
itself capable of multi-threading. We propose to leverage this fine-grain
parallelsim to improve the scalability of data mining applications. We
implement an array of learning algorithms on the Cray MTA-2 -- a high-performance
parallel architecture -- and perform an evaluation and comparison to conventional
sequential architectures. Our experimental results show that employing thread-level
parallelism provides more graceful scalability with increasing data size.
We also demonstrate the ability to construct a single decision tree from a
dataset of 50 million records. We believe that the MTA-2 and/or similar parallel
architectures will play an important role in the advance of high-performance
data mining in commercial applications and the scientific domain.
|
| people |
Karsten Steinhaeuser |
|
| |