image
Nitesh Chawla, Ph.D., is the Frank Freimann Collegiate Associate Professor in the Department of Computer Science and Engineering in University of Notre Dame, Director of Data Inference Analytics and Learning Lab (DIAL), and Director of the Interdisciplinary Center for Network Science and Applications (iCeNSA). His research is focused on Big Data, Data Science and Network Science.

For meetings and appointments, please contact:
Ms. Jasmine Botello
abotello [at] nd dot edu
574 631 7095

What's new
  • Invited to speak at the Joint Statistical Meeting of American Statistical Association on Big Data for Healthcare in 2014
  • Invited to speak at BioIT World on Cloud Computing and Healthcare in 2014
  • Invited as a Guest Speaker on Twitter #CXO Chat hosted by IBM. Topic: Personalizing Healthcare Byte by Byte.
  • Invited to speak on Harnessing Audience Analytics @ WAN-IFRA SFN Forum in Berlin, October 2013
  • Big Data Republic Coverage of CARE Personalizing Healthcare With Big Data
  • Check out my OpEd on Why Newspapers Need Big Data Now
  • Chawla Receives the IBM Big Data and Analytics Faculty Award.
  • Electronic health tool aims to improve personalized care
  • "Red Black Network: Temporal and Topological Analysis of Two Intertwined Social Networks" accepted in MILCOM 2013
  • EHRs and big data analytics aid personalized patient care
  • New Tool Uses EHR Data To Deliver Personalized Health Care
  • Researchers Link Data Science, Diagnoses
  • Patent Issued for Disease Diagnoses-bases Disease Prediction
  • Searching for a Data Scientist. See the link here .
  • New Grant from United Way Foundation to Creating Large-Scale Social Change in Childhood Obesity & Academic Performance
  • New Grant from INDIANA CTSI to study Diabetic Population -- a collaboration with MHIN and Bendix Family Physicians
  • Predicting Duration of a Phone Call paper accepted in ECML/PKDD'13
  • Classifier Evaluation With Missing Negative Class Labels accepted in IDA'13
  • An Ensemble Topic Model for Sharing Healthcare Data and Predicting Disease Risk accepted in ACM BCB'13
  • Bringing Big Data to Personalized Healthcare accepted in Journal of General Internal Medicine
  • Link Prediction in Human Mobility Networks accepted in ACM/IEEE ASONAM'13
  • Predicting Prominence in Networks accepted in IEEE NSW'13
  • Fellow of the Institute of Asia and Asian Studies
  • Michiana 40 Under 40 Honor
  • NSF Grant Funded on Network Analysis of Trader Behaviors and Institutions
  • GAIN Index moves to Notre Dame
  • Fellow of the Riley Center on Science Technology and Values
  • IBM Watson Faculty Award
  • Recognized as the Frank Freimann Collegiate Chair of Engineering
  • Paper titled, “Exploring and Exploiting Disease Interactions from Multi-relational Gene and Phenotype Networks,” selected for the 2012 IMIA Yearbook of Medical Informatics, 2013 as best of medical informatics papers published in 2011.
Upcoming invited talks and keynotes
November 12, 2013 Invited Panelist for Global Action Platform Global South Summit

October 28, 2013
Invited Speaker on Twitter #CXO Chat Hosted by IBM Big Data and Analytics. Topic: Personalizing Healthcare Byte by Byte.

October 23, 2013
Panelist for "Big Data: Real Challenges and Current Status" for IDEAL'13, China

October 8th, 2013
Invited Speaker a WAN-IFRA SFN Forum on Harnessing Aunalytics, Berlin

October 6, 2013
Co-evolving Networks @ INFORMS Data Mining, Minneapolis

June 25, 2013
Big Data-Driven Healthcare @ INFORMS Healthcare, Chicago

June 4th, 2013
Social Dynamic @ Netsci on Link Prediction

June 3rd, 2013
Networks Over Time @ Netsci on Co-evolving Networks

May 2013
Commencement Speaker in the CSE Graduation Ceremony @ Notre Dame

Feb, 2013
Michigan State University on Prospective and Personalized Healthcare

Biography

Curriculum Vitae (CV)

Nitesh Chawla, PhD is the Frank Freimann Collegiate Associate Professor in the Department of Computer Science and Engineering, Director of the Notre Dame Interdisciplinary Center for Network Science and Applications (iCeNSA), and the Director of (DIAL) . He is also the Lead of the ND-GAIN Index.

Prof. Chawla started his tenure-track position at Notre Dame in 2007, was promoted and tenured in 2011, and recognized with the Frank Freimann Collegiate Chair of Engineering position in 2012. His research is focused on Big Data, network science, and data science. He is at the frontier of interdisciplinary applications with innovative work in healthcare analytics, medicine, social networks, and climate/environmental sciences.

Prof. Chawla is the recipient of multiple awards for research and teaching innovation including outstanding teacher awards (2007 and 2010), National Academy of Engineers New Faculty Fellowship, and number of best paper awards and nominations. He serves as a PI/Co-PI on over $11 Million Dollars of External Research Funding (since 2007). He was recognized with the Frank Freimann Collegiate Chair of Engineering in 2012. He received the IBM Watson Faculty Award in 2012 and the IBM Big Data Award in 2013. He was the chair of the IEEE CIS Data Mining Technical Committee (2010-2012). He is currently the Chair of the IEEE CIS Task Force on Big Data. He serves on editorial boards of a number of journals and organization/program committees of top-tier conferences. Prof. Chawla is a Fellow of the Reilly Center for Science, Technology, and Values; Fellow of the Institute of Asia and Asian Studies; Program Lead for the Transportation Networks in the Environmental Change Initaitive. He is also an Advisor in the ESTEEM Program. Prof. Chawla is the Founder of Aunalytics, Inc., a start-up focused on Big Data Analytics. He also serves on the Board of Managers of Michiana Health and Information Networks.

2013
Saurav Pandit, Jonathan Koch, Yang Yang, Brian Uzzi and Nitesh V. Chawla (November 2013)
"Red Black Network: Temporal and Topological Analysis of Two Intertwined Social Networks"
32nd Annual Military Communications Conference, 2013 (MILCOM'13)
PDF
Andrew Rider, Nitesh V. Chawla, and Scott Emrich (2013)
"A survey of current integrative network algorithms for systems biology"
Systems Biology: Integrative Biology and Simulation Tools, 2013
PDF
Andrew Rider, Reid Johnson, Darcy Davis, T. Ryan Hoens and Nitesh V. Chawla (July 2013)
"Classifier Evaluation With Missing Negative Class Labels
Twelth International Symposium on Intelligent Data Analysis, 2013 (IDA'13)
PDF
Andrew Rider and Nitesh V. Chawla (July 2013).
An Ensemble Topic Model for Sharing Healthcare Data and Predicting Disease Risk
ACM International Conference on Bioinformatics, Computational Biology, and Biomedical Informatics, 2013 (ACM BCB'13)
PDF
Yang Yang, Nitesh V. Chawla, Prithwish Basu, Bhaskar Prabhala, Thomas La Porta
Link Prediction in Human Mobility Networks
ACM/IEEE ASONAM'13
PDF
Yuxiao Dong, Jie Tang, Tiancheng Lou, Bin Wu, Nitesh V. Chawla (Sep 2013).
How Long will She Call Me? Distribution, Social Theory and Duration Prediction
ECML/PKDD' 13, Prague, Sep. 2013. (Acceptance rate: )
PDF
Nitesh V. Chawla and Darcy Davis(July 2013).
Bringing Big Data to Personalized Healthcare: A Patient-Centered Framework.
JOurnal of General Internal Medicine (JGIM)
Nathan Regola, David Cieslak, Nitesh V. Chawla (2013).
The Need to Consider Hardware Selection when Designing Big Data Applications Supported by Metadata.
Big Data Management, Technologies, and Applications
PDF
Saurabh Nagrecha, Pawan J. Lingras, and Nitesh V. Chawla
Comparison of Gene Co-expression Networks and Bayesian Networks Inferring genetic networks is of great importance in unlocking gene behaviour, which in turn provides solutions for drug testing, disease resistance, and many other applications. Dynamic network models provide room for handling noisy or missing prelearned data. This paper discusses how Dynamic Bayesian Networks compare against coexpression networks as discussed by Zhang and Horvath. These shall be tested out on the genes of yeast Saccharomyces cerevisiae. A method is then proposed to get the best out of the strengths of both models, namely, the causality inference from Bayesian networks and the scoring method from a modified version of Zhang and Horvath's method.
5th Asian Conference on Intelligent Information and Database Systems (ACIIDS), Kuala Lumpur, Malaysia, March 18-20, 2013.
PDF
Rachael Purta, Saurabh Nagrecha, and Gregory Madey
Multi-hop Communications in a Swarm of UAVs Unmanned Aerial Vehicles (UAVs) are of increasing interest to researchers because of their diverse applications, such as military operations and search and rescue. The problem we have chosen to focus on is using a swarm of small, inexpensive UAVs to discover static targets in a search space. Though many different swarm models have been used for similar problems, our proposed model, the Icosystem Swarm Game, to our knowledge has not been evaluated for this particular problem of target search.

Further, we propose to simulate the performance of this model in a semi-realistic communications environment. The challenge here is to find the optimal multi-hop configuration for the UAVs, so that they can find the most targets, avoid collision with each other as much as possible, and still communicate efficiently. We implement this through a weighted shortest-path problem using Dijkstra's algorithm, with the weights being the transmission cost over distance. Testing has shown that our multi-hop communications perform, in terms of target-finding and collision avoidance with other UAVs, as well as an idealized communications environment.

Agent-Directed Simulation (ADS) Symposium, San Diego, CA, April 7-10, 2013.
PDF
Nathan Regola and Nitesh V. Chawla (March 2013).
Storing and Using Health Data in a Virtual Private Cloud. Electronic health records are being adopted at a rapid rate due to increased funding from the US federal government. Health data provide the opportunity to identify possible improvements in health care delivery by applying data mining and statistical methods to the data and will also enable a wide variety of new applications that will be meaningful to patients and medical professionals. Researchers are often granted access to health care data to assist in the data mining process, but HIPAA regulations mandate comprehensive safeguards to protect the data. Often universities (and presumably other research organizations) have an enterprise information technology infrastructure and a research infrastructure. Unfortunately, both of these infrastructures are generally not appropriate for sensitive research data such as HIPAA, as they require special accommodations on the part of the enterprise information technology (or increased security on the part of the research computing environment). Cloud computing, which is a concept that allows organizations to build complex infrastructures on leased resources, is rapidly evolving to the point that it is possible to build sophisticated network architectures with advanced security capabilities. We present a prototype infrastructure in Amazon.s Virtual Private Cloud to allow researchers and practitioners to utilize the data in a HIPAA-compliant environment.
Journal of Medical Internet Research, 15(3):e63. (JMIR)
HTML
Robert Thompson and Nitesh V. Chawla
Addressing Challenges in Prescrpition Management So far, electronic prescription management has failed to sufficiently improve physician's knowledge of a patient's current and previous medications. Additionally, prescription management methods have failed at encouraging patients to adhere to medications. Between medication errors resulting from incomplete medication histories and poor patient adherence to drug regiments, significant challenges remain in prescription management. In this paper, we propose a unique use of quick response (QR) codes that should improve completeness of medication history, allowing patients to more accurately treat a patient's diagnosis. This QR code system outlines a method for patients, physicians, and pharmacists to all view the medication history and for pharmacists to accurately update the history when a patient retrieves new medication. We also outline a method for incentivizing patient adherence based on redemption codes, a method proven successful in other industries. Between this and the QR code system, physicians should also be able to access important information about a patient's adherence consistency, allowing him or her to better understand the patient and more effectively treat a diagnosis.
24th Annual Conference of the Production and Operations Management Society (POMS)
PDF


2012
Yang Yang, Nitesh V. Chawla, Yizhou Sun, and Jiawei Han (December 2012).
Link Prediction in Heterogeneous Networks: Influence and Time Matters Link Prediction Using Temporal and Heterogeneous Network Information
Proc. of the 12th IEEE International Conference on Data Mining (ICDM'12), Brussels, Belgium, Dec. 2012 (Acceptance rate: 10.7%)
PDF
Yang Yang, Nitesh V. Chawla, Xiaohui Lu, and Sibel Adali (May 2012).
Prominence in Networks: A Co-evolving process Prominence evolution of human artifact, such as publications
IEEE 2nd International Network Science Workshop (NSW'12), West Point, NY, May. 2012 (Acceptance rate: N/A)
PDF
T. Ryan Hoens,Marina Blanton, Aaron Steele, and Nitesh V. Chawla (September 2012. ACCEPTED)
Reliable Medical Recommendation Systems with Patient Privacy One of the concerns patients have when confronted with a medical condition is which physician to trust. Any recommendation system that seeks to answer this question must ensure any sensitive medical information collected by the system is properly secured. In this paper we codify these privacy concerns in a privacy- friendly framework and present two architectures that realize it: the Secure Processing Architecture (SPA) and the Anonymous Contributions Architecture (ACA). In SPA, patients submit their ratings in a protected form without revealing any information about their data, and the computation of recommendations proceeds over the protected data using secure multi-party computation techniques. In ACA, patients submit their ratings in the clear, but no link between a submission and patient data can be made. We discuss various aspects of both architectures including techniques for ensuring reliability of computed recommendations and system performance, and provide their comparison.
ACM Transactions on Intelligent Systems and Technology (ACM TIST)
PDF
Yang Yang, Nitesh V. Chawla, Yizhou Sun, and Jiawei Han (December 2012).
Link Prediction in Heterogeneous Networks: Influence and Time Matters Link Prediction Using Temporal and Heterogeneous Network Information
Proc. of the 12th IEEE International Conference on Data Mining (ICDM'12), Brussels, Belgium, Dec. 2012 (Acceptance rate: 10.7%)
PDF
Yuxiao Dong, Jie Tang, Sen Wu, Jilei Tian, Nitesh V. Chawla, Jinghai Rao, and Huanhuan Cao (December 2012).
Link Prediction and Recommendation Across Heterogeneous Social Networks
Proc. of the 12th IEEE International Conference on Data Mining (ICDM'12), Brussels, Belgium, Dec. 2012 (Acceptance rate: 10.7%)
PDF
Nitesh V. Chawla and Yang Yang (December 2012).
Link Prediction: A Primer
in Reda Alhajj and Jon Rokne (eds.), Encyclopedia of Social Network Analysis and Mining by Springer, 2012.
PDF
Saurav Pandit, Yang Yang, and Nitesh V. Chawla (December 2012).
Maximizing Information Spread Through Influence Structures in Social Networks Maximize the influence in social networks
DaMNet Workshop, in Proc. of the 12th IEEE International Conference on Data Mining (ICDM'12), Brussels, Belgium, Dec. 2012
PDF
Reid A. Johnson, Nitesh V. Chawla, and Jessica J. Hellmann (October 2012).
Species Distribution Modeling and Prediction: A Class Imbalance Problem Predicting the distributions of species is central to a variety of applications in ecology and conservation biology. With increasing interest in using electronic occurrence records, many modeling techniques have been developed to utilize this data and compute the potential distribution of species as a proxy for actual observations. As the actual observations are typically overwhelmed by non-occurrences, we approach the modeling of species' distributions with a focus on the problem of class imbalance. Our analysis includes the evaluation of several machine learning methods that have been shown to address the problems of class imbalance, but which have rarely or never been applied to the domain of species distribution modeling. Evaluation of these methods includes the use of the area under the precision-recall curve (AUPR), which can supplement other metrics to provide a more informative assessment of model utility under conditions of class imbalance. Our analysis concludes that emphasizing techniques that specifically address the problem of class imbalance can provide AUROC and AUPR results competitive with traditional species distribution models.
NASA Conference on Intelligent Data Understanding (CIDU), Boulder, CO.
PDF
Ryan Lichtenwalter and Nitesh V. Chawla (2012).
Link Prediction: Fair and Effective Evaluation
IEEE/ACM International Conference on Social Networks Analysis and Mining, 2012 (Acceptance rate: 16%)
PDF
Yang Yang, Yizhou Sun, Saurav Pandit, Nitesh V. Chawla and Jiawei Han (December 2012)
Perspective on Measurement Metrics for Community Detection Algorithms
in Zeki Erdem, Tansel Ozyer, Suheil Khoury, Jon Rokne(eds.), Studies in Mining Social Networks and Security Informatics by Springer, 2012.
PDF
T. Ryan Hoens and Nitesh V. Chawla (August 2012).
Learning in Non-stationary Environments with Class Imbalance Learning in non-stationary environments is an increasingly important problem in a wide variety of real-world applica- tions. In non-stationary environments data arrives incre- mentally, however the underlying generating function may change over time. In addition to the environments be- ing non-stationary, they also often exhibit class imbalance. That is one class (the majority class) vastly outnumbers the other class (the minority class). The combination of a non-stationary environments with class imbalance poses sig- nificant and interesting practical problems for classification. To overcome these issues, we introduce a novel instance se- lection mechanism, as well as update the Heuristic Updat- able Weighted Random Subspaces (HUWRS) method for the class imbalance problem. We then compare our modifi- cations of HUWRS (called HUWRS.IP) to other state of the art algorithms, concluding that HUWRS.IP often achieves vastly superior performance.
18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD).
PDF
T. Ryan Hoens and Nitesh V. Chawla (August 2012).
Imbalanced Datasets: From Sampling to Classifiers Classification is one of the most fundamental tasks in the machine learning and data mining communities. One of the most common challenges faced when trying to perform classification is the class imbalance problem. A data set is considered imbalanced if the class of interest (positive or minority class) is relatively rare as compared to the other classes (negative or majority classes). As a result, the classifier can be heavily biased towards the majority class. A number of sampling approaches, ranging from under-sampling to over-sampling, have been developed to solve the problem of class imbalance. One challenge with sampling strategies is deciding how much to sample, which is obviously conditioned on the sampling strategy that is deployed. While a wrapper approach may be used to discover the sampling strategy and amounts, it can quickly become computationally prohibitive. To that end, recent research has also focused on developing novel classification algorithms that are class imbalance (skew) insensitive. In this chapter, we provide an overview of the sampling strategies as well as classification algorithms developed for countering class imbalance. In addition, we consider the issues of correctly evaluating the performance of a classifier on imbalanced datasets and present a discussion on various metrics.
Book Chapter in, Imbalanced Learning: Foundations, Algorithms, and Applications
PDF
Nathan Regola, David A. Cieslak and Nitesh V. Chawla (June 2012).
The Constraints of Magnetic versus Flash Disk Capabilities in Big Data Analysis Solid state disks (or flash disks) are decreasing in cost per gigabyte and are being incorporated into many appliances, such as the Oracle Database Appliance [8]. Databases—and more specifically data warehouses—are often utilized to support large scale data analysis and decision support systems. Decision makers prefer information in real time. Traditional storage systems that are based on magnetic disks achieve high performance by utilizing many disks for parallel operations in RAID arrays. However, this performance is only possible if requests represent a reasonable fraction of the RAID stripe size, or I/O transactions will suffer from high overhead. Solid state disks have the potential to increase the speed of data retrieval for mission critical workloads that require real time applications, such as analytic dashboards. However, solid state disks behave differently than magnetic hard disks due to the limitations of rewriting NAND flash based blocks. Therefore, this work presents benchmark results for a modern relational database that stores data on solid state disks, and contrasts this performance to a ten disk RAID 10 array, a traditional storage design for high performance database data blocks. The preliminary results show that a single solid state disk is able to outperform the array for queries summarizing a data set for a variety of OLAP cube dimensions. Future work will explore the low level database performance in more detail.
Proc. of Second Workshop on Architectures and Systems for Big Data, ACM.
PDF
Darcy Davis, Ryan Lichtenwalter and Nitesh V. Chawla.
Supervised Methods for Multi-Relational Link Prediction Many important real-world systems, modeled naturally as complex networks, have heterogeneous interactions and complicated dependency structures. Link prediction in such networks must model the influences between heterogenous relationships and distinguish the formation mechanisms of each link type, a task which is beyond the simple topological features commonly used to score potential links. In this paper, we introduce a novel probabilistically weighted extension of the Adamic/Adar measure for heterogenous information networks, which we use to demonstrate the potential benefits of diverse evidence, particularly in cases where homogeneous relationships are very sparse. However, we also expose some fundamental flaws of traditional unsupervised link prediction. We develop supervised learning approaches for relationship (link) prediction in multi-relational networks, and demonstrate that a supervised approach to link prediction can enhance performance. We present results on three diverse, real-world heterogeneous information networks and discuss the trends and tradeoffs of supervised and unsupervised link prediction in a multi-relational setting.
Social Networks Analysis and Mining
PDF
Andrew K. Rider, Geoffrey Siwo, Scott J. Emrich, Michael T. Ferdig, Nitesh V. Chawla (to appear).
A Supervised Learning Approach to the Ensemble Clustering of Genes High-throughput techniques have become a primary approach to gathering biological data. These data can be used to explore relationships between genes and guide development of drugs and other research. However, the deluge of data contains an overwhelming amount of unknown information about the organism under study. Therefore, clustering is a common first step in the exploratory analysis of high-throughput biological data. We present a supervised learning approach to clustering that utilizes known gene-gene interaction data to improve results for already commonly used clustering techniques. The approach creates an ensemble similarity measure that can be used as input to any clustering technique and provides results with increased biological significance while not altering the clustering method.
International Journal of Data Mining and Bioinformatics
PDF
T. Ryan Hoens, Robi Polikar, Nitesh V. Chawla (January 2012).
Learning from Streaming Data with Concept Drift and Imbalance: An Overview The primary focus of machine learning has traditionally been on learning from data assumed to be sufficient and representative of the underlying fixed, yet unknown, distribution. Such restrictions on the problem domain paved the way for development of elegant algorithms with theoretically provable performance guarantees. As is often the case, however, real world problems rarely fit neatly into such restricted models. For instance class distributions are often skewed, resulting in the "class imbalance" problem. Data drawn from non-stationary distributions is also common in real world applications, resulting in the "concept drift" or "non-stationary learning" problem which is often associated with streaming data scenarios. Recently, these problems have independently experienced increased research attention, however the combined problem of addressing all of the above mentioned issues has enjoyed relatively little research. If the ultimate goal of intelligent machine learning algorithms is to be able to address a wide spectrum of real world scenarios, then the need for a general framework for learning from, and adapting to, a non-stationary environment that may introduce imbalanced data can be hardly overstated. In this paper, we first present an overview of each of these challenging areas, followed by a comprehensive review of recent research for developing such a general framework.
Progress in Artificial Intelligence, Springer, 1(1).
PDF
Ryan N. Lichtenwalter and Nitesh V. Chawla.
Vertex Collocation Profiles: Subgraph Counting for Link Analysis and Prediction We introduce the concept of a vertex collocation profile (VCP) for the purpose of topological link analysis and prediction. VCPs provide nearly complete information about the surrounding local structure of embedded vertex pairs. The VCP approach offers a new tool for domain experts to understand the underlying growth mechanisms in their networks and to analyze link formation mechanisms in the appropriate sociological, biological, physical, or other contexts. The same resolution that gives VCP its analytical power also enables it to perform well when used in supervised models to discriminate potential new links. We first develop the theory, mathematics, and algorithms underlying VCP. Then we demonstrate VCP methods performing link prediction competitively with unsupervised and supervised methods across several different network families. We conclude with timing results that introduce the comparative performance of several existing algorithms and the practicability of VCP computations on large networks.
Proc. of 21st International Confernence on World Wide Web (WWW) Conference (Acceptance rate: 12%)
PDF
Yizhou Sun, Jiawei Han, Charu C. Aggarwal, and Nitesh V. Chawla
When Will It Happen? Relationship Prediction in Heterogeneous Information Networks Link prediction, i.e., predicting links or interactions between objects in a network, is an important task in network analysis. Although the problem has attracted much attention recently, there are several challenges that have not been addressed so far. First, most existing studies focus only on link prediction in homogeneous networks, where all objects and links belong to the same type. However, in the real world, heterogeneous networks that consist of multi-typed objects and relationships are ubiquitous. Second, most current studies only concern the problem of whether a link will appear in the future but seldom pay attention to the problem of when it will happen. In this paper, we address both issues and study the problem of predicting when a certain relationship will happen in the scenario of heterogeneous networks. First, we extend the link prediction problem to the relationship prediction problem, by systematically defining both the target relation and the topological features, using a meta path-based approach. Then, we directly model the distribution of relationship building time with the use of the extracted topological features. The experiments on citation relationship prediction between authors on the DBLP network demonstrate the effectiveness of our methodology.
Proc. 2012 ACM Int. Conf. on Web Search and Data Mining (WSDM'12), Seattle, WA, Feb. 2012. (Acceptance rate: 75/362 = 20.7%)
PDF
T. Ryan Hoens, Qi Aian, Nitesh V. Chawla, and Zhi-Hua Zhou
Building Decision Trees for the Multi-class Imbalance Problem Learning in imbalanced datasets is a pervasive problem prevalent in a wide variety of real-world applications. In imbalanced datasets, the class of interest is generally a small fraction of the total instances, but misclassification of such instances is often expensive. While there is a significant body of research on the class imbalance problem for bi- nary class datasets, multi-class datasets have received considerably less attention. This is partially due to the fact that the multi-class imbal- ance problem is often much harder than its related binary class problem, as the relative frequency and cost of each of the classes can vary widely from dataset to dataset. In this paper we study the multi-class imbalance problem as it relates to decision trees (specifically C4.4 and HDDT), and develop a new multi-class splitting criterion. From our experiments we show that multi-class Hellinger distance decision trees, when combined with decomposition techniques, outperform C4.4.
Proc. of 16th Pacific Asia Conference on Knowledge Discovery and Data Mining (Acceptance rate: 28%)
PDF


2011
T. Ryan Hoens, Nitesh V. Chawla, and Robi Polikar (December 2011)
Heuristic Updatable Weighted Random Subspaces for Non-stationary Environments Learning in non-stationary environments is an increasingly important problem in a wide variety of real-world applications. In non-stationary environments data arrives incrementally, however the underlying generating function may change over time. While there is a variety of research into such environments, the research mainly consists of detecting concept drift (and then relearning the model), or developing classifiers which adapt to drift incrementally.

We introduce Heuristic Updatable Weighted Random Subspaces (HUWRS), a new technique based on the Random Subspace Method that detects drift in individual features via the use of Hellinger distance, a distributional divergence metric. Through the use of subspaces, HUWRS allows for a more fine-grained approach to dealing with concept drift which is robust to feature drift even without class labels. We then compare our approach to two state of the art algorithms, concluding that for a wide range of datasets and window sizes HUWRS outperforms the other methods.

IEEE International Conference on Data Mining (ICDM) (Acceptance rate: 12%)
PDF
Nathan Regola and Nitesh V. Chawla (December 2011).
Small Compute Clusters for Large-Scale Data Analysis Enterprises of all sizes are dealing with an increasing amount of electronic data that is essential to their business operations. Since there are many options when moving beyond a single server, organizations must consider a range of investment options such as purchasing a small cluster of machines or leasing capacity from cloud providers. Given the complexity (and cost) of configuring larger scale systems (such as distributed databases), we believe that revisiting more generic batch computing systems provides several advantages that are worth considering and benchmarking. For example, leveraging low cost servers, avoiding the data transformation process, and the time necessary to load and move data to a database are important considerations when selecting a workflow or technology. This paper presents benchmark results that compare MySQL to basic Unix tools on Condor and Oracle Grid Engine (OGE) supported by distributed filesystems for high throughput access to data using a real world data set. The benchmark results should be largely generalizable to other business analytic tasks. These results should aid information technology (IT) managers facing difficult decisions on which software stack to utilize for large, dynamic, datasets where database setup and loading may not be feasible within the time constraints imposed by business needs.
International Conference on Information Technology, Systems and Management, Kozhikode, Kerala, India.
PDF
Jake T. Lussier and Nitesh V. Chawla (September 2011).
Network Effects on Tweeting. Online social networks (OSNs) have created new and exciting ways to connect and share information. Perhaps no site has had a more profound effect on information exchange than Twitter.com. In this paper, we study large-scale graph properties and lesser-studied local graph structures of the explicit social network and the implicit retweet network in order to better understand the relationship between socialization and tweeting behaviors. In particular, we first explore the interplay between the social network and user tweet topics and offer evidence that suggests that users who are close in the social graph tend to tweet about similar topics. We then analyze the implicit retweet network and find highly unreciprocal links and unbalanced triads. We also explain the effects of these structural patterns on information diffusion by analyzing and visualizing how URLs tend to be tweeted and retweeted. Finally, given our analyses of the social network and the retweet network, we provide some insights into the relationships between these two networks.
Discovery Science, Lecture Notes in Computer Science, Springer-Verlag, 6926(2011), 209-220, DOI: 10.1007/978-3-642-24477-3_18.
PDF
Ryan N. Lichtenwalter and Nitesh V. Chawla.
LPmade: Link Prediction Made Easy LPmade is a complete cross-platform software solution for multi-core link prediction and related tasks and analysis. Its first principal contribution is a scalable network library supporting high-performance implementations of the most commonly employed unsupervised link prediction methods. Link prediction in longitudinal data requires a sophisticated and disciplined procedure for correct results and fair evaluation, so the second principle contribution of LPmade is a sophisticated GNU \texttt{make} architecture that completely automates link prediction, prediction evaluation, and network analysis. Finally, LPmade streamlines and automates the procedure for creating multivariate supervised link prediction models with a version of WEKA modified to operate effectively on extremely large data sets. With mere minutes of manual work, one may start with a raw stream of records representing a network and progress through hundreds of steps to generate plots, gigabytes or terabytes of output, and actionable or publishable results.
Journal of Machine Learning Research.
PDF
Darcy Davis and Nitesh V. Chawla (July 2011).
Exploring and Exploiting Disease Interactions from Multi-Relational Gene and Phenotype Networks. Availability of electronic health care records is unlocking the potential for novel studies on understanding and modeling the disease co-morbidities based on both phenotypic and genetic data. Moreover, the insurgence of increasingly reliable phenotypic data can aid further studies on investigating the potential genetic links among diseases. The goal is to create a feedback loop where computational tools guide and facilitate research, leading to improved biological knowledge and clinical standards, which in turn should generate better data. We build and analyze disease interaction networks based on data collected from previous genetic association studies and patient medical histories, spanning over 12 years, acquired from a regional hospital. By exploring both individual and combined interactions among these two levels of disease data, we provide novel insight into the interplay between genetics and clinical realities. Our results show a marked difference between the well-defined structure of genetic relationships and the chaotic co-morbidity network, but also highlight clear interdependencies. We demonstrate the power of these dependencies by proposing a novel multi-relational link prediction method, showing that disease co-morbidity can enhance our currently limited knowledge of genetic association. Furthermore, our methods for integrated networks of diverse data are widely applicable and can provide novel advances for many problems in systems biology and personalized medicine.
PLoS ONE, 6(7), e22670
PDF
Troy Raeder, Omar Lizardo, David Hachen, Nitesh V. Chawla (in press).
Predictors of short-term decay of cell phone contacts in a large-scale communication network. Under what conditions is an edge present in a social network at time t likely to decay or persist by some future time t + Delta(t)? Previous research addressing this issue suggests that the network range of the people involved in the edge, the extent to which the edge is embedded in a surrounding structure, and the age of the edge all play a role in edge decay. This paper uses weighted data from a large-scale social network built from cell-phone calls in an 8-week period to determine the importance of edge weight for the decay/persistence process. In particular, we study the relative predictive power of directed weight, embeddedness, newness, and range (measured as outdegree) with respect to edge decay and assess the effectiveness with which a simple decision tree and logistic regression classifier can accurately predict whether an edge that was active in one time period continues to be so in a future time period. We find that directed edge weight, weighted reciprocity and time-dependent measures of edge longevity are highly predictive of whether we classify an edge as persistent or decayed, relative to the other types of factors at the dyad and neighborhood level.
Social Networks.
PDF
Jose G. Moreno-Torres, Troy Raeder, Rocio Alaiz-Rodriguez, Nitesh V. Chawla, Francisco Herrera (January 2012).
A Unifying View on Dataset Shift in Classification. The field of Dataset Shift has received a growing amount of interest in the last few years. The fact that most real-world applications have to cope with some form of shift makes its study highly relevant. The literature on the topic is mostly scattered, and different authors use different names to refer to the same concepts, or use the same name for different concepts. With this work, we attempt to present a unifying framework through the review and comparison of some of the most important works in the literature.
Pattern Recognition, 45(1), 521-530.
PDF
Karsten Steinhaeuser, Auroop R. Ganguly and Nitesh V. Chawla (June 2011).
Multivariate and Multiscale Dependence in the Global Climate System Revealed Through Complex Networks. A systematic characterization of multivariate dependence at multiple spatio-temporal scales is critical to understanding climate system dynamics and improving predictive ability from models and data. However, dependence structures in climate are complex due to nonlinear dynamical generating processes, long-range spatial and long-memory temporal relationships, as well as low-frequency variability. Here we utilize complex networks to explore dependence in climate data. Specifically, networks constructed from reanalysis-based atmospheric variables over oceans and partitioned with community detection methods demonstrate the potential to capture regional and global dependence structures within and among climate variables. Proximity-based dependence as well as long-range spatial relationships are examined along with their evolution over time, yielding new insights on ocean meteorology. The tools are implicitly validated by confirming conceptual understanding about aggregate correlations and teleconnections. Our results also suggest a close similarity of observed dependence patterns in relative humidity and horizontal wind speed over oceans. In addition, updraft velocity, which relates to convective activity over the oceans, exhibits short spatiotemporal decorrelation scales but long-range dependence over time. The multivariate and multi-scale dependence patterns broadly persist over multiple time windows. Our findings motivate further investigations of dependence structures among observations, reanalysis and model-simulated data to enhance process understanding, assess model reliability and improve regional climate predictions.
Climate Dynamics, doi:10.1007/s00382-011-1135-9.
PDF
David A. Cieslak, T. Ryan Hoens, Nitesh V. Chawla and W. Philip Kegelmeyer (June 2011).
Hellinger Distance Decision Trees are Robust and Skew-Insensitive Learning from imbalanced data is an important and common problem. Decision trees, supplemented with sampling techniques, have proven to be an effective way to address the imbalanced data problem. Despite their effectiveness, however, sampling methods add complexity and the need for parameter selection. To bypass these difficulties we propose a new decision tree technique called Hellinger Distance Decision Trees (HDDT) which uses Hellinger distance as the splitting criterion. We analytically and empirically demonstrate the strong skew insensitivity of Hellinger distance and its advantages over popular alternatives such as entropy (gain ratio). We apply a comprehensive empirical evaluation framework testing against commonly used sampling and ensemblemethods, considering performance across 58 varied data- sets. We demonstrate the superiority (using robust tests of statistical significance) of HDDT on imbalanced data, as well as its competitive performance on balanced data- sets. We thereby arrive at the particularly practical conclusion that for imbalanced data it is sufficient to use Hellinger trees with bagging (BG) without any sampling methods. We provide all the datasets and software for this paper online (http://www.nd.edu/~dial/hddt)
Data Mining and Knowledge Discovery (DMKD), doi:10.1007/s10618-011-0222-1.
PDF
Saurav Pandit, Yang Yang, Vikas Kawadia, Sameet Sreenivasan, and Nitesh V. Chawla (to appear).
Detecting Communities in Time-evolving Proximity Networks The pattern of interactions between individuals in a population contains implicitly within them a remarkable amount of information. This information, if extracted, could be of significant importance in several realms such as containing the spread of disease, understanding information flow in social systems and predicting likely future interactions. A popular method of discovering structure in networks is through community detection which attempts to capture the extent to which that network is different from a random network. However, communities are not very well defined for time-varying networks. In this paper, we introduce the notion of spatio-temporal communities that attempts to capture the structure in spatial connections as well as temporal changes in a network. We illustrate the notion via several examples and list the challenges in effectively discovering spatio-temporal communities. For example, such communities are lost if the temporal interactions are aggregated in a single weighted network since the concurrency information is lost. We present an approach that first extracts concurrency information via node-clustering on each snapshot. Each node is then assigned a vector of community memberships over time, which is then used to group nodes into overlapping communities via recently introduced link clustering techniques. However we measure similarity (of nodes and edges) based on concurrence, i.e. when they existed, if they existed together. We call our approach the co-community algorithm. We validate our approach using several real-world data sets spanning multiple contexts.
IEEE Network Science Workshow (NSW), West Point, NY.
PDF
Karsten Steinhaeuser, Nitesh V. Chawla and Auroop R. Ganguly (to appear).
Comparing Predictive Power in Climate Data: Clustering Matters. Various clustering methods have been applied to climate, ecological, and other environmental datasets, for example to define climate zones, automate land-use classification, and similar tasks. Measuring the \'goodness\' of such clusters is generally application-dependent and highly subjective, often requiring domain expertise and/or validation with field data (which can be costly or even impossible to acquire). Here we focus on one particular task: the extraction of ocean climate indices from observed climatological data. In this case, it is possible to quantify the relative performance of different methods. Specifically, we propose to extract indices with complex networks constructed from climate data, which have been shown to effectively capture the dynamical behavior of the global climate system, and compare their predictive power to candidate indices obtained using other popular clustering methods. Our results demonstrate that network-based clusters are statistically significantly better predictors of land climate than any other clustering method, which could lead to a deeper understanding of climate processes and complement physics-based climate models.
Symposium on Spatial and Temporal Databases (SSTD), Minneapolis, MN.
PDF
Ryan N. Lichtenwalter and Nitesh V. Chawla (July 2011).
Disnet: A Framework for Distributed Graph Computation. With the rise of network science as an exciting interdisciplinary research topic, efficient graph algorithms are in high demand. Problematically, many such algorithms measuring important properties of networks have asymptotic lower bounds that are quadratic, cubic, or higher in the number of vertices. For analysis of social networks, transportation networks, communication networks, and a host of others, computation is intractable. In these networks computation in serial fashion requires years or even decades. Fortunately, these same computational problems are often naturally parallel. We present here the design and implementation of a master-worker framework for easily computing such results in these circumstances. The user needs only to supply two small fragments of code describing the fundamental kernel of the computation. The framework automatically divides and distributes the workload and manages completion using an arbitrary number of heterogeneous computational resources. In practice, we have used thousands of machines and observed commensurate speedups. Writing only 31 lines of standard C++ code, we computed betweenness centrality on a network of 4.7M nodes in 25 hours.
The International Conference on Advances in Social Networks Analysis and Mining (ASONAM), Kaohsiung, Taiwan.
PDF
Darcy Davis, Ryan Lichtenwalter, and Nitesh V. Chawla (July 2011).
Multi-Relational Link Prediction in Heterogeneous Information Networks. Many important real-world systems, modeled naturally as complex networks, have heterogeneous interactions and complicated dependency structures. Link prediction in such networks must model the influences between heterogenous relationships and distinguish the formation mechanisms of each link type, a task which is beyond the simple topological features commonly used to score potential links. In this paper, we introduce a novel probabilistically weighted extension of the Adamic/Adar measure for heterogenous information networks, which we use to demonstrate the potential benefits of diverse evidence, particularly in cases where homogeneous relationships are very sparse. However, we also expose some fundamental flaws of traditional a priori link prediction. In accordance with previous research on homogeneous networks, we further demonstrate that a supervised approach to link prediction can enhance performance and is easily extended to the heterogeneous case. Finally, we present results on three diverse, real-world heterogeneous information networks and discuss the trends and tradeoffs of supervised and unsupervised link prediction in a multi-relational setting.
The International Conference on Advances in Social Networks Analysis and Mining (ASONAM), Kaohsiung, Taiwan.
PDF
Yang Yang, Yizhou Sun, Saurav Pandit, Nitesh Chawla, and Jiawei Han (July 2011).
Is Objective Function the Silver Bullet? A Case Study of Community Detection Algorithms on Social Networks. Community detection or cluster detection in networks is a well-studied, albeit hard, problem. Given the scale and complexity of modern day social networks, detecting \'reasonable\' communities is an even harder problem. Since the first use of k-means algorithm in 1960s, many community detection algorithms have been invented - most of which are developed with specific goals in mind and the idea of detecting \'meaningful\' communities varies widely from one algorithm to another. With the increasing number of community detection algorithms, there has been an advent of a number of evaluation measures and objective functions such as modularity and internal density. In this paper we divide methods of measurements in to two categories, according to whether they rely on ground-truth or not. Our work is aiming to answer whether these general used objective functions are well consistent with the real performance of community detection algorithms across a number of homogeneous and heterogeneous networks. Seven representative algorithms are compared under various performance metrics, and on various real world social networks.
The International Conference on Advances in Social Networks Analysis and Mining (ASONAM), Kaohsiung, Taiwan.
PDF
Alex Pelan, Karsten Steinhaeuser, Nitesh V. Chawla, Dilkushi A. de Alwis Pitts and Auroop R. Ganguly (April 2011).
Empirical Comparison of Correlation Measures and Pruning Levels in Complex Networks Representing the Global Climate System. Climate change is an issue of growing economic, social, and political concern. Continued rise in the average temperatures of the Earth could lead to drastic climate change or an increased frequency of extreme events, which would negatively affect agriculture, population, and global health. One way of studying the dynamics of the Earth.s changing climate is by attempting to identify regions that exhibit similar climatic behavior in terms of long-term variability. Climate networks have emerged as a strong analytics framework for both descriptive analysis and predictive modeling of the emergent phenomena. Previously, the networks were constructed using only one measure of similarity, namely the (linear) Pearson cross correlation, and were then clustered using a community detection algorithm. However, nonlinear dependencies are known to exist in climate, which begs the question whether more complex correlation measures are able to capture any such relationships. In this paper, we present a systematic study of different univariate measures of similarity and compare how each affects both the network structure as well as the predictive power of the clusters.
IEEE Symposium Series on Computational Intelligence and Data Mining (CIDM), Paris, France.
PDF


2010
Karsten Steinhaeuser, Nitesh V. Chawla, and Auroop R. Ganguly (to appear).
Complex Networks as a Unified Framework for Descriptive Analysis and Predictive Modeling in Climate Science. The analysis of climate data has relied heavily on hypothesis-driven statistical methods, while projections of future climate are based primarily on physics-based computational models. However, in recent years a wealth of new datasets has become available. Therefore, we take a more data-centric approach and propose a unified framework for studying climate, with an aim towards characterizing observed phenomena as well as discovering new knowledge in the climate domain. Specifically, we posit that complex networks are well-suited for both descriptive analysis and predictive modeling tasks. We show that the structural properties of \'climate networks\' have useful interpretation within the domain. Further, we extract clusters from these networks and demonstrate their predictive power as climate indices. Our experimental results establish that the network clusters are statistically significantly better predictors than clusters derived using a more traditional clustering approach. Using complex networks as data representation thus enables the unique opportunity for descriptive and predictive modeling to inform each other.
Statistical Analysis and Data Mining, doi:0.1002/sam.10100.
PDF
Andrew K. Rider, Geoffrey Siwo, Nitesh V. Chawla, Michael Ferdig and Scott J. Emrich (to appear).
A Statistical Approach to Finding Overlooked Genetic Associations. Background: Complexity and noise in expression quantitative trait loci (eQTL) studies make it difficult to distinguish potential regulatory relationships among the many interactions. The predominant method of identifying eQTLs finds associations that are significant at a genome-wide level. The vast number of statistical tests carried out on these data make false negatives very likely. Corrections for multiple testing error render genome-wide eQTL techniques unable to detect modest regulatory effects. We propose an alternative method to identify eQTLs that builds on traditional approaches. In contrast to genome-wide techniques, our method determines the significance of an association between an expression trait and a locus with respect to the set of all associations to the expression trait. The use of this specific information facilitates identification of expression traits that have an expression profile that is characterized by a single exceptional association to a locus. Our approach identifies expression traits that have exceptional associations regardless of the genome-wide significance of those associations. This property facilitates the identification of possible false negatives for genome-wide significance. Further, our approach has the property of prioritizing expression traits that are affected by few strong associations. Expression traits identified by this method may warrant additional study because their expression level may be affected by targeting genes near a single locus.
Results: We demonstrate our method by identifying eQTL hotspots in Plasmodium falciparum (malaria) and Saccharomyces cerevisiae (yeast). We demonstrate the prioritization of traits with few strong genetic effects through Gene Ontology (GO) analysis of Yeast. Our results are strongly consistent with results gathered using genome-wide methods and identify additional hotspots and eQTLs.
Conclusions: New eQTLs and hotspots found with this method may represent regions of the genome or biological processes that are controlled through few relatively strong genetic interactions. These points of interest warrant experimental investigation.

BMC Bioinformatics.
PDF
Troy Raeder, T. Ryan Hoens, and Nitesh V. Chawla (December 2010).
Consequences of Variability in Classifier Performance Estimates The prevailing approach to evaluating classifiers in the machine learning community involves comparing the performance of several algorithms over a series of usually unrelated data sets. However, beyond this there are many dimensions along which methodologies vary wildly. We show that, depending on the stability and similarity of the algorithms being compared, these sometimes-arbitrary methodological choices can have a significant impact on the conclusions of any study, including the results of statistical tests. In particular, we show that performance metrics and data sets used, the type of cross-validation employed, and the number of iterations of cross-validation run have a significant, and often predictable, effect. Based on these results, we offer a series of recommendations for achieving consistent, reproducible results in classifier performance comparisons.
IEEE International Conference on Data Mining (ICDM), Sydney, Australia.
PDF
Andrew K. Rider, Geoffrey Siwo, Nitesh V. Chawla, Michael Ferdig and Scott J. Emrich (December 2010).
A Supervised Learning Approach to the Unsupervised Clustering of Genes. Clustering is a common step in the analysis of microarray data. Microarrays enable simultaneous highthroughput measurement of the expression level of genes. These data can be used to explore relationships between genes and can guide development of drugs and further research. A typical first step in the analysis of these data is to use an agglomerative hierarchical clustering algorithm on the correlation between all gene pairs. While this simple approach has been successful it fails to identify many genetic interactions that may be important for drug design and other important applications. We present an approach to the clustering of expression data that utilizes known gene-gene interaction data to improve results for already commonly used clustering techniques. The approach creates an ensemble similarity measure that can be used as input to common clustering techniques and provides results with increased biological significance while not altering the clustering approach at all.
IEEE International Conference on Bioinformatics and Biomedicine (BIBM), Hong Kong.
PDF
T. Ryan Hoens, Marina Blanton and Nitesh V. Chawla (November 2010).
Reliable Medical Recommendation Systems with Patient Privacy One of the concerns patients have when confronted with a medical condition is which physician to trust. Any recommendation system that seeks to fulfill this need must ensure any sensitive medical information collected by the system is properly secured. In this paper we codify these privacy concerns in a privacy-friendly framework and present two architectures that realize it: the Secure Processing Architecture (SPA) and the Anonymous Contributions Architecture (ACA). In SPA, patients submit their ratings in a protected form without revealing any information about their data, and the computation of recommendations proceeds over the protected data using secure multi-party computation techniques. In ACA, patients submit their ratings in the clear, but no link between a submission and patient data can be made. We discuss various aspects of both architectures including techniques for ensuring reliability of computed recommendations and system performance, and provide their comparison.
1st ACM International Health Informatics Symposium (IHI), Washington D.C.
PDF
Karsten Steinhaeuser, Nitesh V. Chawla and Auroop R. Ganguly (October 2010).
Complex Networks in Climate Science: Progress, Opportunities and Challenges. Networks have been used to describe and model a wide range of complex systems, both natural as well as man-made. One particularly interesting application in the earth sciences is the use of complex networks to represent and study the global climate system. In this paper, we motivate this general approach, explain the basic methodology, report on the state of the art (including our contributions), and outline open questions and opportunities for future research.
IEEE Conference on Intelligent Data Understanding (CIDU), Mountain View, CA.
PDF
Qi Liao, Aaron Striegel and Nitesh V. Chawla (September 2010).
Visualizing Graph Dynamics and Similarity for Enterprise Network Security and Management Managing complex enterprise networks requires the understanding at a finer granularity than traditional network monitoring. The ability to correlate and visualize the dynamics and inter-relationships among various network components such as hosts, users, and applications is non-trivial. In this paper, we propose a visualization approach based on the hierarchical structure of similarity/difference visualization in the context of heterogeneous graphs. The concept of hierarchical visualization starts with the evolution of inter-graph states, adapts to the visualization of intra-graph clustering, concludes with the visualization of similarity between individual nodes. The visualization tool we developed quantifies and presents these important changes and dynamic essential to network operators through a visually appealing and highly interactive manner. Through novel graph construction and transformation, such as network connectivity graphs, MDS graphs, bipartite graphs, and similarity graphs, we demonstrate how similarity/dynamics can be effectively visualized to provide insight with regards to network understanding.
ACM 7th International Symposium on Visualization for Cyber Security (VizSec).
PDF
Troy Raeder and Nitesh V. Chawla (August 2010).
Market Basket Analysis with Networks. The field of market basket analysis, the search for meaningful associations in customer purchase data, is one of the oldest areas of data mining. The typical solution involves the mining and analysis of association rules, which take the form of statements such as "people who buy diapers are likely to buy beer." It is well-known, however, that typical transaction datasets can support hundreds or thousands of obvious association rules for each interesting rule, and filtering through the rules is a non-trivial task. One may use an interestingness measure to quantify the usefulness of various rules, but there is no single agreed-upon measure and different measures can result in very different rankings of association rules. In this work, we take a different approach to mining transaction data. By modeling the data as a product network, we discover expressive communities (clusters) in the data, which can then be targeted for further analysis. We demonstrate that our network based approach can concisely isolate influence among products, mitigating the need to search through massive lists of association rules. We develop an interestingness measure for communities of products and show that it isolates useful, actionable communities. Finally, we build upon our experience with product net- works to propose a comprehensive analysis strategy by combining both traditional and network-based techniques. This framework is capable of generating insights that are difficult to achieve with traditional analysis methods.
Social Networks Analysis and Modeling Journal.
PDF
T. Ryan Hoens, Marina Blanton and Nitesh V. Chawla (August 2010).
A Private and Reliable Recommendation System for Social Networks. With the proliferation of internet-based social networks into our lives, new mechanisms to control the release and use of personal data are required. As a step toward this goal, we develop a recommendation system which protects the privacy of user answers while allowing them to learn an aggregate weighted average of ratings. Due to the use of social network connections, the querier obtains a more relevant and trustworthy result than what generic anonymous recommendation systems can provide, while at the same time preserving user privacy. We also give experimental performance results for our solution and several recently developed secure computation techniques, which is of independent interest.
IEEE International Conference on Information Privacy, Security, Risk and Trust (PASSAT), Minneapolis, MN.
PDF
Gregory Ditzler, Nitesh V. Chawla and Robi Polikar (August 2010).
An Incremental Learning Algorithm for Nonstationary Environments and Class Imbalance. Abstract to appear
International Conference on Pattern Recognition (ICPR), Istanbul, Turkey.
no pdf
Ryan Lichtenwalter, Jake Lussier and Nitesh V. Chawla (July 2010).
New Perspectives and Methods in Link Prediction. This paper examines important factors for link prediction in networks and provides a general, high-performance framework for the prediction task. Link prediction in sparse networks presents a significant challenge due to the inherent disproportion of links that can form to links that do form. Previous research has typically approached this as an unsupervised problem. While this is not the first work to explore supervised learning, many factors significant in influencing and guiding classification remain unexplored. In this paper, we consider these factors by first motivating the use of a supervised framework through a careful investigation of issues such as network observational period, generality of existing methods, variance reduction, topological causes and degrees of imbalance, and sampling approaches. We also present an effective flow-based predicting algorithm, offer formal bounds on imbalance in sparse network link prediction, and employ an evaluation method appropriate for the observed imbalance. Our careful consideration of the above issues ultimately leads to a completely general framework that outperforms unsupervised link prediction methods by more than 30% AUC.
ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD)
PDF
Karsten Steinhaeuser, Nitesh V. Chawla and Auroop R. Ganguly (July 2010).
An Exploration of Climate Data Using Complex Networks. Climate change is a pressing focus of research, social and economic concern, and political attention. According to the Fourth Assessment Report of the Intergovernmental Panel on Climate Change (IPCC), increased frequency of extreme events will only intensify the occurrence of natural hazards, acting global population, health, and economies. It is of keen interest to identify \'regions\' of similar climatological behavior to discover spatial relationships in climate variables, including long-range teleconnections. To that end, we consider a complex networks-based representation of climate data. Cross correlation is used to weight network edges, thus respecting the temporal nature of the data, and a community detection algorithm identifies multivariate clusters. Examining networks for consecutive periods allows us to study structural changes over time. We show that communities have a climatological interpretation and that disturbances in structure can be an indicator of climate events (or lack thereof). Finally, we discuss how this model can be applied for the discovery of more complex concepts such as unknown teleconnections or the development of multivariate climate indices and predictive insights.
ACM SIGKDD Explorations, 12(1), 25-32.
PDF
Darcy Davis and Nitesh V. Chawla (July 2010).
Exploring Disease Interactions Using Combined Gene and Phenotype Networks. Faced by unsustainable costs and enormous amounts of under-utilized data, health care needs more efficient practices, research, and tools to harness the benefits of data. These methods should create a feedback loop where computational tools guide and facilitate research, leading to improved biological knowledge and clinical standards, which in turn should generate better data. We build and analyzing disease interaction networks based on data collected from previous genetic association studies and patient medical histories, spanning over 12 years, acquired from a regional hospital. By exploring both individual and combined interactions among these two levels of disease data, we provide in- sight into the interplay between genetics and clinical realities. Our results show a marked difference between the well-defined structure of genetic relationships and the chaotic co-morbidity network, but also highlight clear interdependencies. Additionally, we use significant patterns in the data to locate good target sites for further association research.
International Conference on Intelligent Systems for Molecular Biology.
PDF
Karsten Steinhaeuser, Nitesh V. Chawla and Auroop R. Ganguly (June 2010).
Descriptive Analysis of the Global Climate System and Predictive Modeling for Uncertainty Reduction in Climate Projections using Complex Networks. Prior work involving climate networks has largely focused on what we call descriptive analysis, that is, studying the global network properties and interpreting them in the domain context. However, we propose that the use of complex networks can be extended to include predictive modeling in climate within what we have termed a .unified framework., wherein we utilize networks to discover relationships between ocean climate indicators and land climate from observed data. Specifically, using community detection on climate networks we identify ocean clusters from the data, i.e., regions of homogeneous climatologic behavior. Cluster averages are then treated as potential climate indicators by using them as inputs to a predictive (regression) model for land climate. Thus, we are effectively using data mining techniques to .learn. the unknown physical relationships in the global climate system.
International Conference on Computational Sustainability (COMPSUST), Cambridge, MA.
PDF
James Gray, Darcy Davis, DeWayne Pursley, Jane Smallcomb, Alon Geva and Nitesh V. Chawla (June 2010).
Network Analysis of Team Structure in the Neonatal Intensive Care Unit. WHAT\'S KNOWN ON THIS SUBJECT: For maximal effectiveness, NICU care requires the interaction of well-functioning teams composed of large, diverse groups of individuals.
WHAT THIS STUDY ADDS: By applying network analytic techniques to information contained in EHRs, we show that family perceptions of NICU nursing care are more strongly associated with nursing team structure than with team size.

Journal of the American Academy of Pediatrics, 125(6), 1460-1467.
PDF
T. Ryan Hoens and Nitesh V. Chawla (June 2010).
Generating Diverse Ensembles to Counter the Problem of Class Imbalance. One of the more challenging problems faced by the data mining community is that of imbalanced datasets. In imbalanced datasets one class (sometimes severely) outnumbers the other class, causing correct, and useful predictions to be difficult to achieve. In order to combat this, many techniques have been proposed, especially centered around sampling methods. In this paper we propose an ensemble framework that combines random subspaces with sampling to overcome the class imbalance problem. We then experimentally verify this technique on a wide variety of datasets. We conclude by analyzing the performance of the ensembles, and showing that, overall, our technique provides a significant improvement.
Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD), Hyderabad, India.
PDF
Troy Raeder, Marina Blanton, Nitesh V. Chawla and Keith Frikken (June 2010).
Privacy-Preserving Network Aggregation. Consider the scenario where information about a large network is distributed across several different parties or commercial entities. Intuitively, we would expect that the aggregate network formed by combining the individual private networks would be a more faithful representation of the network phenomenon as a whole. However, privacy preservation of the individual networks becomes a mandate. Thus, it would be useful, given several portions of an underlying network p_1 ... p_n, to securely compute the aggregate of all the networks p_i in a manner such that no party learns information about any other party\'s network. In this work, we propose a novel privacy preservation protocol for the non-trivial case of weighted networks. The protocol is secure against malicious adversaries.
Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD), Hyderabad, India.
PDF
Darcy Davis, Nitesh V. Chawla, Nicholas A. Christakis and Albert László Barabási (May 2010).
Time to CARE: A Collaborative Filtering Engine for Practical Disease Prediction. The monumental cost of health care, especially for chronic disease treatment, is quickly becoming unmanageable. This crisis has motivated the drive towards preventative medicine, where the primary concern is recognizing disease risk and taking action at the earliest signs. However, universal testing is neither time nor cost efficient. We propose CARE, a Collaborative Assessment and Recommendation Engine, which relies only on patient\'s medical history using ICD-9-CM codes in order to predict future disease risks. CARE uses collaborative filtering methods to predict each patient\'s greatest disease risks based on their own medical history and that of similar patients. We also describe an Iterative version, ICARE, which incorporates ensemble concepts for improved performance. Also, we apply time-sensitive modifications which make the CARE framework practical for realistic long-term use. These novel systems require no specialized information and provide predictions for medical conditions of all kinds in a single run. We present experimental results on a large Medicare dataset, demonstrating that CARE and ICARE perform well at capturing future disease risks.
Data Mining and Knowledge Discovery, 20(3), 388-415.
PDF
Karsten Steinhaeuser and Nitesh V. Chawla (April 2010).
Identifying and Evaluating Community Structure in Complex Networks. We compare and evaluate different metrics for community structure in networks. In this context we also discuss a simple approach to community detection, and show that it performs as well as other methods, but at lower computational complexity.
Pattern Recognition Letters, 31(5), 413-421.
PDF
Wei Liu, Sanjay Chawla, David A. Cieslak and Nitesh V. Chawla (April 2010).
A Robust Decision Tree Algorithm for Imbalanced Data Sets. We propose a new decision tree algorithm, Class Confidence Proportion Decision Tree (CCPDT), which is robust and insensitive to class distribution and generates rules which are statistically significant. In order to make decision trees robust, we begin by expressing Information Gain, the metric used in C4.5, in terms of confidence of a rule. This allows us to immediately explain why Information Gain, like confidence, results in rules which are biased towards the majority class. To overcome this bias, we introduce a new measure, Class Confidence Proportion (CCP), which forms the basis of CCPDT. To generate rules which are statistically significant we design a novel and efficient top-down and bottom-up approach which uses Fisher\'s exact test to prune branches of the tree which are not statistically significant. Together these two changes yield a classifier that performs statistically better than not only traditional decision trees but also trees learned from data that has been balanced by well known sampling techniques. Our claims are confirmed through extensive experiments and comparisons against C4.5, CART, HDDT and SPARCCC.
SIAM Conference on Data Mining (SDM), Columbus, OH.
PDF
Jake Lussier, Troy Raeder and Nitesh V. Chawla (March 2010).
User Generated Content Consumption and Social Networking in Knowledge-Sharing OSNs. Knowledge-sharing online social networks are becoming increasingly pervasive and popular. While the user-to-user interactions in these networks have received substantial attention, the consumption of user generated content has not been studied extensively. In this work, we use data gathered from digg.com to present novel findings and draw important sociological conclusions regarding the intimate relationship between consumption and social networking. We first demonstrate that individuals\' consumption habits influence their friend networks, consistent with the concept of homophily. We then show that one\'s social network can also influence the consumption of a submission through the activation of an extended friend network. Finally, we investigate the level of reciprocity, or balance, in the network and uncover relationships that are significantly less balanced than expected.
Social Computing, Behavioral Modeling, and Prediction, Springer, 209-216.
PDF
Lorenzo Beretta, Alessandro Santaniello, Francesca Cappiello, Nitesh V. Chawla, Madelon C. Vonk, Patricia E. Carreira, Yannick Allanore, Delia A. Popa-Diaconu, Marta Cossu, Francesca Bertolotti, Gianfranco Ferraccioli, Antonino Mazzone, and Rafaella Scorza (2010).
Development of a Five-Year Mortality Model in Systemic Sclerosis Patients by Different Analytical Approaches. Objective. Systemic sclerosis (SSc) is a multiorgan disease with high mortality rates. Several clinical features have been associated with poor survival in different populations of SSc patients, but no clear and reproducible prognostic model to assess individual survival prediction in scleroderma patients has ever been developed.
Methods. We used Cox regression and three data mining-based classifiers (Naive Bayes Classifier [NBC], Random Forests [RND-F] and logistic regression [Log-Reg]) to develop a robust and reproducible 5-year prognostic model. All the models were built and internally validated by means of 5-fold cross-validation on a population of 558 Italian SSc patients. Their predictive ability and capability of generalisation was then tested on an independent population of 356 patients recruited from 5 external centres and finally compared to the predictions made by two SSc domain experts on the same population.#,'white',480)" onMouseout="hideddrive0tip()">

Clinical and Experimental Rheumatology, 28(2) Supplement 58, 18-27.
PDF


2009
Ryan N. Lichtenwalter, Katerina Lichtenwalter and Nitesh V. Chawla (December 2009).
Applying Learning Algorithms to Music Generation. There exist several music composition systems that generate blues chord progressions, jazz improvisation, or classical pieces. Such systems often work by applying a set of rules explicitly provided to the system to determine what sequence of output values is appropriate. Others use pattern recognition and generation techniques such as Markov models. These systems often suffer from mediocre performance and limited generality. We propose a system that goes from raw musical data to feature vector representation to classiffication models. We employ sliding window sequential machine learning techniques to generate classiffiers that correspond to a training set of musical data. Our approach has the adntage of greater generality than explicitly specifying rules to a system and the potential to apply a wide variety of powerful existing non-sequential learning algorithms. We present the design and implementation of the composition system. We demonstrate the efficacy of the method, show and analyze successful samples of its output, and discuss ways in which it might be improved.
Indian International Conference on Artificial Intelligence (IIJCAI), Tumkur, India.
PDF
Olufemi A. Omitaomu, Auroop R. Ganguly, João Gama, Ranga Raju Vatsavai, Nitesh V. Chawla and Mohamed M. Gaber (December 2009).
Knowledge Discovery from Sensor Data. Wide-area sensor infrastructures, remote sensors, RFIDs, and wireless sensor networks yield massive volumes of disparate, dynamic, and geographically distributed data. As such sensors are becoming ubiquitous, a set of broad requirements is beginning to emerge across high-priority applications including adaptability to climate change, electric grid monitoring, disaster preparedness and management, national or homeland security, and the management of critical infrastructures. The raw data from sensors need to be efficiently managed and transformed to usable information through data fusion, which in turn must be converted to predictive insights via knowledge discovery, ultimately facilitating automated or human-induced tactical decisions or strategic policy based on decision sciences and decision support systems. Keeping in view the requirements of the emerging field of knowledge discovery from sensor data, we took initiative to develop a community of researchers with common interests and scientific goals, which culminated into the organization of SensorKDD series of workshops in conjunction with the prestigious ACM SIGKDD International Conference of Knowledge Discovery and Data Mining. In this report, we summarize events at the Third ACM-SIGKDD International Workshop on Knowledge Discovery form Sensor Data (SensorKDD 2009).
ACM SIGKDD Explorations, 11(2), 84-87.
PDF
Faruck Morcos, Charles Lamanna, Nitesh V. Chawla and Jesús Izaguirre (July 2009).
Determination of Specificity Residues in Two Component Systems using Graphlets. This work presents a novel method for the identification of specificity residues in two component systems based on the discovery of graphlet signatures. We use network representations of 3-D structures and sequence of proteins, experimental data and graph-based learning to detect graphlet signatures that potentially are responsible for phosphotranfer specificity between Histidine Kinase (HK) and Response Regulator (RR) domains. This approach is applied to the system of HK and RR in E. coli. Structural regions were found for Histidine Kinases RstB and Response Regulator RstA and confirmed using experimental data. In addition, some hypothetical regions of specificity were proposed to explain cross talk between the HKs that phosphorylate YhfA and the RRs that interact with UhpB. Such an approach offers the ability to identify domain specificity residues in two component systems in silico.
International Conference on Bioinformatics & Computational Biology (BIOCOMP), Las Vegas, NV.
PDF
Troy Raeder and Nitesh V. Chawla (July 2009).
Model Monitor (M^2): Evaluating, Comparing, and Monitoring Models. This paper presents Model Monitor (M2), a Java toolkit for robustly evaluating machine learning algorithms in the presence of changing data distributions. M2 provides a simple and intuitive framework in which users can evaluate classifiers under hypothesized shifts in distribution and therefore determine the best model (or models) for their data under a number of potential scenarios. Additionally, M2 is fully integrated with the WEKA machine learning environment, so that a variety of commodity classifiers can be used if desired.
Journal of Machine Learning Research (JMLR), 10, 1387-1390.
PDF
Troy Raeder and Nitesh V. Chawla (July 2009).
Modeling a Store's Product Space as a Social Network. A market basket is a set of products that form a single retail transaction. This purchase data of products can shed important light on how product(s) might influence sales of other product(s). Departing from the standard approach of frequent itemset mining, we posit that purchase data can be modeled as a social network. One can then discover communities of products that are bought together, which can lead to expressive exploration and discovery of a larger influence zone of product(s). We develop a novel utility measure for communities of products and show, both financially and intuitively, that community detection provides a useful complement to association rules for market basket analysis. All our conclusions are validated on real store data.
ACM/IEEE Conference on Advances in Social Network Analysis and Mining (ASONAM), Athens, Greece.
PDF
Karsten Steinhaeuser, Nitesh V. Chawla and Auroop R. Ganguly (June 2009).
Discovery of Climate Patterns with Complex Networks. Climate scientists have applied various clustering methods to discover patterns in historical data, for example regions that share some common climatological behavior. We emply a complex network (graph) representation of the data, wherein nodes correspond to physical locations around the globe and edges are placed between them based solely on their climatologic similarities; to incorporate the temporal nature of the data, a multivariate cross correlation function is used to weight the network edges. We then apply a community detection algorithm to each network to identify climate regions and show that these \'communities\' have a climatological interpretation; for example, we are able to find teleconnections between South America, Africa, and India that are likely linked to the El Niño Southern Oscillation. Tracking communities over consecutive periods allows us to study structural changes, and disturbances in structure can be an indicator of climate events. This model can be extended for the discovery of more complex concepts such as dependence structure between climate extremes or discovery of multivariate climate indices.
International Workshop and Conference on Network Science (NetSci), Venice, Italy.
PDF
Karsten Steinhaeuser, Nitesh V. Chawla and Auroop R. Ganguly (June 2009).
An Exploration of Climate Data Using Complex Networks. To discover patterns in historical data, climate scientists have applied various clustering methods with the goal of identifying regions that share some common climatological behavior. However, past approaches are limited by the fact that they either consider only a single time period (snapshot) of multivariate data, or they consider only a single variable by using the time series data as multi-dimensional feature vector. In both cases, potentially useful information may be lost. Moreover, clusters in high-dimensional data space can be divergecult to interpret, prompting the need for a more ective data representation. We address both of these issues by employing a complex network (graph) to represent climate data, a more intuitive model that can be used for analysis while also having a direct mapping to the physical world for interpretation. A cross correlation function is used to weight network edges, thus respecting the temporal nature of the data, and a community detection algorithm identities multivariate clusters. Examining networks for consecutive periods allows us to study structural changes over time. We show that communities have a climatological interpretation and that disturbances in structure can be an indicator of climate events (or lack thereof). Finally, we discuss how this model can be applied for the discovery of more complex concepts such as unknown teleconnections or the development of multivariate climate indices and predictive insights.
ACM SIGKDD Workshop on Knowledge Discovery from Sensor Data (SensorKDD), Paris, France.
PDF
Sean McRoskey, Jim Notwell, Nitesh V. Chawla and Christian Poellabauer (June 2009).
Mining in a Mobile Environment. Distributed PRocessing in Mobile Environments (DPRiME) is a framework for processing large data sets across an ad-hoc network. Developed to address the shortcomings of Google\'s MapReduce outside of a fully-connected network, DPRiME separates nodes on the network into a master and workers; the master distributes sections of the data to available one-hop workers to process in parallel. Upon returning results to its master, a worker is assigned an unfinished task. Five data mining classifiers were implemented to process the data: decision trees, k-means, k-nearest neighbor, Naïve Bayes, and artificial neural networks. Ensembles were used so the classification tasks could be performed in parallel. This framework is well-suited for many tasks because it handles communications, node movement, node failure, packet loss, data partitioning, and result collection automatically. Therefore, DPRiME allows users with little knowledge of networking or distributed systems to harness the processing power of an entire network of single- and multi-hop nodes.
ACM SIGKDD Workshop on Knowledge Discovery from Sensor Data (SensorKDD), Paris, France.
PDF
Ryan Lichtenwalter and Nitesh V. Chawla (accepted April 2009, to appear).
Learning to Classify Data Streams with Imbalanced Class Distributions. Streaming data is pervasive in a multitude of data mining applications. One fundamental problem in the task of mining streaming data is distributional drift over time. Streams may also exhibit high and varying degrees of class imbalance, which can further complicate the task. In scenarios like these, class imbalance is particularly difficult to overcome and has not been as thoroughly studied. In this paper, we consider the issue of high class imbalacne in conjunction with data streams. We propose a method called Boundary Definition, which relies on building the classifiers by stressing on the boundary cases as the streams arrive. We employ a sequential validation framework, which we believe is the most meaningful option in the context of streaming imbalanced data.
Proceedings of the PAKDD Conference, LNCS, Springer.
PDF
Ryan N. Lichtenwalter and Nitesh V. Chawla (April 2009).
Adaptive Methods for Classification in Arbitrarily Imbalanced and Drifting Data Streams. Streaming data is pervasive in a multitude of data mining applications. One fundamental problem in the task of mining streaming data is distributional drift over time. Streams may also exhibit high and varying degrees of class imbalance, which can further complicate the task. In scenarios like these, class imbalance is particularly difficult to overcome and has not been as thoroughly studied. In this paper, we comprehensively consider the issues of changing distributions in conjunction with high degrees of class imbalance in streaming data. We propose new approaches based on distributional divergence and meta-classiffication that improve several performance metrics often applied in the study of imbalanced classiffication. We also propose a new distance measure for detecting distributional drift and examine its utility in weighting ensemble base classiffiers. We employ a sequential validation framework, which we believe is the most meaningful option in the context of streaming imbalanced data.
PAKDD Workshop for Data Mining When Classes are Imbalanced and Errors have Costs (ICEC), Bangkok, Thailand.
PDF
Laritza M. Taft, R. Scott Evans, Chi-Ren Shyu, Marlene J. Egger, Nitesh V. Chawla, Joyce A. Mitchell, Sidney N. Thornton, Bruce Bray and Michael W. Varner (April 2009).
Countering Imbalanced Datasets to Improve Adverse Drug Event Predictive Models in Labor and Delivery. Background: The IOM report, Preventing Medication Errors, emphasizes the overall lack of knowledge of the incidence of adverse drug events (ADE). Operating rooms, emergency departments and intensive care units are known to have a higher incidence of ADE. Labor and delivery (L\&D) is an emergency care unit that could have an increased risk of ADE, where reported rates remain low and under-reporting is suspected. Risk factor identification with electronic pattern recognition techniques could improve ADE detection rates.
Objective: The objective of the present study is to apply Synthetic Minority Over Sampling Technique (SMOTE) as an enhanced sampling method in a sparse dataset to generate prediction models to identify ADE in women admitted for labor and delivery based on patient risk factors and comorbidities.
Results: By creating synthetic cases with the SMOTE algorithm and using a 10-fold cross-validation technique, we demonstrated improved performance of the Naïve Bayes and the decision tree algorithms. The true positive rate (TPR) of 0.32 in the raw dataset increased to 0.67 in the 800% over-sampled dataset.
Conclusion: Enhanced performance from classification algorithms can be attained with the use of synthetic minority class oversampling techniques in sparse clinical datasets. Predictive models created in this manner can be used to develop evidence based ADE monitoring systems.

Journal of Biomedical Informatics (JBI), 42(2), 356-364.
PDF
Karsten Steinhaeuser and Nitesh V. Chawla (March 2009).
A Network-Based Approach to Understanding and Predicting Diseases. Pursuit of preventive healthcare relies on fundamental knowledge of the complex relationships between diseases and individuals. We take a step towards understanding these connections by employing a network-based approach to explore a large medical database. Here we report on two distinct tasks. First, we characterize networks of diseases in terms of their physical properties and emergent behavior over time. Our analysis reveals important insights with implications for modeling and prediction. Second, we immediately apply this knowledge to construct patient networks and build a predictive model to assess disease risk for individuals based on medical history. We evaluate the ability of our model to identify conditions a person is likely to develop in the future and study the benets of demographic data partitioning.We discuss strengths and limitations of our method as well as the data itself to provide direction for future work.
Social Computing, Behavioral Modeling, and Prediction, Springer, 209-216.
PDF
David A. Cieslak and Nitesh V. Chawla (January 2009).
A Framework for Monitoring Classifiers' Performance: When and Why Failure Occurs? Classifier error is the product of model bias and data variance. While understanding the bias involved when selecting a given learning algorithm, it is similarly important to understand the variability in data over time, since even the One True Model might perform poorly when training and evaluation samples diverge. Thus, the ability to identify distributional divergence is critical towards pinpointing when fracture points in classifier performance will occur, particularly since contemporary methods such as ten-folds and hold-out are poor predictors in divergent circumstances. This article implements a comprehensive evaluation framework to proactively detect breakpoints in classifiers. predictions and shifts in data distributions through a series of statistical tests. We outline and utilize three scenarios under which data changes: sample selection bias, covariate shift, and shifting class priors. We evaluate the framework with a variety of classifiers and datasets.
Knowledge and Information Systems (KAIS), 18(1), 83-108.
PDF
Yuchuh Tang, Yan-Qing Zhang, Nitesh V. Chawla and Sven Kresser (February 2009).
SVMs Modeling for Highly Imbalanced Classification. Traditional classification algorithms can be limited in their performance on highly unbalanced data sets. A popular stream of work for countering the problem of class imbalance has been the application of a sundry of sampling strategies. In this paper, we focus on designing modifications to support vector machines (SVMs) to appropriately tackle the problem of class imbalance. We incorporate different ldquorebalancerdquo heuristics in SVM modeling, including cost-sensitive learning, and over- and undersampling. These SVM-based strategies are compared with various state-of-the-art approaches on a variety of data sets by using various metrics, including G-mean, area under the receiver operating characteristic curve, F-measure, and area under the precision/recall curve. We show that we are able to surpass or match the previously known best algorithms on each data set. In particular, of the four SVM variations considered in this paper, the novel granular SVMs-repetitive undersampling algorithm (GSVM-RU) is the best in terms of both effectiveness and efficiency. GSVM-RU is effective, as it can minimize the negative effect of information loss while maximizing the positive effect of data cleaning in the undersampling process. GSVM-RU is efficient by extracting much less support vectors and, hence, greatly speeding up SVM prediction.
IEEE Transactions on Systems, Man and Cybernetics, Part B (SMCB), 39(1), 281-288.
PDF


2008
David A. Cieslak and Nitesh V. Chawla (December 2008).
Start Globally, Optimize Locally, Predict Globally: Improving Performance on Unbalanced Data. Class imbalance is a ubiquitous problem in supervised learning and has gained wide-scale attention in the literature. Perhaps the most prevalent solution is to apply sampling to training data in order improve classifier performance. The typical approach will apply uniform levels of sampling globally. However, we believe that data is typically multi-modal, which suggests sampling should be treated locally rather than globally. It is the purpose of this paper to propose a framework which first identifies meaningful regions of data and then proceeds to find optimal sampling levels within each. This paper demonstrates that a global classifier trained on data locally sampled produces superior rank-orderings on a wide range of real-world and artificial datasets as compared to contemporary global sampling methods.
IEEE International Conference on Data Mining (ICDM), Pisa, Italy.
PDF
Christopher Moretti, Karsten Steinhaeuser, Douglas Thain and Nitesh V. Chawla (December 2008).
Scaling Up Classifiers to Cloud Computers. As the size of available datasets has grown from Megabytes to Gigabytes and now into Terabytes, machine learning algorithms and computing infrastructures have continuously evolved in an effort to keep pace. But at large scales, mining for useful patterns still presents challenges in terms of data management as well as computation. These issues can be addressed by dividing both data and computation to build ensembles of classifiers in a distributed fashion, but trade-offs in cost, performance, and accuracy must be considered when designing or selecting an appropriate architecture. In this paper, we present an abstraction for scalable data mining that allows us to explore these tradeoffs. Data and computation are distributed to a computing cloud with minimal effort from the user, and multiple models for data management are available depending on the workload and system configuration. We demonstrate the performance and scalability characteristics of our ensembles using a wide variety of datasets and algorithms on a Condor-based pool with Chirp to handle the storage.
IEEE International Conference on Data Mining (ICDM), Pisa, Italy.
PDF
Ranga Raju Vatsavai, Olufemi A. Omitaomu, João Gama, Nitesh V. Chawla, Mohamed M. Gaber and Auroop R. Ganguly (December 2008).
Knowledge Discovery from Sensor Data. Extracting knowledge and emerging patterns from sensor data is a nontrivial task. The challenges for the knowledge discovery community are expected to be immense. On one hand, dynamic data streams or events require real-time analysis methodologies and systems, while on the other hand centralized processing through high end computing is also required for generating offline predictive insights, which in turn can facilitate real-time analysis. In addition, emerging societal problems require knowledge discovery solutions that are designed to investigate anomalies, changes, extremes and nonlinear processes, and departures from the normal. Keeping in view the requirements of the emerging field of knowledge discovery from sensor data, we took initiative to develop a community of researchers with common interests and scientific goals, which culminated into the organization of Sensor-KDD series of workshops in conjunction with the prestigious ACM SIGKDD International Conference of Knowledge Discovery and Data Mining. In this report, we summarize the events of the Second ACM-SIGKDD International Workshop on Knowledge Discovery form Sensor Data (Sensor-KDD 2008).
ACM SIGKDD Explorations, 10(2), 68-73.
PDF
Darcy Davis, Nitesh V. Chawla, Nicholas Blumm, Nicholas A. Christakis, Albert-László Barabási (October 2008).
Predicting Individual Disease Risk Based on Medical History. The monumental cost of health care, especially for chronic disease treatment, is quickly becoming unmanageable. This crisis has motivated the drive towards preventative medicine, where the primary concern is recognizing disease risk and taking action at the earliest signs. However, universal testing is neither time nor cost efficient. We propose CARE, a Collaborative Assessment and Recommendation Engine, which relies only on a patient\'s medical history using ICD-9-CM codes in order to predict future diseases risks. CARE combines collaborative filtering methods with clustering to predict each patient\'s greatest disease risks based on their own medical history and that of similar patients. We also describe an Iterative version, ICARE, which incorporates ensemble concepts for improved performance. These novel systems require no specialized information and provide predictions for medical conditions of all kinds in a single run. We present experimental results on a large Medicare dataset, demonstrating that CARE and ICARE perform well at capturing future disease risks.
ACM Conference on Information and Knowledge Management (CIKM), Napa, CA.
PDF
David A. Cieslak, Nitesh V. Chawla and Douglas Thain (September 2008).
Troubleshooting Thousands of Jobs on Production Grids Using Data Mining Techniques. Large scale production computing grids introduce new challenges in debugging and troubleshooting. A user that submits a workload consisting of tens of thousands of jobs to a grid of thousands of processors has a good chance of receiving thousands of error messages as a result. How can one begin to reason about such problems? We propose that data mining techniques can be employed to classify failures according to the properties of the jobs and machines involved. We demonstrate this technique through several case studies on real workloads consisting of tens of thousands of jobs. We apply the same techniques to a year\'s worth of data on a 3000 CPU production grid and use it to gain a high level understanding of the system behavior.
IEEE/ACM International Conference on Grid Computing (GRID), Tsukuba, Japan.
PDF
David A. Cieslak and Nitesh V. Chawla (September 2008).
Learning Decision Trees for Unbalanced Data. Learning from unbalanced datasets presents a convoluted problem in which traditional learning algorithms may perform poorly. The objective functions used for learning the classifiers typically tend to favor the larger, less important classes in such problems. This paper compares the performance of several popular decision tree splitting criteria - information gain, Gini measure, and DKM - and identifies a new skew insensitive measure in Hellinger distance. We outline the strengths of Hellinger distance in class imbalance, propose its pplication in forming decision trees, and perform a comprehensive comparative analysis between each decision tree construction method. In addition, we consider the performance of each tree within a powerful sampling wrapper framework to capture the interaction of the splitting metric and sampling. We evaluate over this wide range of datasets and determine which operate best under class imbalance.
European Conference on Principles and Practice of Knowledge Discovery in Databases (ECML/PKDD), Antwerp, Belgium.
PDF
Qi Liao, David A. Cieslak, Aaron D. Striegel and Nitesh V. Chawla (June 2008).
Using Selective, Short-Term Memory to Improve Resilience Against DDoS Exhaustion Attacks. Distributed denial of service (DDoS) attacks originating from botnets can quickly bring normally effective web services to a screeching halt. This paper presents SESRAA (selective short-term randomized acceptance algorithms), an adaptive scheme for maintaining web service despite the presence of multifaceted attacks in a noisy environment. In contrast to existing solutions that rely upon \'clean\'. training data, we presume that a live web service environment makes finding such training data difficult if not impossible. SESRAA functions much like a battlefield surgeon\'s triage: focusing on quickly and efficiently salvaging good connections with the realization that the chaotic nature of the live environment implicitly limits the accuracy of such detections. SESRAA employs an adaptive k-means clustering approach using short-term extraction and limited centroid evolution to defend the legitimate connections in a mixed attack environment. We present the SESRAA approach and evaluate its performance through experimental studies in a diverse attack environment. The results show significant improvements against a wide variety of DDoS configurations and input traffic patterns.
Security and Communication Networks, 1(4), 287-299.
PDF
David A. Cieslak and Nitesh V. Chawla (May 2008).
Analyzing Classifier Performance on Imbalanced Datasets when Training and Testing Distributions Differ. Many machine learning applications like finance, medicine, and risk management suffer from class imbalance: cases of interest occur rarely. Further complicating these applications is that the training and testing samples might differ significantly in their respective class distributions. Sampling has been shown to be a strong solution to imbalance and additionally offers a rich parameter space from which to select classifiers. This paper is concerned with the interaction between Probability Estimation Trees (PETs), sampling, and performance metrics as testing distributions fluctuate substantially. A set of comprehensive analyses is presented, which anticipate classifier performance through a set of widely varying testing distributions.
Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD), Osaka, Japan.
PDF
Karsten Steinhaeuser and Nitesh V. Chawla (June 2008).
Is Modularity the Answer to Evaluating Community Structure in Networks? We examine modularity as both an evaluation measure of community structure as well as an optimization criterion used by some algorithms to identify communities in networks Specifically, we assess the pros and cons of modularity, and identify its shortcomings via comparison to alternate evaluation metrics on networks for which the true communities are known. We also develop a simple method for community detection using random walks and show that it can identify the actual communities at least as well as more complex algorithms. We further expand upon the idea of community detection with random walks by extending the method to weighted networks and explain how to incorporate node attributes - information that is frequently available but usually ignored by other algorithms - to compute edge weights that can aid in detecting more meaningful communities.
International Workshop and Conference on Network Science (NetSci), Norwich, UK.
PDF
Karsten Steinhaeuser and Nitesh V. Chawla (April 2008).
Scalable Learning with Thread-Level Parallelism. A significant increase in the ability to collect and store diverse information over the past decade has led to an outright data explosion, providing larger and richer datasets than ever before. This proliferation in dataset size is accompanied by the dilemma of successfully analyzing this data to discover patterns of interest. Extreme dataset sizes place unprecedented demands on high-performance computing infrastructures, and a gap has developed between the available real-world datasets and our ability to process them; data volumes are quickly approaching Tera and Petabytes. This rate of increase also defies the subsampling paradigm, as even a subsample of data runs well into Gigabytes. To counter this challenge, we exploit advances in multi-threaded processor technology. We explore massive thread-level parallelism - provided by the Cray MTA-2 - as a platform for scalable data mining. We conjecture that such an architecture is well suited for the application of machine learning to large datasets. To this end, we present a thorough complexity analysis and experimental evaluation of a popular decision tree algorithm implemented using fine-grain parallelism, including a comparison to two more conventional architectures. We use diverse datasets with sizes varying in both dimensions (number of records and attributes). Our results lead us to the conclusion that a massively parallel architecture is an appropriate platform for the implementation of highly scalable learning algorithms.
Midwest Artificial Intelligence and Cognitive Science Conference (MAICS), Cincinnati, OH.
PDF
Karsten Steinhaeuser and Nitesh V. Chawla (March 2008).
Community Detection in a Large Real-World Social Network. Identifying meaningful community structure in social networks is a hard problem, and extreme network size or sparseness of the network compound the difficulty of the task.With a proliferation of real-world network datasets there has been an increasing demand for algorithms that work effectively and efficiently. Existing methods are limited by their computational requirements and rely heavily on the network topology, which fails in scale-free networks. Yet, in addition to the network connectivity, many datasets also include attributes of individual nodes, but current methods are unable to incorporate this data. Cognizant of these requirements we propose a simple approach that stirs away from complex algorithms, focusing instead on the edge weights; more specifically, we leverage the node attributes to compute better weights. Our experimental results on a real-world social network show that a simple thresholding method with edge weights based on node attributes is sufficient to identify a very strong community structure.
Social Computing, Behavioral Modeling, and Prediction, Springer, 168-175.
PDF
Nitesh V. Chawla, David A. Cieslak, Lawrence O. Hall, Ajay Joshi (January 2008).
Automatically Countering Imbalance and Its Empirical Relationship to Cost. Learning from imbalanced data sets presents a convoluted problem both from the modeling and cost standpoints. In particular, when a class is of great interest but occurs relatively rarely such as in cases of fraud, instances of disease, and regions of interest in large-scale simulations, there is a correspondingly high cost for the misclassification of rare events. Under such circumstances, the data set is often re-sampled to generate models with high minority class accuracy. However, the sampling methods face a common, but important, criticism: how to automatically discover the proper amount and type of sampling? To address this problem, we propose a wrapper paradigm that discovers the amount of re-sampling for a data set based on optimizing evaluation functions like the f-measure, Area Under the ROC Curve (AUROC), cost, cost-curves, and the cost dependent f-measure. Our analysis of the wrapper is twofold. First, we report the interaction between different evaluation and wrapper optimization functions. Second, we present a set of results in a cost-sensitive environment, including scenarios of unknown or changing cost matrices. We also compared the performance of the wrapper approach versus cost-sensitive learning methods - MetaCost and the Cost-Sensitive Classifiers - and found the wrapper to outperform the cost-sensitive classifiers in a cost-sensitive environment. Lastly, we obtained the lowest cost per test example compared to any result we are aware of for the KDD-99 Cup intrusion detection data set.
Data Mining and Knowledge Discovery, 17(2), 225-252.
PDF


2007
David A. Cieslak and Nitesh V. Chawla (December 2007).
Detecting Fracture Points in Classifier Performance. A fundamental tenet assumed by many classification algorithms is the presumption that both training and testing samples are drawn from the same distribution of data - this is the stationary distribution assumption. This entails that the past is strongly indicative of the future. However, in real world applications, many factors may alter the One True Model responsible for generating the data distribution both significantly and subtly. In circumstances violating the stationary distribution assumption, traditional validation schemes such as ten-folds and hold-out become poor performance predictors and classifier rankers. Thus, it becomes critical to discover the fracture points in classifier performance by discovering the divergence between populations. In this paper, we implement a comprehensive evaluation framework to identify bias, enabling selection of a \'correct\' classifier given the sample bias. To thoroughly evaluate the performance of classifiers within biased distributions, we consider the following three scenarios: missing completely at random (akin to stationary); missing at random; and missing not at random. The latter reflects the canonical sample selection bias problem.
IEEE International Conference on Data Mining (ICDM), Omaha, NE.
PDF
Alec Pawling, Nitesh V. Chawla and Gregory Madey (December 2007).
Anomaly Detection in Mobile Communication Networks. Mobile communication networks produce massive amounts of data which may be useful in identifying the location of an emergency situation and the area it affects. We propose a one pass clustering algorithm?s for quickly identifying anomalous data points. We evaluate this algorithms ability to detect outliers in a data set and describe how such an algorithm may be used as a component of an emergency response management system.
Computational and Mathematical Organizational Theory, 13(4), 407-422.
PDF
Vince Thomas, Nitesh V. Chawla, Kevin W. Bowyer and Patrick J. Flynn (September 2007).
Learning to Predict Gender from Iris Images. This paper employs machine learning techniques to develop models that predict gender based on the iris texture features. While there is a large body of research that explores biometrics as a means of verifying identity, there has been very little work done to determine if biometric measures can be used to determine specific human attributes. If it is possible to discover such attributes, they would be useful in situations where a biometric system fails to identify an individual that has not been enrolled, yet still needs to be identified. The iris was selected as the biometric to analyze for two major reasons: (1) quality methods have already been developed to segment and encode an iris image, (2) current iris encoding methods are conducive to selecting and extracting attributes from an iris texture and creating a meaningful feature vector.
IEEE Conference on Biometrics: Theory, Applications and Systems (BTAS), Washington, DC.
PDF
Nitesh V. Chawla and Kevin W. Bowyer (July 2007).
Actively Exploring Creation of Face Spaces for Improved Face Recognition. We propose a learning framework that actively explores creation of face space(s) by selecting images that are complementary to the images already represented in the face space. We also construct ensembles of classifiers learned from such actively sampled image sets, which further provides improvement in the recognition rates. We not only significantly reduce the number of images required in the training set but also improve the accuracy over learning from all the images. We also show that the single face space or ensemble of face spaces, thus constructed, has a higher generalization performance across different illumination and expression conditions.
Conference on Artificial Intelligence (AAAI), Vancouver, Canada.
PDF
Michael J. Chapple, Nitesh V. Chawla and Aaron Striegel (June 2007).
Authentication Anomaly Detection: A Case Study on a Virtual Private Network. The authentication logs on a network can provide a trove of information for discovering potential anomalies in login attempts. Using such logs collected by a production Virtual Private Network device over a period of 15 months, we generate a diurnal model of network accesses. These models are used to detect anomalous authentications, which merit further investigation by a security analyst. We intend that this work will dramatically reduce the amount time spent by analysts identifying anomalous events and allow them to focus on in-depth analysis of these anomalies. Our work makes two contributions: a novel approach of mining authentication data, and the use of geographic distance as a metric to evaluate Virtual Private Network connections. We demonstrate the success of our model using real-world case analysis.
ACM SIGMETRICS Workshop on Mining Network Data (MineNet), San Diego, CA.
PDF
Gregory R. Madey, Albert-László Barabási, Nitesh V. Chawla, Marta Gonzales, David Hachen, Brett Lantz, Alec Pawling, Timothy W. Schoenharl, Gábor Szabó, Pu Wang, Ping Yan (May 2007).
Enhanced Situational Awareness: Application of DDDAS Concepts to Emergency and Disaster Management. We describe a prototype emergency and disaster information system designed and implemented using DDDAS concepts. The system is designed to use real-time cell phone calling data from a geographical region, including calling activity - who calls whom, call duration, services in use, and cell phone location information - to provide enhanced situational awareness for managers in emergency operations centers (EOCs) during disaster events. Powered-on cell phones maintain contact with one or more within-range cell towers so as to receive incoming calls. Thus, location data about all phones in an area are available, either directly from GPS equipped phones, or by cell tower, cell sector, distance from tower and triangulation methods. This permits the cell phones of a geographical region to serve as an ad hoc mobile sensor net, measuring the movement and calling patterns of the population. A prototype system, WIPER, serves as a test bed to research open DDDAS design issues, including dynamic validation of simulations, algorithms to interpret high volume data streams, ensembles of simulations, runtime execution, middleware services, and experimentation frameworks.
International Conference on Computational Science (ICCS), Beijing, China.
PDF
Nitesh V. Chawla and Jared Sylvester (May 2007).
Exploiting Diversity in Ensembles: Improving Performance on Unbalanced Datasets. Ensembles are often capable of greater predictive performance than any of their individual classifiers. Despite the need for classifiers to make different kinds of errors, the majority voting scheme, typically used, treats each classifier as though it contributed equally to the group\'s performance. This can be particularly limiting on unbalanced datasets, as one is more interested in complementing classifiers that can assist in improving the true positive rate without signicantly increasing the false positive rate. Therefore, we implement a genetic algorithm based framework to weight the contribution of each classifier by an appropriate fitness function, such that the classifiers that complement each other on the unbalanced dataset are preferred, resulting in significantly improved performances. The proposed framework can be built on top of any collection of classifiers with different fitness functions.
International Workshop on Multiple Classifier Systems (MCS), Prague, Czech Republic.
PDF
Tanu Malik, Randal Burns and Nitesh V. Chawla (January 2007).
A Black-Box Approach to Query Cardinality Estimation. We present a ?black-box? approach to estimating query cardinality that has no knowledge of query execution plans and data distribution, yet provides accurate estimates. It does so by grouping queries into syntactic families and learning the cardinality distribution of that group directly from points in a high-dimensional input space constructed from the query?s attributes, operators, function arguments, aggregates, and constants. We envision an increasing need for such an approach in applications in which query cardinality is required for resource optimization and decision-making at locations that are remote from the data sources. Our primary case study is the Open SkyQuery federation of Astronomy archives, which uses a scheduling and caching mechanism at the mediator for execution of federated queries at remote sources. Experiments using real workloads show that the black-box approach produces accurate estimates and is frugal in its use of space and in computation resources. Also, the black-box approach provides dramatic improvements in the performance of caching in Open SkyQuery.
ACM Conference on Innovative Data Systems Research (CIDM).
PDF


2006
Tanu Malik, Randal Burns, Nitesh V. Chawla and Alexander S. Szalay (November 2006).
Estimating Query Result Sizes for Proxy Caching in Scientific Database Federations. In a proxy cache for federations of scientific databases it is important to estimate the size of a query before making a caching decision. With accurate estimates, near-optimal cache performance can be obtained. On the other extreme, inaccurate estimates can render the cache totally ineffective.
We present classification and regression over templates (CAROT), a general method for estimating query result sizes, which is suited to the resource-limited environment of proxy caches and the distributed nature of database federations. CAROT estimates query result sizes by learning the distribution of query results, not by examining or sampling data, but from observing workload. We have integrated CAROT into the proxy cache of the National Virtual Observatory (NVO) federation of astronomy databases. Experiments conducted in the NVO show that CAROT dramatically outperforms conventional estimation techniques and provides near-optimal cache performance.

ACM/IEEE Supercomputing, Tampa, FL.
PDF
Yang Liu, Nitesh V. Chawla, Mary P. Harper, Elizabeth Shriberg and Andreas Stolcke (October 2006).
A Study in Machine Learning from Imbalanced Data for Sentence Boundary Detection in Speech. Enriching speech recognition output with sentence boundaries improves its human readability and enables further processing by downstream language processing modules. We have constructed a hidden Markov model (HMM) system to detect sentence boundaries that uses both prosodic and textual information. Since there are more nonsentence boundaries than sentence boundaries in the data, the prosody model, which is implemented as a decision tree classifier, must be constructed to effectively learn from the imbalanced data distribution. To address this problem, we investigate a variety of sampling approaches and a bagging scheme. A pilot study was carried out to select methods to apply to the full NIST sentence boundary evaluation task across two corpora (conversational telephone speech and broadcast news speech), using both human transcriptions and recognition output. In the pilot study, when classification error rate is the performance measure, using the original training set achieves the best performance among the sampling methods, and an ensemble of multiple classifiers from different downsampled training sets achieves slightly poorer performance, but has the potential to reduce computational effort. However, when performance is measured using receiver operating characteristics (ROC) or area under the curve (AUC), then the sampling approaches outperform the original training set. This observation is important if the sentence boundary detection output is used by downstream language processing modules. Bagging was found to significantly improve system performance for each of the sampling methods. The gain from these methods may be diminished when the prosody model is combined with the language model, which is a strong knowledge source for the sentence detection task. The patterns found in the pilot study were replicated in the full NIST evaluation task. The conclusions may be dependent on the task, the classifiers, and the knowledge combination approach.
Journal of Computer Speech and Language, 20(4), 468-494.
PDF
Karsten Steinhaeuser, Nitesh V. Chawla and Peter M. Kogge (September 2006).
Exploiting Thread-Level Parallelism to Build Decision Trees. Classification is an important data mining task, and decision trees have emerged as a popular classifier due to their simplicity and relatively low computational complexity. However, as datasets get extremely large, the time required to build a decision tree still becomes intractable. Hence, there is an increasing need for more efficient tree-building algorithms. One approach to this problem involves using a parallel mode of computation. Prior work has successfully used processor-level parallelism to partition the data and computation. We propose to use Cray\'s Multi-Threaded Architecture (MTA) and extend the idea by employing thread-level parallelism to reduce the execution time of the tree building process. Decision tree building is well-suited for such low-level parallelism as it requires a large number of independent computations. In this paper, we present the analysis and parallel implementation of the ID3 algorithm, along with experimental results.
ECML/PKDD Workshop on Parallel and Distributed Data Mining, Berlin, Germany.
PDF
Dinesh Rajan, Christian Poellabauer and Nitesh V. Chawla (September 2006).
Resource Access Pattern Mining for Dynamic Energy Management. Energy management mechanisms in resource constrained environments such as mobile systems are typically based on the current status of various system parameters such as utilization and quality of service requirements. Any knowledge gained about future system resource usage and other significant dynamics in the system can further lead to increased optimizations in such mechanisms. Future knowledge about system resource utilization and dynamics can be derived from two separate sources: a) the information maintained in the system about its current utilization and resource accesses and, b) the application specific behavioral patterns of currently active applications. This work targets at employing known efficient data mining techniques to exploit the latter class of information in achieving a suitable application behavior model that can provide accurate predictions of future application specific resource accesses and usage. Such formulated application behavioral patterns containing information about their resource usage, termed Resource Access Patterns, are then employed in a novel dynamic energy management scheme for a mobile system network resource. This work also shows that a data mining approach is beneficial in energy management decisions as it can capture all the underlying intricate interactions and dependencies that exist among the active applications and between the resources available in the system. The key components of this work can thus be summarized as: a) employing a data mining approach to characterize application behavior and, b) designing a novel energy management policy for a network interface device based on information about its future usage derived from the formulated application behavioral model.
ECML/PKDD Workshop on Automatic Computing: A New Challenge for Machine Learning, Berlin, Germany.
PDF
Alec Pawling, Nitesh V. Chawla and Amitabh Chaudhary (September 2006).
Evaluation of Summarization Schemes for Learning in Streams. Traditional discretization techniques for machine learning, from examples with continuous feature spaces, are not efficient when the data is in the form of a stream from an unknown, possibly changing, distribution. We present a time-and-memory-efficient discretization technique based on computing ε-approximate exponential frequency quantiles, and prove bounds on the worst-case error introduced in computing information entropy in data streams compared to an offline algorithm that has no efficiency constraints.We compare the empirical performance of the technique, using it for feature selection, with (streaming adaptations of) two popular methods of discretization, equal width binning and equal frequency binning, under a variety of streaming scenarios for real and artificial datasets. Our experiments show that ε-approximate exponential frequency quantiles are remarkably consistent in their performance, in contrast to the simple and efficient equal width binning that perform quite well when the streams are from stationary distributions, and quite poorly otherwise.
European Conference on Principles and Practice of Knowledge Discovery in Databases (ECML/PKDD), Berlin, Germany.
PDF
Nitesh V. Chawla and Xiangning Li (August 2006).
Pricing Scheme for Benefit Scoring. Data mining models often carry the final objective of maximizing profit or minimizing cost. This problem becomes even more profound in financial applications that can have multiple constraints, such as interest rate, score cut-off, and the loan amount to allocate. In this paper, we present a pricing framework for discovering the total profit from a probabilistic model, given a benefit function.
ACM SIGKDD Workshop on Utility Based Data Mining (UBDM), Philadelphia, PA.
PDF
Danny Roobaert, Grigoris Karakoulas and Nitesh V. Chawla (August 2006).
Information Gain, Correlation and Support Vector Machines. We report on our approach, CBAmethod3E, which was submitted to the NIPS 2003 Feature Selection Challenge on Dec. 8, 2003. Our approach consists of combining filtering techniques for variable selection, information gain and feature correlation, with Support Vector Machines for induction. We ranked 13th overall and ranked 6th as a group. It is worth pointing out that our feature selection method was very successful in selecting the second smallest set of features among the top-20 submissions, and in identifying almost all probes in the datasets, resulting in the challenge\'s best performance on the latter benchmark.
Feature Extraction: Foundations and Applications, Springer, 463-470.
PDF
Nitesh V. Chawla and David A. Cieslak (July 2006).
Evaluating Probability Estimates from Decision Trees. Decision trees, a popular choice for classification, have their limitation in providing good quality probability estimates. Typically, smoothing methods such as Laplace or m-estimate are applied at the decision tree leaves to overcome the systematic bias introduced by the frequency-based estimates. An ensemble of decision trees has also been shown to help in reducing the bias and variance in the leaf estimates, resulting in better calibrated probabilistic predictions. In this work, we evaluate the calibration or quality of these estimates using various loss measures. We also examine the relationship between the quality of such estimates and resulting rank-ordering of test instances. Our results quantify the impact of smoothing in terms of the loss measures, and the coupled relationship with the AUC measure.
AAAI Workshop on Evaluation Methods for Machine Learning, Boston, MA.
PDF
Jared Sylvester and Nitesh V. Chawla (July 2006).
Evolutionary Ensemble Creation and Thinning. Ensembles are often capable of greater predictive accuracy than any of their individual members. One key attribute of ensembles\' success is the notion of diversity. However, the majority voting scheme used in most ensembles treats each classifier as if it contributed equally to the group performance, without capitalizing on the relative improvement offered by each member of the ensemble. Our solution to this problem is to use genetic algorithms to weight the contribution of each classifier. This improves the performance of the ensemble by providing a weighted vote which maximizes collaboration among classifiers. Our approach provides a general-purpose framework for evolutionary ensembles, allowing them to build on top of any collection of classifiers, whether they be heterogeneous or homogeneous. In addition, the ability of our framework to thin ensembles, and its effect on ensemble diversity is presented.
IEEE International Joint Conference on Neural Networks (IJCNN), Vancouver, Canada.
PDF
Nitesh V. Chawla (July 2006).
Many Are Better Than One: Improving Probabilistic Estimates from Decision Trees. Decision trees, a popular choice for classification, have their limitation in providing probability estimates, requiring smoothing at the leaves. Typically, smoothing methods such as Laplace or m-estimate are applied at the decision tree leaves to overcome the systematic bias introduced by the frequency-based estimates. In this work, we show that an ensemble of decision trees significantly improves the quality of the probability estimates produced at the decision tree leaves. The ensemble overcomes the myopia of the leaf frequency based estimates. We show the effectiveness of the probabilistic decision trees as a part of the Predictive Uncertainty Challenge. We also include three additional highly imbalanced datasets in our study. We show that the ensemble methods significantly improve not only the quality of the probability estimates but also the AUC for the imbalanced datasets.
Machine Learning Challenges, Springer, 41-55.
PDF
Karsten Steinhaeuser, Nitesh V. Chawla and Christian Poellabauer (June 2006).
Towards Learning-Based Sensor Management. The management of wireless sensor networks in the presence of multiple constraints is an open problem in systems research. Existing methods perform well when optimized for a single parameter (such as energy, delay, network bandwidth). However, we might want to establish trade-offs on the fly, and optimize the information flow/exchange. This position paper shall serve as a preliminary proof-of-concept that techniques and algorithms from the machine learning and data mining domains can be applied to network data to learn relevant information about the routing behavior of individual nodes and the overall state of the network. We describe two simple examples which demonstrate the application of existing algorithms and analyze the results to illustrate their usefulness.
First Workshop on Tackling Computer Systems Problems with Machine Learning (SysML), Saint-Malo, France.
PDF
David A. Cieslak, Douglas Thain and Nitesh V. Chawla (June 2006).
Troubleshooting Distributed Systems via Data Mining. Through massive parallelism, distributed systems enable the multiplication of productivity. Unfortunately, increasing the scale of available machines to users will also multiply debugging when failure occurs. Data mining allows the extraction of patterns within large amounts of data and therefore forms the foundation for a useful method of debugging, particularly within such distributed systems. This paper outlines a successful application of data mining in troubleshooting distributed systems, proposes a framework for further study, and speculates on other future work.
IEEE International Symposium on High Performance Distributed Computing (HPDC), Paris, France.
PDF
Alec Pawling, Nitesh V. Chawla and Gregory Madey (June 2006).
Anomaly Detection in a Mobile Communication Network. Cell phone networks produce a massive volume of service usage data which, when combined with location data, can be used to pinpoint emergency situations that cause changes in network usage. Such a change may be the results of an increased number of people trying to call friends or family to tell them what is happening or a decrease in network usage caused by people being unable to use the network. Such events are anomalies and managing emergencies effectively requires identifying anomalies quickly. This problem is difficult due to the rate at which very large volumes of data are produced. In this paper, we discuss the use of data stream clustering algorithms for anomaly detection.
Annual Conference of the North American Association for Computational Social and Organizational Science (NAACSOS), Notre Dame, IN.
PDF
David A. Cieslak, Nitesh V. Chawla and Aaron Striegel (May 2006).
Combating Imbalance in Network Intrusion Data. An approach to combating network intrusion is the development of systems applying machine learning and data mining techniques. Many IDS (Intrusion Detection Systems) suffer from a high rate of false alarms and missed intrusions. We want to be able to improve the intrusion detection rate at a reduced false positive rate. The focus of this paper is rule-learning, using RIPPER, on highly imbalanced intrusion datasets with an objective to improve the true positive rate (intrusions) without significantly increasing the false positives. We use RIPPER as the underlying rule classifier. To counter imbalance in data, we implement a combination of oversampling (both by replication and synthetic generation) and undersampling techniques. We also propose a clustering based methodology for oversampling by generating synthetic instances. We evaluate our approaches on two intrusion datasets - destination and actual packets based - constructed from actual Notre Dame traffic, giving a flavor of real-world data with its idiosyncrasies. Using ROC analysis, we show that oversampling by synthetic generation of minority (intrusion) class outperforms oversampling by replication and RIPPER\'s loss ratio method. Additionally, we establish that our clustering based approach is more suitable for the detecting intrusions and is able to provide additional improvement over just synthetic generation of instances.
IEEE International Conference on Granular Computing (GrC), Atlanta, GA.
PDF


2005
Alec Pawling, Nitesh V. Chawla and Amitabh Chaudhary (November 2005).
Computing Information Gain in Data Streams. Computing information gain in general data streams, in which we do not make any assumptions on the underlying distributions or domains, is a hard problem, severely constrained by the limitations on memory space. We present a simple randomized solution to this problem that is time and space efficient as well as tolerates a relative error that has a theoretical upper bound. It is based on a novel method of discretization of continuous domains using quantiles. Our empirical evaluation of the technique, using standard and simulated datasets, convincingly demonstrates its practicality and robustness. Our results include accuracy versus memory usage plots and comparisons with a popular discretization technique.
IEEE ICDM Workshop on Temporal Data Mining, Houston, TX.
PDF
Nitesh V. Chawla and Kevin W. Bowyer (October 2005).
Ensembles in Face Recognition: Tackling the Extremes of High Dimensionality, Temporality, and Variance in Data. Random subspaces are a popular ensemble construction technique that improves the accuracy of weak classifiers. It has been shown, in different domains, that random subspaces combined with weak classifiers such as decision trees and nearest neighbor classifiers can provide an improvement in accuracy. In this paper, we apply the random subspace methodology to the 2-D face recognition task. The main goal of the paper is to see if the random subspace methodology can improve the performance of the face recognition system given the high dimensional data, temporal, and distribution variant data. We used two different datasets to evaluate the methodology. One dataset comprises of completely unique subjects for testing, and the other dataset comprises of the same subjects (both in training and testing) but images in the test set are captured at different times under different conditions.
IEEE International Conference on Systems, Man and Cybernetics (SMC), Big Island, Hawaii.
PDF
Nitesh V. Chawla (September 2005).
Data Mining for Imbalanced Datasets: An Overview. A dataset is imbalanced if the classification categories are not approximately equally represented. Recent years brought increased interest in applying machine learning techniques to difficult \'real-world\' problems, many of which are characterized by imbalanced data. Additionally the distribution of the testing data may differ from that of the training data, and the true misclassification costs may be unknown at learning time. Predictive accuracy, a popular choice for evaluating performance of a classifier, might not be appropriate when the data is imbalanced and/or the costs of different errors vary markedly. In this Chapter, we discuss some of the sampling techniques used for balancing the datasets, and the performance measures more appropriate for mining imbalanced datasets.
Data Mining and Knowledge Discovery Handbook: A Complete Guide for Practitioners and Researchers, Springer, 853-867.
PDF
Nitesh V. Chawla, Lawrence O. Hall and Ajay Joshi (August 2005).
Wrapper-Based Computation and Evaluation of Sampling Methods for Imbalanced Datasets. Learning from imbalanced datasets presents an interesting problem both from modeling and economy standpoints. When the imbalance is large, classification accuracy on the smaller class(es) tends to be lower. In particular, when a class is of great interest but occurs relatively rarely such as cases of fraud, instances of disease, and regions of interest in largescale simulations, it is important to accurately identify it. It then becomes more costly to misclassify the interesting class. In this paper, we implement a wrapper approach that computes the amount of under-sampling and synthetic generation of the minority class examples (SMOTE) to improve minority class accuracy. The f-value serves as the evaluation function. Experimental results show the wrapper approach is effective in optimization of the composite f-value, and reduces the average cost per test example for the datasets considered. We report both average cost per test example and the cost curves in the paper. The true positive rate of the minority class increases significantly without causing a significant change in the f-value. We also obtain the lowest cost per test example, compared to any result we are aware of for the KDD Cup-99 intrusion detection data set.
ACM SIGKDD Workshop on Utility-Based Data Mining (UBDM), Chicago, IL.
PDF
Jared Sylvester and Nitesh V. Chawla (July 2005).
Evolutionary Ensembles: Combining Learning Agents Using Genetic Algorithms. Ensembles of classifiers are often used to achieve accuracy greater than any single classifier. The predictions of these classifiers are typically combined together by uniform or weighted voting. In this paper, we approach the ensembles construction under a multi-agent framework. Each individual agent is capable of learning from data, and the agents can either be homogenous (same learning algorithm) or heterogeneous (different learning algorithm). These learning agents are combined by a meta-agent that utilizes evolutionary algorithm, using the accuracy as fitness score, for discovering the weights for each individual agent. The weights are indicative the best searched combination (or collaboration) of the set of agents. Experimental results show that this approach is a valid model for ensemble building when compared to the best individual agent and a simple plurality vote of the agents.
AAAI Workshop on Multi-Agent Systems, Pittsburgh, PA.
PDF
Nitesh V. Chawla and Kevin W. Bowyer (June 2005).
Random Subspaces and Subsampling for 2-D Face Recognition. Random subspaces are a popular ensemble construction technique that improves the accuracy of weak classifiers. It has been shown, in different domains, that random subspaces combined with weak classifiers such as decision trees and nearest neighbor classifiers can provide an improvement in accuracy. In this paper, we apply the random subspace methodology to the 2-D face recognition task. The main goal of the paper is to see if the random subspace methodology can do as well, if not better, than the single classifier constructed on the tuned face space. We also propose the use of a validation set for tuning the face space, to avoid bias in the accuracy estimation. In addition, we also compare the random subspace methodology to an ensemble of subsamples of image data. This work shows that a random subspaces ensemble can outperform a well-tuned single classifier for a typical 2-D face recognition problem. The random subspaces approach has the added advantage of requiring less careful tweaking.
Computer Vision and Pattern Recognition, 2, 582-589.
PDF
Nitesh V. Chawla (June 2005).
Teaching Data Mining by Coalescing Theory and Applications. We report on our experience for the first time departmental offering of the data mining course in Spring 2005. The course was cross-listed such that both the upper level undergraduates and graduate students could attend. However, the majority of the registered students were undergraduates. Data mining, being a confluence of multiple fields, offers an interesting addition to the computer science curriculum. The main objective of the course was to provide grounding on both the theoretical and practical aspects of data mining and machine learning. In addition, the course used concepts learned in various courses throughout the undergraduate degree. The course utilized a machine learning toolkit, Weka, by the University of Waikato, New Zealand. In this paper, we present the various components of the course, structure, innovative assignments and discussions, and the project life cycle.
International Conference on Frontiers in Education, Las Vegas, NV.
PDF
Nitesh V. Chawla and Kevin W. Bowyer (June 2005).
Designing Multiple Classifier Systems for Face Recognition. Face recognition systems often use different images of a subject for training and enrollment. Typically, one may use LDA using all the image samples or train a nearest neighbor classifier for each (separate) set of images. The latter can require that information about lighting or expression about each testing point be available. In this paper, we propose usage of different images in a multiple classifier systems setting. Our main goals are to see (1) what is the preferred use of different images? And (2) can the multiple classifiers generalize well enough across different kinds of images in the testing set, thus mitigating the need of the meta-information? We show that an ensemble of classifiers outperforms the single classifier versions without any tuning, and is as good as a single classifier trained on all the images and tuned on the test set.
International Workshop on Multiple Classifier Systems (MCS), Seaside, CA.
PDF
Daniel Mack, Nitesh V. Chawla and Gregory Madey (June 2005).
Activity Mining in Open Source Software. Open Source software repository is a collection of various projects with varying levels of activities, participations, and downloads. In this paper, we attempt to categorize and mine activity, thus discovering success, by focusing on the Games group. However, we observe that there are a significant number of projects within the Games category that are inactive or have no ranking assigned by Sourceforge. So, we first segmented our dataset based on just the ranking or activity assigned. We then constructed a set of characteristics, such as number of developers; number of posters; number of downloads, to identify associations, activity, and relationships in the smaller active set. We used regression rules and association rules to discover the associations and relationships.
Annual Conference of the North American Association for Computational Social and Organizational Science (NAACSOS), Notre Dame, IN.
PDF
Nitesh V. Chawla and Grigoris J. Karakoulas (March 2005).
Learning from Labeled and Unlabeled Data: An Empirical Study Across Techniques and Domains. There has been increased interest in devising learning techniques that combine unlabeled data with labeled data - i.e. semi-supervised learning. However, to the best of our knowledge, no study has been performed across various techniques and different types and amounts of labeled and unlabeled data. Moreover, most of the published work on semi-supervised learning techniques assumes that the labeled and unlabeled data come from the same distribution. It is possible for the labeling process to be associated with a selection bias such that the distributions of data points in the labeled and unlabeled sets are different. Not correcting for such bias can result in biased function approximation with potentially poor performance. In this paper, we present an empirical study of various semi-supervised learning techniques on a variety of datasets. We attempt to answer various questions such as the effect of independence or relevance amongst features, the effect of the size of the labeled and unlabeled sets and the effect of noise. We also investigate the impact of sample-selection bias on the semi -supervised learning techniques under study and implement a bivariate probit technique particularly designed to correct for such bias.
Journal of Artificial Intelligence Research, 23, 331-366.
PDF


2004
Nitesh V. Chawla, Lawrence O. Hall, Kevin W. Bowyer and W. Philip Kegelmeyer (December 2004).
Learning Ensembles from Bites: A Scalable and Accurate Approach. Bagging and boosting are two popular ensemble methods that typically achieve better accuracy than a single classifier. These techniques have limitations on massive data sets, because the size of the data set can be a bottleneck. Voting many classifiers built on small subsets of data (\'pasting small votes\') is a promising approach for learning from massive data sets, one that can utilize the power of boosting and bagging. We propose a framework for building hundreds or thousands of such classifiers on small subsets of data in a distributed environment. Experiments show this approach is fast, accurate, and scalable.
Journal of Machine Learning Research, 5, 421-451.
PDF
Steven Eschrich, Nitesh V. Chawla and Lawrence O. Hall (November 2004).
Learning to Predict in Complex Biological Domains. Protein secondary structure prediction and high-throughput drug screen data mining are two important applications in bioinformatics. The data is represented in sparse feature spaces and can be unrepresentative of future data. There is certainly some noise in the data and there may be significant noise. Supervised learners in this context will display their inherent bias toward certain solutions, generally solutions that fit the training set well. In this paper, we first describe an ensemble approach using subsampling that scales well with dataset size. A sufficient number of ensemble members using subsamples of the data can yield a more accurate classifier than a single classifier using the entire dataset. Experiments on several datasets demonstrate the effectiveness of the approach. We report results from the KDD Cup 2001 drug discovery dataset in which our approach yields a higher weighted accuracy than the winning entry. We then extend our ensemble approach to create an over-generalized classifier for prediction by reducing the individual subsample size. The ensemble strategy using small subsamples has the effect of averaging over a wider range of hypotheses. We show that both protein secondary structure prediction and drug discovery prediction can be improved by the use of over-generalization, specifically through the use of ensembles of small subsamples.
Journal of System Simulation, 14(11), 1464-1471.
no pdf
Predrag Radivojac, Nitesh V. Chawla, A. Keith Dunker and Zoran Obradovic (August 2004).
Classification and Knowledge Discovery in Protein Databases. We consider the problem of classification in noisy, high-dimensional, and class-imbalanced protein datasets. In order to design a complete classification system, we use a three-stage machine learning framework consisting of a feature selection stage, a method addressing noise and class-imbalance, and a method for combining biologically related tasks through a prior-knowledge based clustering. In the first stage, we employ Fisher\'s permutation test as a feature selection filter. Comparisons with the alternative criteria show that it may be favorable for typical protein datasets. In the second stage, noise and class imbalance are addressed by using minority class over-sampling, majority class under-sampling, and ensemble learning. The performance of logistic regression models, decision trees, and neural networks is systematically evaluated. The experimental results show that in many cases ensembles of logistic regression classifiers may outperform more expressive models due to their robustness to noise and low sample density in a high-dimensional feature space. However, ensembles of neural networks may be the best solution for large datasets. In the third stage, we use prior knowledge to partition unlabeled data such that the class distributions among non-overlapping clusters significantly differ. In our experiments, training classifiers specialized to the class distributions of each cluster resulted in a further decrease in classification error.
Journal of Biomedical Informatics, 37(4), 224-239.
PDF
Nitesh V. Chawla, Nathalie Japkowicz and Aleksander Kolcz (June 2004).
Learning From Imbalanced Datasets. The purpose of this special issue is to communicate and present some of the latest research carried out in this area while reviewing other important recent developments in the field. In this Editorial, we begin by reviewing the class imbalance as well as an array of general solutions that were previously proposed to deal with it. We then discuss the progression of ideas starting at the 2000 workshop to today. In order to give a comprehensive picture of the state of the art in the field, we give a short overview of the papers that were presented at the 2003 workshop as well as a short description of the papers contained in this volume.
ACM SIGKDD Explorations, 6(1), 1-6.
PDF


2003
Nitesh V. Chawla, Grigoris Karakoulas and Danny Roobaert (December 2003).
Lessons Learned from the NIPS Feature Selection Challenge. The purpose of this paper is to provide insight on the performance of different feature selection techniques and learning algorithms that we used on the five datasets of the competition. As part of our participation (CBAgroup) we considered filtering and wrapper feature selection techniques, combined with different learning algorithms. In terms of feature selection techniques, we used information gain, Relief-f, linear SVM together with forward selection and a genetic algorithm. In terms of inductive learning techniques we used a proprietary Bayesian learning algorithm as well as different types of hyperparameter-tuning algorithms for (standard and Bayesian) SVMs with linear and RBF kernel. We provide an evaluation of these techniques on the datasets.
NIPS Workshop on Feature Selection, Vancouver, Canada.
PDF
Nitesh V. Chawla, Aleksandar Lazarevic, Lawrence O. Hall and Kevin W. Bowyer (September 2003)
SMOTEBoost: Improving the Prediction of the Minority Class in Boosting. Many real world data mining applications involve learning from imbalanced data sets. Learning from data sets that contain very few instances of the minority (or interesting) class usually produces biased classifiers that have a higher predictive accuracy over the majority class(es), but poorer predictive accuracy over the minority class. SMOTE (Synthetic Minority Over-sampling TEchnique) is specifically designed for learning from imbalanced data sets. This paper presents a novel approach for learning from imbalanced data sets, based on a combination of the SMOTE algorithm and the boosting procedure. Unlike standard boosting where all misclassified examples are given equal weights, SMOTEBoost creates synthetic examples from the rare or minority class, thus indirectly changing the updating weights and compensating for skewed distributions. SMOTEBoost applied to several highly and moderately imbalanced data sets shows improvement in prediction performance on the minority class and overall improved F-values.
European Conference on Principles and Practice of Knowledge Discovery in Databases (ECML/PKDD), Cavtat-Dubrovnik, Croatia.
PDF
Nitesh V. Chawla (August 2003).
C4.5 and Imbalanced Data Sets: Investigating the Effect of Sampling Method, Probabilistic Estimate, and Decision Tree Structure. Imbalanced data sets are becoming ubiquitous, as many applications have very few instances of the \'interesting\' or \'abnormal\' class. Traditional machine learning algorithms can be biased towards majority class due to over-prevalence. It is desired that the interesting (minority) class prediction be improved, even if at the cost of additional majority class errors. In this paper, we study three issues, usually considered separately, concerning decision trees and imbalanced data sets -- quality of probabilistic es timates, pruning, and ect of preprocessing the imbalanced data set by over or undersampling methods such that a fairly balanced training set is provided to the decision trees. We consider each issue independently and in conjunction with each other, highlighting the scenarios where one method might be preferred over another for learning decision trees from imbalanced data sets.
ICML Workshop on Learning from Imbalanced Data Sets II, Washington, DC.
PDF
Nitesh V. Chawla, Thomas E. Moore, Lawrence O. Hall, Kevin W. Bowyer, W. Philip Kegelmeyer and Clayton Springer (January 2003).
Distributed Learning with Bagging-Like Performance. Bagging forms a committee of classifiers by bootstrap aggregation of training sets from a pool of training data. A simple alternative to bagging is to partition the data into disjoint subsets. Experiments with decision tree and neural network classifiers on various datasets show that, given the same size partitions and bags, disjoint partitions result in performance equivalent to, or better than, bootstrap aggregates (bags). Many applications (e.g., protein structure prediction) involve use of datasets that are too large to handle in the memory of the typical computer. Hence, bagging with samples the size of the data is impractical. Our results indicate that, in such applications, the simple approach of creating a committee of n classifiers from disjoint partitions each of size 1/n (which will be memory resident during learning) in a distributed way results in a classifier which has a bagging-like performance gain. The use of distributed disjoint partitions in learning is significantly less complex and faster than bagging.
Pattern Recognition Letters, 24(1),455-471.
PDF


2002
Steven Eschrich, Nitesh V. Chawla and Lawrence O. Hall (July 2002).
Generalization Methods in Bioinformatics. Protein secondary structure prediction and high-throughput drug screen data mining are two important applications in bioinformatics. The data is represented in sparse feature spaces and can be unrepresentative of future data. Supervised learners in this context will display their inherent bias toward certain solutions, generally solutions that fit the training set well. In this paper, we first describe an ensemble approach using subsampling that scales well with dataset size. A substantialcient number of ensemble members using subsamples of the data can yield a more accurate classifier than a single classifier using the entire dataset. Experiments on several datasets demonstrate the ectiveness of the approach. We report results from the KDD Cup 2001 drug discovery dataset in which our approach yields a higher weighted accuracy than the winning entry. We then extend our ensemble approach to create an over-generalized classifier for prediction by reducing the individual subsample size. The ensemble strategy using small subsamples has the effect of averaging over a wider range of hypotheses. We show that both protein secondary structure prediction and drug discovery prediction can be improved by the use of over-generalization, specifically through the use of ensembles of small subsamples.
ACM SIGKDD Workshop on Data Mining in Bioinformatics (BIOKDD), Edmonton, Canada.
PDF
Nitesh V. Chawla, Kevin W. Bowyer, Thomas E. Moore and Philip Kegelmeyer (June 2002).
SMOTE: Synthetic Minority Over-Sampling Technique. An approach to the construction of classifiers from imbalanced datasets is described. A dataset is imbalanced if the classification categories are not approximately equally represented. Often real-world data sets are predominately composed of \'normal\' examples with only a small percentage of \'abnormal\' or \'interesting\' examples. It is also the case that the cost of misclassifying an abnormal (interesting) example as a normal example is often much higher than the cost of the reverse error. Under-sampling of the majority (normal) class has been proposed as a good means of increasing the sensitivity of a classifier to the minority class. This paper shows that a combination of our method of over-sampling the minority (abnormal) class and under-sampling the majority (normal) class can achieve better classifier performance (in ROC space) than only under-sampling the majority class. This paper also shows that a combination of our method of over-sampling the minority class and under-sampling the majority class can achieve better classifier performance (in ROC space) than varying the loss ratios in Ripper or class priors in Naive Bayes. Our method of over-sampling the minority class involves creating synthetic minority class examples. Experiments are performed using C4.5, Ripper and a Naive Bayes classifier. The method is evaluated using the area under the Receiver Operating Characteristic curve (AUC) and the ROC convex hull strategy.
Journal of Artificial Intelligence Research, 16, 321-357.
PDF
Nitesh V. Chawla, Lawrence O. Hall, Kevin W. Bowyer, Thomas E. Moore, W. Philip Kegelmeyer (June 2002).
Distributed Pasting of Small Votes. Bagging and boosting are two popular ensemble methods that achieve better accuracy than a single classifier. These techniques have limitations on massive datasets, as the size of the dataset can be a bottleneck. Voting many classifiers built on small subsets of data (\'pasting small votes\') is a promising approach for learning from massive datasets. Pasting small votes can utilize the power of boosting and bagging, and potentially scale up to massive datasets. We propose a framework for building hundreds or thousands of such classifiers on small subsets of data in a distributed environment. Experiments show this approach is fast, accurate, and scalable to massive datasets.
International Workshop on Multiple Classifier Systems (MCS), Cagliari, Italy.
PDF


2001
Nitesh V. Chawla, Steven Eschrich and Lawrence O. Hall (November 2001).
Creating Ensembles of Classifiers. Ensembles of classifiers offer promise in increasing overall classification accuracy. The availability of extremely large datasets has opened avenues for application of distributed and/or parallel learning to efficiently learn models of them. In this paper, distributed learning is done by training classifiers on disjoint subsets of the data. We examine a random partitioning method to create disjoint subsets and propose a more intelligent way of partitioning into disjoint subsets using clustering. It was observed that the intelligent method of partitioning generally performs better than random partitioning for our datasets. In both methods a significant gain in accuracy may be obtained by applying bagging to each of the disjoint subsets, creating multiple diverse classifiers. The significance of our finding is that a partition strategy for even small/moderate sized datasets when combined with bagging can yield better performance than applying a single learner using the entire dataset.
IEEE International Conference on Data Mining (ICDM), San Jose, CA.
PDF
Nitesh V. Chawla, Thomas E. Moore, Kevin W. Bowyer, Lawrence O. Hall, Clayton Springer and W. Philip Kegelmeyer (August 2001).
Investigation of Bagging-Like Effects and Decision Trees Versus Neural Nets in Protein Secondary Structure Prediction. In the Third Critical Assessment of Techniques for Protein Structure Prediction (\'CASP-3\') contest, the best performance was obtained with a classifier that uses neural networks, a window size of fiteen around a given amino acid, and a training set of about 299,186 amino acids. We set out to investigate the possibility of obtaining better performance by using a bagging-like committee of binary decision trees, created using an order-of-magnitude more training data. There are two main reasons to believe that it should be possible to obtain better performance in this way. One is that Jones did not use a committee of classifiers in CASP-3 (and used only a four-classifier committee in CASP-4), whereas bagging studies indicate that improvement plateaus in the range of thirty to fifty classifiers in a committee. A second is that, by using supercomputers available at the Sandia National Labs, it is feasible to use an order of magnitude more training data than was used by Jones. This paper reports on our experiences pursuing this line of research. We show that with \'large\' data sets and with bag size equal to partition size, simple disjoint partitioning performs at least as well as standard bagging. Given large datasets, either outperforms a single classifier built on all the data. We also show that there are subtle differences in the operation of binary decision trees and neural networks for this problem. One difference is that the neural network seems less prone to \'over-learning\' the \'easy\' subset of the training data.
ACM SIGKDD Workshop on Data Mining in Bioinformatics (BIOKDD), San Francisco, CA.
PDF
Nitesh V. Chawla, Thomas E. Moore, Kevin W. Bowyer, Lawrence O. Hall, Clayton Springer and W. Philip Kegelmeyer (August 2001).
Bagging is a Small-Data-Set Phenomenon. Bagging forms a committee of classifiers by bootstrap aggregation of training sets from a pool of training data. A simple alternative to bagging is to partition the data into disjoint subsets. Experiments on various datasets show that, given the same size partitions and bags, the use of disjoint partitions results in better performance than the use of bags. Many applications (e.g., protein structure prediction) involve the use of datasets that are too large to handle in the memory of the typical computer. Our results indicate that, in such applications, the simple approach of creating a committee of classifiers from disjoint partitions is to be preferred over the more complex approach of bagging.
Computer Vision and Pattern Recognition, 2, 684-689.
PDF


2000
Kevin W. Bowyer, Lawrence O. Hall, Thomas E. Moore, Nitesh V. Chawla and W. Phillip Kegelmeyer (October 2000).
A Parallel Decision Tree Builder for Mining Very Large Visualization Datasets. Simulation problems in the DOE ASCI program generate visualization datasets more than a terabyte in size. The practical difficulties in visualizing such datasets motivate the desire for automatic recognition of salient events. We have developed a parallel decision tree classifier for use in this context. Comparisons to ScalParC, a previous attempt to build a fast parallelization of a decision tree classifier, are provided. Our parallel classifier executes on the \'ASCI Red\' supercomputer. Experiments demonstrate that datasets too large to be processed on a single processor can be efficiently handled in parallel, and suggest that there need not be any decrease in accuracy relative to a monolithic classifier constructed on a single processor.
IEEE International Conference on Systems, Man and Cybernetics (SMC), Nashville, TN.
PDF
Lawrence O. Hall, Nitesh V. Chawla, Kevin W. Bowyer and W. Philip Kegelmeyer (January 2000).
Learning Rules from Distributed Data. In this paper a concern about the accuracy (as a function of parallelism) of a certain class of distributed learning algorithms is raised, and one proposed improvement is illustrated.We focus on learning a single model from a set of disjoint data sets, which are distributed across a set of computers. The model is a set of rules. The distributed data sets may be disjoint for any of several reasons. In our approach, the first step is to construct a rule set (model) for each of the original disjoint data sets. Then rule sets are merged until an eventual final rule set is obtained which models the aggregate data. We show that this approach compares to directly creating a rule set from the aggregate data and promises faster learning. Accuracy can drop oás the degree of parallelism increases. However, an approach has been developed to extend the degree of parallelism achieved before this problem takes over.
Large-Scale Parallel Data Mining, LNAI, Springer, 211-220.
PDF


1999
Lawrence O. Hall, Nitesh V. Chawla, Kevin W. Bowyer and W. Philip Kegelmeyer (August 1999).
Learning Rules from Distributed Data. In this paper a concern about the accuracy (as a function of parallelism) of a certain class of distributed learning algorithms is raised, and one proposed improvement is illustrated. We focus on learning a single model from a set of disjoint data sets, which are distributed across a set of computers. The model is a set of rules. The distributed data sets may be disjoint for any of several reasons. In our approach, the first step is to construct a rule set (model) for each of the original disjoint data sets. Then rule sets are merged until an eventual final rule set is obtained which models the aggregate data. We show that this approach compares to directly creating a rule set from the aggregate data and promises faster learning. Accuracy can drop off as the degree of parallelism increases. However, an approach hsa been developed to reduce this problem.
ACM SIGKDD Workshop on Large-Scale Parallel Data Mining, San Diego, CA.
PDF


1998
Lawrence O. Hall, Nitesh V. Chawla and Kevin W. Bowyer (August 1998).
Combining Decision Trees Learned in Parallel. Very large data sets may be utilized for visualization. To focus attention on the salient regions of a data set being visualized, it is useful to have information on the interesting regions of data. It is possible to learn the salience of regions of data but very slow, if possible, to do so serially on currently available terabyte plus datasets. This paper describes an approach in which decision trees can be learned in parallel from disjoint subsets of a complete data set. The learned decision trees are converted to rules and the rules are combined into a single rule set. The combination process is based on an approach, suggested in Williams 1990 dissertation, in which rules that match one or more examples but assign them to different classes are resolved. Similar rules are also combined into more general rules. An alternate approach to combining the rule sets based on work of Provost and Hennessy 1996 is also discussed. Results on two small data sets indicate the decision tree to rules with rule conflict resolution approach has promise.
ACM SIGKDD Workshop on Distributed Data Mining, New York, NY.
PDF
Lawrence O. Hall, Nitesh V. Chawla and Kevin W. Bowyer (July 1998).
Decision Tree Learning on Very Large Data Sets. Consider a labeled data set of 1 terabyte in size. A salient subset might depend upon the users interests. Clearly, browsing such a large data set to find interesting areas would be very time consuming. An intelligent agent which, for a given class of user, could provide hints on areas of the data that might interest the user would be very useful. Given large data sets having categories of salience for different user classes attached to the data in them, these labeled sets of data can be used to train a decision tree to label unseen data examples with a category of salience. The training set will be much larger than usual. This paper describes an approach to generating the rules for an agent from a large training set. A set of decision trees are built in parallel on tractable size training data sets which are a subset of the original data. Each learned decision tree will be reduced to a set of rules, conflicting rules resolved and the resultant rules merged into one set. Results from cross validation experiments on a data set suggest this approach may be effectively applied to large sets of data.
IEEE International Conference on Systems, Man and Cybernetics (SMC), San Diego, CA.
PDF












Labs
The DIAL Lab

Directed by Prof. Chawla, DIAL (Data, Inference Analytics, and Learning) Lab in College of Engineering is focused on Big Data. Our research addresses a diverse set of problems in Big Data and Analytics. Our fundamental work in data mining includes -- class imbalance (rare and extreme events), cost-sensitive learning, evaluation issues, streaming data, graph mining, network science, and parallel and distributed data mining. We work on a broad range of applications including social networks, healthcare analytics and informatics, systems biology, and climate and environmental science.

Website: The DIAL Lab
Web Templates

Directed by Professor Chawla, iCeNSA is a multi-disciplinary University Research Institute organized around network and data science problems in social, biological, biochemical, physical, environmental, financial, organizational, technical and defense systems.

As a research center, iCeNSA (1) develops a systems-level understanding of the fundamental processes and mechanisms that underly the structural, dynamical and functional properties of complex networks, (2) develops and integrates novel mathematical and computational tools for network science.

Website: iCeNSA
Contact Information

iCeNSA, 384 Nieuwland Science Hall

Notre Dame, IN 46556

Email: nchawla [at] nd dot edu

Phone: (574) 631-1090

For meetings and appointments, please contact:
Ms. Jasmine Botello
abotello [at] nd dot edu
(574) 631-7095