Dr. Foster Provost
New York University
Abstract
Customer accounts are linked by communications and other transactions. Organizatons are linked by joint activities. Text documents are hyperlinked. Such networked data create opportunities for learning and applying classification models. For example, for detecting fraud a common and successful strategy is to use transactions to link a questionable account to previous fraudulent activity. Document classification can be improved by considering hyperlink structure. Marketing can change dramatically when customer communication is taken into account. Two special characteristics of classification with networked data include: (1) Knowing the classifications of some entities in the network can improve the classification of others. (2) Very-high-cardinality categorical attributes (e.g., identifiers) can be used effectively in learned models. I will present NetKit, a toolkit to facilitate research on classification and learning with networked data. NetKit is based on a modular framework that allows components to be mixed and matched to form different network classification algorithms. I will demonstrate NetKit with a case study of univariate classification using networked data from several domains.