CaltechAUTHORS
  A Caltech Library Service

Data complexity in machine learning

Li, Ling and Abu-Mostafa, Yaser S. (2006) Data complexity in machine learning. Computer Science Technical Reports, 2006.003. California Institute of Technology , Pasadena, USA. (Unpublished) https://resolver.caltech.edu/CaltechCSTR:2006.004

[img]
Preview
PDF - Submitted Version
See Usage Policy.

1MB

Use this Persistent URL to link to this item: https://resolver.caltech.edu/CaltechCSTR:2006.004

Abstract

We investigate the role of data complexity in the context of binary classification problems. The universal data complexity is defined for a data set as the Kolmogorov complexity of the mapping enforced by the data set. It is closely related to several existing principles used in machine learning such as Occam's razor, the minimum description length, and the Bayesian approach. The data complexity can also be defined based on a learning model, which is more realistic for applications. We demonstrate the application of the data complexity in two learning problems, data decomposition and data pruning. In data decomposition, we illustrate that a data set is best approximated by its principal subsets which are Pareto optimal with respect to the complexity and the set size. In data pruning, we show that outliers usually have high complexity contributions, and propose methods for estimating the complexity contribution. Since in practice we have to approximate the ideal data complexity measures, we also discuss the impact of such approximations.


Item Type:Report or Paper (Technical Report)
Group:Computer Science Technical Reports
Series Name:Computer Science Technical Reports
Issue or Number:2006.003
DOI:10.7907/Z9319SW2
Record Number:CaltechCSTR:2006.004
Persistent URL:https://resolver.caltech.edu/CaltechCSTR:2006.004
Official Citation:L. Li and Y. S. Abu-Mostafa. Data complexity in machine learning. Computer Science Technical Report CaltechCSTR:2006.004, California Institute of Technology, May 2006.
Usage Policy:You are granted permission for individual, educational, research and non-commercial reproduction, distribution, display and performance of this work in any format.
ID Code:27081
Collection:CaltechCSTR
Deposited By: Imported from CaltechCSTR
Deposited On:31 May 2006
Last Modified:03 Oct 2019 03:20

Repository Staff Only: item control page