IIT Home Page CNR Home Page

K-Boost: a Scalable Algorithm for High-Quality Clustering of Microarray Gene Expression Data

Motivation: Microarray technology for profiling gene expression levels is a popular tool in modern biological research. Applications range from tissue classification to the detection of metabolic networks, from drug discovery to time-critical personalized medicine. Given the increase in size and complexity of the data sets produced, their analysis is becoming problematic in terms of time/quality tradeoffs. Clustering genes with similar expression profiles is a key initial step for subsequent manipulations and the increasing volumes of data to be analyzed requires methods that are at the same time efficient (completing an analysis in minutes rather than hours) and effective (identifying significant clusters with high biological correlations). Results: In this paper we propose K-Boost, a novel clustering algorithm based on a combination of the Furthest-Point-First (FPF) heuristic for solving the metric k-centers problem, a stability-based method for determining the number of clusters (i.e. the value of k), and a k-means-like cluster refinement. K-Boost is able to detect the optimal number of clusters to produce. It is scalable to large data-sets without sacrificing output quality as measured by several internal and external criteria.


Autori: F. Geraci , M. Leoncini, M. Montangero, M. Pellegrini, M.E. Renda
Autori IIT:

Manuela Montangero

Foto di Manuela Montangero

Tipo: Rapporti tecnici, manuali, carte geologiche e tematiche e prodotti multimediali
Area di disciplina: Information Technology and Communication Systems
rapporti tecnici IIT 2007-TR-015