MPI für molekulare Genetik / Department of Computational Molecular Biology |
|Large Scale Hierarchical Clustering of Protein Sequences|
|Authors:||Krause, Antje; Stoye, Jens; Vingron, Martin|
|Date of Publication (YYYY-MM-DD):||2005-01-22|
|Title of Journal:||BMC Bioinformatics|
|Copyright:||© 1999-2006 BioMed Central Ltd unless otherwise stated|
|Review Status:||not specified|
|Abstract / Description:||Background
Searching a biological sequence database with a query sequence looking for homologues has become a routine operation in computational biology. In spite of the high degree of sophistication of currently available search routines it is still virtually impossible to identify quickly and clearly a group of sequences that a given query sequence belongs to.
We report on our developments in grouping all known protein sequences hierarchically into superfamily and family clusters. Our graph-based algorithms take into account the topology of the sequence space induced by the data itself to construct a biologically meaningful partitioning. We have applied our clustering procedures to a non-redundant set of about 1,000,000 sequences resulting in a hierarchical clustering which is being made available for querying and browsing at http://systers.molgen.mpg.de/.
Comparisons with other widely used clustering methods on various data sets show the abilities and strengths of our clustering methods in producing a biologically meaningful grouping of protein sequences.
|Comment of the Author/Creator:||Methodology article|
|External Publication Status:||published|
|Communicated by:||Martin Vingron|
|Affiliations:||MPI für molekulare Genetik|
|External Affiliations:||Universität Bielefeld, Technische Fakultät, AG Genominformatik, Postfach 100131, 33501 Bielefeld, Germany;
TFH Wildau, Bahnhofstrasse 1, 15745 Wildau, Germany.