Large scale clustering of protein sequences with FORCE - a layout based heuristic for weighted cluster editing

Wittkop T, Baumbach J, Lobo FP, Rahmann S (2007)
BMC Bioinformatics 8(1): 396.

Zeitschriftenaufsatz | Veröffentlicht | Englisch
 
Download
OA
Autor*in
Wittkop, Tobias; Baumbach, Jan; Lobo, Francisco P.; Rahmann, Sven
Abstract / Bemerkung
Background: Detecting groups of functionally related proteins from their amino acid sequence alone has been a long-standing challenge in computational genome research. Several clustering approaches, following different strategies, have been published to attack this problem. Today, new sequencing technologies provide huge amounts of sequence data that has to be efficiently clustered with constant or increased accuracy, at increased speed. Results: We advocate that the model of weighted cluster editing, also known as transitive graph projection is well-suited to protein clustering. We present the FORCE heuristic that is based on transitive graph projection and clusters arbitrary sets of objects, given pairwise similarity measures. In particular, we apply FORCE to the problem of protein clustering and show that it outperforms the most popular existing clustering tools ( Spectral clustering, TribeMCL, GeneRAGE, Hierarchical clustering, and Affinity Propagation). Furthermore, we show that FORCE is able to handle huge datasets by calculating clusters for all 192 187 prokaryotic protein sequences ( 66 organisms) obtained from the COG database. Finally, FORCE is integrated into the corynebacterial reference database CoryneRegNet. Conclusion: FORCE is an applicable alternative to existing clustering algorithms. Its theoretical foundation, weighted cluster editing, can outperform other clustering paradigms on protein homology clustering. FORCE is open source and implemented in Java. The software, including the source code, the clustering results for COG and CoryneRegNet, and all evaluation datasets are available at http://gi.cebitec.uni-bielefeld.de/comet/force/.
Erscheinungsjahr
2007
Zeitschriftentitel
BMC Bioinformatics
Band
8
Ausgabe
1
Art.-Nr.
396
ISSN
1471-2105
Page URI
https://pub.uni-bielefeld.de/record/1784011

Zitieren

Wittkop T, Baumbach J, Lobo FP, Rahmann S. Large scale clustering of protein sequences with FORCE - a layout based heuristic for weighted cluster editing. BMC Bioinformatics. 2007;8(1): 396.
Wittkop, T., Baumbach, J., Lobo, F. P., & Rahmann, S. (2007). Large scale clustering of protein sequences with FORCE - a layout based heuristic for weighted cluster editing. BMC Bioinformatics, 8(1), 396. https://doi.org/10.1186/1471-2105-8-396
Wittkop, Tobias, Baumbach, Jan, Lobo, Francisco P., and Rahmann, Sven. 2007. “Large scale clustering of protein sequences with FORCE - a layout based heuristic for weighted cluster editing”. BMC Bioinformatics 8 (1): 396.
Wittkop, T., Baumbach, J., Lobo, F. P., and Rahmann, S. (2007). Large scale clustering of protein sequences with FORCE - a layout based heuristic for weighted cluster editing. BMC Bioinformatics 8:396.
Wittkop, T., et al., 2007. Large scale clustering of protein sequences with FORCE - a layout based heuristic for weighted cluster editing. BMC Bioinformatics, 8(1): 396.
T. Wittkop, et al., “Large scale clustering of protein sequences with FORCE - a layout based heuristic for weighted cluster editing”, BMC Bioinformatics, vol. 8, 2007, : 396.
Wittkop, T., Baumbach, J., Lobo, F.P., Rahmann, S.: Large scale clustering of protein sequences with FORCE - a layout based heuristic for weighted cluster editing. BMC Bioinformatics. 8, : 396 (2007).
Wittkop, Tobias, Baumbach, Jan, Lobo, Francisco P., and Rahmann, Sven. “Large scale clustering of protein sequences with FORCE - a layout based heuristic for weighted cluster editing”. BMC Bioinformatics 8.1 (2007): 396.
Alle Dateien verfügbar unter der/den folgenden Lizenz(en):
Copyright Statement:
Dieses Objekt ist durch das Urheberrecht und/oder verwandte Schutzrechte geschützt. [...]
Volltext(e)
Access Level
OA Open Access
Zuletzt Hochgeladen
2019-09-06T08:48:53Z
MD5 Prüfsumme
c3b69cf1dee5125ff903c896dd7c7fb9


25 Zitationen in Europe PMC

Daten bereitgestellt von Europe PubMed Central.

Guiding biomedical clustering with ClustEval.
Wiwie C, Baumbach J, Röttger R., Nat Protoc 13(6), 2018
PMID: 29844526
Evaluation and improvements of clustering algorithms for detecting remote homologous protein families.
Bernardes JS, Vieira FR, Costa LM, Zaverucha G., BMC Bioinformatics 16(), 2015
PMID: 25651949
Comparing the performance of biomedical clustering methods.
Wiwie C, Baumbach J, Röttger R., Nat Methods 12(11), 2015
PMID: 26389570
Networks' Characteristics Matter for Systems Biology.
Rider AK, Milenković T, Siwo GH, Pinapati RS, Emrich SJ, Ferdig MT, Chawla NV., Netw Sci (Camb Univ Press) 2(2), 2014
PMID: 26500772
Massive fungal biodiversity data re-annotation with multi-level clustering.
Vu D, Szöke S, Wiwie C, Baumbach J, Cardinali G, Röttger R, Robert V., Sci Rep 4(), 2014
PMID: 25355642
A laboratory information management system for DNA barcoding workflows.
Vu TD, Eberhardt U, Szöke S, Groenewald M, Robert V., Integr Biol (Camb) 4(7), 2012
PMID: 22344310
DEFOG: discrete enrichment of functionally organized genes.
Wittkop T, Berman AE, Fleisch KM, Mooney SD., Integr Biol (Camb) 4(7), 2012
PMID: 22706384
GFam: a platform for automatic annotation of gene families.
Sasidharan R, Nepusz T, Swarbreck D, Huala E, Paccanaro A., Nucleic Acids Res 40(19), 2012
PMID: 22790981
PolyQ: a database describing the sequence and domain context of polyglutamine repeats in proteins.
Robertson AL, Bate MA, Androulakis SG, Bottomley SP, Buckle AM., Nucleic Acids Res 39(database issue), 2011
PMID: 21059684
Comprehensive cluster analysis with Transitivity Clustering.
Wittkop T, Emig D, Truss A, Albrecht M, Böcker S, Baumbach J., Nat Protoc 6(3), 2011
PMID: 21372810
Discovery and annotation of small proteins using genomics, proteomics, and computational approaches.
Yang X, Tschaplinski TJ, Hurst GB, Jawdy S, Abraham PE, Lankford PK, Adams RM, Shah MB, Hettich RL, Lindquist E, Kalluri UC, Gunter LE, Pennacchio C, Tuskan GA., Genome Res 21(4), 2011
PMID: 21367939
Ultra-fast sequence clustering from similarity networks with SiLiX.
Miele V, Penel S, Duret L., BMC Bioinformatics 12(), 2011
PMID: 21513511
Assessing the functional coherence of modules found in multiple-evidence networks from Arabidopsis.
Lysenko A, Defoin-Platel M, Hassani-Pak K, Taubert J, Hodgman C, Rawlings CJ, Saqi M., BMC Bioinformatics 12(), 2011
PMID: 21612636
Genome sequence of a mesophilic hydrogenotrophic methanogen Methanocella paludicola, the first cultivated representative of the order Methanocellales.
Sakai S, Takaki Y, Shimamura S, Sekine M, Tajima T, Kosugi H, Ichikawa N, Tasumi E, Hiraki AT, Shimizu A, Kato Y, Nishiko R, Mori K, Fujita N, Imachi H, Takai K., PLoS One 6(7), 2011
PMID: 21829548
clusterMaker: a multi-algorithm clustering plugin for Cytoscape.
Morris JH, Apeltsin L, Newman AM, Baumbach J, Wittkop T, Su G, Bader GD, Ferrin TE., BMC Bioinformatics 12(), 2011
PMID: 22070249
Partitioning biological data with transitivity clustering.
Wittkop T, Emig D, Lange S, Rahmann S, Albrecht M, Morris JH, Böcker S, Stoye J, Baumbach J., Nat Methods 7(6), 2010
PMID: 20508635
Genome-wide comparative gene family classification.
Frech C, Chen N., PLoS One 5(10), 2010
PMID: 20976221
Genetic makeup of the Corynebacterium glutamicum LexA regulon deduced from comparative transcriptomics and in vitro DNA band shift assays.
Jochmann N, Kurze AK, Czaja LF, Brinkrolf K, Brune I, Hüser AT, Hansmeier N, Pühler A, Borovok I, Tauch A., Microbiology 155(pt 5), 2009
PMID: 19372162
Family classification without domain chaining.
Joseph JM, Durand D., Bioinformatics 25(12), 2009
PMID: 19478015
Force feature spaces for visualization and classification.
Veljkovic D, Robbins KA., Int Conf Digit Signal Process Proc 2008(), 2008
PMID: 20676225

27 References

Daten bereitgestellt von Europe PubMed Central.

Gapped BLAST and PSI-BLAST: a new generation of protein database search programs.
Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ., Nucleic Acids Res. 25(17), 1997
PMID: 9254694
Exact and heuristic algorithms for weighted cluster editing.
Rahmann S, Wittkop T, Baumbach J, Martin M, Truss A, Bocker S., Comput Syst Bioinformatics Conf 6(), 2007
PMID: 17951842
On best transitive approximations of simple graphs
Delvaux S, Horsten L., 2004
Cluster graph modification problems
Shamir R, Sharan R, Tsur D., 2004
ProClust: improved clustering of protein sequences with an extended graph-based approach.
Pipenbacher P, Schliep A, Schneckener S, Schonhuth A, Schomburg D, Schrader R., Bioinformatics 18 Suppl 2(), 2002
PMID: 12386002
Large scale hierarchical clustering of protein sequences.
Krause A, Stoye J, Vingron M., BMC Bioinformatics 6(), 2005
PMID: 15663796
Spectral clustering of protein sequences.
Paccanaro A, Casbon JA, Saqi MA., Nucleic Acids Res. 34(5), 2006
PMID: 16547200
An efficient algorithm for large-scale detection of protein families.
Enright AJ, Van Dongen S, Ouzounis CA., Nucleic Acids Res. 30(7), 2002
PMID: 11917018

Everitt BS., 1993
GeneRAGE: a robust algorithm for sequence clustering and domain detection.
Enright AJ, Ouzounis CA., Bioinformatics 16(5), 2000
PMID: 10871267
Clustering by passing messages between data points.
Frey BJ, Dueck D., Science 315(5814), 2007
PMID: 17218491
The COG database: an updated version includes eukaryotes.
Tatusov RL, Fedorova ND, Jackson JD, Jacobs AR, Kiryutin B, Koonin EV, Krylov DM, Mazumder R, Mekhedov SL, Nikolskaya AN, Rao BS, Smirnov S, Sverdlov AV, Vasudevan S, Wolf YI, Yin JJ, Natale DA., BMC Bioinformatics 4(), 2003
PMID: 12969510
Graph drawing by force-directed placement
Fruchterman TMJ, Reingold EM., 1991
Protein complex prediction via cost-based clustering.
King AD, Przulj N, Jurisica I., Bioinformatics 20(17), 2004
PMID: 15180928
SCOP database in 2004: refinements integrate structure and sequence family data.
Andreeva A, Howorth D, Brenner SE, Hubbard TJ, Chothia C, Murzin AG., Nucleic Acids Res. 32(Database issue), 2004
PMID: 14681400
The ASTRAL Compendium in 2004.
Chandonia JM, Hon G, Walker NS, Lo Conte L, Koehl P, Levitt M, Brenner SE., Nucleic Acids Res. 32(Database issue), 2004
PMID: 14681391
SCOP website
AUTHOR UNKNOWN, 0
ASTRAL website
AUTHOR UNKNOWN, 0
COG website
AUTHOR UNKNOWN, 0
COG sequences (FTP)
AUTHOR UNKNOWN, 0
CoryneRegNet: an ontology-based data warehouse of corynebacterial transcription factors and regulatory networks.
Baumbach J, Brinkrolf K, Czaja LF, Rahmann S, Tauch A., BMC Genomics 7(), 2006
PMID: 16478536
CoryneRegNet 3.0--an interactive systems biology platform for the analysis of gene regulatory networks in corynebacteria and Escherichia coli.
Baumbach J, Wittkop T, Rademacher K, Rahmann S, Brinkrolf K, Tauch A., J. Biotechnol. 129(2), 2006
PMID: 17229482
Automated generation of search tree algorithms for hard graph modification problems
Gramm J, Guo J, Hüffner F, Niedermeier R., 2004
Graph-modeled data clustering: Exact algorithm for clique generation
Gramm J, Guo J, Hüffner F, Niedermeier R., 2005
The Cluster Editing Problem: Implementations and Experiments
Dehne F, Langston MA, Luo X, Pitre S, Shaw P, Zhang Y., 2006
CoryneRegNet website
AUTHOR UNKNOWN, 0
Fast index based algorithms and software for matching position specific scoring matrices.
Beckstette M, Homann R, Giegerich R, Kurtz S., BMC Bioinformatics 7(), 2006
PMID: 16930469
Export

Markieren/ Markierung löschen
Markierte Publikationen

Open Data PUB

Web of Science

Dieser Datensatz im Web of Science®
Quellen

PMID: 17941985
PubMed | Europe PMC

Suchen in

Google Scholar