Clustering Molecular Dynamics Trajectories: 1. Characterizing the Performance of Different Clustering Algorithms

Jianyin Shao, Stephen W. Tanner, Nephi Thompson, and Thomas E. Cheatham, III*
Departments of Medicinal Chemistry, Pharmaceutics and Pharmaceutical Chemistry, and Bioengineering, College of Pharmacy, University of Utah, 2000 East 30 South, Skaggs Hall 201, Salt Lake City, Utah 84112
J. Chem. Theory Comput., 2007, 3 (6), pp 2312–2334
DOI: 10.1021/ct700119m
Publication Date (Web): October 6, 2007
Copyright © 2007 American Chemical Society

Abstract

Molecular dynamics simulation methods produce trajectories of atomic positions (and optionally velocities and energies) as a function of time and provide a representation of the sampling of a given molecule's energetically accessible conformational ensemble. As simulations on the 10−100 ns time scale become routine, with sampled configurations stored on the picosecond time scale, such trajectories contain large amounts of data. Data-mining techniques, like clustering, provide one means to group and make sense of the information in the trajectory. In this work, several clustering algorithms were implemented, compared, and utilized to understand MD trajectory data. The development of the algorithms into a freely available C code library, and their application to a simple test example of random (or systematically placed) points in a 2D plane (where the pairwise metric is the distance between points) provide a means to understand the relative performance. Eleven different clustering algorithms were developed, ranging from top-down splitting (hierarchical) and bottom-up aggregating (including single-linkage edge joining, centroid-linkage, average-linkage, complete-linkage, centripetal, and centripetal-complete) to various refinement (means, Bayesian, and self-organizing maps) and tree (COBWEB) algorithms. Systematic testing in the context of MD simulation of various DNA systems (including DNA single strands and the interaction of a minor groove binding drug DB226 with a DNA hairpin) allows a more direct assessment of the relative merits of the distinct clustering algorithms. Additionally, means to assess the relative performance and differences between the algorithms, to dynamically select the initial cluster count, and to achieve faster data mining by “sieved clustering” were evaluated. Overall, it was found that there is no one perfect “one size fits all” algorithm for clustering MD trajectories and that the results strongly depend on the choice of atoms for the pairwise comparison. Some algorithms tend to produce homogeneously sized clusters, whereas others have a tendency to produce singleton clusters. Issues related to the choice of a pairwise metric, clustering metrics, which atom selection is used for the comparison, and about the relative performance are discussed. Overall, the best performance was observed with the average-linkage, means, and SOM algorithms. If the cluster count is not known in advance, the hierarchical or average-linkage clustering algorithms are recommended. Although these algorithms perform well, it is important to be aware of the limitations or weaknesses of each algorithm, specifically the high sensitivity to outliers with hierarchical, the tendency to generate homogenously sized clusters with means, and the tendency to produce small or singleton clusters with average-linkage.

Citing Articles

View all 76 citing articles

Citation data is made available by participants in CrossRef's Cited-by Linking service. For a more comprehensive list of citations to this article, users are encouraged to perform a search in SciFinder.

This article has been cited by 29 ACS Journal articles (5 most recent appear below).

  • Cover Image

    Efficient Construction of Mesostate Networks from Molecular Dynamics Trajectories

    Andreas Vitalis and Amedeo Caflisch
    Journal of Chemical Theory and Computation2012 Article ASAP
    • Efficient Construction of Mesostate Networks from Molecular Dynamics Trajectories

      Andreas Vitalis and Amedeo Caflisch
      Journal of Chemical Theory and Computation2012 Article ASAP

      The coarse-graining of data from molecular simulations yields conformational space networks that may be used for predicting the system’s long time scale behavior, to discover structural pathways connecting free energy basins in the system, or simply to ...

  • Cover Image

    The Structural Role of Mg2+ Ions in a Class I RNA Polymerase Ribozyme: A Molecular Simulation Study

    Jacopo Sgrignani and Alessandra Magistrato
    The Journal of Physical Chemistry B2012 Article ASAP
    • The Structural Role of Mg2+ Ions in a Class I RNA Polymerase Ribozyme: A Molecular Simulation Study

      Jacopo Sgrignani and Alessandra Magistrato
      The Journal of Physical Chemistry B2012 Article ASAP

      According to the RNA world hypothesis, self-replicating ribozymes, storing the genetic information and being able to perform catalysis, were the constituents of the first living organisms. In particular, RNA polymerase ribozymes, similar to current ...

  • Cover Image

    The Carbohydrate-Binding Site in Galectin-3 Is Preorganized To Recognize a Sugarlike Framework of Oxygens: Ultra-High-Resolution Structures and Water Dynamics

    Kadhirvel Saraboji, Maria Håkansson, Samuel Genheden, Carl Diehl, Johan Qvist, Ulrich Weininger, Ulf J. Nilsson, Hakon Leffler, Ulf Ryde, Mikael Akke, and Derek T. Logan
    Biochemistry2012 51 (1), 296-306
    • The Carbohydrate-Binding Site in Galectin-3 Is Preorganized To Recognize a Sugarlike Framework of Oxygens: Ultra-High-Resolution Structures and Water Dynamics

      Kadhirvel Saraboji, Maria Håkansson, Samuel Genheden, Carl Diehl, Johan Qvist, Ulrich Weininger, Ulf J. Nilsson, Hakon Leffler, Ulf Ryde, Mikael Akke, and Derek T. Logan
      Biochemistry2012 51 (1), 296-306

      The recognition of carbohydrates by proteins is a fundamental aspect of communication within and between living cells. Understanding the molecular basis of carbohydrate–protein interactions is a prerequisite for the rational design of synthetic ligands. ...

  • Cover Image

    Simulations of Allosteric Motions in the Zinc Sensor CzrA

    Dhruva K. Chakravorty, Bing Wang, Chul Won Lee, David P. Giedroc, and Kenneth M. Merz, Jr.
    Journal of the American Chemical Society2011 Article ASAP
    • Simulations of Allosteric Motions in the Zinc Sensor CzrA

      Dhruva K. Chakravorty, Bing Wang, Chul Won Lee, David P. Giedroc, and Kenneth M. Merz, Jr.
      Journal of the American Chemical Society2011 Article ASAP

      The zinc sensing transcriptional repressor Staphylococcus aureus CzrA represents an excellent model system to understand how metal sensor proteins maintain cellular metal homeostasis. Zn(II) binding induces a quaternary structural switch from a “closed” ...

  • Cover Image

    Base- and Structure-Dependent DNA Dinucleotide–Carbon Nanotube Interactions: Molecular Dynamics Simulations and Thermodynamic Analysis

    Zhengtao Xiao, Xia Wang, Xue Xu, Hong Zhang, Yan Li, and Yonghua Wang
    The Journal of Physical Chemistry C2011 115 (44), 21546-21558
    • Base- and Structure-Dependent DNA Dinucleotide–Carbon Nanotube Interactions: Molecular Dynamics Simulations and Thermodynamic Analysis

      Zhengtao Xiao, Xia Wang, Xue Xu, Hong Zhang, Yan Li, and Yonghua Wang
      The Journal of Physical Chemistry C2011 115 (44), 21546-21558

      Wrapping of single-wall carbon nanotubes (SWCNTs) by single-stranded DNA (ssDNA) was found to be sequence-dependent, offering properties such as the facilitation of SWCN sorting, ultrafast DNA sequencing, and construction of chemical sensors. Although the ...

Tools

SciFinder Links

SciFinder subscribers:  Click to sign in | Not a SciFinder subscriber? Learn more at www.cas.org

Explore by:


Accession Codes

History

  • Published In Issue November 13, 2007
  • Received May 17, 2007

Recommend & Share

  • Share on ACS NetworkACS Network
  • Add to FacebookFacebook
  • Tweet ThisTweet This
  • Add to CiteULikeCiteULike
  • Add to NewsvineNewsvine
  • Digg ThisDigg This
  • Add to DeliciousDelicious

Related Content

Other ACS content by these authors: