Journal of Proteome Research, 3 (4), 697 -711, 2004. 10.1021/pr0340680 S1535-3893(03)04068-5
Web Release Date: April 13, 2004

Copyright © 2004 American Chemical Society

Clinical and Pharmacogenomic Data Mining: 2. A Simple Method for the Combination of Information from Associations and Multivariances to Facilitate Analysis, Decision, and Design in Clinical Research and Practice

Barry Robson* and Richard Mushlin

T. J. Watson Research Center, 1101 Kitchawan Road, Yorktown Heights, New York 10598

Received August 26, 2003

Abstract:

The physician and researcher must ultimately be able to combine qualitative and quantitative features from a variety of combinations of observations on data of many component items (i.e., many dimensions), and hence reach simple conclusions about interpretation, rational courses of action, and design. In the first paper of this series, it was noted that such needs are challenging the classical means of using statistics. Hence, the paper proposed the use of a Generalized Theory of Expected Information or "Zeta Theory". The conjoint event [a,b,c,..] is seen as a rule of association for a,b,c,.. associated with a rule strength I(a;b;c;...) = (s,o[a,b,c,..]) - (s,e[a,b,c,...]), where is the incomplete Zeta Function. Here, o[a,b,c,...] is the observed, and e[a,b,c,..] the expected, frequency of occurrence of conjoint event [a,b,c,...]. The present paper explores how output from this approach might be assembled in a form better suited for decision support. Related to this is the difficulty that the treatment of covariance and multivariance was previously rendered as a "fuzzy association" so that the output would fall into a similar form as the true associations, but this was a somewhat ad hoc approach in which only the final I( ) had any meaning. Users at clinical research sites had subsequently requested an alternative approach in which "effective frequencies" o[ ] and e[ ] calculated from the above variances and used to evaluate I( ) give some intuitive feeling analogous to the association treatment, and this is explored here. Though the present paper is theoretical, real examples are used to illustrate application. One clinical-genomic example illustrates experimental design by identifying data which is, or is not, statistically germane to the study. We also report on some impressions based on applying these techniques in studies of real, extensive patient record data which are now emerging, as well as on molecular design data originally studied in part to test the ability to deduce the effects of simple natural patient sequence variations ("SNPs") on patient protein activity. On the basis of these study experiences, methods of rationalizing and condensing the rules implied by associations and variances between data, as well as discussion of the difficulty of what is meant by "condensed", are presented in the Appendix.

Keywords: data mining association covariance proteomics pharmacogenomics patient record information theory expected information zeta function rules


Download the full text: PDF | HTML