Web Release Date: May 8,
Clinical and Pharmacogenomic Data Mining: 1. Generalized Theory of Expected Information and Application to the Development of Tools
T. J. Watson Research Center, 1101 Kitchwan Road, Yorktown Heights, New York 10598
Received December 4, 2002
Revised February 20, 2003
Abstract:
New scientific problems, arising from the human genome project, are challenging the classical means
of using statistics. Yet quantified knowledge in the form of rules and rule strengths based on real
relationships in data, as opposed to expert opinion, is urgently required for researcher and physician
decision support. The problem is that with many parameters, the space to be analyzed is highly
dimensional. That is, the combinations of data to examine are subject to a combinatorial explosion as
the number of possible events (entries, items, sub-records) (a),(b),(c),... per record (a,b,c,..) increases,
and hence much of the space is sparsely populated. These combinatorial considerations are particularly
problematic for identifying those associations called "Unicorn Events" which occur significantly less
than expected to the extent that they are never seen to be counted. To cope with the combinatorial
explosion, a novel numerical "book keeping" approach is taken to generate information terms relating
to the combinatorial subsets of events (a,b,c,..), and, most importantly, the
(Zeta) function is employed.
The incomplete Zeta function
(s,n) with s = 1, in which frequencies of occurrence such as n = n(a,b,c,...)
determine the range of summation n, is argued to be the natural choice of information function. It
emerges from Bayesian integration, taken over the distribution of possible values of information
measures for sparse and ample data alike. Expected mutual information I(a;b;c) in nats (i.e., natural
units analogous to bits but based on the natural logarithm), such as is available to the observer, is
measured as e.g., the difference
(s,o(a,b,c..)) -
(s,e(a,b,c..)) where o(a,b,c,..) and e(a,b,c,..) are, or
relate to, the observed and expected frequencies of occurrence, respectively. For real values of s >1
the qualitative impact of strongly (positively or negatively) ranked data is preserved despite several
numerical approximations. As real s increases, and the output of the information functions converge
into three values +1, 0, and -1 nats representing a trinary logic system. For quantitative data, a useful
ad hoc method, to report
-normalized covariations in an analogous manner to mutual information for
significance comparison purposes, is demonstrated. Finally, the potential ability to make use of mutual
information in a complex biomedical study, and to include Bayesian prior information derived from
statistical, tabular, anecdotal, and expert opinion is briefly illustrated.
Keywords: data mining
association
covariance
negative association
proteomics
pharmacogenomics
patient
record
information theory
expected information
zeta function
Download the full text: PDF | HTML