My graduate work in a lab that primarily focuses on NMR-based metabolomics has brought me into close contact with two very powerful multivariate data handling algorithms: PCA and PLS. These algorithms uncover underlying trends in high-dimensional data (NMR spectra) by approximating said data in a low-dimensional space. In the low-dimensional ‘latent variable’ space, the original observations are represented as points (‘scores’) and distances between the points are related to the original distances between the high-dimensional spectra.
Practicioners of metabolomics may then draw several (~10-100) replicates from any suitably chosen experimental groups and perform PCA or PLS on the NMR (or MS, IR, Raman, CE, etc.) data to identify metabolic differences between those groups.
A very salient question in this exercise is: ‘how do we decide how far apart experimental groups have to be in scores space to be significant?’ The human mind is an extremely creative and wonderful gift, but it leads us astray from truth in times when we may expect a given outcome. Thankfully, cold and calculating computer software can apply statistical measures to scores to yield a scientifically respectable answer.
But a human still had to write the software. :)
Since I wanted to code to run virtually anywhere, I implemented pca-utils in straight up, vanilla C. My implementation also incorporated a UPGMA algorithm for generating dendrograms from scores, as well as rendering confidence ellipses and ellipsoids in two- and three-dimensional space, respectively.
Take, for example, urine samples (not actually) collected from four individuals. One can use pca-utils to visualize the scores data with both confidence ellipses and dendrograms, really answering the question: ‘are these samples different?’
Of course, this page would be pretty useless without source code, huh? :P I’ll throw in a PDF of the original paper while I’m at it.