Title: | Chemometrics with R - Multivariate Data Analysis in the Natural Sciences and Life Sciences (2nd Edition) |
---|---|
Description: | Functions and scripts used in the book "Chemometrics with R - Multivariate Data Analysis in the Natural Sciences and Life Sciences", 2nd edition, by Ron Wehrens, Springer (2019). |
Authors: | Ron Wehrens [aut, cre] |
Maintainer: | Ron Wehrens <[email protected]> |
License: | GPL (>= 2) |
Version: | 0.2.0 |
Built: | 2024-10-27 03:49:25 UTC |
Source: | https://github.com/rwehrens/ChemometricsWithR |
Functions and scripts used in the book "Chemometrics with R - Multivariate Data Analysis in the Natural Sciences and Life Sciences", 2nd edition, by Ron Wehrens, Springer (2019).
Package accompanying the second edition of the book "Chemometrics with R". The package will not be hosted on CRAN and therefore no longer has to comply to the size constraints that CRAN has imposed. This simplifies matters considerably for readers and in particular removes the need to install a separate data package.
The scripts in the demo
directory can be used to reproduce the
results and plots of the individual chapters. For Chapter 3, for
example, simply run demo("chapter3.R")
.
Ron Wehrens [aut, cre]
Maintainer: Ron Wehrens <[email protected]>
R. Wehrens. "Chemometrics with R - Multivariate Data Analysis in the Natural Sciences and Life Sciences", 2nd edition, Springer, Heidelberg, 2019.
The Adjusted Rand Index is a measure of similarity for two groupings or clusterings. A value of 1 indicates total agreement.
AdjRkl(part1, part2)
AdjRkl(part1, part2)
part1 |
First partitioning. |
part2 |
Second partitioning. |
Number.
Ron Wehrens
R. Wehrens. "Chemometrics with R - Multivariate Data Analysis in the Natural Sciences and Life Sciences". 2nd edition, Springer, Heidelberg, 2019.
data(wines, package = "kohonen") wines.dist <- dist(scale(wines)) wines.sl <- hclust(wines.dist, method = "single") wines.cl <- hclust(wines.dist, method = "complete") AdjRkl(cutree(wines.sl, 4), cutree(wines.cl, 4))
data(wines, package = "kohonen") wines.dist <- dist(scale(wines)) wines.sl <- hclust(wines.dist, method = "single") wines.cl <- hclust(wines.dist, method = "complete") AdjRkl(cutree(wines.sl, 4), cutree(wines.cl, 4))
LC-MS data from Arabidopsis thaliana samples. The
arabidopsis
data object contains relative
intensities of 567 reconstructed metabolites (columns) for 761
samples (rows). A sizeable fraction of intensities are missing, in
most cases because the corresponding metabolites are below the
detection level. The corresponding meta-information object
(arabidopsis.Y
)contains for every sample batch and
sequence information, as well as (coded) information on the genotype
and the sample type (study sample or reference sample). Processing of
the raw data has been done with Metalign and MSClust programs.
data(arabidopsis)
data(arabidopsis)
"@ArticleWehrens2016, author = Ron Wehrens and Jos.~A.~Hageman and Fred~van~Eeuwijk and Rik~Kooke and P\'adraic~J.~Flood and Erik Wijnker and Joost~J.B.~Keurentjes and Arjen~Lommen and Henri\"ette~D.L.M.~van~Eekelen and Robert~D.~Hall and Roland~Mumm and Ric~C.H.~de~Vos, title = Improved batch correction in untargeted MS-based metabolomics, journal = Metabolomics, year = 2016, volume = 12, DOI = 10.1007/s11306-016-1015-8, pages = 1–12 "
data(arabidopsis) dim(arabidopsis) sum(is.na(arabidopsis)) dim(arabidopsis.Y)
data(arabidopsis) dim(arabidopsis) sum(is.na(arabidopsis)) dim(arabidopsis.Y)
Two chemical mixtures of three compounds have been measured using HPLC-UV. Two of the compounds are known: diazinon and parthion-ethyl, both organophosphorus pesticides. Each data matrix consists of 73 wavelengths and 40 time points. The challenge is to infer the pure spectra of the individual compounds, as well as their time profiles.
data(bdata)
data(bdata)
A list of four elements. The first two, d1
and d2
,
are the mixture matrices of the two analytes and one unknown
interferent. The last two, sp1
and sp2
, contain the pure
spectra of the two analytes.
Original matlab data files obtained from
http://www.ub.edu/mcr/web_mcr/download_dataHPLC.html
(bdataset.zip).
No longer available.
R. Wehrens. "Chemometrics with R - Multivariate Data Analysis in the Natural Sciences and Life Sciences". Springer, 2nd edition, Heidelberg, 2019.
R. Tauler, S. Lacorte and D. Barcelo. "Application of multivariate curve self-modeling curve resolution for the quantitation of trace levels of organophosphorous pesticides in natural waters from interlaboratory studies". J. of Chromatogr. A, 730, 177-183 (1996).
data(bdata) persp(bdata$d1, phi = 20, theta = 34, expand = .5, xlab = "Time", ylab = "Wavelength")
data(bdata) persp(bdata$d1, phi = 20, theta = 34, expand = .5, xlab = "Time", ylab = "Wavelength")
Error functions for classification and regression
rms(x, y) err.rate(x, y)
rms(x, y) err.rate(x, y)
x , y
|
True or predicted values, either numbers or factors. |
Function rms
returns the root-mean-square error for real-valued x
and y vectors. Function err.rate
returns the fraction of
non-matching cases in x and y (real numbers or factors).
Ron Wehrens
R. Wehrens. "Chemometrics with R - Multivariate Data Analysis in the Natural Sciences and Life Sciences". 2nd edition, Springer, Heidelberg, 2019.
Calculation of the change in the Gini impurity index for
classification and regression trees. The function returns changes in
the gini index associated with using individual values of x
as
split points. Included for demonstration purposes.
gini(x, clss)
gini(x, clss)
x |
Numeric vector of length n. |
clss |
Clss labels, length n. |
The change in Gini impurity index, given a vector of possible splits, and a vector of clss labels. Lower values indicate more pure leaves.
Ron Wehrens
R. Wehrens. "Chemometrics with R - Multivariate Data Analysis in the Natural Sciences and Life Sciences". 2nd edition, Springer, Heidelberg, 2019.
LC-MS data are characterised by peak positions in two dimensions, the retention time dimension and the mass-to-charge ratio (m/z) dimension. The plot shows this plane (corresponding to one sample, basically an intensity matrix), with projections to the top and the right.
lcmsimage(z, rt, mz, xlim = range(rt), ylim = range(mz), zmin = 0, xlab = "Time (s)", ylab = "m/z", ColourScale = c("exponential", "linear"), nBreaks = 12, colors = terrain.colors(nBreaks), PlotProjections = c("max", "sum", "none"), ncolorbarticks = 5, colorbarticks = axisTicks(zlim2, log = ColourScale == "exponential", nint = ncolorbarticks), ...)
lcmsimage(z, rt, mz, xlim = range(rt), ylim = range(mz), zmin = 0, xlab = "Time (s)", ylab = "m/z", ColourScale = c("exponential", "linear"), nBreaks = 12, colors = terrain.colors(nBreaks), PlotProjections = c("max", "sum", "none"), ncolorbarticks = 5, colorbarticks = axisTicks(zlim2, log = ColourScale == "exponential", nint = ncolorbarticks), ...)
z |
the intensity matrix |
rt |
the vector of time points corresponding to the x axis |
mz |
the vector of m/z values corresponding to the y axis |
xlim , ylim
|
defining which part of the data is shown, by default all |
zmin |
minimum of the intensity range, default zero |
xlab , ylab
|
axis labels |
ColourScale |
whether to use an exponential (default) or linear color scale |
nBreaks , colors
|
which and how many colours to use |
PlotProjections |
ways to calculate the projections to the top
and right of the figure. Choosing |
ncolorbarticks , colorbarticks
|
control over the tick marks of the color bar |
... |
other arguments for the projection plots |
Ron Wehrens, Tom Bloemberg
data(lcms,package = "ptw") mycols <- terrain.colors(40) lcmsimage(t(lcms[,,1]), rt = seq(2000, 5500, length = 2000), mz = seq(550, 599.5, length = 100), zmin = 1, colors = mycols, ncolorbarticks = 10)
data(lcms,package = "ptw") mycols <- terrain.colors(40) lcmsimage(t(lcms[,,1]), rt = seq(2000, 5500, length = 2000), mz = seq(550, 599.5, length = 100), zmin = 1, colors = mycols, ncolorbarticks = 10)
Multivariate Curve Resolution, or MCR, decomposes a bilinear matrix into its pure components. A classical example is a matrix consisting of a series of spectral measurements on a mixture of chemicals for following the reaction. At every time point, a spectrum is measured that is a linear combination of the pure spectra. The goal of MCR is to resolve the pure spectra and concentration profiles over time.
mcr(x, init, what = c("row", "col"), convergence = 1e-08, maxit = 50) efa(x, ncomp)
mcr(x, init, what = c("row", "col"), convergence = 1e-08, maxit = 50) efa(x, ncomp)
x |
Data matrix |
init |
Initial guess for pure compounds |
what |
Whether the pure compounds are rows or columns of the data matrix |
convergence |
Convergence criterion |
maxit |
Maximal number of iterations |
ncomp |
Number of pure compounds |
MCR uses repeated application of least-squares regression to find pure profiles and spectra. The method is iterative; EFA is a method to provide initial guesses.
Function mcr
returns a list containing
C |
An estimate of the pure "concentration profiles" |
S |
An estimate of the pure "spectra" |
resids |
The residuals of the final decomposition |
rms |
Root-mean-square values of the individual iterations |
Function efa
returns a list containing
pure.compounds: |
A matrix containing |
forward: |
The development of the singular values of the reduced data matrix when increasing the number of columns in the forward direction |
backward: |
The development of the singular values of the reduced data matrix when increasing the number of columns in the backwarddirection |
Ron Wehrens
R. Wehrens. "Chemometrics with R - Multivariate Data Analysis in the Natural Sciences and Life Sciences". 2nd edition, Springer, Heidelberg, 2019.
data(bdata) D1.efa <- efa(bdata$d1, 3) matplot(D1.efa$forward, type = "l") matplot(D1.efa$backward, type = "l") matplot(D1.efa$pure.comp, type = "l") D1.mcr.efa <- mcr(bdata$d1, D1.efa$pure.comp, what = "col") matplot(D1.mcr.efa$C, type = "l", main = "Concentration profiles") matplot(t(D1.mcr.efa$S), type = "l", main = "Pure spectra")
data(bdata) D1.efa <- efa(bdata$d1, 3) matplot(D1.efa$forward, type = "l") matplot(D1.efa$backward, type = "l") matplot(D1.efa$pure.comp, type = "l") D1.mcr.efa <- mcr(bdata$d1, D1.efa$pure.comp, what = "col") matplot(D1.mcr.efa$C, type = "l", main = "Concentration profiles") matplot(t(D1.mcr.efa$S), type = "l", main = "Pure spectra")
Functions for PCA: creating a PCA object, extracting variances, scores and loadings for individual PCs, projecting new data in the PC space, and reconstruction using a limited number of PCs.
PCA(X, warn = TRUE) ## S3 method for class 'PCA' summary(object, varperc = 90, pc.select = c(1:5,10), ...) variances(object, npc = maxpc) ## S3 method for class 'PCA' scores(object, npc = maxpc, ...) ## S3 method for class 'PCA' loadings(object, npc = maxpc, ...) reconstruct(object, npc = maxpc) project(object, npc = maxpc, newdata, ldngs)
PCA(X, warn = TRUE) ## S3 method for class 'PCA' summary(object, varperc = 90, pc.select = c(1:5,10), ...) variances(object, npc = maxpc) ## S3 method for class 'PCA' scores(object, npc = maxpc, ...) ## S3 method for class 'PCA' loadings(object, npc = maxpc, ...) reconstruct(object, npc = maxpc) project(object, npc = maxpc, newdata, ldngs)
X |
a matrix, with each row representing an object. |
warn |
logical, whether or not to give a warning when the data are not mean-centered. |
object |
an object of class "PCA" (see below). |
varperc |
variance threshold in the |
... |
extra arguments, e.g., for printing the variance table (digits = ...). |
pc.select |
PCs to be included in the |
npc |
the number of PCs to be returned. |
newdata |
data (with the same number of variables as the original
data) that are to be projected into the space of the first
|
ldngs |
loadings to be used; by default the PCA loadings. |
Function PCA
returns an object of class "PCA" with components
scores |
object weights per PC. |
loadings |
variable weights per PC. |
var |
variance explained per PC. |
totalvar |
The total variance in the data set. |
Function summary.PCA
gives a short summary of the PCA model,
stating how many PCs are needed to cover a certain percentage of the
total variance, and for selected PCs gives the (cumulative) variance
explained.
Function variances
returns the variances associated with each PC.
Function scores
returns the scores associated with each PC.
Function loadings
returns the loadings associated with each PC.
Function reconstruct
returns the reconstruction of the original
data matrix, based on npc
PCs.
Function project
projects the new data into the subspace
spanned by the given loadings. If argument ldngs
is given,
arguments pcamod
and npc
are not needed.
Ron Wehrens
R. Wehrens. "Chemometrics with R - Multivariate Data Analysis in the Natural Sciences and Life Sciences". 2nd edition, Springer, Heidelberg, 2019.
data(wines, package = "kohonen") wines.PC <- PCA(scale(wines))
data(wines, package = "kohonen") wines.PC <- PCA(scale(wines))
Plotting functions for PCA: for scores, loadings, scores and loadings simultaneously (a biplot), and variances (a screeplot, where the log of the explained variance is plotted for each PC).
## S3 method for class 'PCA' scoreplot(object, pc = c(1, 2), pcscores = scores(object), show.names = FALSE, xlab, ylab, xlim, ylim, ...) ## S3 method for class 'PCA' loadingplot(object, pc = c(1, 2), pcloadings = loadings(object), scalefactor = 1, add = FALSE, show.names = FALSE, xlab, ylab, xlim, ylim, col = "blue", min.length = 0.01, varnames = NULL, ...) ## S3 method for class 'PCA' biplot(x, pc = c(1,2), show.names = c("none", "scores", "loadings", "both"), score.col = 1, loading.col = "blue", min.length = .01, varnames = NULL, ...) screeplot(object, type = c("scree", "percentage"), npc, ...)
## S3 method for class 'PCA' scoreplot(object, pc = c(1, 2), pcscores = scores(object), show.names = FALSE, xlab, ylab, xlim, ylim, ...) ## S3 method for class 'PCA' loadingplot(object, pc = c(1, 2), pcloadings = loadings(object), scalefactor = 1, add = FALSE, show.names = FALSE, xlab, ylab, xlim, ylim, col = "blue", min.length = 0.01, varnames = NULL, ...) ## S3 method for class 'PCA' biplot(x, pc = c(1,2), show.names = c("none", "scores", "loadings", "both"), score.col = 1, loading.col = "blue", min.length = .01, varnames = NULL, ...) screeplot(object, type = c("scree", "percentage"), npc, ...)
x , object
|
an object of class "PCA" (see below). |
pc |
which PCs to show. |
pcscores |
matrix of scores, by default the scores of the PCA model object. |
show.names |
show names rather than plotting
symbols. For |
xlab , ylab , xlim , ylim , col
|
graphical parameters of the plot. |
pcloadings |
matrix of loadings, by default the loadings of the PCA model object. |
scalefactor |
scaling factor for the loadings; used internally, when
the |
add |
logical, whether to add to the existing plot (again, useful
when |
npc |
how many PCs to show in the scree plot (starting from 1). |
type |
show a real screeplot ( |
score.col , loading.col
|
colours of the scores and loadings in a biplot. |
min.length |
minimal length of loading vectors to be plotted by arrows. Vectors that are too short lead to warning messages, are not interesting, and only clutter the graphic. |
varnames |
alternative vector of variable names. |
... |
Graphical arguments passed on to lower-level plotting functions. |
Score plots and loading plots show the amount of explained variance at the axis labels only when PCA has been performed at mean-centered data.
Ron Wehrens
R. Wehrens. "Chemometrics with R - Multivariate Data Analysis in the Natural Sciences and Life Sciences". 2nd edition, Springer, Heidelberg, 2019.
data(wines, package = "kohonen") wines.PC <- PCA(scale(wines)) wine.classes <- as.integer(vintages) scoreplot(wines.PC, col = wine.classes, pch = wine.classes) loadingplot(wines.PC, show.names = TRUE) biplot(wines.PC, score.col = wine.classes) screeplot(wines.PC)
data(wines, package = "kohonen") wines.PC <- PCA(scale(wines)) wine.classes <- as.integer(vintages) scoreplot(wines.PC, col = wine.classes, pch = wine.classes) loadingplot(wines.PC, show.names = TRUE) biplot(wines.PC, score.col = wine.classes) screeplot(wines.PC)
Function to identify local maxima in a vector, typically a spectrum or a chromatogram.
pick.peaks(x, span)
pick.peaks(x, span)
x |
Numerical vector. |
span |
Neighbourhood, used to define local maxima. |
A vector containing positions of local maxima in the input data.
Ron Wehrens
if (require("ptw")) { data(lcms, package = "ptw") plot(lcms[1,,1], type = "l", xlim = c(1000, 1500)) abline(v = pick.peaks(lcms[1,,1], 20), col = "blue") } else { cat("Package ptw not available.\nInstall it by typing 'install.packages(\"ptw\")'") }
if (require("ptw")) { data(lcms, package = "ptw") plot(lcms[1,,1], type = "l", xlim = c(1000, 1500)) abline(v = pick.peaks(lcms[1,,1], 20), col = "blue") } else { cat("Package ptw not available.\nInstall it by typing 'install.packages(\"ptw\")'") }
A data object of class msSet
,
consisting of 654 mass spectra (327 spectra in duplicate) from 2000 to
20000 Da, which were generated from patients with prostate cancer,
benign prostatic hypertrophy, and normal controls. These spectra are
already baseline corrected and normalized. Please see the references for
more details.
Since the original package msProstate is orphaned at the end of 2012, the data are included in the ChemometricsWithR package so that the examples in the book are still executable. This manual page has been adapted to reflect this.
B.L. Adam, Y. Qu, J.W. Davis, M.D. Ward, M.A. Clements, L.H. Cazares, O.J. Semmes, P.F. Schellhammer, Y. Yasui, Z. Feng, and G.L. Wright, Jr., "Serum protein fingerprinting coupled with a pattern-matching algorithm distinguishes prostate cancer from benign prostate hyperplasia and healthy men," Cancer Research, 62(13):3609–14, 2002.
Y. Qu, B.L. Adam, Y. Yasui, M.D. Ward, L.H. Cazares, P.F. Schellhammer, Z. Feng, O.J. Semmes, and G.L. Wright Jr., "Boosted decision tree analysis of surface-enhanced laser desorption/ionization mass spectral serum profiles discriminates prostate cancer from noncancer patients", Clinical Chemistry, 48(10):1835–43, 2002.
R. Wehrens, "Chemometrics with R - Multivariate Data Analysis in the Natural Sciences and Life Sciences". 2nd edition, Springer, Heidelberg, 2019.
## Examples have been changed from the original man page upon inclusion ## in the ChemometricsWithRData package data("Prostate2000Raw") ## plot a few spectra, partially matplot(Prostate2000Raw$mz[1:8000], Prostate2000Raw$intensity[1:8000, 1:5], type = "l", lty = 1, col = 1:5, xlab = "m/z", ylab = "response")
## Examples have been changed from the original man page upon inclusion ## in the ChemometricsWithRData package data("Prostate2000Raw") ## plot a few spectra, partially matplot(Prostate2000Raw$mz[1:8000], Prostate2000Raw$intensity[1:8000, 1:5], type = "l", lty = 1, col = 1:5, xlab = "m/z", ylab = "response")
NIR data from 654 tablets, measured at two different instruments. The data have been divided in training, test and validation sets.
data(shootout)
data(shootout)
Variable shootout
is a list containing spectral data of
tablets, measured on two instruments, as well as response variables.
For every tablet, three reponse variables are measured: the amount of active ingredient (nominally 200 mg/tablet), the weight and the hardness. Typically, one wants to estimate the amount of active ingredient from the NIR spectra, a straightforward multivariate calibration problem. The goal of the shootout competition was to find the optimal way to transfer a calibration model of the first instrument to the second.
http://www.idrc-chambersburg.org/shootout2002.html
R. Wehrens. "Chemometrics with R - Multivariate Data Analysis in the Natural Sciences and Life Sciences". 2nd edition, Springer, Heidelberg, 2019.
data(shootout) plot(seq(600, 1898, by = 2), shootout$calibrate.1[1,], type = "l", xlab = "wavelength", ylab = "log(1/T)")
data(shootout) plot(seq(600, 1898, by = 2), shootout$calibrate.1[1,], type = "l", xlab = "wavelength", ylab = "log(1/T)")