---
title: "Introduction to AlpsNMR (older API)"
author: "AlpsNMR authors"
package: AlpsNMR
abstract: >
  An introduction to the AlpsNMR package, showing the most relevant functions and
  a proposed workflow, using the older workflow.
date: "`r format(Sys.Date(), '%F')`"
output:
  BiocStyle::pdf_document:
    latex_engine: lualatex
vignette: >
  %\VignetteIndexEntry{Older Introduction to AlpsNMR (soft-deprecated API)}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---


```{r setup, include = FALSE}
knitr::opts_chunk$set(
  fig.width = 7,
  fig.height = 5,
  collapse = TRUE,
  comment = "#>"
)
```

The `AlpsNMR` package was written with two purposes in mind:

- to help **data analysts and NMR scientists** to work with NMR samples.
- to help **IT pipeline builders** implement automated methods for preprocessing.

Functions from this package written for data analysts and NMR scientists are
prefixed with `nmr_`, while higher level functions written for IT pipeline
builders are prefixed with `pipe_`. The main reason why all exported functions
have a prefix is to make it easy for the user to discover the functions from the
package. By typing nmr_ RStudio will return the list of exported functions. In
the R terminal, nmr_ followed by the tab key (⇥) twice will have the same
effect. Other popular packages, follow similar approaches (e.g: `forcats`:
`fct_*`, `stringr`: `str_*`).

This vignette is written for the first group. It assumes some prior basic
knowledge of NMR and data analysis, as well as some basic R programming. In case
you are interested in building pipelines with this package, you may want to open
the file saved in this directory (run it on your computer):

```
pipeline_example <- system.file("pipeline-rmd", "pipeline_example.R", package = "AlpsNMR")
pipeline_example
```


```{r}
library(BiocParallel)
library(AlpsNMR)
library(ggplot2)
```


# Enable parallellization

This package is able to parallellize several functions through the use of the
`BiocParallel` package. Whether to parallelize or not is left to the user
that can control the parallellization registering backends. Please check
the `BiocParallel` introduction for further details

```{r}
library(BiocParallel)
#register(SerialParam(), default = TRUE)  # disable parallellization
register(SnowParam(workers = 2, exportglobals = FALSE), default = TRUE)  # enable parallellization with 2 workers
```


# Data: The `MeOH_plasma_extraction` dataset

To explore the basics of the AlpsNMR package, we have included four NMR samples
acquired in a 600 MHz Bruker instrument bundled with the package. The samples
are pooled quality control plasma samples, that were extracted with methanol,
and therefore only contain small molecules.

If you have installed this package, you can obtain the directory where the four
samples are with the command

```{r}
MeOH_plasma_extraction_dir <- system.file("dataset-demo", package = "AlpsNMR")
MeOH_plasma_extraction_dir
```

The demo directory includes four samples (zipped) and a dummy Excel metadata file.

```{r}
fs::dir_ls(MeOH_plasma_extraction_dir)
```

Given the name of the dataset, one may guess that the dataset was used to check
the Methanol extraction in serum samples. The dummy metadata consists of dummy
information, just for the sake of showing how this package can integrate
external metadata. The excel file consists of two tidy tables, in two sheets.

```{r}
MeOH_plasma_extraction_xlsx <- file.path(MeOH_plasma_extraction_dir, "dummy_metadata.xlsx")
exp_subj_id <- readxl::read_excel(MeOH_plasma_extraction_xlsx, sheet = 1)
subj_id_age <- readxl::read_excel(MeOH_plasma_extraction_xlsx, sheet = 2)
exp_subj_id
subj_id_age

```


# Loading samples

The function to read samples is called `nmr_read_samples`. It expects a
character vector with the samples to load that can be paths to directories of
Bruker format samples or paths to JDX files.

Additionally, this function can filter by pulse sequences (e.g. load only NOESY
samples) or loading only metadata.

```{r load-samples}
zip_files <- fs::dir_ls(MeOH_plasma_extraction_dir, glob = "*.zip")
zip_files
dataset <- nmr_read_samples(sample_names = zip_files)
dataset
```

As we have not added any metadata to this dataset, the only column we see is
the `NMRExperiment`:

```{r}
nmr_meta_get(dataset, groups = "external")
```


# Adding metadata

Initally our dataset only has the `NMRExperiment` column:

```{r}
nmr_meta_get(dataset, groups = "external")
```

The `exp_subj_id` table we loaded links the `NMRExperiment` to the `SubjectID`.

As we already have the `NMRExperiment` column, we can use it as the merging
column (note that both columns have the same column name to match the metadata
such as group class, age, BMI...):

```{r}
dataset <- nmr_meta_add(dataset, metadata = exp_subj_id, by = "NMRExperiment")
nmr_meta_get(dataset, groups = "external")
```

If we have info from different files we can match them.
For instance, now we have the `SubjectID` information so we can add the table that 
adds the `SubjectID` to the `Age`.

```{r}
dataset <- nmr_meta_add(dataset, metadata = subj_id_age, by = "SubjectID")
nmr_meta_get(dataset, groups = "external")
```

Now we have our metadata integrated in the dataset and we can make use of it
in further data analysis steps.


# Interpolation

1D NMR samples can be interpolated together, in order to arrange all the spectra 
into a matrix, with one row per sample. The main parameters we would need is the 
range of ppm values that we want to interpolate and the resolution.

We can see the ppm resolution by looking at the ppm axis of one sample:

```{r}
ppm_res <- nmr_ppm_resolution(dataset)[[1]]
message("The ppm resolution is: ", format(ppm_res, digits = 2), " ppm")
```

We can interpolate the dataset, obtaining an `nmr_dataset_1D` object:

```{r}
dataset <- nmr_interpolate_1D(dataset, axis = c(min = -0.5, max = 10, by = 2.3E-4))
```

This operation changes the class of the object, as now the data is on a matrix. The
dataset is now of class `nmr_dataset_1D`. The `axis` element is now a numeric vector
and the `data_1r` element is a matrix.


# Plotting samples

The `AlpsNMR` package offers the possibility to plot `nmr_dataset_1D` objects.
Plotting many spectra with so many points is quite expensive so it is possible
to include only some regions of the spectra or plot only some samples.

Use `?plot.nmr_dataset_1D` to check the parameters, among them: 

- `NMRExperiment`: A character vector with the NMR experiments to plot 
- `chemshift_range`: A ppm range to plot only a small region, or to reduce the
resolution 
- `interactive`: To make the plot interactive - `...`: Can be used to
pass additional parameters such as `color = "SubjectID"` that are passed as
aesthetics to ggplot.

```{r}
plot(dataset, NMRExperiment = c("10", "30"), chemshift_range = c(2.2, 2.8))
```


## Creating interactive plots

The option `interactive = TRUE` described above has some performance limitations.
As high performance workaround, you can make many plots interactive with the
function `plot_interactive`.

This function will use WebGL technologies to create a webpage that, once opened,
allows you to interact with the plot.

Due to technical limitations, these plots need to be opened manually and can't
be embedded in RMarkdown documents. Therefore, the function saves the plot in 
the directory for further exploration. Additionally, some old web browsers may
not be able to display these interactive plots correctly.

```
plt <- plot(dataset, NMRExperiment = c("10", "30"), chemshift_range = c(2.2, 2.8))
plot_interactive(plt, "plot_region.html")
```


# Exclude regions

Some regions can easily be excluded from the spectra with `nmr_exclude_region`.
Note that the regions are fully removed and not zeroed, as using zeros complicates
a lot the implementation^[e.g. it can inadvertedly distort the PQN normalization results]
and has little advantages.

```{r}
regions_to_exclude <- list(water = c(4.6, 5), methanol = c(3.33, 3.39))
dataset <- nmr_exclude_region(dataset, exclude = regions_to_exclude)
plot(dataset, chemshift_range = c(4.2, 5.5))
```


# Filter samples

Maybe we just want to analyze a subset of the data, e.g., only a class group or
a particular gender. We can filter some samples according to their metadata as 
follows:

```{r}
samples_10_20 <- filter(dataset, SubjectID == "Ana")
nmr_meta_get(samples_10_20, groups = "external")
```


# Robust PCA for outlier detection

The AlpsNMR package includes robust PCA analysis for outlier detection.
With such a small demo dataset, it is not practical to use, but check out the
documentation of `nmr_pca_outliers_*` functions.

```{r}
pca_outliers_rob <- nmr_pca_outliers_robust(dataset, ncomp = 3)
nmr_pca_outliers_plot(dataset, pca_outliers_rob)
```


# Baseline removal

Spectra may display an unstable baseline, specially when processing blood/fecal 
blood/fecal samples. If so, `nmr_baseline_removal` subtract the 
baseline by means of Asymmetric Least Squares method.

See before:

```{r}
plot(dataset, chemshift_range = c(3.5,3.8))
```

And after:

```{r}
dataset = nmr_baseline_removal(dataset, lambda = 6, p = 0.01)
plot(dataset, chemshift_range = c(3.5,3.8))
```


# Peak detection

The peak detection is performed on short spectra segments using a continuous
wavelet transform. See `?nmr_detect_peaks` for more information.

Our current approach relies on the use of the baseline threshold 
(`baselineThresh`) automatic calculated (see `?nmr_baseline_threshold`) 
and the Signal to Noise Threshold (`SNR.Th`) to discriminate valid peaks 
from noise. 

The combination of the `baselineThresh` and the `SNR.Th` optimizes 
the number of actual peaks from noise.

The advantage of the `SNR.Th` method is that it estimates the noise
level on each spectra region independently, so in practice it can be used as
a dynamic baseline threshold level.

```{r}
peak_table <- nmr_detect_peaks(dataset,
                               nDivRange_ppm = 0.1,
                               scales = seq(1, 16, 2),
                               baselineThresh = NULL, SNR.Th = 3)
NMRExp_ref <- nmr_align_find_ref(dataset, peak_table)
message("Your reference is NMRExperiment ", NMRExp_ref)
nmr_detect_peaks_plot(dataset, peak_table, NMRExperiment = "20", chemshift_range = c(3.5,3.8))
```


# Spectra alignment

To align the sample, we use the `nmr_align` function, which in turn uses a hierarchical
clustering method (see `?nmr_align` for further details).

The `maxShift_ppm` limits the maximum shift allowed for the spectra.

```{r}
nmr_exp_ref <- nmr_align_find_ref(dataset, peak_table)
dataset_align <- nmr_align(dataset, peak_table, nmr_exp_ref, maxShift_ppm = 0.0015, acceptLostPeak = FALSE)
```

```{r}
plot(dataset, chemshift_range = c(3.025, 3.063))
plot(dataset_align, chemshift_range = c(3.025, 3.063))
```


# Normalization

There are multiple normalization techniques available. The most strongly 
recommended is the `pqn` normalization, but it may not be fully reliable
when the number of samples is small, as it needs a  computation of the 
median spectra. Nevertheless, it is possible to compute it:

```{r}
dataset_norm <- nmr_normalize(dataset_align, method = "pqn")
```

The `AlpsNMR` package offers the possibility to extract additional 
normalization information with `nmr_normalize_extra_info(dataset)`, to explore
the normalization factors applied to each sample:

The plot shows the dispersion with respect to the median of the normalization
factors, and can highlight samples with abnormaly large or small normalization
factors.

```{r}
diagnostic <- nmr_normalize_extra_info(dataset_norm)
diagnostic$norm_factor
diagnostic$plot
```


# Peak integration

## 1. Integration based on peak center and width

If we want to integrate the whole spectra, we need ppm from the `peak_table`. 
See `Peak detection` section. The function `nmr_integrate_peak_positions`
generates a new `nmr_dataset_1D` object containing the integrals from 
the `peak_table` (ppm values corresponding to detected peaks).

```{r}
peak_table_integration = nmr_integrate_peak_positions(
  samples = dataset_norm,
  peak_pos_ppm = peak_table$ppm,
  peak_width_ppm = 0.006)

peak_table_integration = get_integration_with_metadata(peak_table_integration)
```


We can also integrate with a specific peak position and some arbitrary width:

```{r}
nmr_data(
  nmr_integrate_peak_positions(samples = dataset_norm,
                            peak_pos_ppm = c(4.1925, 4.183, 4.1775, 4.17),
                            peak_width_ppm = 0.006)
)
```


## 2. Integration based on peak boundaries

Imagine we only want to integrate the four peaks corresponding to the pyroglutamic
acid:

```{r}
pyroglutamic_acid_region <- c(4.15, 4.20)
plot(dataset_norm, chemshift_range = pyroglutamic_acid_region) +
  ggplot2::ggtitle("Pyroglutamic acid region")
```

We define the peak regions and integrate them. Note how we can correct the
baseline or not. If we correct the baseline, the limits of the integration will
be connected with a straight line and that line will be used as the baseline,
that will be subtracted.

```{r}
pyroglutamic_acid <- list(pyroglutamic_acid1 = c(4.19, 4.195),
                          pyroglutamic_acid2 = c(4.18, 4.186),
                          pyroglutamic_acid3 = c(4.175, 4.18),
                          pyroglutamic_acid4 = c(4.165, 4.172))
regions_basel_corr_ds <- nmr_integrate_regions(dataset_norm, pyroglutamic_acid, fix_baseline = TRUE)
regions_basel_corr_matrix <- nmr_data(regions_basel_corr_ds)
regions_basel_corr_matrix


regions_basel_not_corr_ds <- nmr_integrate_regions(dataset_norm, pyroglutamic_acid, fix_baseline = FALSE)
regions_basel_not_corr_matrix <- nmr_data(regions_basel_not_corr_ds)
regions_basel_not_corr_matrix
```

We may plot the integral values to explore variation based on the baseline
subtraction.

```{r}
dplyr::bind_rows(
  regions_basel_corr_matrix %>%
    as.data.frame() %>%
    tibble::rownames_to_column("NMRExperiment") %>%
    tidyr::gather("metabolite_peak", "area", -NMRExperiment) %>%
    dplyr::mutate(BaselineCorrected = TRUE),
  regions_basel_not_corr_matrix %>%
    as.data.frame() %>%
    tibble::rownames_to_column("NMRExperiment") %>%
    tidyr::gather("metabolite_peak", "area", -NMRExperiment) %>%
    dplyr::mutate(BaselineCorrected = FALSE)
) %>% ggplot() + geom_point(aes(x = NMRExperiment, y = area, color = metabolite_peak)) +
  facet_wrap(~BaselineCorrected)

```


# Identification

After applying any feature selection or machine learning, Alps allows the 
identification of features of interest through `nmr_identify_regions_blood`. 
The function gives 3 possibilities sorted by the most probable metabolite 
(see `nmr_identify_regions_blood` for details).

```{r}
ppm_to_assign <- c(4.060960203, 3.048970634,2.405935596,0.990616851,0.986520147, 1.044258467)
identification <- nmr_identify_regions_blood (ppm_to_assign)
identification[!is.na(identification$Metabolite), ]
```


# Final thoughts

This vignette shows many of the features of the package, some features have
room for improvement, others are not fully described, and the reader will need
to browse the documentation. Hopefully it is a good starting point for using the
package.

```{r}
sessionInfo()
```