---
title: "Introduction to GCIMS"
output:
  "BiocStyle::html_document":
    dev: png
  "BiocStyle::pdf_document":
    latex_engine: lualatex
    df_print: "kable"
    dev: png
package: GCIMS
author: "GCIMS authors"
date: "`r format(Sys.Date(), '%F')`"
abstract: >
  An introduction to the GCIMS package, showing the most relevant functions and
  a proposed workflow. This includes loading demo samples, adding sample
  annotations, preprocessing the spectra, alignment, detecting peaks and regions
  of interest (ROIs), clustering of ROIs across samples, peak integration and
  building a peak table.
vignette: >
  %\VignetteIndexEntry{Introduction to GCIMS}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```

```{r setup}
start_time <- Sys.time()
library(BiocParallel)
library(ggplot2)
library(GCIMS)
```

The GCIMS package allows you to import your Gas Chromatography - Ion Mobility Spectrometry samples,
preprocess them, align them one to each other and build a peak table with the relevant features.

Enable parallellization of the workflow, here we use three cores:

```{r}
# disable parallellization: (Useful for better error reporting)
#register(SerialParam(progressbar = show_progress_bar()), default = TRUE)

# enable parallellization with 3 workers (you can use more if you have them):
register(SnowParam(workers = 2, progressbar = show_progress_bar(), exportglobals = FALSE), default = TRUE)
```


This vignette will use a small dataset consisting of a mixture of three ketones.

# Downloading the dataset

Download the "threeketones" dataset:

```{r}
# The folder where we will download the samples:
samples_directory <- "threeketones"

# Download the ketones dataset:
tryCatch({
  download_three_ketones_dataset(samples_directory)
  message("Download successful")
}, error = function(e) {
  message("The download of the samples did not succeed. The vignette can't continue.")
  message(conditionMessage(e))
  knitr::knit_exit()
})
```
Check that the files are downloaded:

```{r}
list.files(samples_directory)
```

# Import data

Please start by preparing an Excel spreadsheet (or a CSV/TSV file if you prefer)
with your samples and their annotations. Please name the first column `SampleID`
and the second column `Filename`. We will use those annotations in plots.


```{r message=FALSE}
annotations <- create_annotations_table(samples_directory)
annotations$SampleID<-c("Ketones1","Ketones2","Ketones3")
annotations
```

If you need to create your `annotations.csv` file for your samples, please
follow the example from `help("create_annotations_table")` for further details.

# Create a GCIMSDataset object

```{r}
dataset <- GCIMSDataset$new(
  annotations,
  base_dir = samples_directory,
  on_ram = TRUE # You probably should set this to FALSE if you have more 
                # than a handful of samples. See ?GCIMSDataset.
)
dataset
```


Most operations on the `dataset` are not executed until you need to get the actual samples or
data. This is done to perform them in batch, more efficiently, if possible. However,
you can manually `realize` the `GCIMSDataset` object so it executes all its pending operations.
We can see how the "read_sample" pending operation becomes part of the dataset history:

```{r}
dataset$realize()
```
Explore one sample:

```{r}
ket1 <- dataset$getSample(sample = "Ketones1")
plot(ket1, rt_range = c(0, 1000), dt_range = c(7.5, 17))
```

```{r}
dt_k1 <- dtime(ket1)
tis_k1 <- getTIS(ket1)

ggplot(dplyr::filter(data.frame(x = dt_k1, y = tis_k1), x >1)) + 
  geom_line(aes(x = x, y = y)) +
  scale_x_continuous(name = "Drift time (ms)", limits = c(7, 17)) +
  scale_y_continuous(name = "Intensity (a.u)", trans = cubic_root_trans())
```


```{r}
rt_k1 <- rtime(ket1)
ric_k1 <- getRIC(ket1)

ggplot(dplyr::filter(data.frame(x = rt_k1, y = ric_k1))) + 
  geom_line(aes(x = x, y = y)) +
  scale_x_continuous(name = "Retention time (ms)", limits = c(55, 900)) +
  scale_y_continuous(name = "Intensity (a.u)")
```

Plot the RIC and the TIS to get an overview of the dataset:

```{r}
plotTIS(dataset)
```

```{r}
plotRIC(dataset)
```


# Filter the retention and drift time of your samples

```{r}
filterRt(dataset, rt = c(0, 1100)) # in s
filterDt(dataset, dt = c(5, 16)) # in ms
dataset
```


```{r}
ket1afterfilter <- dataset$getSample(sample = "Ketones1")
ket1afterfilter
```

# Smoothing

You can remove noise from your sample using a Savitzky-Golay filter, applied
both in drift time and in retention time.

The Savitzky-Golay has two main parameters: the filter length and the filter order.
It is recommended to use a filter order of 2, but the filter length must be selected
so it is large enough to remove noise but always smaller than the peak width to
prevent distorting the peaks.

You can apply the smoothing filter to a single IMS spectrum or to a single chromatogram
to see how noise is removed and how peaks are not distorted. Tweak the filter lengths
and, once you are happy, apply the smoothing filter to all the dataset.

```{r smoothing-starts-here}
one_ims_spec <- getSpectrum(ket1afterfilter, rt_range = 97.11)
```

```{r}
one_ims_smoothed <- smooth(one_ims_spec, dt_length_ms = 0.14, dt_order = 2)
to_plot <- dplyr::bind_rows(
  NoSmoothed = as.data.frame(one_ims_spec),
  Smoothed = as.data.frame(one_ims_smoothed),
  .id = "Status"
)
plot_interactive(ggplot(to_plot) +
  geom_line(aes(x = drift_time_ms, y = intensity, colour = Status)) +
  labs(x = "Drift time (ms)", y = "Intensity (a.u.)"))
```

```{r}
one_chrom <- getChromatogram(ket1afterfilter, dt_range = 10.4)
```

```{r}
one_chrom_smoothed <- smooth(one_chrom, rt_length_s = 3, rt_order = 2)
to_plot <- dplyr::bind_rows(
  NoSmoothed = as.data.frame(one_chrom),
  Smoothed = as.data.frame(one_chrom_smoothed),
  .id = "Status"
)
plot_interactive(ggplot(to_plot) +
  geom_line(aes(x = retention_time_s, y = intensity, colour = Status)) +
  labs(x = "Retention time (s)", y = "Intensity (a.u.)"))
```

You can also apply it to a single sample:

```{r}
ket1_smoothed <- smooth(
  ket1afterfilter,
  rt_length_s = 3,
  dt_length_ms = 0.14,
  rt_order = 2,
  dt_order = 2
)
```

```{r}
cowplot::plot_grid(
  plot(ket1afterfilter, rt_range = c(0, 500), dt_range = c(6, 16)),
  plot(ket1_smoothed, rt_range = c(0, 500), dt_range = c(6, 16)),
  ncol = 2
)
```


Or to the whole dataset (the default order is 2 for both axes):

```{r}
dataset <- smooth(dataset, rt_length_s = 3, dt_length_ms = 0.14)
dataset$realize()
```

# Decimation


One way to speed up calculations and reduce the memory requirements is to decimate
the matrix, by taking 1 every Nd points in drift time and 1 every Nr points in retention time.

```{r}
ket1_decimated <- decimate(ket1_smoothed, rt_factor = 1, dt_factor = 2)
```

```{r}
ket1_spec_smoothed <- getSpectrum(ket1_smoothed, rt_range = 300, dt_range = c(7, 13))
ket1_spec_decimated <- getSpectrum(ket1_decimated, rt_range = 300, dt_range = c(7, 13))
```

```{r}
cowplot::plot_grid(
  plot(ket1_spec_smoothed) + labs(title = "Before decimation"),
  plot(ket1_spec_decimated) + labs(title = "After decimation"),
  ncol = 1
)

```


Once you are satisfied with de decimation factor (if you want to apply it) just
use it on the whole dataset:

```{r}
decimate(dataset, rt_factor = 1, dt_factor = 2)
```

One alternative is to start with a higher decimation factor (lets say
2 in retention time and 4 in drift time) and, after running the pipeline successfully,
repeat it without decimation.


# Alignment

Pressure and temperature fluctuations as well as degradation of the chromatographic
column are some of the causes of misalignments in the data, both in retention and
drift time.

In order to be able to compare samples to each other, we align the samples.

The alignment will happen first in drift time and afterwards in retention time. To
correct the drift time, we will use a multiplicative correction $t_d' = k  t_d$. The
correction factor $k$ will be estimated using the RIP positions for each sample,
extracted from the Total Ion Spectra. The correction factors are typically between
0.9 and 1.1. The reference RIP position is the median of all the RIP positions to
minimize the distortions.

One of the checks to verify the alignment in drift time is not failing is by plotting
the Total Ion Spectra of several samples before and after the alignment and see
the effect of the correction.

The retention time will be corrected using Parametric Time Warping, where
$t_r' = P(t_r)$, and $P$ is a polynomial of typically not a high order (1-5). For
efficiency reasons, the polynomial will be estimated using the Reverse Ion Chromatogram of
the samples to be aligned.


```{r}
plotTIS(dataset, dt_range = c(7, 17))
```

```{r}
plotRIC(dataset)
```


```{r alignment-starts-here}
align(dataset)
```

```{r}
dataset$realize()
```

```{r}
plotTIS(dataset, dt_range = c(7, 17))
```

```{r}
plotRIC(dataset)
```


```{r fig.height=8}
align_plots <- alignPlots(dataset)
cowplot::plot_grid(plotlist = align_plots, ncol = 1)
```


# Peaks


First try one sample and optimize the `noise_level` parameter there. Change values
from 0.5 to 4 to explore. If it's higher less peaks will be found. If it is too low
peaks may appear broken into two regions or false detections may occur.

```{r peak-detection-starts-here}
ket1 <- dataset$getSample("Ketones1")
ch1 <- getChromatogram(ket1, dt_range = 10.75)
```

```{r}
# I don't like the _xunits in the argument names, I'll find an alternative soon
ch1 <- findPeaks(
  ch1,
  verbose = FALSE,
  length_in_xunits = 3,
  peakwidth_range_xunits = c(10, 25),
  peakDetectionCWTParams = list(exclude0scaleAmpThresh = TRUE),
  extension_factor = 0,
  iou_overlap_threshold = 0.2,
  debug = FALSE
)
peak_list_ch1 <- peaks(ch1)
plot(ch1) + geom_vline(xintercept = peak_list_ch1$apex, color = "red", linetype = "dashed")
```

```{r}
ket1 <- findPeaks(
  ket1,
  rt_length_s = 3,
  dt_length_ms = 0.14,
  verbose = TRUE,
  dt_peakwidth_range_ms = c(0.15, 0.4),
  rt_peakwidth_range_s = c(10, 25),
  dt_peakDetectionCWTParams = list(exclude0scaleAmpThresh = TRUE),
  rt_peakDetectionCWTParams = list(exclude0scaleAmpThresh = TRUE),
  dt_extension_factor = 0,
  rt_extension_factor = 0,
  exclude_rip = TRUE,
  iou_overlap_threshold = 0.2
)
```

```{r}
peak_list_ket1 <- peaks(ket1)
plot_interactive(plot(ket1) +
  overlay_peaklist(peak_list_ket1, color_by = "PeakID"))
```

Then do it on the whole dataset:

```{r}
findPeaks(
  dataset,
  rt_length_s = 3,
  dt_length_ms = 0.14,
  verbose = TRUE,
  dt_peakwidth_range_ms = c(0.15, 0.4),
  rt_peakwidth_range_s = c(10, 25),
  dt_peakDetectionCWTParams = list(exclude0scaleAmpThresh = TRUE),
  rt_peakDetectionCWTParams = list(exclude0scaleAmpThresh = TRUE),
  dt_extension_factor = 0,
  rt_extension_factor = 0,
  exclude_rip = TRUE,
  iou_overlap_threshold = 0.2
)
peak_list <- peaks(dataset)
```
You can get any other sample if you like, plot it and plot its peaks on top:

```{r}
ket2 <- dataset$getSample("Ketones2")
```

```{r}
plot_interactive(plot(ket2) +
  overlay_peaklist(peaks(ket2)) )
```

Or plot all the peaks from all the dataset together, overlayed on a single sample:

```{r}
plt <- plot(ket2) +
  overlay_peaklist(peaks(dataset), color_by = "SampleID")
plot_interactive(plt)
```

# Clustering

```{r clustering-section-starts-here}
peak_clustering <- clusterPeaks(
  peak_list,
  distance_method = "euclidean",
  dt_cluster_spread_ms = 0.1,
  rt_cluster_spread_s = 20,
  clustering = list(method = "hclust")
)
```

The peak list, with cluster ids can be plotted on top of a single sample:

```{r}
peak_list_clustered <- peak_clustering$peak_list_clustered
```


```{r }
plt <- plot(ket2) +
  overlay_peaklist(peak_list_clustered, color_by = "SampleID")
plot_interactive(plt)
```

```{r}
plt <- plot(ket2) +
  overlay_peaklist(dplyr::filter(peak_list_clustered, !is.na(cluster)), color_by = "cluster")
plot_interactive(plt)
```

The resulting cluster sizes (median position of individual clusters) is not a good reference
for integration. We are working on this.

```{r}
plt <- plot(ket2) +
  overlay_peaklist(peak_clustering$cluster_stats, color_by = "cluster")
plot_interactive(plt)
```

# Baseline correction

```{r baseline-section-starts-here}
ket1 <- dataset$getSample("Ketones1")
```


```{r}
one_ims_spec <- getSpectrum(ket1, rt_range = 97.11)
```

```{r}
plot(one_ims_spec) + coord_cartesian(xlim = c(7, 12))
```

```{r}
one_ims_spec <- estimateBaseline(one_ims_spec, dt_peak_fwhm_ms = 0.2, dt_region_multiplier = 12)
to_plot <- data.frame(
  drift_time_ms = dtime(one_ims_spec),
  Intensity = intensity(one_ims_spec),
  Baseline = baseline(one_ims_spec)
)
plot_interactive(ggplot(to_plot) +
  geom_line(aes(x = drift_time_ms, y = Intensity), color = "red") +
  geom_line(aes(x = drift_time_ms, y = Baseline), color = "blue") +
  labs(x = "Drift time (ms)", y = "Intensity (a.u.)"))
```

```{r}
one_chrom <- getChromatogram(ket1, dt_range = 10.4)
```

```{r}
one_chrom <- estimateBaseline(one_chrom, rt_length_s = 200)
to_plot <- data.frame(
  retention_time_s = rtime(one_chrom),
  Intensity = intensity(one_chrom),
  Baseline = baseline(one_chrom)
)

ggplot(to_plot) +
  geom_line(aes(x = retention_time_s, y = Intensity), color = "red") +
  geom_line(aes(x = retention_time_s, y = Baseline), color = "blue") +
  labs(x = "Retention time (s)", y = "Intensity (a.u.)")
```

You can also apply it to a single sample:

```{r}
ket1_no_basline <- estimateBaseline(
  ket1,
  dt_peak_fwhm_ms = 0.2, 
  dt_region_multiplier = 12,
  rt_length_s = 200,
  remove = TRUE
)
ket1_basline <- estimateBaseline(
  ket1,
  dt_peak_fwhm_ms = 0.2, 
  dt_region_multiplier = 12,
  rt_length_s = 200,
  remove = FALSE
)
```

```{r}
cowplot::plot_grid(
  plot(x = ket1_no_basline, rt_range = c(0, 500), dt_range = c(6, 16)),
  plot(x = ket1_basline, rt_range = c(0, 500), dt_range = c(6, 16)),
  ncol = 2
)
```


Or to the whole dataset:

```{r}
dataset <- estimateBaseline(
  dataset,
  dt_peak_fwhm_ms = 0.2, 
  dt_region_multiplier = 12,
  rt_length_s = 200
)
dataset$realize()
```


Some clusters:

```{r}
plot(ket2) + 
  overlay_peaklist(
    peak_clustering$peak_list_clustered,
    color_by = "SampleID",
    mapping_roi = c(
      "dt_min_ms" = "fixedsize_dt_min_ms",
      "dt_max_ms" = "fixedsize_dt_max_ms",
      "rt_min_s" = "fixedsize_rt_min_s",
      "rt_max_s" = "fixedsize_rt_max_s"
    )
  ) +
  lims(x = c(6, 12), y = c(0, 400))
```


# Peak integration

```{r}
dataset <- integratePeaks(
  dataset, 
  peak_clustering$peak_list, 
  integration_size_method = "fixed_size", 
  rip_saturation_threshold = 0.1
)
```


```{r}
peak_list <- peaks(dataset)
```


# Build peak table

```{r}
peak_table <- peakTable(peak_list, aggregate_conflicting_peaks = max)
t(tail(t(peak_table$peak_table_matrix)))
```
We showed the last 6 clusters where we have "NA" Values

# Imputation
```{r}
peak_table_imputed <- imputePeakTable(peak_table$peak_table_matrix,
                                      dataset, 
                                      peak_clustering$cluster_stats)
t(tail(t(peak_table_imputed)))
```
As we can see the "NA" values were imputed


```{r}
# Implemented until here
end_time <- Sys.time()
message("The vignette ran in  ", format(end_time - start_time))
```


# Session Info:

```{r}
sessionInfo()
```