--- title: "Introduction to GCIMS" output: "BiocStyle::html_document": dev: png "BiocStyle::pdf_document": latex_engine: lualatex df_print: "kable" dev: png package: GCIMS author: "GCIMS authors" date: "`r format(Sys.Date(), '%F')`" abstract: > An introduction to the GCIMS package, showing the most relevant functions and a proposed workflow. This includes loading demo samples, adding sample annotations, preprocessing the spectra, alignment, detecting peaks and regions of interest (ROIs), clustering of ROIs across samples, peak integration and building a peak table. vignette: > %\VignetteIndexEntry{Introduction to GCIMS} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` ```{r setup} start_time <- Sys.time() library(BiocParallel) library(ggplot2) library(GCIMS) ``` The GCIMS package allows you to import your Gas Chromatography - Ion Mobility Spectrometry samples, preprocess them, align them one to each other and build a peak table with the relevant features. Enable parallellization of the workflow, here we use three cores: ```{r} # disable parallellization: (Useful for better error reporting) #register(SerialParam(progressbar = show_progress_bar()), default = TRUE) # enable parallellization with 3 workers (you can use more if you have them): register(SnowParam(workers = 2, progressbar = show_progress_bar(), exportglobals = FALSE), default = TRUE) ``` This vignette will use a small dataset consisting of a mixture of three ketones. # Downloading the dataset Download the "threeketones" dataset: ```{r} # The folder where we will download the samples: samples_directory <- "threeketones" # Download the ketones dataset: tryCatch({ download_three_ketones_dataset(samples_directory) message("Download successful") }, error = function(e) { message("The download of the samples did not succeed. The vignette can't continue.") message(conditionMessage(e)) knitr::knit_exit() }) ``` Check that the files are downloaded: ```{r} list.files(samples_directory) ``` # Import data Please start by preparing an Excel spreadsheet (or a CSV/TSV file if you prefer) with your samples and their annotations. Please name the first column `SampleID` and the second column `Filename`. We will use those annotations in plots. ```{r message=FALSE} annotations <- create_annotations_table(samples_directory) annotations$SampleID<-c("Ketones1","Ketones2","Ketones3") annotations ``` If you need to create your `annotations.csv` file for your samples, please follow the example from `help("create_annotations_table")` for further details. # Create a GCIMSDataset object ```{r} dataset <- GCIMSDataset$new( annotations, base_dir = samples_directory, on_ram = TRUE # You probably should set this to FALSE if you have more # than a handful of samples. See ?GCIMSDataset. ) dataset ``` Most operations on the `dataset` are not executed until you need to get the actual samples or data. This is done to perform them in batch, more efficiently, if possible. However, you can manually `realize` the `GCIMSDataset` object so it executes all its pending operations. We can see how the "read_sample" pending operation becomes part of the dataset history: ```{r} dataset$realize() ``` Explore one sample: ```{r} ket1 <- dataset$getSample(sample = "Ketones1") plot(ket1, rt_range = c(0, 1000), dt_range = c(7.5, 17)) ``` ```{r} dt_k1 <- dtime(ket1) tis_k1 <- getTIS(ket1) ggplot(dplyr::filter(data.frame(x = dt_k1, y = tis_k1), x >1)) + geom_line(aes(x = x, y = y)) + scale_x_continuous(name = "Drift time (ms)", limits = c(7, 17)) + scale_y_continuous(name = "Intensity (a.u)", trans = cubic_root_trans()) ``` ```{r} rt_k1 <- rtime(ket1) ric_k1 <- getRIC(ket1) ggplot(dplyr::filter(data.frame(x = rt_k1, y = ric_k1))) + geom_line(aes(x = x, y = y)) + scale_x_continuous(name = "Retention time (ms)", limits = c(55, 900)) + scale_y_continuous(name = "Intensity (a.u)") ``` Plot the RIC and the TIS to get an overview of the dataset: ```{r} plotTIS(dataset) ``` ```{r} plotRIC(dataset) ``` # Filter the retention and drift time of your samples ```{r} filterRt(dataset, rt = c(0, 1100)) # in s filterDt(dataset, dt = c(5, 16)) # in ms dataset ``` ```{r} ket1afterfilter <- dataset$getSample(sample = "Ketones1") ket1afterfilter ``` # Smoothing You can remove noise from your sample using a Savitzky-Golay filter, applied both in drift time and in retention time. The Savitzky-Golay has two main parameters: the filter length and the filter order. It is recommended to use a filter order of 2, but the filter length must be selected so it is large enough to remove noise but always smaller than the peak width to prevent distorting the peaks. You can apply the smoothing filter to a single IMS spectrum or to a single chromatogram to see how noise is removed and how peaks are not distorted. Tweak the filter lengths and, once you are happy, apply the smoothing filter to all the dataset. ```{r smoothing-starts-here} one_ims_spec <- getSpectrum(ket1afterfilter, rt_range = 97.11) ``` ```{r} one_ims_smoothed <- smooth(one_ims_spec, dt_length_ms = 0.14, dt_order = 2) to_plot <- dplyr::bind_rows( NoSmoothed = as.data.frame(one_ims_spec), Smoothed = as.data.frame(one_ims_smoothed), .id = "Status" ) plot_interactive(ggplot(to_plot) + geom_line(aes(x = drift_time_ms, y = intensity, colour = Status)) + labs(x = "Drift time (ms)", y = "Intensity (a.u.)")) ``` ```{r} one_chrom <- getChromatogram(ket1afterfilter, dt_range = 10.4) ``` ```{r} one_chrom_smoothed <- smooth(one_chrom, rt_length_s = 3, rt_order = 2) to_plot <- dplyr::bind_rows( NoSmoothed = as.data.frame(one_chrom), Smoothed = as.data.frame(one_chrom_smoothed), .id = "Status" ) plot_interactive(ggplot(to_plot) + geom_line(aes(x = retention_time_s, y = intensity, colour = Status)) + labs(x = "Retention time (s)", y = "Intensity (a.u.)")) ``` You can also apply it to a single sample: ```{r} ket1_smoothed <- smooth( ket1afterfilter, rt_length_s = 3, dt_length_ms = 0.14, rt_order = 2, dt_order = 2 ) ``` ```{r} cowplot::plot_grid( plot(ket1afterfilter, rt_range = c(0, 500), dt_range = c(6, 16)), plot(ket1_smoothed, rt_range = c(0, 500), dt_range = c(6, 16)), ncol = 2 ) ``` Or to the whole dataset (the default order is 2 for both axes): ```{r} dataset <- smooth(dataset, rt_length_s = 3, dt_length_ms = 0.14) dataset$realize() ``` # Decimation One way to speed up calculations and reduce the memory requirements is to decimate the matrix, by taking 1 every Nd points in drift time and 1 every Nr points in retention time. ```{r} ket1_decimated <- decimate(ket1_smoothed, rt_factor = 1, dt_factor = 2) ``` ```{r} ket1_spec_smoothed <- getSpectrum(ket1_smoothed, rt_range = 300, dt_range = c(7, 13)) ket1_spec_decimated <- getSpectrum(ket1_decimated, rt_range = 300, dt_range = c(7, 13)) ``` ```{r} cowplot::plot_grid( plot(ket1_spec_smoothed) + labs(title = "Before decimation"), plot(ket1_spec_decimated) + labs(title = "After decimation"), ncol = 1 ) ``` Once you are satisfied with de decimation factor (if you want to apply it) just use it on the whole dataset: ```{r} decimate(dataset, rt_factor = 1, dt_factor = 2) ``` One alternative is to start with a higher decimation factor (lets say 2 in retention time and 4 in drift time) and, after running the pipeline successfully, repeat it without decimation. # Alignment Pressure and temperature fluctuations as well as degradation of the chromatographic column are some of the causes of misalignments in the data, both in retention and drift time. In order to be able to compare samples to each other, we align the samples. The alignment will happen first in drift time and afterwards in retention time. To correct the drift time, we will use a multiplicative correction $t_d' = k t_d$. The correction factor $k$ will be estimated using the RIP positions for each sample, extracted from the Total Ion Spectra. The correction factors are typically between 0.9 and 1.1. The reference RIP position is the median of all the RIP positions to minimize the distortions. One of the checks to verify the alignment in drift time is not failing is by plotting the Total Ion Spectra of several samples before and after the alignment and see the effect of the correction. The retention time will be corrected using Parametric Time Warping, where $t_r' = P(t_r)$, and $P$ is a polynomial of typically not a high order (1-5). For efficiency reasons, the polynomial will be estimated using the Reverse Ion Chromatogram of the samples to be aligned. ```{r} plotTIS(dataset, dt_range = c(7, 17)) ``` ```{r} plotRIC(dataset) ``` ```{r alignment-starts-here} align(dataset) ``` ```{r} dataset$realize() ``` ```{r} plotTIS(dataset, dt_range = c(7, 17)) ``` ```{r} plotRIC(dataset) ``` ```{r fig.height=8} align_plots <- alignPlots(dataset) cowplot::plot_grid(plotlist = align_plots, ncol = 1) ``` # Peaks First try one sample and optimize the `noise_level` parameter there. Change values from 0.5 to 4 to explore. If it's higher less peaks will be found. If it is too low peaks may appear broken into two regions or false detections may occur. ```{r peak-detection-starts-here} ket1 <- dataset$getSample("Ketones1") ch1 <- getChromatogram(ket1, dt_range = 10.75) ``` ```{r} # I don't like the _xunits in the argument names, I'll find an alternative soon ch1 <- findPeaks( ch1, verbose = FALSE, length_in_xunits = 3, peakwidth_range_xunits = c(10, 25), peakDetectionCWTParams = list(exclude0scaleAmpThresh = TRUE), extension_factor = 0, iou_overlap_threshold = 0.2, debug = FALSE ) peak_list_ch1 <- peaks(ch1) plot(ch1) + geom_vline(xintercept = peak_list_ch1$apex, color = "red", linetype = "dashed") ``` ```{r} ket1 <- findPeaks( ket1, rt_length_s = 3, dt_length_ms = 0.14, verbose = TRUE, dt_peakwidth_range_ms = c(0.15, 0.4), rt_peakwidth_range_s = c(10, 25), dt_peakDetectionCWTParams = list(exclude0scaleAmpThresh = TRUE), rt_peakDetectionCWTParams = list(exclude0scaleAmpThresh = TRUE), dt_extension_factor = 0, rt_extension_factor = 0, exclude_rip = TRUE, iou_overlap_threshold = 0.2 ) ``` ```{r} peak_list_ket1 <- peaks(ket1) plot_interactive(plot(ket1) + overlay_peaklist(peak_list_ket1, color_by = "PeakID")) ``` Then do it on the whole dataset: ```{r} findPeaks( dataset, rt_length_s = 3, dt_length_ms = 0.14, verbose = TRUE, dt_peakwidth_range_ms = c(0.15, 0.4), rt_peakwidth_range_s = c(10, 25), dt_peakDetectionCWTParams = list(exclude0scaleAmpThresh = TRUE), rt_peakDetectionCWTParams = list(exclude0scaleAmpThresh = TRUE), dt_extension_factor = 0, rt_extension_factor = 0, exclude_rip = TRUE, iou_overlap_threshold = 0.2 ) peak_list <- peaks(dataset) ``` You can get any other sample if you like, plot it and plot its peaks on top: ```{r} ket2 <- dataset$getSample("Ketones2") ``` ```{r} plot_interactive(plot(ket2) + overlay_peaklist(peaks(ket2)) ) ``` Or plot all the peaks from all the dataset together, overlayed on a single sample: ```{r} plt <- plot(ket2) + overlay_peaklist(peaks(dataset), color_by = "SampleID") plot_interactive(plt) ``` # Clustering ```{r clustering-section-starts-here} peak_clustering <- clusterPeaks( peak_list, distance_method = "euclidean", dt_cluster_spread_ms = 0.1, rt_cluster_spread_s = 20, clustering = list(method = "hclust") ) ``` The peak list, with cluster ids can be plotted on top of a single sample: ```{r} peak_list_clustered <- peak_clustering$peak_list_clustered ``` ```{r } plt <- plot(ket2) + overlay_peaklist(peak_list_clustered, color_by = "SampleID") plot_interactive(plt) ``` ```{r} plt <- plot(ket2) + overlay_peaklist(dplyr::filter(peak_list_clustered, !is.na(cluster)), color_by = "cluster") plot_interactive(plt) ``` The resulting cluster sizes (median position of individual clusters) is not a good reference for integration. We are working on this. ```{r} plt <- plot(ket2) + overlay_peaklist(peak_clustering$cluster_stats, color_by = "cluster") plot_interactive(plt) ``` # Baseline correction ```{r baseline-section-starts-here} ket1 <- dataset$getSample("Ketones1") ``` ```{r} one_ims_spec <- getSpectrum(ket1, rt_range = 97.11) ``` ```{r} plot(one_ims_spec) + coord_cartesian(xlim = c(7, 12)) ``` ```{r} one_ims_spec <- estimateBaseline(one_ims_spec, dt_peak_fwhm_ms = 0.2, dt_region_multiplier = 12) to_plot <- data.frame( drift_time_ms = dtime(one_ims_spec), Intensity = intensity(one_ims_spec), Baseline = baseline(one_ims_spec) ) plot_interactive(ggplot(to_plot) + geom_line(aes(x = drift_time_ms, y = Intensity), color = "red") + geom_line(aes(x = drift_time_ms, y = Baseline), color = "blue") + labs(x = "Drift time (ms)", y = "Intensity (a.u.)")) ``` ```{r} one_chrom <- getChromatogram(ket1, dt_range = 10.4) ``` ```{r} one_chrom <- estimateBaseline(one_chrom, rt_length_s = 200) to_plot <- data.frame( retention_time_s = rtime(one_chrom), Intensity = intensity(one_chrom), Baseline = baseline(one_chrom) ) ggplot(to_plot) + geom_line(aes(x = retention_time_s, y = Intensity), color = "red") + geom_line(aes(x = retention_time_s, y = Baseline), color = "blue") + labs(x = "Retention time (s)", y = "Intensity (a.u.)") ``` You can also apply it to a single sample: ```{r} ket1_no_basline <- estimateBaseline( ket1, dt_peak_fwhm_ms = 0.2, dt_region_multiplier = 12, rt_length_s = 200, remove = TRUE ) ket1_basline <- estimateBaseline( ket1, dt_peak_fwhm_ms = 0.2, dt_region_multiplier = 12, rt_length_s = 200, remove = FALSE ) ``` ```{r} cowplot::plot_grid( plot(x = ket1_no_basline, rt_range = c(0, 500), dt_range = c(6, 16)), plot(x = ket1_basline, rt_range = c(0, 500), dt_range = c(6, 16)), ncol = 2 ) ``` Or to the whole dataset: ```{r} dataset <- estimateBaseline( dataset, dt_peak_fwhm_ms = 0.2, dt_region_multiplier = 12, rt_length_s = 200 ) dataset$realize() ``` Some clusters: ```{r} plot(ket2) + overlay_peaklist( peak_clustering$peak_list_clustered, color_by = "SampleID", mapping_roi = c( "dt_min_ms" = "fixedsize_dt_min_ms", "dt_max_ms" = "fixedsize_dt_max_ms", "rt_min_s" = "fixedsize_rt_min_s", "rt_max_s" = "fixedsize_rt_max_s" ) ) + lims(x = c(6, 12), y = c(0, 400)) ``` # Peak integration ```{r} dataset <- integratePeaks( dataset, peak_clustering$peak_list, integration_size_method = "fixed_size", rip_saturation_threshold = 0.1 ) ``` ```{r} peak_list <- peaks(dataset) ``` # Build peak table ```{r} peak_table <- peakTable(peak_list, aggregate_conflicting_peaks = max) t(tail(t(peak_table$peak_table_matrix))) ``` We showed the last 6 clusters where we have "NA" Values # Imputation ```{r} peak_table_imputed <- imputePeakTable(peak_table$peak_table_matrix, dataset, peak_clustering$cluster_stats) t(tail(t(peak_table_imputed))) ``` As we can see the "NA" values were imputed ```{r} # Implemented until here end_time <- Sys.time() message("The vignette ran in ", format(end_time - start_time)) ``` # Session Info: ```{r} sessionInfo() ```