spectrakit Package – Part 2: Merging Spectra in R with combineSpectra()
This is part 2 of a series on the spectrakit R package. Start
here for an introduction.
Overview
The
combineSpectra() function from the spectrakit R
package streamlines the process of merging spectral data from multiple
files into a single, tidy dataset ready for analysis. It automatically
reads all raw spectra files in a specified folder, aligns them by a shared
variable (such as energy, wavelength or another common feature),
optionally filters the data to a specific range, and applies normalization
so the spectra are on comparable scales.
This function shines in workflows that involve handling multiple
measurement files from a specific spectrometry technique and require
merging them before analysis. For example, when preparing data for
multivariate analysis or machine learning, spectra must typically be
aligned along a common axis and normalized to account for differences in
overall signal intensity. Instead of manually importing and merging dozens
of raw files and writing repetitive code for normalization,
combineSpectra() automates this process, reducing the risk of
errors and producing a clean, analysis-ready dataset.
Syntax
combineSpectra(
folder = ".",
file_type = "csv",
sep = ",",
header = TRUE,
common_col_pos = 1,
data_col_pos = 2,
range = NULL,
normalization = c("none", "simple", "min-max", "z-score", "area", "vector"),
norm_scope = c("global", "range"),
orientation = c("columns", "rows")
)
Arguments
| Argument name | Type | Description | Default value (if provided, argument is optional) |
|---|---|---|---|
| folder | Character | Path to the folder containing spectra files | "." (working directory) |
| file_type | Character | File extension (without dot) to search for | "csv" |
| sep | Character | Delimiter for file columns. Use "\t" for tab-delimited files | "," (comma-separated) |
| header | Logical | Whether the files contain a header row | TRUE |
| common_col_pos | Integer | Column position for the common variable (e.g., wavelength) | 1 |
| data_col_pos | Integer | Column position for the spectral intensity values | 2 |
| range | Numeric vector of length 2 | Range filter for the common column (e.g., wavelength limits) | NULL (no filtering) |
| normalization | Character |
Normalization method to apply to intensity values. One of:
|
"none" |
| norm_scope | Character |
Determines whether normalization is applied to the full spectrum or
only to a specified range. One of:
|
"global" |
| orientation | Character |
Output organization. One of:
|
"columns" |
Return value
The function returns a combined spectral dataset as a tibble (i.e., a modern
data-frame from the tibble package). Each spectrum is consistently
arranged as either columns or rows, depending on the user’s preference, with
a shared variable (e.g., energy or wavelength) preserved for alignment. As a
tibble, the output can be printed for quick inspection, stored as an object
for later use (including export), subsetted or extracted using standard
data-frame operations (e.g.,
$ or dplyr verbs), passed
directly into downstream analysis functions (such as multivariate or
modeling workflows), or used for visualization with plotting packages like
ggplot2.
Practical examples
Each raw spectrum file must contain two columns: one for the x-axis (e.g., energy, wavelength or wavenumber) and one for the y-axis (e.g., counts, reflectance or absorbance). In this example, we generate six near-infrared (NIR) sample spectra using theNIRsoil dataset included in the
prospectr package. The code below randomly selects six spectra from the
dataset and saves each as a CSV file in a temporary folder on the Desktop,
with named columns for wavelength and reflectance:
# ---- Install & load package if needed ----
# install.packages("prospectr")
library(prospectr)
# ---- Load dataset ----
data(NIRsoil)
# Spectral matrix and wavelength axis
spectra_matrix <- NIRsoil$spc
wavelength <- as.numeric(colnames(spectra_matrix)) # wavelength in nm
# ---- Create TEMP folder on Desktop ----
desktop_path <- file.path(path.expand("~"), "Desktop")
temp_dir <- file.path(desktop_path, "TEMP")
if (!dir.exists(temp_dir)) {
dir.create(temp_dir, recursive = TRUE)
}
# ---- Randomly select 6 spectra ----
set.seed(260329) # for reproducibility
n_available <- nrow(spectra_matrix)
if (n_available < 6) stop("Dataset contains fewer than 6 spectra.")
selected_rows <- sample(seq_len(n_available), 6)
# ---- Save each spectrum as a CSV file ----
for (i in seq_along(selected_rows)) {
reflectance <- spectra_matrix[selected_rows[i], ]
output_df <- data.frame(
Wavelength = wavelength,
Reflectance = as.numeric(reflectance)
)
file_path <- file.path(
temp_dir,
paste0("NIRsoil_spectrum_", i, ".csv")
)
write.csv(
output_df,
file = file_path,
row.names = FALSE,
quote = FALSE
)
}
cat("6 random NIR spectra saved in:", temp_dir, "\n")
The minimal working example of the
combineSpectra() function,
using only default values, does not require any explicit arguments,
provided that the working directory is set to the temporary Desktop folder
containing the spectra files:
# ---- Install & load package if needed ----
# install.packages("spectrakit")
library(spectrakit)
setwd(temp_dir)
combineSpectra()
The function returns a tibble with 700 rows and 7 columns, where each
row corresponds to a single wavelength. The first column contains
wavelength values starting at 1100, while the remaining six columns
contain reflectance measurements at each wavelength. Essentially, this
tibble consolidates the six spectra into a single tidy dataset, making
it easy to compare measurements across samples, plot the spectra or
perform further analyses. Only the first 10 rows are displayed by
default, with the rest accessible by printing the full tibble:
# A tibble: 700 × 7
Wavelength Spec_1 Spec_2 Spec_3 Spec_4 Spec_5 Spec_6
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1100 0.302 0.303 0.277 0.306 0.285 0.365
2 1102 0.301 0.303 0.276 0.305 0.285 0.364
3 1104 0.301 0.302 0.276 0.305 0.284 0.364
4 1106 0.300 0.302 0.276 0.305 0.284 0.364
5 1108 0.300 0.302 0.275 0.304 0.283 0.363
6 1110 0.299 0.301 0.275 0.304 0.283 0.363
7 1112 0.299 0.301 0.274 0.303 0.283 0.363
8 1114 0.298 0.300 0.274 0.303 0.282 0.362
9 1116 0.298 0.300 0.274 0.303 0.282 0.362
10 1118 0.297 0.300 0.273 0.302 0.282 0.362
# ℹ 690 more rows
# ℹ Use `print(n = ...)` to see more rows
In reality, the
common_col_pos and
data_col_pos arguments also accept vectors of multiple
column positions for both common and specific variables, making the
function suitable for merging more complex datasets.
If the spectra were saved not as comma-separated CSV files in the
working directory but as tab-separated TXT files in a
text_version subfolder, we should specify the following
arguments:
combineSpectra(folder = "./text_version",
file_type = "txt",
sep = "\t")
In most cases, the function can also read
.dat and
.dpt files, provided they are in a compatible text format
(e.g., tab-delimited).
We may consider focusing on a narrower spectral range and/or
normalizing the intensity values to make them directly comparable. The
range and normalization arguments allow
restricting the data to a specific region of the common column and
scaling the values in the remaining columns, respectively.
Importantly, unless norm_scope is set to
"range", selecting a spectral range does not affect the
normalization step, which is by default applied to the full measured
spectra (this is the standard approach in spectroscopy, which avoids
misleading intensity scaling that could occur if normalization were
applied only within the subset defined by range):
combineSpectra(range = c(1500, 2200), normalization = "simple")
# A tibble: 351 × 7
Wavelength Spec_1 Spec_2 Spec_3 Spec_4 Spec_5 Spec_6
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1500 0.619 0.801 0.821 0.795 0.767 0.889
2 1502 0.617 0.801 0.821 0.794 0.766 0.888
3 1504 0.615 0.800 0.820 0.793 0.765 0.887
4 1506 0.614 0.799 0.819 0.792 0.764 0.887
5 1508 0.612 0.799 0.819 0.791 0.763 0.886
6 1510 0.611 0.798 0.818 0.790 0.762 0.885
7 1512 0.609 0.797 0.817 0.790 0.761 0.884
8 1514 0.607 0.796 0.816 0.789 0.760 0.884
9 1516 0.606 0.796 0.816 0.788 0.760 0.883
10 1518 0.604 0.795 0.815 0.787 0.759 0.882
# ℹ 341 more rows
# ℹ Use `print(n = ...)` to see more rows
Finally, we may want to transpose the dataset so that each spectrum is
represented as a row rather than a column. In that case, the
orientation argument should be set to
"rows":
combineSpectra(orientation = "rows")
# A tibble: 6 × 701
Spectrum `1100` `1102` `1104` `1106` `1108` `1110` `1112` `1114` `1116` `1118` `1120` `1122` `1124`
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Spec_1 0.302 0.301 0.301 0.300 0.300 0.299 0.299 0.298 0.298 0.297 0.297 0.296 0.296
2 Spec_2 0.303 0.303 0.302 0.302 0.302 0.301 0.301 0.300 0.300 0.300 0.299 0.299 0.298
3 Spec_3 0.277 0.276 0.276 0.276 0.275 0.275 0.274 0.274 0.274 0.273 0.273 0.273 0.272
4 Spec_4 0.306 0.305 0.305 0.305 0.304 0.304 0.303 0.303 0.303 0.302 0.302 0.302 0.301
5 Spec_5 0.285 0.285 0.284 0.284 0.283 0.283 0.283 0.282 0.282 0.282 0.281 0.281 0.281
6 Spec_6 0.365 0.364 0.364 0.364 0.363 0.363 0.363 0.362 0.362 0.362 0.361 0.361 0.360
# ℹ 687 more variables
# ℹ Use `colnames()` to see all variable names
Depending on
orientation, a sum or mean spectrum can
optionally be computed using rowSums() or
colSums(), or rowMeans() or
colMeans(), respectively.
Quick recap
The
combineSpectra() function from the
spectrakit R package merges multiple spectra files into a
single, tidy dataset, aligning them by a shared variable and
optionally applying range filtering and normalization. Key arguments
include folder (file location),
file_type and sep (input format),
range (spectral region of interest),
normalization (scaling method), and
orientation (rows or columns). It’s typically used when
preparing spectra for multivariate analysis, machine learning
workflows or downstream visualization with a plotting system like
ggplot2, automating what would otherwise be repetitive file
handling. With this function, you can quickly combine, align and
normalize spectra with minimal code, producing analysis-ready
datasets in one step.
About the author
Gianluca Pastorelli is a Heritage Scientist (Senior Researcher) working at the National Gallery of Denmark (SMK).
Comments
Post a Comment