spectrakit Package – Part 2: Merging Spectra in R with combineSpectra()

This is part 2 of a series on the spectrakit R package. Start here for an introduction.

Overview

The combineSpectra() function from the spectrakit R package streamlines the process of merging spectral data from multiple files into a single, tidy dataset ready for analysis. It automatically reads all raw spectra files in a specified folder, aligns them by a shared variable (such as energy, wavelength or another common feature), optionally filters the data to a specific range, and applies normalization so the spectra are on comparable scales.
This function shines in workflows that involve handling multiple measurement files from a specific spectrometry technique and require merging them before analysis. For example, when preparing data for multivariate analysis or machine learning, spectra must typically be aligned along a common axis and normalized to account for differences in overall signal intensity. Instead of manually importing and merging dozens of raw files and writing repetitive code for normalization, combineSpectra() automates this process, reducing the risk of errors and producing a clean, analysis-ready dataset.

Syntax

combineSpectra(
  folder = ".",
  file_type = "csv",
  sep = ",",
  header = TRUE,
  common_col_pos = 1,
  data_col_pos = 2,
  range = NULL,
  normalization = c("none", "simple", "min-max", "z-score", "area", "vector"),
  norm_scope = c("global", "range"),
  orientation = c("columns", "rows")
)

Arguments

Argument name Type Description Default value (if provided, argument is optional)
folder Character Path to the folder containing spectra files "." (working directory)
file_type Character File extension (without dot) to search for "csv"
sep Character Delimiter for file columns. Use "\t" for tab-delimited files "," (comma-separated)
header Logical Whether the files contain a header row TRUE
common_col_pos Integer Column position for the common variable (e.g., wavelength) 1
data_col_pos Integer Column position for the spectral intensity values 2
range Numeric vector of length 2 Range filter for the common column (e.g., wavelength limits) NULL (no filtering)
normalization Character Normalization method to apply to intensity values. One of:
  • "none" (no normalization is applied)
  • "simple" (divide by the maximum intensity)
  • "min-max" (scale intensities to the [0,1] range)
  • "z-score" (subtract the mean and divide by the standard deviation of intensities)
  • "area" (divide by the total sum of intensities so the spectrum area = 1)
  • "vector" (normalize the spectrum as a unit vector by dividing by the square root of the sum of squared intensities; also known as L2 normalization)
"none"
norm_scope Character Determines whether normalization is applied to the full spectrum or only to a specified range. One of:
  • "global" (full spectrum)
  • "range" (specified range)
"global"
orientation Character Output organization. One of:
  • "columns", to keep each spectrum as a column
  • "rows", to transpose so each spectrum is a row
"columns"

Return value

The function returns a combined spectral dataset as a tibble (i.e., a modern data-frame from the tibble package). Each spectrum is consistently arranged as either columns or rows, depending on the user’s preference, with a shared variable (e.g., energy or wavelength) preserved for alignment. As a tibble, the output can be printed for quick inspection, stored as an object for later use (including export), subsetted or extracted using standard data-frame operations (e.g., $ or dplyr verbs), passed directly into downstream analysis functions (such as multivariate or modeling workflows), or used for visualization with plotting packages like ggplot2.

Practical examples

Each raw spectrum file must contain two columns: one for the x-axis (e.g., energy, wavelength or wavenumber) and one for the y-axis (e.g., counts, reflectance or absorbance). In this example, we generate six near-infrared (NIR) sample spectra using the NIRsoil dataset included in the prospectr package. The code below randomly selects six spectra from the dataset and saves each as a CSV file in a temporary folder on the Desktop, with named columns for wavelength and reflectance:
# ---- Install & load package if needed ----
# install.packages("prospectr")
library(prospectr)

# ---- Load dataset ----
data(NIRsoil)

# Spectral matrix and wavelength axis
spectra_matrix <- NIRsoil$spc
wavelength <- as.numeric(colnames(spectra_matrix))  # wavelength in nm

# ---- Create TEMP folder on Desktop ----
desktop_path <- file.path(path.expand("~"), "Desktop")
temp_dir <- file.path(desktop_path, "TEMP")

if (!dir.exists(temp_dir)) {
    dir.create(temp_dir, recursive = TRUE)
}

# ---- Randomly select 6 spectra ----
set.seed(260329)  # for reproducibility
n_available <- nrow(spectra_matrix) if (n_available < 6) stop("Dataset contains fewer than 6 spectra.") selected_rows <- sample(seq_len(n_available), 6) # ---- Save each spectrum as a CSV file ---- for (i in seq_along(selected_rows)) { reflectance <- spectra_matrix[selected_rows[i], ] output_df <- data.frame( Wavelength = wavelength, Reflectance = as.numeric(reflectance) ) file_path <- file.path( temp_dir, paste0("NIRsoil_spectrum_", i, ".csv") ) write.csv( output_df, file = file_path, row.names = FALSE, quote = FALSE ) } cat("6 random NIR spectra saved in:", temp_dir, "\n")

The minimal working example of the combineSpectra() function, using only default values, does not require any explicit arguments, provided that the working directory is set to the temporary Desktop folder containing the spectra files:
# ---- Install & load package if needed ----
# install.packages("spectrakit")
library(spectrakit)

setwd(temp_dir)

combineSpectra()

The function returns a tibble with 700 rows and 7 columns, where each row corresponds to a single wavelength. The first column contains wavelength values starting at 1100, while the remaining six columns contain reflectance measurements at each wavelength. Essentially, this tibble consolidates the six spectra into a single tidy dataset, making it easy to compare measurements across samples, plot the spectra or perform further analyses. Only the first 10 rows are displayed by default, with the rest accessible by printing the full tibble:
# A tibble: 700 × 7                                                                                      
Wavelength Spec_1 Spec_2 Spec_3 Spec_4 Spec_5 Spec_6
     <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
1    1100  0.302  0.303  0.277  0.306  0.285  0.365
2    1102  0.301  0.303  0.276  0.305  0.285  0.364
3    1104  0.301  0.302  0.276  0.305  0.284  0.364
4    1106  0.300  0.302  0.276  0.305  0.284  0.364
5    1108  0.300  0.302  0.275  0.304  0.283  0.363
6    1110  0.299  0.301  0.275  0.304  0.283  0.363
7    1112  0.299  0.301  0.274  0.303  0.283  0.363
8    1114  0.298  0.300  0.274  0.303  0.282  0.362
9    1116  0.298  0.300  0.274  0.303  0.282  0.362
10   1118  0.297  0.300  0.273  0.302  0.282  0.362
# ℹ 690 more rows
# ℹ Use `print(n = ...)` to see more rows

In reality, the common_col_pos and data_col_pos arguments also accept vectors of multiple column positions for both common and specific variables, making the function suitable for merging more complex datasets.
If the spectra were saved not as comma-separated CSV files in the working directory but as tab-separated TXT files in a text_version subfolder, we should specify the following arguments:
combineSpectra(folder = "./text_version",
    file_type = "txt",
    sep = "\t")

In most cases, the function can also read .dat and .dpt files, provided they are in a compatible text format (e.g., tab-delimited).
We may consider focusing on a narrower spectral range and/or normalizing the intensity values to make them directly comparable. The range and normalization arguments allow restricting the data to a specific region of the common column and scaling the values in the remaining columns, respectively. Importantly, unless norm_scope is set to "range", selecting a spectral range does not affect the normalization step, which is by default applied to the full measured spectra (this is the standard approach in spectroscopy, which avoids misleading intensity scaling that could occur if normalization were applied only within the subset defined by range):
combineSpectra(range = c(1500, 2200), normalization = "simple")

# A tibble: 351 × 7                                                                                      
Wavelength Spec_1 Spec_2 Spec_3 Spec_4 Spec_5 Spec_6
     <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
 1   1500  0.619  0.801  0.821  0.795  0.767  0.889
 2   1502  0.617  0.801  0.821  0.794  0.766  0.888
 3   1504  0.615  0.800  0.820  0.793  0.765  0.887
 4   1506  0.614  0.799  0.819  0.792  0.764  0.887
 5   1508  0.612  0.799  0.819  0.791  0.763  0.886
 6   1510  0.611  0.798  0.818  0.790  0.762  0.885
 7   1512  0.609  0.797  0.817  0.790  0.761  0.884
 8   1514  0.607  0.796  0.816  0.789  0.760  0.884
 9   1516  0.606  0.796  0.816  0.788  0.760  0.883
10   1518  0.604  0.795  0.815  0.787  0.759  0.882
# ℹ 341 more rows
# ℹ Use `print(n = ...)` to see more rows

Finally, we may want to transpose the dataset so that each spectrum is represented as a row rather than a column. In that case, the orientation argument should be set to "rows":
combineSpectra(orientation = "rows")

# A tibble: 6 × 701                                                                                      
  Spectrum `1100` `1102` `1104` `1106` `1108` `1110` `1112` `1114` `1116` `1118` `1120` `1122` `1124`
  <chr>     <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
1 Spec_1    0.302  0.301  0.301  0.300  0.300  0.299  0.299  0.298  0.298  0.297  0.297  0.296  0.296
2 Spec_2    0.303  0.303  0.302  0.302  0.302  0.301  0.301  0.300  0.300  0.300  0.299  0.299  0.298
3 Spec_3    0.277  0.276  0.276  0.276  0.275  0.275  0.274  0.274  0.274  0.273  0.273  0.273  0.272
4 Spec_4    0.306  0.305  0.305  0.305  0.304  0.304  0.303  0.303  0.303  0.302  0.302  0.302  0.301
5 Spec_5    0.285  0.285  0.284  0.284  0.283  0.283  0.283  0.282  0.282  0.282  0.281  0.281  0.281
6 Spec_6    0.365  0.364  0.364  0.364  0.363  0.363  0.363  0.362  0.362  0.362  0.361  0.361  0.360
# ℹ 687 more variables
# ℹ Use `colnames()` to see all variable names

Depending on orientation, a sum or mean spectrum can optionally be computed using rowSums() or colSums(), or rowMeans() or colMeans(), respectively.

Quick recap

The combineSpectra() function from the spectrakit R package merges multiple spectra files into a single, tidy dataset, aligning them by a shared variable and optionally applying range filtering and normalization. Key arguments include folder (file location), file_type and sep (input format), range (spectral region of interest), normalization (scaling method), and orientation (rows or columns). It’s typically used when preparing spectra for multivariate analysis, machine learning workflows or downstream visualization with a plotting system like ggplot2, automating what would otherwise be repetitive file handling. With this function, you can quickly combine, align and normalize spectra with minimal code, producing analysis-ready datasets in one step.


About the author
Gianluca Pastorelli is a Heritage Scientist (Senior Researcher) working at the National Gallery of Denmark (SMK).

Comments