somhca Package – Part 1: Training and Visualizing Self-Organizing Maps in R

This is part 1 of a two-part series on the somhca R package. Start here for an introduction.

Overview

The four functions from the somhca R package presented in this post provide a complete workflow for preparing data, selecting an appropriate self-organizing map (SOM) configuration, training the SOM, and visualizing the resulting patterns. Together, they simplify the process of applying SOM-based exploratory analysis to high-dimensional numeric datasets such as spectra, sensor measurements, or other multivariate observations. These functions are particularly useful when working with complex datasets where pattern exploration, clustering, dimensionality reduction, and visualization are important goals. By automating tasks such as data preprocessing, SOM grid optimization, model training, and graphical interpretation, this workflow helps users build robust and reproducible SOM analyses with minimal manual tuning.

loadMatrix()

The loadMatrix() function is designed to simplify the process of preparing data for analysis. It can import data either from a CSV file or directly from an in-memory object such as a data frame or matrix. During loading, you can optionally remove unwanted columns and apply normalization methods to scale the data before it is converted into a matrix format. To work correctly, the input dataset should follow a tidy data structure: each row should represent an observation (for example, a sample), each column should represent a variable or feature, and each cell should contain a single numeric value. Non-numeric columns, column headers or row identifiers can be excluded from the analysis as needed.

Syntax

loadMatrix(
  input,
  remove_columns = NULL,
  scaling = "no"
)

Arguments

Argument name Type Description Default value (if provided, argument is optional)
input Character or in-memory dataset String specifying the path to a CSV file, or an in-memory object (data frame or matrix) Argument is required
remove_columns Integer, character, or vector of either Specifies the column positions or column names in the dataset that must be excluded from the analysis NULL
scaling Character Normalization method to apply to the values in each column of the dataset. One of:
  • "no" (no scaling is applied)
  • "simpleFeature" (divide by the maximum value)
  • "minMax" (scale values to the [0,1] range)
  • "zScore" (standardize values by subtracting the mean and dividing by the standard deviation)
"no"

Return value

The function returns a numeric matrix containing the processed dataset after any specified column removal and normalization steps have been applied. The resulting matrix can be stored in a variable and passed directly to the next functions in the SOMHCA workflow.

Practical examples

The SOMHCA method can be applied to any dataset containing p variables measured across n observations. In this example, we create a spectral dataset in which each row corresponds to a measured spectrum, each column corresponds to a wavelength, and each cell contains a numeric value representing the recorded reflectance at that wavelength. We start by generating 100 near-infrared (NIR) sample spectra using the NIRsoil dataset included in the prospectr package. The code below randomly selects 100 spectra from the dataset and saves each as a CSV file in a temporary folder on the Desktop, with named columns for wavelength and reflectance:
# ---- Install & load package if needed ----
# install.packages("prospectr")
library(prospectr)

# ---- Load dataset ----
data(NIRsoil)

# Spectral matrix and wavelength axis
spectra_matrix <- NIRsoil$spc
wavelength <- as.numeric(colnames(spectra_matrix))  # wavelength in nm

# ---- Create TEMP folder on Desktop ----
desktop_path <- file.path(path.expand("~"), "Desktop")
temp_dir <- file.path(desktop_path, "TEMP")

if (!dir.exists(temp_dir)) {
    dir.create(temp_dir, recursive = TRUE)
}

# ---- Randomly select 100 spectra ----
set.seed(260329)  # for reproducibility
n_available <- nrow(spectra_matrix)
if (n_available < 100) stop("Dataset contains fewer than 100 spectra.")

selected_rows <- sample(seq_len(n_available), 100)

# ---- Save each spectrum as a CSV file ----
for (i in seq_along(selected_rows)) {
    
    reflectance <- spectra_matrix[selected_rows[i], ]
    
    output_df <- data.frame(
        Wavelength = wavelength,
        Reflectance = as.numeric(reflectance)
    )
    
    file_path <- file.path(
        temp_dir,
        paste0("NIRsoil_spectrum_", i, ".csv")
    )
    
    write.csv(
        output_df,
        file = file_path,
        row.names = FALSE,
        quote = FALSE
    )
}

cat("100 random NIR spectra saved in:", temp_dir, "\n")

We can then use the combineSpectra() function from the spectrakit package (see spectrakit Package – Part 2) to merge the 100 spectra into a single dataset (100 rows × 701 columns), making it suitable for the loadMatrix() function. The input argument of loadMatrix() expects either a path to a CSV file or an object already loaded in memory; in this case, we will store the merged spectra in an object called allSpectra:
# ---- Install & load package if needed ----
# install.packages("spectrakit")
library(spectrakit)

setwd(temp_dir)

allSpectra <- combineSpectra(orientation = "rows")

The minimal working example of the loadMatrix() function requires specifying only the input argument, while all other arguments use their default values:
# ---- Install & load package if needed ----
# install.packages("somhca")
library(somhca)
loadMatrix(allSpectra)

However, this produces an error:
Error in loadMatrix(allSpectra) :
    All columns must be numeric after column removal.

This happens because the allSpectra dataset contains a first column (Spectrum) with character values, while loadMatrix() requires the resulting matrix to contain only numeric data. Therefore, we exclude the first column using the remove_columns argument. In addition, since self-organizing maps and hierarchical clustering are distance-based methods, they generally perform better when the data is normalized; for this reason, we use the scaling argument to normalize the dataset. In this case, we use standardization (z-score scaling), which is often useful when variables differ in scale; however, scaling choices can strongly influence results, and the most appropriate preprocessing strategy depends on the application and data type. We will store the resulting matrix in an object called myMatrix:
myMatrix <- loadMatrix(allSpectra, remove_columns = 1, scaling = "zScore")

We are now ready to analyze myMatrix (100 rows × 700 columns) using SOM. It is also advisable to set a global seed using set.seed() before the next steps to ensure that the entire analysis pipeline is reproducible. We will not do that in these examples because a seed has already been set for the random selection of the spectra, and we will continue using the random number generator (RNG) state resulting from that call (although setting a new seed before each stochastic stage is often better practice). Also, for datasets that are not particularly time-consuming to analyze, it may be advisable to assess the stability of the process by repeating the SOMHCA pipeline multiple times using different random seeds.

optimalSOM()

Before training a self-organizing map (SOM), it is necessary to choose an appropriate grid size, which can often be a challenging task. The optimalSOM() function automates this process by training multiple maps and comparing their quality to identify a suitable square SOM grid size for a given data matrix.
The process starts with a small 2 × 2 grid and progressively increases the grid dimensions up to a maximum size estimated using a heuristic approach. For each candidate grid, optimalSOM() computes four standard SOM quality metrics:
  • Quantization error (Qe) – measures map resolution
  • Topographic error (Te) – measures topology preservation
  • Kaski–Lagus error (KLe) – combines quantization and topology preservation quality
  • Explained variance (%ev) – measures how well the SOM represents the original dataset
For each grid configuration, these metrics are normalized and combined in different ways, including a single quality index (QI) that aggregates all of them. The optimal grid size is then defined as the configuration that minimizes the three error-based measures while maximizing explained variance and QI, providing a balanced trade-off between model accuracy and map complexity.

Syntax

optimalSOM(
    data,
    method = "A",
    increments = 1,
    iterations = NULL
)

Arguments

Argument name Type Description Default value (if provided, argument is optional)
data Matrix containing numeric data Preprocessed data matrix containing the input data for SOM training Argument is required
method Character or integer Method for estimating the maximum grid dimension. One of:
  • "A" (calculates the rounded value of √(5 × √n), where n is the number of rows in data) [1]
  • "B" (calculates the rounded value of n raised to the power of 1/2.5, where n is the number of rows in data)
  • Manually specified maximum dimension
"A"
increments Integer Step size for increasing grid dimensions 1
iterations Integer Number of iterations for SOM training
    NULL

    Return value

    The function returns a data frame summarizing the optimal square SOM grid dimension along with a suggested number of training iterations for each evaluated quality measure. The output is typically used as a decision-support table when configuring a SOM, helping compare different grid sizes and choose an appropriate one based on model quality. Because the result is a standard data frame, it can be printed directly for inspection or stored in an object for later use.

    Practical examples

    The minimal working example of the optimalSOM() function requires specifying only the data argument, while all other arguments use their default values:
    optimalSOM(myMatrix)

    This is what we obtain:
      |                                                                                     |   0%
    Iterations automatically set to 50.
      |=======================================================================              |  83%
    QE plateau reached at grid 7x7 (delta QE < 0.2%). Stopping grid expansion.
      |=====================================================================================| 100%
       Quality measure      Value Associated grid dimension Suggested iterations
    1      Min nQe+nTe -0.8580244                         7                  500
    2         Min nKLe -1.1572084                         6                  500
    3 Min nQe+nTe+nKLe -1.8664009                         7                  500
    4         Max n%ev  0.7957414                         3                  500
    5           Max QI  2.3119845                         7                  500

    The QE plateau at grid 7 × 7 (ΔQE < 0.2%) indicates that the quantization error has stabilized at this grid size, and further increases in size no longer produce meaningful improvements, so grid expansion is stopped. This suggests that 7 × 7 is a sufficiently detailed map for the data, since additional nodes would add complexity without improving model quality (however, using the default method for estimating the maximum grid dimension, 7 × 7 already corresponds to the largest grid size, so no further increases would have been considered in any case).
    However, there is a discrepancy between error-based measures and explained variance in terms of the suggested optimal grid dimension (6-7 vs. 3), so it can be useful to adjust training parameters such as the number of iterations (also known as rlen) to improve consistency. If the iterations argument is left as NULL, the function automatically determines a reasonable value based on dataset size (50 in this case), but this can be overridden by explicitly setting the argument. For example, setting it to around 400 iterations is safe here and helps ensure stable training while keeping computation time reasonable. On the other hand, for very large datasets (unlike this case), reducing the number of iterations can help prevent long runtimes or errors. Similarly, while the default grid expansion step defined by the increments argument is 1 (allowing a fine-grained search, which is ideal in this example), larger step sizes such as 2 or 5 can be used in large-scale applications to speed up the search for optimal grid dimensions, although this reduces resolution and increases the risk of skipping the best configuration.
    So, we can go ahead and set the number of iterations to 400:
    optimalSOM(myMatrix, iterations = 400)

    The output now looks like this:
      |                                                                                     |   0%
    Using user-specified iterations: 400.
      |=====================================================================================| 100%
       Quality measure      Value Associated grid dimension Suggested iterations
    1      Min nQe+nTe -0.7877985                         3                  500
    2         Min nKLe -1.5996196                         7                  500
    3 Min nQe+nTe+nKLe -2.2761013                         7                  500
    4         Max n%ev  0.6090651                         7                  500
    5           Max QI  2.8851664                         7                  500

    The suggested values are now more consistent, and it is reasonable to conclude that a 7 × 7 grid size is the best choice for training the SOM using the next function.

    finalSOM()

    Now, using the finalSOM() function, we can retrain the SOM with the recommended grid size and number of iterations identified earlier by the optimalSOM() function.

    Syntax

    finalSOM(
        data,
        dimension,
        iterations,
    chunk = 100 )

    Arguments

    Argument name Type Description Default value (if provided, argument is optional)
    data Matrix containing numeric data Preprocessed data matrix containing the input data for SOM training Argument is required
    dimension Integer Dimension of the square SOM grid (for example, 5 results in a 5 × 5 grid) Argument is required
    iterations Integer Number of iterations for SOM training Argument is required
    chunk Integer Number of iterations per training block
      100

      Return value

      The function returns the trained SOM model as an S3 object, typically containing the codebook vectors (also called node weight vectors, prototype weights, or codes) and training history. This object serves as the primary result of the modeling process and is used as the input for downstream analysis. It can be printed to inspect a summary of the model, plotted to visualize the map structure, or queried using the $ operator to extract components such as node assignments, weights, or cluster labels. In practice, the returned object is commonly stored for later reuse and passed to subsequent functions for visualization and evaluation of the learned SOM structure or for clustering refinement.

      Practical examples

      Let’s retrain the model using the values suggested by the optimalSOM() function. In the finalSOM() function, most arguments are required, so we will provide the myMatrix dataset to the data argument, set the grid dimension to 7, and use 500 iterations to train the SOM model. Larger grids typically require substantially more iterations to ensure proper convergence. Although using more than 500 iterations could, in theory, further improve training quality for the 7 × 7 grid used in this example, increasing the number of iterations too far may trigger a warning suggesting that the value should be reduced. On the other hand, reducing the number of iterations can speed up training, but at the risk of insufficient SOM convergence.
      We will keep the chunk argument at its default value of 100, so the SOM is trained in blocks of this many iterations, with a progress message printed after each block. This helps reassure the user that the function is actively running and is not frozen. Larger chunk sizes reduce the frequency of messages, while smaller values provide more frequent updates but may slightly slow execution.
      We will store the resulting model in an object called mySOM:
      mySOM <- finalSOM(myMatrix, 7, 500)

      Training SOM: 7x7 grid, 500 iterations Completed 100 / 500 iterations... Completed 200 / 500 iterations... Completed 300 / 500 iterations... Completed 400 / 500 iterations... Completed 500 / 500 iterations... SOM training complete.

      We now have a trained SOM model that can be inspected to explore its underlying structure.

      generatePlot()

      The generatePlot() function can be used to create various types of plots for visualizing and evaluating the trained SOM model.

      Syntax

      generatePlot(
          model,
          plot_type,
      data = NULL )

      Arguments

      Argument name Type Description Default value (if provided, argument is optional)
      model In-memory SOM model object Trained SOM model Argument is required
      plot_type Integer Specifies the type of plot to generate. One of:
      1. Training progress plot (changes during training)
      2. Node count plot (number of samples mapped to each node) for assessing map quality
      3. U-matrix plot (similarities between neighboring nodes)
      4. Weight vector plot (patterns in the distributions of variables)
      5. Kohonen heatmaps for all variables in the dataset (distribution of single variables across the map)
      Argument is required
      data Matrix containing numeric data Preprocessed data matrix containing the input data for SOM training NULL (required only for plot_type = 5)

      Return value

      The function produces a graphical output object representing one or more visualizations of the trained SOM model, depending on the selected plot type. In most cases, the function returns a plot object that is automatically displayed when printed or run interactively. While it is primarily used for visual inspection, the object can also be assigned to a variable for later re-display or for saving to a file (e.g., via standard graphics devices such as png() or pdf()). Because its main purpose is visualization rather than data manipulation, it is generally not intended for downstream computational use or extraction of components (e.g., via $), but instead for plotting, storing, or exporting the generated figure(s).

      Practical examples

      We can now generate different types of plots, but in this example we will not use plot_type = 5, which uniquely requires the original data matrix via the data argument to generate a series of heatmaps. Because the matrix contains 700 variables (i.e., individual wavelengths), this would result in 700 separate plots, one for each variable, which is not practical here. However, when a dataset contains a smaller number of variables, this type of plot can be very useful for inspecting how individual variables are distributed across the SOM map.
      generatePlot(mySOM, 1)
      generatePlot(mySOM, 2)
      generatePlot(mySOM, 3)
      generatePlot(mySOM, 4)

      These plots show, clockwise from the top left: the change in mean distance to the closest unit during training, the number of spectra mapped to each node, the distribution patterns of the variables (which resemble spectra because each variable corresponds to an individual wavelength), and the similarity relationships between neighboring nodes.
      The Node Counts plot is particularly interesting because it shows that eight of the 49 nodes in the current SOM are empty. Empty nodes are often normal in exploratory SOM analysis, especially with highly variable or complex data, and can help preserve separation between patterns. However, too many empty nodes may suggest that the map grid is too large relative to the number of observations. In practice, this can be addressed by reducing the SOM grid size until only a very small number of nodes remain empty, resulting in a more balanced and interpretable map. So, we retrain the model using a grid dimension of 6; because the dataset is relatively small, we can also increase the number of training iterations to 1000 to improve convergence, and then generate the updated plots:
      mySOM <- finalSOM(myMatrix, 6, 1000)

      Training SOM: 6x6 grid, 1000 iterations Completed 100 / 1000 iterations... Completed 200 / 1000 iterations... Completed 300 / 1000 iterations... Completed 400 / 1000 iterations... Completed 500 / 1000 iterations... Completed 600 / 1000 iterations... Completed 700 / 1000 iterations... Completed 800 / 1000 iterations... Completed 900 / 1000 iterations... Completed 1000 / 1000 iterations... SOM training complete.


      generatePlot(mySOM, 1) generatePlot(mySOM, 2) generatePlot(mySOM, 3) generatePlot(mySOM, 4)


      The updated SOM now contains only two empty nodes out of 36, with a more balanced distribution of observations across the map. The plots also show that one spectrum may differ considerably from the others, suggesting the presence of a potentially distinct sample.

      Quick recap

      In summary, loadMatrix() prepares and normalizes numeric datasets, optimalSOM() identifies a suitable SOM grid size and training configuration, finalSOM() trains the final self-organizing map, and generatePlot() provides visual tools for interpreting the trained model. A typical workflow begins by preprocessing the data into a numeric matrix, followed by selecting an appropriate SOM configuration, training the SOM, and finally evaluating the results through diagnostic plots. Overall, these functions from the somhca R package provide an integrated and practical framework for performing SOM-based exploratory data analysis and visualization in preparation for subsequent cluster analysis and interpretation of complex datasets.

      References

      [1] J. Vesanto, E. Alhoniemi, IEEE Trans. Neural Netw. 2000, 11, 586.




      About the author
      Gianluca Pastorelli is a Heritage Scientist (Senior Researcher) working at the National Gallery of Denmark (SMK).

      Comments