somhca Package – Part 1: Training and Visualizing Self-Organizing Maps in R
Overview
The four functions from the somhca R package presented in this post provide a
complete workflow for preparing data, selecting an appropriate self-organizing
map (SOM) configuration, training the SOM, and visualizing the resulting
patterns. Together, they simplify the process of applying SOM-based
exploratory analysis to high-dimensional numeric datasets such as spectra,
sensor measurements, or other multivariate observations. These functions are
particularly useful when working with complex datasets where pattern
exploration, clustering, dimensionality reduction, and visualization are
important goals. By automating tasks such as data preprocessing, SOM grid
optimization, model training, and graphical interpretation, this workflow
helps users build robust and reproducible SOM analyses with minimal manual
tuning.
loadMatrix()
TheloadMatrix() function is designed to simplify
the process of preparing data for analysis. It can import data either from a CSV
file or directly from an in-memory object such as a data frame or matrix. During
loading, you can optionally remove unwanted columns and apply normalization
methods to scale the data before it is converted into a matrix format. To
work correctly, the input dataset should follow a
tidy data structure: each row should represent an observation (for
example, a sample), each column should represent a variable or feature, and each
cell should contain a single numeric value. Non-numeric columns, column headers
or row identifiers can be excluded from the analysis as needed.Syntax
loadMatrix(
input,
remove_columns = NULL,
scaling = "no"
)
Arguments
| Argument name | Type | Description | Default value (if provided, argument is optional) |
|---|---|---|---|
| input | Character or in-memory dataset | String specifying the path to a CSV file, or an in-memory object (data frame or matrix) | Argument is required |
| remove_columns | Integer, character, or vector of either | Specifies the column positions or column names in the dataset that must be excluded from the analysis | NULL |
| scaling | Character |
Normalization method to apply to the values in each column of the
dataset. One of:
|
"no" |
Return value
The function returns a numeric matrix containing the processed dataset after
any specified column removal and normalization steps have been applied. The
resulting matrix can be stored in a variable and passed directly to the next
functions in the SOMHCA workflow.
Practical examples
The SOMHCA method can be applied to any dataset containing p variables measured across n observations. In this example, we create a spectral dataset in which each row corresponds to a measured spectrum, each column corresponds to a wavelength, and each cell contains a numeric value representing the recorded reflectance at that wavelength. We start by generating 100 near-infrared (NIR) sample spectra using
the
NIRsoil dataset included in the prospectr package.
The code below randomly selects 100 spectra from the dataset and saves each
as a CSV file in a temporary folder on the Desktop, with named columns for
wavelength and reflectance:
# ---- Install & load package if needed ----
# install.packages("prospectr")
library(prospectr)
# ---- Load dataset ----
data(NIRsoil)
# Spectral matrix and wavelength axis
spectra_matrix <- NIRsoil$spc
wavelength <- as.numeric(colnames(spectra_matrix)) # wavelength in nm
# ---- Create TEMP folder on Desktop ----
desktop_path <- file.path(path.expand("~"), "Desktop")
temp_dir <- file.path(desktop_path, "TEMP")
if (!dir.exists(temp_dir)) {
dir.create(temp_dir, recursive = TRUE)
}
# ---- Randomly select 100 spectra ----
set.seed(260329) # for reproducibility
n_available <- nrow(spectra_matrix)
if (n_available < 100) stop("Dataset contains fewer than 100 spectra.")
selected_rows <- sample(seq_len(n_available), 100)
# ---- Save each spectrum as a CSV file ----
for (i in seq_along(selected_rows)) {
reflectance <- spectra_matrix[selected_rows[i], ]
output_df <- data.frame(
Wavelength = wavelength,
Reflectance = as.numeric(reflectance)
)
file_path <- file.path(
temp_dir,
paste0("NIRsoil_spectrum_", i, ".csv")
)
write.csv(
output_df,
file = file_path,
row.names = FALSE,
quote = FALSE
)
}
cat("100 random NIR spectra saved in:", temp_dir, "\n")
We can then use the
combineSpectra() function from the
spectrakit package (see
spectrakit Package – Part 2) to merge the 100 spectra into a single dataset (100 rows × 701
columns), making it suitable for the loadMatrix() function.
The input argument of loadMatrix() expects
either a path to a CSV file or an object already loaded in memory; in this
case, we will store the merged spectra in an object called
allSpectra:
# ---- Install & load package if needed ----
# install.packages("spectrakit")
library(spectrakit)
setwd(temp_dir)
allSpectra <- combineSpectra(orientation = "rows")
The minimal working example of the
loadMatrix() function
requires specifying only the input argument, while all
other arguments use their default values:
# ---- Install & load package if needed ----
# install.packages("somhca")
library(somhca)
loadMatrix(allSpectra)
However, this produces an error:
Error in loadMatrix(allSpectra) :
All columns must be numeric after column removal.
This happens because the
allSpectra dataset contains a
first column (Spectrum) with character values, while
loadMatrix() requires the resulting matrix to contain
only numeric data. Therefore, we exclude the first column using the
remove_columns argument. In addition, since
self-organizing maps and hierarchical clustering are distance-based
methods, they generally perform better when the data is normalized;
for this reason, we use the scaling argument to
normalize the dataset. In this case, we use standardization (z-score scaling), which is often useful when variables differ in scale; however, scaling choices can strongly influence results, and the most appropriate preprocessing strategy depends on the application and data type. We will store the resulting matrix in an
object called myMatrix:myMatrix <- loadMatrix(allSpectra, remove_columns = 1, scaling = "zScore")
We are now ready to analyze
myMatrix (100 rows × 700
columns) using SOM. It is also advisable to set a global seed using
set.seed() before the next steps to ensure that the
entire analysis pipeline is reproducible. We will not do that in these
examples because a seed has already been set for the random selection
of the spectra, and we will continue using the random number generator
(RNG) state resulting from that call (although setting a new seed before each stochastic stage is often better practice). Also, for datasets that are not particularly time-consuming to analyze, it may be advisable to assess the stability of the process by repeating the SOMHCA pipeline multiple times using different random seeds.optimalSOM()
Before training a self-organizing map (SOM), it is necessary to
choose an appropriate grid size, which can often be a challenging
task. The
optimalSOM() function automates this
process by training multiple maps and comparing their quality to
identify a suitable square SOM grid size for a given data
matrix.
The process starts with a small 2 × 2 grid and progressively
increases the grid dimensions up to a maximum size estimated using
a heuristic approach. For each candidate grid,
optimalSOM()
computes four standard SOM quality metrics:
- Quantization error (Qe) – measures map resolution
- Topographic error (Te) – measures topology preservation
- Kaski–Lagus error (KLe) – combines quantization and topology preservation quality
- Explained variance (%ev) – measures how well the SOM represents the original dataset
For each grid configuration, these metrics are normalized and
combined in different ways, including a single quality index (QI)
that aggregates all of them. The optimal grid size is then defined
as the configuration that minimizes the three error-based measures
while maximizing explained variance and QI, providing a balanced
trade-off between model accuracy and map complexity.
Syntax
optimalSOM(
data,
method = "A",
increments = 1,
iterations = NULL
)
Arguments
| Argument name | Type | Description | Default value (if provided, argument is optional) |
|---|---|---|---|
| data | Matrix containing numeric data | Preprocessed data matrix containing the input data for SOM training | Argument is required |
| method | Character or integer |
Method for estimating the maximum grid dimension. One
of:
|
"A" |
| increments | Integer | Step size for increasing grid dimensions | 1 |
| iterations | Integer |
Number of iterations for SOM training
|
NULL |
Return value
The function returns a data frame summarizing the optimal square
SOM grid dimension along with a suggested number of training
iterations for each evaluated quality measure. The output is
typically used as a decision-support table when configuring a SOM,
helping compare different grid sizes and choose an appropriate one based on model quality. Because the result is a
standard data frame, it can be printed directly for inspection or
stored in an object for later use.
Practical examples
The minimal working example of the
optimalSOM() function requires specifying only the
data argument, while all other arguments use their
default values:
optimalSOM(myMatrix)
This is what we obtain:
| | 0%
Iterations automatically set to 50.
|======================================================================= | 83%
QE plateau reached at grid 7x7 (delta QE < 0.2%). Stopping grid expansion.
|=====================================================================================| 100%
Quality measure Value Associated grid dimension Suggested iterations
1 Min nQe+nTe -0.8580244 7 500
2 Min nKLe -1.1572084 6 500
3 Min nQe+nTe+nKLe -1.8664009 7 500
4 Max n%ev 0.7957414 3 500
5 Max QI 2.3119845 7 500
The QE plateau at grid 7 × 7 (ÎQE < 0.2%) indicates that the
quantization error has stabilized at this grid size, and further
increases in size no longer produce meaningful improvements, so
grid expansion is stopped. This suggests that 7 × 7 is a
sufficiently detailed map for the data, since additional nodes
would add complexity without improving model quality (however, using the default method for estimating the maximum grid dimension, 7 × 7 already corresponds to the largest grid size, so no further increases would have been considered in any case).
However, there is a discrepancy between error-based measures and
explained variance in terms of the suggested optimal grid
dimension (6-7 vs. 3), so it can be useful to adjust training
parameters such as the number of iterations (also known as
rlen) to improve consistency. If the iterations argument is left as
NULL, the function
automatically determines a reasonable value based on dataset size
(50 in this case), but this can be overridden by explicitly
setting the argument. For example, setting it to around 400
iterations is safe here and helps ensure stable training while
keeping computation time reasonable. On the other hand, for very
large datasets (unlike this case), reducing the number of
iterations can help prevent long runtimes or errors. Similarly,
while the default grid expansion step defined by the
increments argument is 1 (allowing a fine-grained
search, which is ideal in this example), larger step sizes such as
2 or 5 can be used in large-scale applications to speed up the
search for optimal grid dimensions, although this reduces
resolution and increases the risk of skipping the best
configuration.
So, we can go ahead and set the number of iterations to 400:
optimalSOM(myMatrix, iterations = 400)
The output now looks like this:
| | 0%
Using user-specified iterations: 400.
|=====================================================================================| 100%
Quality measure Value Associated grid dimension Suggested iterations
1 Min nQe+nTe -0.7877985 3 500
2 Min nKLe -1.5996196 7 500
3 Min nQe+nTe+nKLe -2.2761013 7 500
4 Max n%ev 0.6090651 7 500
5 Max QI 2.8851664 7 500
The suggested values are now more consistent, and it is reasonable
to conclude that a 7 × 7 grid size is the best choice for training
the SOM using the next function.
finalSOM()
Now, using the
finalSOM() function, we can retrain
the SOM with the recommended grid size and number of iterations
identified earlier by the
optimalSOM() function.Syntax
finalSOM(
data,
dimension,
iterations,
chunk = 100
)
Arguments
| Argument name | Type | Description | Default value (if provided, argument is optional) |
|---|---|---|---|
| data | Matrix containing numeric data | Preprocessed data matrix containing the input data for SOM training | Argument is required |
| dimension | Integer | Dimension of the square SOM grid (for example, 5 results in a 5 × 5 grid) | Argument is required |
| iterations | Integer | Number of iterations for SOM training | Argument is required |
| chunk | Integer |
Number of iterations per training block
|
100 |
Return value
The function returns the trained SOM model as an S3 object,
typically containing the codebook vectors (also called node weight vectors, prototype weights, or codes) and
training history. This object serves as the primary result of the
modeling process and is used as the input for downstream analysis.
It can be printed to inspect a summary of the model, plotted to
visualize the map structure, or queried using the
$ operator to extract components such as node
assignments, weights, or cluster labels. In practice, the returned
object is commonly stored for later reuse and passed to subsequent
functions for visualization and evaluation of the learned SOM
structure or for clustering refinement.Practical examples
Let’s retrain the model using the values suggested by the
optimalSOM() function. In the
finalSOM() function, most arguments are required,
so we will provide the myMatrix dataset to the
data argument, set the grid dimension to 7, and use
500 iterations to train the SOM model. Larger grids typically
require substantially more iterations to ensure proper
convergence. Although using more than 500 iterations could, in
theory, further improve training quality for the 7 × 7 grid used
in this example, increasing the number of iterations too far may
trigger a warning suggesting that the value should be reduced.
On the other hand, reducing the number of iterations can speed
up training, but at the risk of insufficient SOM convergence.
We will keep the
chunk argument at its default
value of 100, so the SOM is trained in blocks of this many
iterations, with a progress message printed after each block.
This helps reassure the user that the function is actively
running and is not frozen. Larger chunk sizes reduce the
frequency of messages, while smaller values provide more
frequent updates but may slightly slow execution.
We will store the resulting model in an object called
mySOM:
mySOM <- finalSOM(myMatrix, 7, 500)
Training SOM: 7x7 grid, 500 iterations
Completed 100 / 500 iterations...
Completed 200 / 500 iterations...
Completed 300 / 500 iterations...
Completed 400 / 500 iterations...
Completed 500 / 500 iterations...
SOM training complete.
We now have a trained SOM model that can be inspected to explore
its underlying structure.
generatePlot()
ThegeneratePlot() function can
be used to create various types of plots for visualizing and
evaluating the trained SOM model.
Syntax
generatePlot(
model,
plot_type,
data = NULL
)
Arguments
| Argument name | Type | Description | Default value (if provided, argument is optional) |
|---|---|---|---|
| model | In-memory SOM model object | Trained SOM model | Argument is required |
| plot_type | Integer |
Specifies the type of plot to generate. One of:
|
Argument is required |
| data | Matrix containing numeric data | Preprocessed data matrix containing the input data for SOM training | NULL (required only for plot_type = 5) |
Return value
The function produces a graphical output object representing one
or more visualizations of the trained SOM model, depending on the
selected plot type. In most cases, the function returns a plot
object that is automatically displayed when printed or run
interactively. While it is primarily used for visual inspection,
the object can also be assigned to a variable for later re-display
or for saving to a file (e.g., via standard graphics devices such
as
png() or pdf()). Because its main
purpose is visualization rather than data manipulation, it is
generally not intended for downstream computational use or
extraction of components (e.g., via $), but instead
for plotting, storing, or exporting the generated figure(s).
Practical examples
We can now generate different types of plots, but in this example
we will not use
plot_type = 5, which uniquely
requires the original data matrix via the
data argument to generate a series of heatmaps.
Because the matrix contains 700 variables (i.e., individual
wavelengths), this would result in 700 separate plots, one for
each variable, which is not practical here. However, when a
dataset contains a smaller number of variables, this type of plot
can be very useful for inspecting how individual variables are
distributed across the SOM map.
generatePlot(mySOM, 1) generatePlot(mySOM, 2) generatePlot(mySOM, 3) generatePlot(mySOM, 4)
These plots show, clockwise from the top left: the change in
mean distance to the closest unit during training, the number of
spectra mapped to each node, the distribution patterns of the
variables (which resemble spectra because each variable
corresponds to an individual wavelength), and the similarity
relationships between neighboring nodes.
The Node Counts plot is particularly interesting because it
shows that eight of the 49 nodes in the current SOM are empty.
Empty nodes are often normal in exploratory SOM analysis, especially with highly variable or complex data, and can help preserve separation between patterns. However, too many empty nodes may suggest that the map grid is too large relative to the number of observations. In practice, this can be addressed
by reducing the SOM grid size until only a very small number of
nodes remain empty, resulting in a more balanced and
interpretable map. So, we retrain the model using a grid
dimension of 6; because the dataset is relatively small, we can
also increase the number of training iterations to 1000 to
improve convergence, and then generate the updated plots:
mySOM <- finalSOM(myMatrix, 6, 1000)
Training SOM: 6x6 grid, 1000 iterations
Completed 100 / 1000 iterations...
Completed 200 / 1000 iterations...
Completed 300 / 1000 iterations...
Completed 400 / 1000 iterations...
Completed 500 / 1000 iterations...
Completed 600 / 1000 iterations...
Completed 700 / 1000 iterations...
Completed 800 / 1000 iterations...
Completed 900 / 1000 iterations...
Completed 1000 / 1000 iterations...
SOM training complete.
generatePlot(mySOM, 1)
generatePlot(mySOM, 2)
generatePlot(mySOM, 3)
generatePlot(mySOM, 4)
The updated SOM now contains only two empty nodes out of 36,
with a more balanced distribution of observations across the
map. The plots also show that one spectrum may differ considerably
from the others, suggesting the presence of a potentially distinct
sample.
Quick recap
In summary,
loadMatrix() prepares and normalizes
numeric datasets, optimalSOM() identifies a
suitable SOM grid size and training configuration,
finalSOM() trains the final self-organizing map,
and generatePlot() provides visual tools for
interpreting the trained model. A typical workflow begins by
preprocessing the data into a numeric matrix, followed by
selecting an appropriate SOM configuration, training the SOM,
and finally evaluating the results through diagnostic plots.
Overall, these functions from the somhca R package provide an
integrated and practical framework for performing SOM-based
exploratory data analysis and visualization in preparation for
subsequent cluster analysis and interpretation of complex
datasets.
References
[1] J. Vesanto, E. Alhoniemi, IEEE Trans. Neural Netw. 2000, 11, 586.
← Previous post: Introducing the somhca R package
About the author
Gianluca Pastorelli is a Heritage Scientist (Senior Researcher)
working at the National Gallery of Denmark (SMK).


Comments
Post a Comment