somhca Package – Part 2: Performing Hierarchical Cluster Analysis in R
Overview
The three functions from the somhca R package presented in this post
provide a complete workflow for hierarchical cluster analysis (HCA), from grouping observations after SOM training or other types of dimensionality reduction and pattern extraction, to retrieving and exploring the results. Together, these functions help transform complex, high-dimensional datasets into interpretable groups that can be validated, visualized and used for further analysis. Typical problems these functions help solve include:
- Simplifying interpretation of SOM results;
- Identifying natural groupings in complex datasets;
- Comparing clustering strategies (e.g., SOM-based vs PCA-based clustering);
- Detecting outliers or unusual observations;
- Assigning cluster labels for visualization or statistical analysis;
- Preparing grouped datasets for downstream machine learning or reporting.
clusterSOM()
TheclusterSOM() function groups similar nodes of a previously
trained SOM using hierarchical clustering. In general, clustering methods face a similar issue to choosing the optimal SOM size: the number of clusters must be set in advance and may need to be adjusted through trial and error. Several approaches exist to address this; for example, k-means clustering can be run over a range of cluster numbers and the resulting within-cluster sum of squares (WCSS) can be plotted to identify an “elbow” point as a heuristic for the optimal choice. On the other hand, HCA builds a hierarchy of clusters, and the Kelley–Gardner–Sutcliffe (KGS) penalty function [1] can be used to automatically estimate a reasonable number of clusters by selecting the lowest penalty.Syntax
clusterSOM(
model,
n_clusters = NULL,
validity_indices = TRUE,
plot_result = TRUE,
input = NULL
)
Arguments
| Argument name | Type | Description | Default value (if provided, argument is optional) |
|---|---|---|---|
| model | In-memory SOM model object | Trained SOM model | Argument is required |
| n_clusters | Integer | If provided, specifies the number of clusters to cut the SOM dendrogram into | NULL |
| validity_indices | Logical | Indicates whether to compute and print popular clustering validity indices | TRUE |
| plot_result | Logical | Indicates whether to plot the clustering result | TRUE |
| input | Character or in-memory dataset | String specifying the path to a CSV file, or an in-memory object (data frame or matrix) | NULL |
Return value
The function invisibly returns
NULL. If plot_result = TRUE, a
visualization of the clustered SOM units is produced on the SOM grid. If
input is provided, the function appends cluster assignments to
the supplied dataset and stores the resulting dataset in the package
environment as DataAndClusters, which can be retrieved using
getClusterData().
Practical examples
The minimal working example of the
clusterSOM() function
requires specifying only the model argument, while all
other arguments use their default values. In this example, the function
is applied to the trained SOM model mySOM created in Part 1:
clusterSOM(mySOM)
This is what we obtain:
5 clusters were determined using KGS.
Cluster validity indices:
Average Silhouette Width: 0.430 (higher is better; >0.25 reasonable, >0.5 strong)
Dunn Index: 0.069 (higher is better)
Calinski-Harabasz Index: 174.6 (higher is better)
Pearson Gamma Coefficient: 0.372 (closer to 1 is better)
clusterSOM(mySOM, 6)
clusterSOM(mySOM, 4)
6 clusters were specified by the user.
Cluster validity indices:
Average Silhouette Width: 0.400 (higher is better; >0.25 reasonable, >0.5 strong)
Dunn Index: 0.111 (higher is better)
Calinski-Harabasz Index: 192.2 (higher is better)
Pearson Gamma Coefficient: 0.373 (closer to 1 is better)
4 clusters were specified by the user.
Cluster validity indices:
Average Silhouette Width: 0.502 (higher is better; >0.25 reasonable, >0.5 strong)
Dunn Index: 0.122 (higher is better)
Calinski-Harabasz Index: 170.9 (higher is better)
Pearson Gamma Coefficient: 0.486 (closer to 1 is better)
For interpretation purposes, we may want to append these assigned clusters to the
observations in the original dataset. In that case, we need to set the
input argument to the allSpectra data frame:clusterSOM(mySOM, 4, input = allSpectra)
The resulting "clustered" dataset is stored in the package environment
as
DataAndClusters, which can be retrieved using the next
function.
getClusterData()
ThegetClusterData() function is used to access the dataset
with cluster assignments stored by clusterSOM(). The
function takes no arguments and can be used directly.
Return value
The function returns a data frame, which is typically used for
downstream analyses, visualization or export.
Practical examples
The only way to use the
getClusterData() function is to
call it directly and store the result in a new object, which in this
example we call allSpectraClust:
allSpectraClust <- getClusterData()
We can use the base R function
str() to get a quick sense
of the size, column names and values of this new dataset:
str(allSpectraClust)
'data.frame': 100 obs. of 702 variables:
$ Cluster : Factor w/ 4 levels "1","2","3","4": 2 2 2 1 1 1 1 1 4 1 ...
$ Spectrum: chr "NIRsoil_spectrum_01_Reflectance" "NIRsoil_spectrum_02_Reflectance" ...
$ 1100 : num 0.478 0.403 0.393 0.299 0.337 ...
$ 1102 : num 0.478 0.402 0.393 0.299 0.337 ...
$ 1104 : num 0.477 0.402 0.393 0.298 0.336 ...
[list output truncated]
Compared to the original
allSpectra dataset, there is now
an additional column named Cluster, which contains the
cluster assignment for each observation.
One way to use this new data frame is to examine how the 100 spectra are
grouped by cluster in principal component (PC) space. For this, we use the
plotPCA() function from the spectrakit package (see
spectrakit Package – Part 3). We don't need to exclude the Cluster column from the
analysis, since it is stored as a factor and will therefore not be used
in the principal component calculation:
res <- plotPCA(
data = allSpectraClust,
color_var = "Cluster",
shape_var = "Cluster",
ellipses = TRUE,
ellipse_var = "Cluster",
display_names = TRUE,
legend_title = "SOMHCA\ncluster",
return_pca = TRUE
)
# Show the score plot object
res$plot
Too few points to calculate an ellipse
clusterX()
The somhca package also includes aclusterX() function,
which works in the same way as clusterSOM() and uses the same
syntax and arguments, with the only difference that the
model argument is replaced by x. This means that,
instead of a trained SOM model, the function accepts a data frame or matrix containing numeric data (while non-numeric columns are automatically ignored),
but the purpose remains the same: to group similar observations using
hierarchical clustering, optionally appending cluster assignments to a
supplied dataset and storing the resulting dataset in the package
environment as DataAndClusters.
This function is especially useful when:
- Performing clustering without training a SOM first;
- Working with transformed datasets, e.g., PCA scores or latent variables;
- Comparing SOM-based clustering with direct clustering approaches;
- Benchmarking different dimensionality reduction strategies.
Practical examples
We could potentially use the
clusterX() function on the raw
allSpectra data, but it is more interesting to apply it to
the PCA scores obtained from the plotPCA() function. This
allows us to compare clustering results based on two different
dimensionality reduction approaches: SOM and PCA. We will stick with the
four manually specified clusters to maintain consistency with the previous
results:
# Extract PCA scores for cluster analysis
scores <- res$pca$x
clusterX(scores, 4, input = allSpectra)
4 clusters were specified by the user.
Cluster validity indices:
Average Silhouette Width: 0.532 (higher is better; >0.25 reasonable, >0.5 strong)
Dunn Index: 0.062 (higher is better)
Calinski-Harabasz Index: 247.8 (higher is better)
Pearson Gamma Coefficient: 0.521 (closer to 1 is better)
The clustered dataset is stored in the package environment as 'DataAndClusters'.
Use `getClusterData()` to retrieve it.
We can observe a general improvement in cluster validity indices and,
although the dendrogram is a bit cluttered, the cluster sizes are still
comparable to the HCA results from the trained SOM; spectrum 9 still
stands out as a single-member cluster.
We are now ready to generate the
PCA score plot using the clusters identified by the
clusterX() function:
allSpectraClust2 <- getClusterData()
plotPCA(
data = allSpectraClust2,
color_var = "Cluster",
shape_var = "Cluster",
ellipses = TRUE,
ellipse_var = "Cluster",
display_names = TRUE,
legend_title = "PCA\ncluster"
)
Even though the cluster numbers don't align with the shapes and
colors from the previous analysis due to arbitrary labeling between methods (label switching), we can still see that the groups appear in
almost exactly the same spots on the PCA score plot. This provides a qualitative comparison between two different dimensionality reduction approaches and suggests that a similar clustering structure is recovered across both SOM-based and PCA-based workflows, offering some evidence of robustness in the grouping.
Quick recap
In summary,clusterSOM() clusters the nodes of a trained SOM using hierarchical clustering, getClusterData() retrieves datasets containing cluster assignments generated during clustering, and clusterX() performs hierarchical clustering directly on matrices or data frames without requiring a SOM. A typical workflow begins by clustering SOM units, followed by retrieving observation-level cluster assignments, visualizing the resulting grouped dataset, and optionally comparing the results with clustering performed on PCA scores or other transformed data. Overall, these functions from the somhca R package provide a practical and flexible workflow for clustering, validating, retrieving and interpreting grouped observations from SOMs, PCA scores and other multivariate datasets.References
[1] L. A. Kelley, S. P. Gardner, M. J. Sutcliffe, Protein Eng., des. Sel. 1996, 9, 1063.
← Previous post: somhca Package – Part 1: Training and Visualizing Self-Organizing Maps in R
About the author
Gianluca Pastorelli is a Heritage Scientist (Senior Researcher) working at the National Gallery of Denmark (SMK).





Comments
Post a Comment