somhca Package – Part 2: Performing Hierarchical Cluster Analysis in R

This is part 2 of a two-part series on the somhca R package. Start here for an introduction.

Overview

The three functions from the somhca R package presented in this post provide a complete workflow for hierarchical cluster analysis (HCA), from grouping observations after SOM training or other types of dimensionality reduction and pattern extraction, to retrieving and exploring the results. Together, these functions help transform complex, high-dimensional datasets into interpretable groups that can be validated, visualized and used for further analysis. Typical problems these functions help solve include:
  • Simplifying interpretation of SOM results;
  • Identifying natural groupings in complex datasets;
  • Comparing clustering strategies (e.g., SOM-based vs PCA-based clustering);
  • Detecting outliers or unusual observations;
  • Assigning cluster labels for visualization or statistical analysis;
  • Preparing grouped datasets for downstream machine learning or reporting.

clusterSOM()

The clusterSOM() function groups similar nodes of a previously trained SOM using hierarchical clustering. In general, clustering methods face a similar issue to choosing the optimal SOM size: the number of clusters must be set in advance and may need to be adjusted through trial and error. Several approaches exist to address this; for example, k-means clustering can be run over a range of cluster numbers and the resulting within-cluster sum of squares (WCSS) can be plotted to identify an “elbow” point as a heuristic for the optimal choice. On the other hand, HCA builds a hierarchy of clusters, and the Kelley–Gardner–Sutcliffe (KGS) penalty function [1] can be used to automatically estimate a reasonable number of clusters by selecting the lowest penalty.

Syntax

clusterSOM(
  model,
  n_clusters = NULL,
  validity_indices = TRUE,
  plot_result = TRUE,
  input = NULL
)

Arguments

Argument name Type Description Default value (if provided, argument is optional)
model In-memory SOM model object Trained SOM model Argument is required
n_clusters Integer If provided, specifies the number of clusters to cut the SOM dendrogram into NULL
validity_indices Logical Indicates whether to compute and print popular clustering validity indices TRUE
plot_result Logical Indicates whether to plot the clustering result TRUE
input Character or in-memory dataset String specifying the path to a CSV file, or an in-memory object (data frame or matrix) NULL

Return value

The function invisibly returns NULL. If plot_result = TRUE, a visualization of the clustered SOM units is produced on the SOM grid. If input is provided, the function appends cluster assignments to the supplied dataset and stores the resulting dataset in the package environment as DataAndClusters, which can be retrieved using getClusterData().

Practical examples

The minimal working example of the clusterSOM() function requires specifying only the model argument, while all other arguments use their default values. In this example, the function is applied to the trained SOM model mySOM created in Part 1:
clusterSOM(mySOM)

This is what we obtain:
5 clusters were determined using KGS.

Cluster validity indices:
Average Silhouette Width: 0.430 (higher is better; >0.25 reasonable, >0.5 strong)
Dunn Index: 0.069 (higher is better)
Calinski-Harabasz Index: 174.6 (higher is better)
Pearson Gamma Coefficient: 0.372 (closer to 1 is better)


Ideally, the resulting clusters are spatially contiguous on the map surface, but this depends on the underlying data distribution. Since five clusters were automatically determined, we can try increasing or reducing the number of clusters to see whether the validity indices improve:
clusterSOM(mySOM, 6)
clusterSOM(mySOM, 4)

6 clusters were specified by the user.

Cluster validity indices: Average Silhouette Width: 0.400 (higher is better; >0.25 reasonable, >0.5 strong) Dunn Index: 0.111 (higher is better) Calinski-Harabasz Index: 192.2 (higher is better) Pearson Gamma Coefficient: 0.373 (closer to 1 is better)


4 clusters were specified by the user.

Cluster validity indices: Average Silhouette Width: 0.502 (higher is better; >0.25 reasonable, >0.5 strong) Dunn Index: 0.122 (higher is better) Calinski-Harabasz Index: 170.9 (higher is better) Pearson Gamma Coefficient: 0.486 (closer to 1 is better)


We can see that most indices improve when using four clusters, although the Calinski-Harabasz index shows a different trend. This is not uncommon, as different validity indices emphasize different aspects of cluster structure and may favor different solutions. Therefore, the final choice of cluster number is guided by a balance of these measures and interpretability. For this example, we will proceed with four clusters for the rest of the analysis.
For interpretation purposes, we may want to append these assigned clusters to the observations in the original dataset. In that case, we need to set the input argument to the allSpectra data frame:
clusterSOM(mySOM, 4, input = allSpectra)

The resulting "clustered" dataset is stored in the package environment as DataAndClusters, which can be retrieved using the next function.

getClusterData()

The getClusterData() function is used to access the dataset with cluster assignments stored by clusterSOM(). The function takes no arguments and can be used directly.

Return value

The function returns a data frame, which is typically used for downstream analyses, visualization or export.

Practical examples

The only way to use the getClusterData() function is to call it directly and store the result in a new object, which in this example we call allSpectraClust:
allSpectraClust <- getClusterData()

We can use the base R function str() to get a quick sense of the size, column names and values of this new dataset:
str(allSpectraClust)

'data.frame': 100 obs. of 702 variables: $ Cluster : Factor w/ 4 levels "1","2","3","4": 2 2 2 1 1 1 1 1 4 1 ... $ Spectrum: chr "NIRsoil_spectrum_01_Reflectance" "NIRsoil_spectrum_02_Reflectance" ... $ 1100 : num 0.478 0.403 0.393 0.299 0.337 ... $ 1102 : num 0.478 0.402 0.393 0.299 0.337 ... $ 1104 : num 0.477 0.402 0.393 0.298 0.336 ... [list output truncated]

Compared to the original allSpectra dataset, there is now an additional column named Cluster, which contains the cluster assignment for each observation.
One way to use this new data frame is to examine how the 100 spectra are grouped by cluster in principal component (PC) space. For this, we use the plotPCA() function from the spectrakit package (see spectrakit Package – Part 3). We don't need to exclude the Cluster column from the analysis, since it is stored as a factor and will therefore not be used in the principal component calculation:
res <- plotPCA(
        data = allSpectraClust,
        color_var = "Cluster",
        shape_var = "Cluster",
        ellipses = TRUE,
        ellipse_var = "Cluster",
        display_names = TRUE,
        legend_title = "SOMHCA\ncluster",
        return_pca = TRUE
)

# Show the score plot object
res$plot


Too few points to calculate an ellipse


The score plot shows three fairly well-defined clusters of spectra, with cluster 2 partially overlapping clusters 1 and 3. This suggests that the differences between these groups are present but not markedly strong. A message indicates that, in one case, there are too few points to compute an ellipse; this corresponds to observation 9, which lies far from the other spectra and matches the isolated spectrum already identified earlier in the SOMHCA workflow.

clusterX()

The somhca package also includes a clusterX() function, which works in the same way as clusterSOM() and uses the same syntax and arguments, with the only difference that the model argument is replaced by x. This means that, instead of a trained SOM model, the function accepts a data frame or matrix containing numeric data (while non-numeric columns are automatically ignored), but the purpose remains the same: to group similar observations using hierarchical clustering, optionally appending cluster assignments to a supplied dataset and storing the resulting dataset in the package environment as DataAndClusters.
This function is especially useful when:
  • Performing clustering without training a SOM first;
  • Working with transformed datasets, e.g., PCA scores or latent variables;
  • Comparing SOM-based clustering with direct clustering approaches;
  • Benchmarking different dimensionality reduction strategies.

Practical examples

We could potentially use the clusterX() function on the raw allSpectra data, but it is more interesting to apply it to the PCA scores obtained from the plotPCA() function. This allows us to compare clustering results based on two different dimensionality reduction approaches: SOM and PCA. We will stick with the four manually specified clusters to maintain consistency with the previous results:
# Extract PCA scores for cluster analysis
scores <- res$pca$x

clusterX(scores, 4, input = allSpectra)


4 clusters were specified by the user. Cluster validity indices: Average Silhouette Width: 0.532 (higher is better; >0.25 reasonable, >0.5 strong) Dunn Index: 0.062 (higher is better) Calinski-Harabasz Index: 247.8 (higher is better) Pearson Gamma Coefficient: 0.521 (closer to 1 is better)
The clustered dataset is stored in the package environment as 'DataAndClusters'.
Use `getClusterData()` to retrieve it.


We can observe a general improvement in cluster validity indices and, although the dendrogram is a bit cluttered, the cluster sizes are still comparable to the HCA results from the trained SOM; spectrum 9 still stands out as a single-member cluster.
We are now ready to generate the PCA score plot using the clusters identified by the clusterX() function:
allSpectraClust2 <- getClusterData()

plotPCA( data = allSpectraClust2,
color_var = "Cluster", shape_var = "Cluster", ellipses = TRUE, ellipse_var = "Cluster", display_names = TRUE, legend_title = "PCA\ncluster" )


Even though the cluster numbers don't align with the shapes and colors from the previous analysis due to arbitrary labeling between methods (label switching), we can still see that the groups appear in almost exactly the same spots on the PCA score plot. This provides a qualitative comparison between two different dimensionality reduction approaches and suggests that a similar clustering structure is recovered across both SOM-based and PCA-based workflows, offering some evidence of robustness in the grouping.

Quick recap

In summary, clusterSOM() clusters the nodes of a trained SOM using hierarchical clustering, getClusterData() retrieves datasets containing cluster assignments generated during clustering, and clusterX() performs hierarchical clustering directly on matrices or data frames without requiring a SOM. A typical workflow begins by clustering SOM units, followed by retrieving observation-level cluster assignments, visualizing the resulting grouped dataset, and optionally comparing the results with clustering performed on PCA scores or other transformed data. Overall, these functions from the somhca R package provide a practical and flexible workflow for clustering, validating, retrieving and interpreting grouped observations from SOMs, PCA scores and other multivariate datasets.

References

[1] L. A. Kelley, S. P. Gardner, M. J. Sutcliffe, Protein Eng., des. Sel. 1996, 9, 1063.




About the author
Gianluca Pastorelli is a Heritage Scientist (Senior Researcher) working at the National Gallery of Denmark (SMK).

Comments