Introducing the somhca R package

Complex datasets often contain patterns that are difficult to interpret using traditional statistical methods alone. One effective approach is to combine self-organizing maps (SOM) with hierarchical cluster analysis (HCA). Together, these techniques provide a powerful framework for exploring, visualizing and grouping high-dimensional data.

SOM is an unsupervised neural network method that projects high-dimensional data onto a two-dimensional grid while preserving local relationships in the original data. Similar observations are positioned close to one another on the map, allowing patterns, relationships, trends and structures in complex datasets to become visually apparent, making SOM an excellent tool for dimensionality reduction and exploratory data analysis.


However, SOM is not a clustering method by itself; it is primarily a topology-preserving mapping technique. When a large number of SOM units is used, similar observations may still be distributed across multiple neighboring nodes. To address this, HCA can be applied to the SOM nodes. HCA groups similar nodes together, making the resulting map easier to interpret and helping identify cluster structure.


After clustering the SOM nodes, cluster labels can be reassigned to the original observations in the dataset. These grouped results can then be explored through visualizations such as scatter plots, heatmaps or time-series representations.

A simple workflow

This two-stage process (first training a SOM and then clustering the SOM nodes using HCA) is commonly referred to as the SOMHCA approach. A typical SOMHCA workflow includes:
  • Loading and normalizing the data
  • Estimating a suitable SOM grid size
  • Training the SOM model
  • Visualizing and refining the SOM
  • Performing hierarchical clustering on the SOM units
  • Retrieving cluster assignments for the original observations
  • Visualizing and analyzing the grouped data
  • Optionally comparing results with alternative methods such as PCA or k-means clustering

The somhca R package – seven core functions for SOMHCA analysis

In this tutorial, we will implement the SOMHCA workflow using the somhca package in R. Once somhca is installed and loaded via install.packages("somhca") and library(somhca), the package provides functions for:
  1. Data preparation with loadMatrix()
  2. SOM grid size optimization using optimalSOM()
  3. Training the SOM model with finalSOM()
  4. Visualizing feature distributions using generatePlot()
  5. Clustering SOM nodes with clusterSOM()
  6. Clustering data processed with alternative methods using clusterX()
  7. Retrieving observation-level cluster assignments using getClusterData()


Where to Start

If you’re new to the package, a good starting point is training and visualizing SOMs. From there, you can progress to cluster analysis, explore visualizations of the grouped data, and compare results with alternative techniques.

Final thoughts

By the end of this tutorial, you will understand how to use somhca to transform complex, high-dimensional data into interpretable visual patterns and cluster structures for further analysis.


About the author
Gianluca Pastorelli is a Heritage Scientist (Senior Researcher) working at the National Gallery of Denmark (SMK).
ORCID: https://orcid.org/0000-0001-6926-1952
GitHub: https://github.com/Gianluca-Pastorelli

Comments