Introducing the somhca R package
Complex datasets often contain patterns that are difficult to interpret
using traditional statistical methods alone. One effective approach is to
combine
self-organizing maps (SOM)
with
hierarchical cluster analysis (HCA). Together, these techniques provide a powerful framework for exploring,
visualizing and grouping high-dimensional data.
SOM is an unsupervised neural network method that projects high-dimensional data onto a two-dimensional grid while preserving local relationships in the original data. Similar observations are positioned close to one another on the map, allowing patterns, relationships, trends and structures in complex datasets to become visually apparent, making SOM an excellent tool for dimensionality reduction and exploratory data analysis.
However, SOM is not a clustering method by itself; it is primarily a topology-preserving mapping technique. When a large
number of SOM units is used, similar observations may still be distributed
across multiple neighboring nodes. To address this, HCA can be applied to
the SOM nodes. HCA groups similar nodes together, making the resulting map
easier to interpret and helping identify cluster structure.
After clustering the SOM nodes, cluster labels can be reassigned to the
original observations in the dataset. These grouped results can then be
explored through visualizations such as scatter plots, heatmaps or
time-series representations.
A simple workflow
This two-stage process (first training a SOM and then clustering the SOM
nodes using HCA) is commonly referred to as the SOMHCA approach. A typical
SOMHCA workflow includes:
- Loading and normalizing the data
- Estimating a suitable SOM grid size
- Training the SOM model
- Visualizing and refining the SOM
- Performing hierarchical clustering on the SOM units
- Retrieving cluster assignments for the original observations
- Visualizing and analyzing the grouped data
- Optionally comparing results with alternative methods such as PCA or k-means clustering
The somhca R package – seven core functions for SOMHCA analysis
In this tutorial, we will implement the SOMHCA workflow using the
somhca package in R. Once somhca is installed and
loaded via
install.packages("somhca") and
library(somhca), the package provides functions for:
- Data preparation with
loadMatrix() - SOM grid size optimization using
optimalSOM() - Training the SOM model with
finalSOM() -
Visualizing feature distributions using
generatePlot() - Clustering SOM nodes with
clusterSOM() -
Clustering data processed with alternative methods using
clusterX() -
Retrieving observation-level cluster assignments using
getClusterData()
Where to Start
If you’re new to the package, a good starting point is training and visualizing SOMs. From there, you can progress to cluster analysis, explore
visualizations of the grouped data, and compare results with alternative
techniques.
Final thoughts
By the end of this tutorial, you will understand how to use somhca to
transform complex, high-dimensional data into interpretable visual patterns
and cluster structures for further analysis.
About the author
Gianluca Pastorelli is a Heritage Scientist (Senior Researcher) working at the National Gallery of Denmark (SMK).
ORCID: https://orcid.org/0000-0001-6926-1952
GitHub: https://github.com/Gianluca-Pastorelli
About the author
Gianluca Pastorelli is a Heritage Scientist (Senior Researcher) working at the National Gallery of Denmark (SMK).
ORCID: https://orcid.org/0000-0001-6926-1952
GitHub: https://github.com/Gianluca-Pastorelli



Comments
Post a Comment