spectrakit Package – Part 3: Performing Principal Component Analysis in R with plotPCA()
This is part 3 of a series on the spectrakit R package. Start
here for an introduction.
Overview
The
plotPCA() function from the spectrakit R package
performs principal component analysis (PCA) on numeric variables in a
dataset and generates publication-ready visualizations of scores, loadings,
biplots or cumulative explained variance. It automatically handles scaling,
missing values, grouping variables, confidence ellipses, customizable color
palettes and figure export. This function is especially useful for
exploratory multivariate analysis when patterns, clustering or dominant
sources of variation are not immediately clear. Typical use cases include
determining the number of relevant principal components, assessing group
separation, identifying influential variables, and producing standardized
PCA figures for reports, presentations or scientific publications.
Syntax
plotPCA(
data,
pcs = c(1, 2),
color_var = NULL,
shape_var = NULL,
plot_type = c("score", "loading", "biplot", "cumvar"),
palette = "Dark2",
score_labels = TRUE,
loading_labels = TRUE,
ellipses = FALSE,
ellipse_var = NULL,
display_names = FALSE,
legend_title = NULL,
return_pca = FALSE,
output_format = "tiff",
output_folder = NULL
)
Arguments
| Argument name | Type | Description | Default value (if provided, argument is optional) |
|---|---|---|---|
| data | Data frame containing numeric variables | Numeric variables are centered and scaled by default; non-numeric columns are ignored | Argument is required |
| pcs | Numeric vector of length 2 | Indicates which principal components to plot | c(1, 2) |
| color_var | Character | Column name for coloring points by group; converted to factor internally | NULL (all points same color) |
| shape_var | Character | Column name for shaping points by group; converted to factor internally | NULL (all points same shape) |
| plot_type | Character |
Type of PCA plot to generate. One of:
|
"score" |
| palette | Character or vector |
Color setting for groups. One of:
|
"Dark2" |
| score_labels | Logical or character |
Controls labeling of points (scores). One of:
|
TRUE |
| loading_labels | Logical |
If TRUE, displays labels for variables (loadings)
|
TRUE |
| ellipses | Logical |
If TRUE, draws confidence ellipses around groups in score/biplot.
Grouping logic for ellipses follows this priority:
|
FALSE |
| ellipse_var | Character | Name of the variable used to group ellipses. Takes precedence over all other grouping variables; converted to factor internally | NULL |
| display_names | Logical | If TRUE, shows legend | FALSE |
| legend_title | Character |
Legend title corresponding to color_var and/or shape_var
|
NULL |
| return_pca | Logical |
If TRUE, returns a list with plot and PCA object
|
FALSE |
| output_format | Character | File format for saving plots. Examples: "tiff", "png", "pdf" | "tiff" |
| output_folder | Character | Path to folder where plots are saved. If NULL, plot is not saved and the ggplot object (or list with plot and PCA if return_pca = TRUE) is returned. If specified, plot is saved automatically (function returns PCA object only if return_pca = TRUE); if ".", plot is saved in the working directory | NULL |
Return value
The function returns either a
ggplot object or a list containing both
the PCA object and the plot, depending on the
return_pca argument. If return_pca = FALSE (the
default), the function simply returns a ggplot object representing
the chosen PCA visualization: scores, loadings, biplot or cumulative
variance. This object is fully compatible with standard
ggplot2 operations, such as printing, further customization with
+ layers, or saving via ggsave. If
return_pca = TRUE, the function returns a list with elements
plot (the ggplot object) and pca (the prcomp result),
allowing users to extract components with the $ operator,
inspect eigenvalues, loadings or scores, and integrate the PCA results into
downstream analyses or reports.
Practical examples
The function takes a data frame of numeric variables and, by default,
centers and scales the data, removes rows with missing values, and ignores
any non-numeric columns. In this example, we generate 50 near-infrared (NIR)
sample spectra using the
NIRsoil dataset included in the prospectr package. The
code below randomly selects 50 spectra from the dataset and saves each as a
CSV file in a temporary folder on the Desktop, with named columns for
wavelength and reflectance:
# ---- Install & load package if needed ----
# install.packages("prospectr")
library(prospectr)
# ---- Load dataset ----
data(NIRsoil)
# Spectral matrix and wavelength axis
spectra_matrix <- NIRsoil$spc
wavelength <- as.numeric(colnames(spectra_matrix)) # wavelength in nm
# ---- Create TEMP folder on Desktop ----
desktop_path <- file.path(path.expand("~"), "Desktop")
temp_dir <- file.path(desktop_path, "TEMP")
if (!dir.exists(temp_dir)) {
dir.create(temp_dir, recursive = TRUE)
}
# ---- Randomly select 50 spectra ----
set.seed(260329) # for reproducibility
n_available <- nrow(spectra_matrix)
if (n_available < 50) stop("Dataset contains fewer than 50 spectra.")
selected_rows <- sample(seq_len(n_available), 50)
# ---- Save each spectrum as a CSV file ----
for (i in seq_along(selected_rows)) {
reflectance <- spectra_matrix[selected_rows[i], ]
output_df <- data.frame(
Wavelength = wavelength,
Reflectance = as.numeric(reflectance)
)
file_path <- file.path(
temp_dir,
paste0("NIRsoil_spectrum_", i, ".csv")
)
write.csv(
output_df,
file = file_path,
row.names = FALSE,
quote = FALSE
)
}
cat("50 random NIR spectra saved in:", temp_dir, "\n")
We can then use the
combineSpectra() function (see
spectrakit Package – Part 2) to merge the 50 spectra into a single
dataset (50 rows × 701 columns), making it suitable for the
plotPCA() function. Because the data argument of
plotPCA() expects an object already loaded in memory, we will
store the merged spectra in an object called allSpectra:
# ---- Install & load package if needed ----
# install.packages("spectrakit")
library(spectrakit)
setwd(temp_dir)
allSpectra <- combineSpectra(orientation = "rows")
The minimal working example of the
plotPCA() function
requires specifying only the data argument, while all other
arguments use their default values:
plotPCA(allSpectra)
This produces a score plot (a scatter plot of observations in PCA
space) of PC1 vs. PC2, with all points (scores) labeled using the row
names and sharing the same color and shape because no variables (i.e.,
column names in the original data) were selected to control these
aesthetics:
A warning message indicates that some data points are unlabeled due to excessive overlap; this is expected behavior based on the function’s underlying settings.
We can also set the
score_labels argument to the
Spectrum column to display the spectra names as labels, or set
the argument to FALSE so that the scores appear without
labels:
plotPCA(allSpectra, score_labels = "Spectrum") # labels from column
plotPCA(allSpectra, score_labels = FALSE) # no labels
Instead of plotting PC1 vs. PC2, we may choose to plot PC1 vs. PC3. In
that case, we need to specify the desired principal components in the
pcs argument:
plotPCA(allSpectra, pcs = c(1,3))
In this case, PC3 accounts for very little variance (0.1%), so we
will continue focusing only on PC1 and PC2.
Suppose we add a column named Group to our data frame to
represent some form of grouping (for example, a sample type, or a
cluster assignment obtained from a clustering analysis). If this new
variable is numeric, it must be converted to a factor, otherwise it
will be included in the PCA, while we only want to use it as a
grouping variable to control the colors and shapes of the data
points:
Group <- c(3, 3, 1, 1, 1, 2, 1, 4, 1, 3, 1,
2, 1, 1, 2, 1, 2, 1, 2, 1, 1, 2, 1,
2, 2, 1, 2, 3, 3, 3, 1, 3, 1, 3, 2,
1, 1, 2, 1, 1, 1, 1, 1, 2, 1, 3, 1,
1, 1, 1)
allSpectra$Group <- as.factor(Group)
We can now assign Group to the
color_var and/or shape_var arguments to
use it as a grouping variable for the colors and shapes of the
data points (though any one or two other variables could also be
used for this purpose):
plotPCA(allSpectra, color_var = "Group", shape_var = "Group")
It is worth noting that the
color_var argument is
linked to the palette argument, which can be a single
color (not very informative when color-coding a variable), a
ColorBrewer palette ("Dark2" by default) or a
custom vector of colors (e.g.,
c("red", "yellow", "green", "blue")).
We can also display confidence ellipses around groups to visualize
their spread by setting the
ellipses argument to
TRUE. Any variable can be used to define the ellipse
groups via the ellipse_var argument; if
ellipse_var is not specified, the ellipses will be
defined by color_var or shape_var, in
that order:
plotPCA(allSpectra, color_var = "Group", shape_var = "Group",
ellipses = TRUE)
For clarity, we can add a legend to explain the colors and shapes,
and provide a descriptive title using the
display_names and
legend_title arguments, respectively:
plotPCA(allSpectra, color_var = "Group", shape_var = "Group",
ellipses = TRUE, display_names = TRUE, legend_title = "Cluster")
So far, we have only looked at score plots, but with the
plotPCA() function can also generate loading plots
(which show how variables contribute, with arrows whose lengths
are proportional to their importance), biplots (combining scores
and loadings), and cumulative variance explained plots (which
indicate the number of PCs needed to reach 95% and for 99% of the
variance, helping decide how many PCs to retain). The spectra
files selected for the previous examples are not ideal because
they contain a very high number of variables (700); instead, we
will use the built-in R iris dataset, since the function
can be applied to many types of datasets, not just spectral data.
Below are examples of four different plot types, showing various
styles, such as using shapes or colors to define groups, including
or omitting labels for scores and/or loadings, and displaying or
hiding ellipses:
plotPCA(
data = iris,
shape_var = "Species",
ellipses = TRUE,
display_names = TRUE,
legend_title = "Iris species"
) +
ggtitle("Score plot")
plotPCA(
data = iris,
color_var = "Species",
plot_type = "biplot",
score_labels = FALSE,
loading_labels = FALSE,
display_names = TRUE,
legend_title = "Iris Species"
) +
ggtitle("Biplot")
plotPCA(
data = iris,
plot_type = "loading"
) +
ggtitle("Loading plot")
plotPCA(
data = iris,
plot_type = "cumvar"
) +
ggtitle("Cumulative variance")
If the
return_pca argument is set to
TRUE, the function returns a list with two
components: a plot object and the full PCA result object for
further analysis. This is useful when you want direct access
to scores (pca$x ), loadings
(pca$rotation) and standard deviation of each
PC (pca$sdev):
res <- plotPCA(iris, return_pca = TRUE)
# 1. Handle PCA plot
res$plot # simply show the plot object
ggsave("pca.png", res$plot) # save it as an image
# 2. Access PCA results
summary(res$pca) # this will show:
# - importance of components (standard deviation)
# - proportion of variance explained
# - cumulative proportion of variance
# 3. Extract scores for analysis (e.g., cluster analysis)
scores <- res$pca$x
# 4. Extract loadings to see variable contributions
loadings <- res$pca$rotation
# 5. Calculate variance explained by each PC
variance <- (res$pca$sdev)^2 / sum(res$pca$sdev^2)
If
output_folder is specified, the plot is
saved to disk (TIFF by default, though a different
format can be set via the
output_format argument), and the function
returns NULL invisibly or only the PCA
object if return_pca = TRUE.
Quick recap
plotPCA() is a fairly full-featured PCA
visualization utility built around
prcomp() and ggplot2. It cleans and
scales numeric data automatically, performs principal
component analysis, and produces flexible plots
including scores, loadings, biplots, or cumulative
variance. The function supports optional grouping by
color or shape, labeling of points or variables, and
confidence ellipses, and can save plots to disk or
return the PCA object for further analysis. Key
arguments include data for the numeric
dataset, pcs to select which principal
components to plot, plot_type to choose the
type of visualization, color_var and
shape_var for grouping,
ellipses and ellipse_var for
confidence ellipses, score_labels and
loading_labels to control labeling, and
return_pca to access the underlying PCA
results. Typical usage involves preparing a numeric
dataset, calling the function with desired options,
inspecting or saving the resulting plot, and optionally
extracting PCA components for further analysis. In
short, it is a simple yet powerful wrapper that allows
quick PCA visualization, flexible grouping and labeling,
and publication-ready plots.
About the author
Gianluca Pastorelli is a Heritage Scientist (Senior Researcher) working at the National Gallery of Denmark (SMK).








Comments
Post a Comment