spectrakit Package – Part 3: Performing Principal Component Analysis in R with plotPCA()

This is part 3 of a series on the spectrakit R package. Start here for an introduction.

Overview

The plotPCA() function from the spectrakit R package performs principal component analysis (PCA) on numeric variables in a dataset and generates publication-ready visualizations of scores, loadings, biplots or cumulative explained variance. It automatically handles scaling, missing values, grouping variables, confidence ellipses, customizable color palettes and figure export. This function is especially useful for exploratory multivariate analysis when patterns, clustering or dominant sources of variation are not immediately clear. Typical use cases include determining the number of relevant principal components, assessing group separation, identifying influential variables, and producing standardized PCA figures for reports, presentations or scientific publications.

Syntax

plotPCA(
  data,
  pcs = c(1, 2),
  color_var = NULL,
  shape_var = NULL,
  plot_type = c("score", "loading", "biplot", "cumvar"),
  palette = "Dark2",
  score_labels = TRUE,
  loading_labels = TRUE,
  ellipses = FALSE,
  ellipse_var = NULL,
  display_names = FALSE,
  legend_title = NULL,
  return_pca = FALSE,
  output_format = "tiff",
  output_folder = NULL
)

Arguments

Argument name	Type	Description	Default value (if provided, argument is optional)
data	Data frame containing numeric variables	Numeric variables are centered and scaled by default; non-numeric columns are ignored	Argument is required
pcs	Numeric vector of length 2	Indicates which principal components to plot	c(1, 2)
color_var	Character	Column name for coloring points by group; converted to factor internally	NULL (all points same color)
shape_var	Character	Column name for shaping points by group; converted to factor internally	NULL (all points same shape)
plot_type	Character	Type of PCA plot to generate. One of: "score" (plot PCA scores, i.e., observations) "loading" (plot PCA loadings, i.e., variables) "biplot" (combine scores and loadings in a biplot) "cumvar" (plot cumulative variance explained across principal components)	"score"
palette	Character or vector	Color setting for groups. One of: a single color repeated for all groups (e.g., "black") a ColorBrewer palette name (e.g., "Dark2") a custom color vector, recycled to match the number of groups	"Dark2"
score_labels	Logical or character	Controls labeling of points (scores). One of: TRUE (uses row names for labels) column name (uses a column in the data frame for labels) FALSE (no labels are shown)	TRUE
loading_labels	Logical	If TRUE, displays labels for variables (loadings)	TRUE
ellipses	Logical	If TRUE, draws confidence ellipses around groups in score/biplot. Grouping logic for ellipses follows this priority: if ellipse_var is provided, ellipses are drawn by that variable else, if color_var is provided, ellipses are drawn by color groups else, if shape_var is provided, ellipses are drawn by shape groups if none are provided, no ellipses are drawn	FALSE
ellipse_var	Character	Name of the variable used to group ellipses. Takes precedence over all other grouping variables; converted to factor internally	NULL
display_names	Logical	If TRUE, shows legend	FALSE
legend_title	Character	Legend title corresponding to color_var and/or shape_var	NULL
return_pca	Logical	If TRUE, returns a list with plot and PCA object	FALSE
output_format	Character	File format for saving plots. Examples: "tiff", "png", "pdf"	"tiff"
output_folder	Character	Path to folder where plots are saved. If NULL, plot is not saved and the ggplot object (or list with plot and PCA if return_pca = TRUE) is returned. If specified, plot is saved automatically (function returns PCA object only if return_pca = TRUE); if ".", plot is saved in the working directory	NULL

Return value

The function returns either a ggplot object or a list containing both the PCA object and the plot, depending on the return_pca argument. If return_pca = FALSE (the default), the function simply returns a ggplot object representing the chosen PCA visualization: scores, loadings, biplot or cumulative variance. This object is fully compatible with standard ggplot2 operations, such as printing, further customization with + layers, or saving via ggsave. If return_pca = TRUE, the function returns a list with elements plot (the ggplot object) and pca (the prcomp result), allowing users to extract components with the $ operator, inspect eigenvalues, loadings or scores, and integrate the PCA results into downstream analyses or reports.

Practical examples

The function takes a data frame of numeric variables and, by default, centers and scales the data, removes rows with missing values, and ignores any non-numeric columns. Normally, spectra are exported from commercial spectroscopy software (e.g., Bruker OPUS and Thermo Scientific OMNIC) into formats such as ASCII text, Excel or JCAMP using built-in “Export” or “Save As” functions. In this example, however, we generate 50 near-infrared (NIR) sample spectra using the NIRsoil dataset included in the prospectr package. The code below randomly selects 50 spectra from the dataset and saves each as a CSV file in a temporary folder on the Desktop, with named columns for wavelength and reflectance:

# ---- Install & load package if needed ----
# install.packages("prospectr")
library(prospectr)

# ---- Load dataset ----
data(NIRsoil)

# Spectral matrix and wavelength axis
spectra_matrix <- NIRsoil$spc
wavelength <- as.numeric(colnames(spectra_matrix))  # wavelength in nm

# ---- Create TEMP folder on Desktop ----
desktop_path <- file.path(path.expand("~"), "Desktop")
temp_dir <- file.path(desktop_path, "TEMP")

if (!dir.exists(temp_dir)) {
    dir.create(temp_dir, recursive = TRUE)
}

# ---- Randomly select 50 spectra ----
set.seed(260329)  # for reproducibility
n_available <- nrow(spectra_matrix)
if (n_available < 50) stop("Dataset contains fewer than 50 spectra.")

selected_rows <- sample(seq_len(n_available), 50)

# ---- Save each spectrum as a CSV file ----
for (i in seq_along(selected_rows)) {
    
    reflectance <- spectra_matrix[selected_rows[i], ]
    
    output_df <- data.frame(
        Wavelength = wavelength,
        Reflectance = as.numeric(reflectance)
    )
    
    file_path <- file.path(
        temp_dir,
        paste0("NIRsoil_spectrum_", i, ".csv")
    )
    
    write.csv(
        output_df,
        file = file_path,
        row.names = FALSE,
        quote = FALSE
    )
}

cat("50 random NIR spectra saved in:", temp_dir, "\n")

We can then use the combineSpectra() function (see spectrakit Package – Part 2) to merge the 50 spectra into a single dataset (50 rows × 701 columns), making it suitable for the plotPCA() function. Because the data argument of plotPCA() expects an object already loaded in memory, we will store the merged spectra in an object called allSpectra:

# ---- Install & load package if needed ----
# install.packages("spectrakit")
library(spectrakit)

setwd(temp_dir)

allSpectra <- combineSpectra(orientation = "rows")

The minimal working example of the plotPCA() function requires specifying only the data argument, while all other arguments use their default values:

plotPCA(allSpectra)

This produces a score plot (a scatter plot of observations in PCA space) of PC1 vs. PC2, with all points (scores) labeled using the row names and sharing the same color and shape because no variables (i.e., column names in the original data) were selected to control these aesthetics:

A warning message indicates that some data points are unlabeled due to excessive overlap; this is expected behavior based on the function’s underlying settings.

We can also set the score_labels argument to the Spectrum column to display the spectra names as labels, or set the argument to FALSE so that the scores appear without labels:

plotPCA(allSpectra, score_labels = "Spectrum") # labels from column

plotPCA(allSpectra, score_labels = FALSE) # no labels

Instead of plotting PC1 vs. PC2, we may choose to plot PC1 vs. PC3. In that case, we need to specify the desired principal components in the pcs argument:

plotPCA(allSpectra, pcs = c(1,3))

In this case, PC3 accounts for very little variance (0.1%), so we will continue focusing only on PC1 and PC2.

Suppose we add a column named Group to our data frame to represent some form of grouping (for example, a sample type, or a cluster assignment obtained from a clustering analysis). If this new variable is numeric, it must be converted to a factor, otherwise it will be included in the PCA, while we only want to use it as a grouping variable to control the colors and shapes of the data points:

Group <- c(3, 3, 1, 1, 1, 2, 1, 4, 1, 3, 1,
      	     2, 1, 1, 2, 1, 2, 1, 2, 1, 1, 2, 1,
             2, 2, 1, 2, 3, 3, 3, 1, 3, 1, 3, 2,
             1, 1, 2, 1, 1, 1, 1, 1, 2, 1, 3, 1,
             1, 1, 1)

allSpectra$Group <- as.factor(Group)

We can now assign Group to the color_var and/or shape_var arguments to use it as a grouping variable for the colors and shapes of the data points (though any one or two other variables could also be used for this purpose):

plotPCA(allSpectra, color_var = "Group", shape_var = "Group")

It is worth noting that the color_var argument is linked to the palette argument, which can be a single color (not very informative when color-coding a variable), a ColorBrewer palette ("Dark2" by default) or a custom vector of colors (e.g., c("red", "yellow", "green", "blue")).

We can also display confidence ellipses around groups to visualize their spread by setting the ellipses argument to TRUE. Any variable can be used to define the ellipse groups via the ellipse_var argument; if ellipse_var is not specified, the ellipses will be defined by color_var or shape_var, in that order:

plotPCA(allSpectra, color_var = "Group", shape_var = "Group",
            ellipses = TRUE)

For clarity, we can add a legend to explain the colors and shapes, and provide a descriptive title using the display_names and legend_title arguments, respectively:

plotPCA(allSpectra, color_var = "Group", shape_var = "Group",
            ellipses = TRUE, display_names = TRUE, legend_title = "Cluster")

So far, we have only looked at score plots, but the plotPCA() function can also generate loading plots (which show how variables contribute, with arrows whose lengths are proportional to their importance), biplots (combining scores and loadings), and cumulative variance explained plots (which indicate the number of PCs needed to reach 95% and for 99% of the variance, helping decide how many PCs to retain). The spectra files selected for the previous examples are not ideal because they contain a very high number of variables (700); instead, we will use the built-in R iris dataset, since the function can be applied to many types of datasets, not just spectral data. Below are examples of four different plot types, showing various styles, such as using shapes or colors to define groups, including or omitting labels for scores and/or loadings, and displaying or hiding ellipses:

library(ggplot2)

plotPCA(
        data = iris,
        shape_var = "Species",
        ellipses = TRUE,
        display_names = TRUE,
        legend_title = "Iris species"
) +
        ggtitle("Score plot")

plotPCA(
        data = iris,
        color_var = "Species",
        plot_type = "biplot",
        score_labels = FALSE,
        loading_labels = FALSE,
        display_names = TRUE,
        legend_title = "Iris Species"
) +
        ggtitle("Biplot")      

plotPCA(
        data = iris,
        plot_type = "loading"
) +
        ggtitle("Loading plot")

plotPCA(
        data = iris,
        plot_type = "cumvar"
) +
        ggtitle("Cumulative variance")

If the return_pca argument is set to TRUE, the function returns a list with two components: a plot object and the full PCA result object for further analysis. This is useful when you want direct access to scores (pca$x ), loadings (pca$rotation) and standard deviation of each PC (pca$sdev):

res <- plotPCA(iris, return_pca = TRUE)

# 1. Handle PCA plot
res$plot # simply show the plot object
ggsave("pca.png", res$plot)  # save it as an image

# 2. Access PCA results
summary(res$pca)  # this will show:
                    # - importance of components (standard deviation)
                    # - proportion of variance explained
                    # - cumulative proportion of variance

# 3. Extract scores for analysis (e.g., cluster analysis)
scores <- res$pca$x

# 4. Extract loadings to see variable contributions
loadings <- res$pca$rotation

If output_folder is specified, the plot is saved to disk (TIFF by default, though a different format can be set via the output_format argument), and the function returns NULL invisibly or only the PCA object if return_pca = TRUE.

Quick recap

plotPCA() is a fairly full-featured PCA visualization utility built around prcomp() and ggplot2. It cleans and scales numeric data automatically, performs principal component analysis, and produces flexible plots including scores, loadings, biplots, or cumulative variance. The function supports optional grouping by color or shape, labeling of points or variables, and confidence ellipses, and can save plots to disk or return the PCA object for further analysis. Key arguments include data for the numeric dataset, pcs to select which principal components to plot, plot_type to choose the type of visualization, color_var and shape_var for grouping, ellipses and ellipse_var for confidence ellipses, score_labels and loading_labels to control labeling, and return_pca to access the underlying PCA results. Typical usage involves preparing a numeric dataset, calling the function with desired options, inspecting or saving the resulting plot, and optionally extracting PCA components for further analysis. In short, it is a simple yet powerful wrapper that allows quick PCA visualization, flexible grouping and labeling, and publication-ready plots.

← Previous post: spectrakit Package – Part 2: Merging Spectra in R with combineSpectra()

→ Next post: spectrakit Package – Part 4: Creating a Composite Image Grid in R with makeComposite()

Citation

If you use spectrakit in your research, please cite it as follows:

Pastorelli G (2026). spectrakit: Spectral Data Handling and Visualization. R package version 0.2.0. DOI: 10.32614/CRAN.package.spectrakit. Available at: https://CRAN.R-project.org/package=spectrakit

A BibTeX entry for LaTeX users is

@Manual{spectrakit,
  title = {spectrakit: Spectral Data Handling and Visualization},
  author = {Gianluca Pastorelli},
  year = {2026},
  note = {R package version 0.2.0},
  url = {https://CRAN.R-project.org/package=spectrakit},
  doi = {10.32614/CRAN.package.spectrakit},
}

About the author

Gianluca Pastorelli is a Heritage Scientist (Senior Researcher) working at the National Gallery of Denmark (SMK).

ORCID: https://orcid.org/0000-0001-6926-1952

GitHub: https://github.com/Gianluca-Pastorelli

Search This Blog

The R Package Lab