spectrakit Package – Part 3: Performing Principal Component Analysis in R with plotPCA()

This is part 3 of a series on the spectrakit R package. Start here for an introduction.

Overview

The plotPCA() function from the spectrakit R package performs principal component analysis (PCA) on numeric variables in a dataset and generates publication-ready visualizations of scores, loadings, biplots or cumulative explained variance. It automatically handles scaling, missing values, grouping variables, confidence ellipses, customizable color palettes and figure export. This function is especially useful for exploratory multivariate analysis when patterns, clustering or dominant sources of variation are not immediately clear. Typical use cases include determining the number of relevant principal components, assessing group separation, identifying influential variables, and producing standardized PCA figures for reports, presentations or scientific publications.

Syntax

plotPCA(
  data,
  pcs = c(1, 2),
  color_var = NULL,
  shape_var = NULL,
  plot_type = c("score", "loading", "biplot", "cumvar"),
  palette = "Dark2",
  score_labels = TRUE,
  loading_labels = TRUE,
  ellipses = FALSE,
  ellipse_var = NULL,
  display_names = FALSE,
  legend_title = NULL,
  return_pca = FALSE,
  output_format = "tiff",
  output_folder = NULL
)

Arguments

Argument name Type Description Default value (if provided, argument is optional)
data Data frame containing numeric variables Numeric variables are centered and scaled by default; non-numeric columns are ignored Argument is required
pcs Numeric vector of length 2 Indicates which principal components to plot c(1, 2)
color_var Character Column name for coloring points by group; converted to factor internally NULL (all points same color)
shape_var Character Column name for shaping points by group; converted to factor internally NULL (all points same shape)
plot_type Character Type of PCA plot to generate. One of:
  • "score" (plot PCA scores, i.e., observations)
  • "loading" (plot PCA loadings, i.e., variables)
  • "biplot" (combine scores and loadings in a biplot)
  • "cumvar" (plot cumulative variance explained across principal components)
"score"
palette Character or vector Color setting for groups. One of:
  • a single color repeated for all groups (e.g., "black")
  • a ColorBrewer palette name (e.g., "Dark2")
  • a custom color vector, recycled to match the number of groups
"Dark2"
score_labels Logical or character Controls labeling of points (scores). One of:
  • TRUE (uses row names for labels)
  • column name (uses a column in the data frame for labels)
  • FALSE (no labels are shown)
    TRUE
    loading_labels Logical If TRUE, displays labels for variables (loadings)
      TRUE
      ellipses Logical If TRUE, draws confidence ellipses around groups in score/biplot. Grouping logic for ellipses follows this priority:
      • if ellipse_var is provided, ellipses are drawn by that variable
      • else, if color_var is provided, ellipses are drawn by color groups
      • else, if shape_var is provided, ellipses are drawn by shape groups
      • if none are provided, no ellipses are drawn
      FALSE
      ellipse_var Character Name of the variable used to group ellipses. Takes precedence over all other grouping variables; converted to factor internally NULL
      display_names Logical If TRUE, shows legend FALSE
      legend_title Character Legend title corresponding to color_var and/or shape_var
        NULL
        return_pca Logical If TRUE, returns a list with plot and PCA object
          FALSE
          output_format Character File format for saving plots. Examples: "tiff", "png", "pdf" "tiff"
          output_folder Character Path to folder where plots are saved. If NULL, plot is not saved and the ggplot object (or list with plot and PCA if return_pca = TRUE) is returned. If specified, plot is saved automatically (function returns PCA object only if return_pca = TRUE); if ".", plot is saved in the working directory NULL

          Return value

          The function returns either a ggplot object or a list containing both the PCA object and the plot, depending on the return_pca argument. If return_pca = FALSE (the default), the function simply returns a ggplot object representing the chosen PCA visualization: scores, loadings, biplot or cumulative variance. This object is fully compatible with standard ggplot2 operations, such as printing, further customization with + layers, or saving via ggsave. If return_pca = TRUE, the function returns a list with elements plot (the ggplot object) and pca (the prcomp result), allowing users to extract components with the $ operator, inspect eigenvalues, loadings or scores, and integrate the PCA results into downstream analyses or reports.

          Practical examples

          The function takes a data frame of numeric variables and, by default, centers and scales the data, removes rows with missing values, and ignores any non-numeric columns. In this example, we generate 50 near-infrared (NIR) sample spectra using the NIRsoil dataset included in the prospectr package. The code below randomly selects 50 spectra from the dataset and saves each as a CSV file in a temporary folder on the Desktop, with named columns for wavelength and reflectance:
          # ---- Install & load package if needed ----
          # install.packages("prospectr")
          library(prospectr)
          
          # ---- Load dataset ----
          data(NIRsoil)
          
          # Spectral matrix and wavelength axis
          spectra_matrix <- NIRsoil$spc
          wavelength <- as.numeric(colnames(spectra_matrix))  # wavelength in nm
          
          # ---- Create TEMP folder on Desktop ----
          desktop_path <- file.path(path.expand("~"), "Desktop")
          temp_dir <- file.path(desktop_path, "TEMP")
          
          if (!dir.exists(temp_dir)) {
              dir.create(temp_dir, recursive = TRUE)
          }
          
          # ---- Randomly select 50 spectra ----
          set.seed(260329)  # for reproducibility
          n_available <- nrow(spectra_matrix)
          if (n_available < 50) stop("Dataset contains fewer than 50 spectra.")
          
          selected_rows <- sample(seq_len(n_available), 50)
          
          # ---- Save each spectrum as a CSV file ----
          for (i in seq_along(selected_rows)) {
              
              reflectance <- spectra_matrix[selected_rows[i], ]
              
              output_df <- data.frame(
                  Wavelength = wavelength,
                  Reflectance = as.numeric(reflectance)
              )
              
              file_path <- file.path(
                  temp_dir,
                  paste0("NIRsoil_spectrum_", i, ".csv")
              )
              
              write.csv(
                  output_df,
                  file = file_path,
                  row.names = FALSE,
                  quote = FALSE
              )
          }
          
          cat("50 random NIR spectra saved in:", temp_dir, "\n")
          

          We can then use the combineSpectra() function (see spectrakit Package – Part 2) to merge the 50 spectra into a single dataset (50 rows × 701 columns), making it suitable for the plotPCA() function. Because the data argument of plotPCA() expects an object already loaded in memory, we will store the merged spectra in an object called allSpectra:
          # ---- Install & load package if needed ----
          # install.packages("spectrakit")
          library(spectrakit)
          
          setwd(temp_dir)
          
          allSpectra <- combineSpectra(orientation = "rows")

          The minimal working example of the plotPCA() function requires specifying only the data argument, while all other arguments use their default values:
          plotPCA(allSpectra)

          This produces a score plot (a scatter plot of observations in PCA space) of PC1 vs. PC2, with all points (scores) labeled using the row names and sharing the same color and shape because no variables (i.e., column names in the original data) were selected to control these aesthetics:


          A warning message indicates that some data points are unlabeled due to excessive overlap; this is expected behavior based on the function’s underlying settings.
          We can also set the score_labels argument to the Spectrum column to display the spectra names as labels, or set the argument to FALSE so that the scores appear without labels:
          plotPCA(allSpectra, score_labels = "Spectrum") # labels from column
          
          plotPCA(allSpectra, score_labels = FALSE) # no labels



          Instead of plotting PC1 vs. PC2, we may choose to plot PC1 vs. PC3. In that case, we need to specify the desired principal components in the pcs argument:
          plotPCA(allSpectra, pcs = c(1,3))


          In this case, PC3 accounts for very little variance (0.1%), so we will continue focusing only on PC1 and PC2.
          Suppose we add a column named Group to our data frame to represent some form of grouping (for example, a sample type, or a cluster assignment obtained from a clustering analysis). If this new variable is numeric, it must be converted to a factor, otherwise it will be included in the PCA, while we only want to use it as a grouping variable to control the colors and shapes of the data points:
          Group <- c(3, 3, 1, 1, 1, 2, 1, 4, 1, 3, 1,
                	     2, 1, 1, 2, 1, 2, 1, 2, 1, 1, 2, 1,
                       2, 2, 1, 2, 3, 3, 3, 1, 3, 1, 3, 2,
                       1, 1, 2, 1, 1, 1, 1, 1, 2, 1, 3, 1,
                       1, 1, 1)
          
          allSpectra$Group <- as.factor(Group)

          We can now assign Group to the color_var and/or shape_var arguments to use it as a grouping variable for the colors and shapes of the data points (though any one or two other variables could also be used for this purpose):
          plotPCA(allSpectra, color_var = "Group", shape_var = "Group")


          It is worth noting that the color_var argument is linked to the palette argument, which can be a single color (not very informative when color-coding a variable), a ColorBrewer palette ("Dark2" by default) or a custom vector of colors (e.g., c("red", "yellow", "green", "blue")).
          We can also display confidence ellipses around groups to visualize their spread by setting the ellipses argument to TRUE. Any variable can be used to define the ellipse groups via the ellipse_var argument; if ellipse_var is not specified, the ellipses will be defined by color_var or shape_var, in that order:
          plotPCA(allSpectra, color_var = "Group", shape_var = "Group",
                      ellipses = TRUE)


          For clarity, we can add a legend to explain the colors and shapes, and provide a descriptive title using the display_names and legend_title arguments, respectively:
          plotPCA(allSpectra, color_var = "Group", shape_var = "Group",
                      ellipses = TRUE, display_names = TRUE, legend_title = "Cluster")


          So far, we have only looked at score plots, but with the plotPCA() function can also generate loading plots (which show how variables contribute, with arrows whose lengths are proportional to their importance), biplots (combining scores and loadings), and cumulative variance explained plots (which indicate the number of PCs needed to reach 95% and for 99% of the variance, helping decide how many PCs to retain). The spectra files selected for the previous examples are not ideal because they contain a very high number of variables (700); instead, we will use the built-in R iris dataset, since the function can be applied to many types of datasets, not just spectral data. Below are examples of four different plot types, showing various styles, such as using shapes or colors to define groups, including or omitting labels for scores and/or loadings, and displaying or hiding ellipses:
          plotPCA(
                  data = iris,
                  shape_var = "Species",
                  ellipses = TRUE,
                  display_names = TRUE,
                  legend_title = "Iris species"
          ) +
                  ggtitle("Score plot")
          
          plotPCA(
                  data = iris,
                  color_var = "Species",
                  plot_type = "biplot",
                  score_labels = FALSE,
                  loading_labels = FALSE,
                  display_names = TRUE,
                  legend_title = "Iris Species"
          ) +
                  ggtitle("Biplot")      
          
          plotPCA(
                  data = iris,
                  plot_type = "loading"
          ) +
                  ggtitle("Loading plot")
          
          plotPCA(
                  data = iris,
                  plot_type = "cumvar"
          ) +
                  ggtitle("Cumulative variance")
          


          If the return_pca argument is set to TRUE, the function returns a list with two components: a plot object and the full PCA result object for further analysis. This is useful when you want direct access to scores (pca$x ), loadings (pca$rotation) and standard deviation of each PC (pca$sdev):
          res <- plotPCA(iris, return_pca = TRUE)
          
          # 1. Handle PCA plot
          res$plot # simply show the plot object
          ggsave("pca.png", res$plot)  # save it as an image
          
          # 2. Access PCA results
          summary(res$pca)  # this will show:
                              # - importance of components (standard deviation)
                              # - proportion of variance explained
                              # - cumulative proportion of variance
          
          # 3. Extract scores for analysis (e.g., cluster analysis)
          scores <- res$pca$x
          
          # 4. Extract loadings to see variable contributions
          loadings <- res$pca$rotation
          
          # 5. Calculate variance explained by each PC
          variance <- (res$pca$sdev)^2 / sum(res$pca$sdev^2)
          

          If output_folder is specified, the plot is saved to disk (TIFF by default, though a different format can be set via the output_format argument), and the function returns NULL invisibly or only the PCA object if return_pca = TRUE.

          Quick recap

          plotPCA() is a fairly full-featured PCA visualization utility built around prcomp() and ggplot2. It cleans and scales numeric data automatically, performs principal component analysis, and produces flexible plots including scores, loadings, biplots, or cumulative variance. The function supports optional grouping by color or shape, labeling of points or variables, and confidence ellipses, and can save plots to disk or return the PCA object for further analysis. Key arguments include data for the numeric dataset, pcs to select which principal components to plot, plot_type to choose the type of visualization, color_var and shape_var for grouping, ellipses and ellipse_var for confidence ellipses, score_labels and loading_labels to control labeling, and return_pca to access the underlying PCA results. Typical usage involves preparing a numeric dataset, calling the function with desired options, inspecting or saving the resulting plot, and optionally extracting PCA components for further analysis. In short, it is a simple yet powerful wrapper that allows quick PCA visualization, flexible grouping and labeling, and publication-ready plots.


          About the author
          Gianluca Pastorelli is a Heritage Scientist (Senior Researcher) working at the National Gallery of Denmark (SMK).

          Comments