Principal Component Analysis

Overview

Principal Component Analysis (PCA) is a dimensionality reduction technique that is widely used in data science to simplify complex datasets while retaining their essential information. By transforming the original variables into a new set of uncorrelated variables called principal components, PCA allows us to identify the most significant patterns in the data. These principal components are ordered by the amount of variance they explain, with the first few components typically capturing the majority of the variability in the dataset. This reduction in dimensionality not only makes the data easier to visualize and analyze but also helps to mitigate issues related to multicollinearity in statistical models. By applying PCA, we can focus on the most influential aspects of the data, enabling more efficient and insightful analysis.

Data Prep and Code

There are a number of features in this dataset that are highly correlated, including some values that are directly related (for example, average, max, and min temperature). These values will be dropped.

Based on the first pass of PCA, it appears that both Bison Lake and McClure Pass are highly correlated (r: >0.98). To avoid multicollinearity, we will average the results of both stations and re-run analysis. We will also average soil moisture data based on depth. This will take our original 30 feature dataset down to eight key features.

Conclusion

The PCA done on the temperature, precipitation, groundwater, and snow water equivalent dataset revealed much about the data, and ultimately allowed us to trim down unneeded columns. Both the 3D and 2D plots indicate an even spread of data across each axis, indicating the effectiveness of the exercise. Six components out of the eight supplied features are needed to reach a 95% variance of the dataset. There is good and bad here: It's difficult to reduce the demensionality of the data, which may cause compute issues for future modeling steps; however, the data that was collected is generally independent and can most likely fully be utilized for our end goal of predicting streamflow.