Data Science

How does PCA work? What are its assumptions?

Aryan Aryan
Sep 17, 2025 2 Min Read
Data Science Deep Dive

Understanding Principal Component Analysis

Principal Component Analysis (PCA) is the "Marie Kondo" of data—it helps you keep the information that "sparks joy" (variance) and discard the clutter (noise/redundancy).

How PCA Works (Step-by-Step)

1. Standardization

PCA is sensitive to scale. If one feature is "Annual Income" (thousands) and another is "Age" (tens), the income will dominate. We center the data by subtracting the mean and scaling to unit variance.

$$z = \frac{x - \mu}{\sigma}$$

2. Covariance Matrix Computation

We calculate a matrix that expresses how the features vary together. This identifies redundancy: if two variables move in lockstep, we don't need both.

3. Eigendecomposition

We compute Eigenvectors (the directions of the new axes) and Eigenvalues (the magnitude/importance of those directions).

  • PC1: The direction of maximum variance.
  • PC2: Perpendicular (orthogonal) to PC1, capturing the next highest variance.

4. Feature Vector & Projection

We decide how many components to keep (usually via a Scree Plot). Finally, we multiply the original data by the selected eigenvectors to project it onto the new, lower-dimensional space.

The Core Assumptions

Linearity

PCA assumes the relationships between variables are linear. If your data has complex curves or "spiral" patterns, standard PCA will fail to capture the structure (use Kernel PCA instead).

Variance = Importance

PCA assumes that the directions with the highest variance contain the most information. It treats small variances as "noise" to be discarded.

Sensitivity to Outliers

Extreme values can massively skew the mean and covariance, leading to principal components that don't represent the bulk of the data accurately.

Orthogonality

By design, PCA assumes the new features must be uncorrelated and perpendicular to each other. This is great for fixing multicollinearity.

Ready to Code PCA?

Understanding the math is one thing; implementing it is another. Our 2026 Data Science cohort dives deep into Scikit-learn, PyTorch, and the linear algebra behind every algorithm.

© 2026 4Achievers Training & Placement. Empowering data-driven decision makers.