Minimizing Cluster Variance in K-Means: Total Within Sum of Squares

How to Set a Default Folder for JupyterLab in Anaconda (Windows 11)

August 2, 2025

When I first heard the term K-Means, I was like, what in the realm of all existence is K-Means? What does it do? How do we use it? Is it dangerous, or nice? These questions and more clustered in my brain for hours. Because of this I ran a quick search online to learn all I could about K-means.

I can tell you that the plethora of information available instantaneously boggled my mind and caused information overload. And so, there I went tumbling down the statistics rabbit hole. As I was tumbling down the hole I learn a few vital information that ignited the light of understanding in my overloaded brain.

What is K-Means?

Hence, I determine to learn the easiest and most effective way to understand K-means. You see, K-Means has a single goal and that is to minimize the variance which is “a measure of the spread or dispersion within a set of data. And there are two types of variance the population variance, usually denoted by σ^2 and the sample variance is usually denoted by s^2. This is to say that K-Means minimize the variance within each population cluster apart from each other or MINIMISE WITHIN CLUSTER VARIANCE.

Okay, okay, I can hear my brain cells yelling, English professor… please tone it down. Let’s put this into context and break it down. Imagine you have… to have the neighbors in one group to be very close to each other and far away from the other cluttered group. The metric within K-Means is known as TOTAL WITHIN CLUSTER SUM OF SQUARES which is analogous to THE VARIANCE ACROSS THE GROUPS. Another way of saying reduce the VARIANCE. This now leads me to ask. What the heck is VARIANCE?

Adam Hayes in his Investopedia article “What Is Variance in Statistics? Definition, Formula, and Example (May 30, 2025)” defined Variance as a statistical measurement of how large of a spread there is within a data set. In other words Variance shows how much each number in the group differs from the average, and how different the numbers are from each other. And Variance is often depicted by this symbol: σ².

K-means works with numeric data, it defines similarity based on distance, it wants to find data point close to each other in a group and data point far from each other it places in other group.

How you perform K-means:

1. You initialize randomly

2. You then iterate, reassigning and recalculating centriods

3. You stop iterating you reach convergence and no longer changing the assignment.

… to be continued

Raphael Rivers

Comments are closed.