k-mean Clustering 알고리즘 개념 및 Sample code

Clustering(군집 분석)이란 주어진 데이터의 명확한 분류 기준이 존재하지 않는 상태에서 주어진 데이터의 특성을 분석해서 그룹(or 클러스터)으로 분류하는 알고리즘입니다. 클러스터안의 데이터는 유사한 특성을 가지게 됩니다. Cluster 알고리즘의 특성은 다음과 같습니다.

데이터를 k개의 클러스터로 묶는 알고리즘입니다.
비지도(Unsupervised) 학습입니다. 입력 데이터(X)에 대한 레이블링 데이터(Y)가 없이 입력 값 (X 값)의 특성만으로 학습을 진행합니다.
사용자가 클러스터의 개수(k)를 정의합니다.
k-means, 스펙트럼 클러스터링, Mean Shift, VGBMM 등의 알고리즘이 개발되었습니다.

기계학습에서 Clustering은 다양한 분야에 적용할 수 있습니다. 예를 들어 사용자의 구매 이력을 학습하여 유사한 그룹으로 묶는 알고리즘으로 활용될 수 있고, 이미지와 비디오 분석을 통해서 유사한 특징을 추출하고 그룹 할 수 있습니다.

k-mean clustering 알고리즘 개요

클러스터링의 대표적인 알고리즘은 K-means 알고리즘입니다. 알고리즘는 아래와 같이 4단계로 구성되며, 자세한 설명은 Wikipedia를 참고해주세요.

1) 초기 k "평균값" (위의 경우 k=3) 은 데이터 오브젝트 중에서 무작위로 뽑는다. (색칠된 동그라미로 표시됨).

2) k 각 데이터 오브젝트들은 가장 가까이 있는 평균값을 기준으로 묶는다. 평균값을 기준으로 분할된 영역은 보로노이 다이어그램 으로 표시한다.

3) k개의 클러스터의 중심점을 기준으로 평균값이 재조정된다.

4) 수렴할 때까지 2), 3) 과정을 반복한다.

K-mean clustering 알고리즘 동작 설명 (출처: https://ko.wikipedia.org/wiki/K-%ED%8F%89%EA%B7%A0_%EC%95%8C%EA%B3%A0%EB%A6%AC%EC%A6%98)

k-mean 알고리즘의 한계점

k-평균 알고리즘은 몇 가지 한계점을 가지고 있다. 이러한 단점을 극복하기 위한 개선 알고리즘 (k-mean++, k-중간점, k-중앙값, 퍼지 C 평균 클러스터링 등)이 존재합니다.

클러스터 개수 k값을 입력 파라미터로 지정해주어야 한다.
알고리즘의 에러 수렴이 전역 최솟값이 아닌 지역 최솟값으로 수렴할 가능성이 있다.
이상 값 (outlier)에 민감하다.
구형 (spherical) 이 아닌 클러스터를 찾는 데에는 적절하지 않다.

https://ko.wikipedia.org/wiki/K-%ED%8F%89%EA%B7%A0_%EC%95%8C%EA%B3%A0%EB%A6%AC%EC%A6%98#/media/%ED%8C%8C%EC%9D%BC:ClusterAnalysis_Mouse.svg

scikit-learn을 활용한 k-mean 구현

k-평균 알고리즘의 구현은 scikit-learn 모듈을 활용할 수 있습니다. 사용법은 아래와 같습니다. scikit learn 모듈을 import 하고 cluster 개수와 알고리즘, 반복 회수를 전달하면 입력 값에 대한 cluster 결과를 얻을 수 있습니다.

# import sci-kit learn

from sklearn.cluster import KMeans

# 입력 data 정의

df = np.array([[1,4],[2,2],[2,5],[3,3],[3,4],[4,7],[5,6],[6,4],[6,7],[7,6],[7,9],[8,7],[8,9],[9,4],[9,8]])

# cluster 개수 정의

n_clusters = 3

# k-mean 알고리즘 적용하고 cluster의 값을 예상

kmeans = KMeans(n_clusters=n_clusters, init='k-means++', max_iter=300, n_init=10)

y_pred = kmeans.fit_predict(df)

→ [1 1 1 1 1 0 0 0 0 0 2 2 2 0 2]

최적의 cluster개수 k 결정하기 - Silhouette analysis

k-mean알고리즘에서는 cluster의 개수를 고정해야 합니다. Silhouette analysis는 최적의 cluster 개수를 찾는 방법입니다. cluster의 평가를 Silhouette 함수를 평가하고 평가 값이 가장 좋은 cluster 개수를 찾습니다. Silhouette analysis의 상세한 설명과 소스 코드는 링크를 참조하세요. 중요한 부분만 요약하면 다음과 같습니다.

from sklearn.cluster import KMeans

from sklearn.metrics import silhouette_samples, silhouette_score

# clusterer range

range_n_clusters = [2, 3, 4, 5, 6]

for n_clusters in range_n_clusters:

# Initialize the clusterer with n_clusters value and a random generator seed of 10 for reproducibility.

clusterer = KMeans(n_clusters=n_clusters, random_state=10)

cluster_labels = clusterer.fit_predict(X)

# The silhouette_score gives the average value for all the samples.

silhouette_avg = silhouette_score(X, cluster_labels)

계산 결과

Silhouette score은 cluster 간의 간격을 의미합니다. 간격의 거리가 멀수록 군집화가 잘 되어 있다고 해석할 수 있습니다. 아래 예제에서는 silhouette_score으로 보면 n=2, n=4 값이 적당합니다. (출처: 링크)

For n_clusters = 2 The average silhouette_score is : 0.7049787496083262
For n_clusters = 3 The average silhouette_score is : 0.5882004012129721
For n_clusters = 4 The average silhouette_score is : 0.6505186632729437
For n_clusters = 5 The average silhouette_score is : 0.56376469026194
For n_clusters = 6 The average silhouette_score is : 0.4504666294372765

최적을 cluster개수 정하기 - Silhouette analysis (출처: scikit learn)

k-mean cluster에 대한 좀 더 자세한 분석은 https://jakevdp.github.io/PythonDataScienceHandbook/05.11-k-means.html을 참조하세요.

k-mean cluster sample

k-mean 알고리즘과 Silhouette analysis의 구현 내용은 아래와 같습니다. GitHub에서 05_kmean.py를 참조해주세요.

#!/usr/bin/env python3
# -*- coding:utf-8 -*-

import numpy as np
import pandas as pd

# from sklearn.preprocessing import LabelEncoder
import matplotlib.pyplot as plt

from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_samples, silhouette_score

df = np.array([[1,4],[2,2],[2,5],[3,3],[3,4],[4,7],[5,6],[6,4],[6,7],[7,6],[7,9],[8,7],[8,9],[9,4],[9,8]])
print ('Input data:')
print (df)

n_clusters = 3
kmeans = KMeans(n_clusters=n_clusters, init='k-means++', max_iter=300, n_init=10)
y_pred = kmeans.fit_predict(df)


# The silhouette_score gives the average value for all the samples.
# This gives a perspective into the density and separation of the formed clusters
# https://scikit-learn.org/stable/auto_examples/cluster/plot_kmeans_silhouette_analysis.html
silhouette_avg = silhouette_score(df, y_pred)

# Compute the silhouette scores for each sample
sample_silhouette_values = silhouette_samples(df, y_pred)

print ('clusters:')
print (y_pred)

print ('kmeans.inertia:',kmeans.inertia_)
print ('kmeans.labels:',kmeans.labels_)
print ('kmeans.algorithm:',kmeans.algorithm)

# select # of cluster
print("For n_clusters =", n_clusters, "The average silhouette_score is :", silhouette_avg)
print ('sample_silhouette_values:\n', sample_silhouette_values)

# plot
plt.scatter(df[:,0], df[:,1])
plt.savefig('05_kmeans_original.png')
plt.clf()

plt.scatter(df[:,0], df[:,1], c=y_pred)
plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], s=100, c='red')
plt.savefig('05_kmeans_centers.png')

[SW 개발/Data 분석 (RDB, NoSQL, Dataframe)] - Keras를 이용한 다중 클래스 분류: softmax regression (Sample code)

[SW 개발/Data 분석 (RDB, NoSQL, Dataframe)] - Python으로 Web 기반 Chart 구현하기: plotly & dash 라이브러리

[SW 개발/Data 분석 (RDB, NoSQL, Dataframe)] - Python Keras를 이용한 로직스틱 회귀 분석(logistics regression) 예제- Wine Quality 분석(Sample code)

[SW 개발/Data 분석 (RDB, NoSQL, Dataframe)] - Python Keras를 이용한 다중회귀(Multiple regression) 예측 (Sample code)

[SW 개발/Data 분석 (RDB, NoSQL, Dataframe)] - Python Keras를 이용한 Linear regression 예측 (Sample code)

[SW 개발/Data 분석 (RDB, NoSQL, Dataframe)] - Python Dataframe Visualization: matplotlib로 chart 그리기 (sample code)

[SW 개발/Python] - Python Decorator를 이용한 함수 실행 시간 측정 방법 (Sample code)

[SW 개발/Data 분석 (RDB, NoSQL, Dataframe)] - Pandas Dataframe Groupby() 함수로 그룹별 연산하기: Split, Apply, Combine