.. _constraintkmeansrst: ================= K-Means contraint ================= .. only:: html **Links:** :download:`notebook `, :downloadlink:`html `, :download:`PDF `, :download:`python `, :downloadlink:`slides `, :githublink:`GitHub|_doc/notebooks/digressions/constraint_kmeans.ipynb|*` Les `k-means `__ construisent des clusters qui ne sont pas forcément équilibrés. Comment modifier l’algorithme original pour que ce soit le cas ? .. code:: ipython3 from jyquickhelper import add_notebook_menu add_notebook_menu() .. contents:: :local: Un échantillon aléatoire ------------------------ .. code:: ipython3 from sklearn.datasets import make_blobs n_samples = 100 data = make_blobs(n_samples=n_samples, n_features=2, centers=2, cluster_std=1.0, center_box=(-10.0, 0.0), shuffle=True, random_state=2) X1 = data[0] data = make_blobs(n_samples=n_samples//2, n_features=2, centers=2, cluster_std=1.0, center_box=(0.0, 10.0), shuffle=True, random_state=2) X2 = data[0] import numpy X = numpy.vstack([X1, X2]) X.shape .. parsed-literal:: (150, 2) .. code:: ipython3 %matplotlib inline .. code:: ipython3 import matplotlib.pyplot as plt fig, ax = plt.subplots(1, 1, figsize=(4, 4)) ax.plot(X[:, 0], X[:, 1], '.') ax.set_title('4 clusters'); .. image:: constraint_kmeans_5_0.png k-means classique ----------------- Le modèles des `k-means `__ classique. .. code:: ipython3 from sklearn.cluster import KMeans km = KMeans(n_clusters=4) km.fit(X) cl = km.predict(X) .. code:: ipython3 colors = 'brgy' fig, ax = plt.subplots(1, 1, figsize=(4, 4)) for i in range(0, max(cl)+1): ax.plot(X[cl == i, 0], X[cl == i, 1], colors[i] + '.', label='cl%d' % i) x = [km.cluster_centers_[i, 0], km.cluster_centers_[i, 0]] y = [km.cluster_centers_[i, 1], km.cluster_centers_[i, 1]] ax.plot(x, y, colors[i] + '+') ax.set_title('KMeans 4 clusters') ax.legend(); .. image:: constraint_kmeans_8_0.png k-means contraint ----------------- Et maintenant la version où chaque cluster contient approximativement le même nombre de points. **not finished yet** .. code:: ipython3 import warnings warnings.simplefilter("ignore", FutureWarning) .. code:: ipython3 from mlinsights.mlmodel import ConstraintKMeans km = ConstraintKMeans(n_clusters=4, balanced_predictions=True) km.fit(X) cl = km.predict(X) .. code:: ipython3 fig, ax = plt.subplots(1, 1, figsize=(4, 4)) for i in range(0, max(cl)+1): ax.plot(X[cl == i, 0], X[cl == i, 1], colors[i] + '.', label='cl%d' % i) x = [km.cluster_centers_[i, 0], km.cluster_centers_[i, 0]] y = [km.cluster_centers_[i, 1], km.cluster_centers_[i, 1]] ax.plot(x, y, colors[i] + '+') ax.set_title('ConstraintKMeans 4 clusters') ax.legend(); .. image:: constraint_kmeans_12_0.png L’algorithme essaye d’échanger des points entre deux classes balancées, l’échange est conservé si cela améliore l’inertie du nuage total, voir `Same-size k-Means Variation `__. k-means contraint - variantes ----------------------------- Une autre variante plus simple qui consiste à affecter d’abord les points proches du centre le plus proche, les points les plus éloignés sont traités en dernier et affectés à ce qui reste. Les clusters construits ne sont plus convexes. .. code:: ipython3 from mlinsights.mlmodel import ConstraintKMeans res = {} for strategy in ['distance', 'gain']: km = ConstraintKMeans(n_clusters=4, balanced_predictions=True, strategy=strategy) km.fit(X) cl = km.predict(X) res[strategy] = (cl, km) fig, axs = plt.subplots(1, 2, figsize=(10, 4)) nb = 0 for k, v in sorted(res.items()): ax = axs[nb] cl, km = v for i in range(0, max(cl)+1): ax.plot(X[cl == i, 0], X[cl == i, 1], colors[i] + '.', label='cl%d' % i) x = [km.cluster_centers_[i, 0], km.cluster_centers_[i, 0]] y = [km.cluster_centers_[i, 1], km.cluster_centers_[i, 1]] ax.plot(x, y, colors[i] + '+') ax.set_title('ConstraintKMeans - ' + k) ax.legend() nb += 1 .. image:: constraint_kmeans_15_0.png On vérifie que la version *distance* ne paraît pas optimal à la périphérie des clusters. La version *gain* n’est pas encore parfaite. On regarde le résultat sans initialisation préalable de avec un *k-means* classique. .. code:: ipython3 from mlinsights.mlmodel import ConstraintKMeans res = {} for strategy in ['distance', 'gain']: km = ConstraintKMeans(n_clusters=4, balanced_predictions=True, strategy=strategy, kmeans0=False, random_state=0) km.fit(X) cl = km.predict(X) res[strategy] = (cl, km) fig, axs = plt.subplots(1, 2, figsize=(10, 4)) nb = 0 for k, v in sorted(res.items()): ax = axs[nb] cl, km = v for i in range(0, max(cl)+1): ax.plot(X[cl == i, 0], X[cl == i, 1], colors[i] + '.', label='cl%d' % i) x = [km.cluster_centers_[i, 0], km.cluster_centers_[i, 0]] y = [km.cluster_centers_[i, 1], km.cluster_centers_[i, 1]] ax.plot(x, y, colors[i] + '+') ax.set_title('ConstraintKMeans - ' + k) ax.legend() nb += 1 .. image:: constraint_kmeans_17_0.png