{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# 1. K-means algorithm\n", "\n", "**Question** What are the following steps of the k-means algorithm?\n", "\n", "**Question** How can we choose the initial clusters?\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Exercise**\n", "Given the following examples of grades od 5 students we want to divide them into 2 groups:\n", "\n", "| Subject | A | B |\n", "|---------|-----|-----|\n", "| 1 | 1.0 | 1.0 |\n", "| 2 | 1.5 | 2.0 |\n", "| 3 | 3.0 | 3.0 |\n", "| 4 | 5.0 | 7.0 |\n", "| 5 | 3.5 | 5.0 |\n", "\n", "We have chosen the two furthest students (using euclidean distance) as the initial clusters' centroids:\n", "\n", "|Cluster|Centroid|A |B |\n", "|-------|--------|---|---|\n", "|C1 |k1 |1.0|1.0|\n", "|C2 |k2 |5.0|7.0|\n", "\n", "Perform the first iteration of k-means: divide all students into clusters and find the centroids of these clusters.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Question** When the algorithm should stop?\n", "\n", "**Question** What advantages and disadvantages of k-means clustering can you find?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 2. K-means with scikit-learn\n", "\n", "### 2.1. Download files mouse.csv and lines.csv. They have multiple examples described with 2 attributes. You are given the functions to read files and plot the data. Use these functions to plot data from both files. Can you manually determine 3 clusters in each of the files?" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import csv\n", "import numpy as np\n", "from matplotlib import pyplot as plt\n", "from sklearn.preprocessing import StandardScaler\n", "\n", "def read_file(path):\n", " with open(path, newline='') as csvfile:\n", " reader = csv.reader(csvfile, quoting=csv.QUOTE_NONNUMERIC)\n", " data = [row for row in reader]\n", " data = StandardScaler().fit_transform(data)\n", " return np.array(data)\n", "\n", "def plot_data(data):\n", " plt.scatter(data[:,0], data[:, 1])\n", " plt.show()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# TODO call functions above and try to find clusters in obtained datasets\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 2.2 Now, let's try to use k-means on the obtained dataset. Again, you are given a function to visualize the obtained plot. Your task is to use KMeans with propoer parameters on \"mouse\" and \"lines\" datasets and see if the clusters generated by k-means are the same that you suggested in the previous exercise.\n", "\n", "See documentation and examples: https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html#sklearn.cluster.KMeans" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def visualize_clusters(clusters, centroids): \n", " #clusters: list of numpy arrays (each array with examples in one cluster)\n", " #centroids: numpy array\n", " for c in clusters:\n", " plt.scatter(c[:,0], c[:,1])\n", " plt.scatter(centroids[:,0], centroids[:,1], marker='+', color='black', s=100)\n", " plt.show()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn.cluster import KMeans\n", "# TODO use KMeans to cluster mouse and lines. Visualize and analyze the obtained clusters.\n" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.12.3" } }, "nbformat": 4, "nbformat_minor": 4 }