{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# 1. K-NN classifier\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In further part of the laboratory, we will perform a preprocessing of the data and a classification of a set of biomedical voice measurements. Some of them has been recorded for people with Parkinson's desease.\n", "\n", "More about the dataset: https://archive.ics.uci.edu/ml/datasets/parkinsons\n", "\n", "First, we load the required packages:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "from sklearn.feature_selection import SelectKBest, chi2, mutual_info_classif, f_classif\n", "from sklearn.model_selection import train_test_split\n", "from sklearn.neighbors import KNeighborsClassifier\n", "from sklearn import metrics\n", "from sklearn import preprocessing\n", "import matplotlib.pyplot as plt" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 2. Data loading and analysis of the attributes\n", "Let's start with the data preparation, \n", "#### 2.1. Load the dataset from file parkinsons.csv into data frame using library pandas (pd.read_csv). Write the body of the read_data function to return a data frame with attributes and a list with class labels. Classes are available in 'status' column. You should also remove column 'name' from the data (see function drop of dataFrame https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.drop.html)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def read_data(path):\n", " #TODO replace the following line with our code\n", " return pd.DataFrame()\n", "\n", "data_X, data_Y = read_data(\"parkinsons.data\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's see what we have loaded." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "data_X.head()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "data_Y" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 1.2. Let's analyse the given data. \n", "* How many attributes are in given data?\n", "* Are the attributes on the common scale?\n", "* Are observations equally distributed for sick and healthy people?\n", "\n", "Plot the histogram of the assigned class and analyse the distribution.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#TODO\n", "# plt.hist(...)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Plot the histograms of the first 5 attributes. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#TODO" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 3. Train and test set selection\n", "\n", "#### We want to build our classifier and test it on another set of observations.\n", "\n", "To split data into train and test sets use train_test_split method from sklearn.model_selection module (https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html). Use 80% of cases in train set and 20% in test set. \n", "Use random_state = 5 just to be sure we all have the same rows in train and test sets :)\n", "\n", "split_data should return a tuple containing: dataframe with train set attributes, list of labels for train data, dataframe with test set attributes and a list of labels for test data." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def split_data(data_X, data_Y, test_percent = 20, random_state=5):\n", "# TODO replace the following line with your code\n", " return data_X, data_Y, pd.DataFrame(), []\n", " \n", "(train_X, train_Y, test_X, test_Y) = split_data(data_X, data_Y)\n", "print(\"rows in train set: \", train_X.shape[0])\n", "print(\"rows in test set:\", test_X.shape[0])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 4. Data standarization/normalization\n", "#### 4.1. Use k-nn algorithm to classify the obtained test set using k=3. What is the accuracy of the classification?\n", "\n", "Use KNeighborsClassifier class from sklearn.neighbors module. Useful methods: fit and predict. Classification accuracy can be obtained with accuracy_score method from sklearn.metrics. Function get_classification_accuracy should return the accuracy of classification of given test set on model build with train set.\n", "\n", "https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def get_classification_accuracy(train_data_X, train_data_Y, test_data_X, test_data_Y, k = 3):\n", "# TODO\n", " return 0\n", "\n", "get_classification_accuracy(train_X, train_Y, test_X, test_Y)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 4.2. Perform some normalization or standarization of attributes. Then repeat the classification. Do the classification accuracy change?\n", "\n", "You can use sklearn.preprocessing.StandardScaler, sklearn.preprocessing.MinMaxScaler or sklearn.preprocessing.MaxAbsScaler and their fit_transform/transform methods.\n", "\n", "Try other standarization methods to verify the standarization procedure influence the classification accuracy.\n", "standarize_train_and_test should return 2 dataFrames - with normalized train and normalized test sets." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def standarize_train_and_test(train_X, test_X):\n", " # TODO replace the following line with your code\n", " return train_X, test_X\n", "\n", "norm_train_X, norm_test_X = standarize_train_and_test(train_X, test_X)\n", "get_classification_accuracy(norm_train_X, train_Y, norm_test_X, test_Y)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 5. Choosing k value\n", "Using obtained in previous exercices train set with normalization/standarization use k-nn algoritm using k from 1 to 20. Use 5-fold cross-validation within the train set to obtain the classification accuracy. Plot the obtained accuracy of the classification. Which k value seems to be the best for the given dataset?\n", "\n", "See https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_validate.html for more info about cross validation in sklearn." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# TODO" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 6. Testing classifier\n", "Train the k-NN classifier again and test it using the obtained best k value on a test set to check the final classification accuracy. You can just call the previous written function get_classification_accuracy)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# TODO\n", "# get_classification_accuracy(..." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.12.3" } }, "nbformat": 4, "nbformat_minor": 4 }