{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 1. K-NN classifier\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In further part of the laboratory, we will perform a preprocessing of the data and a classification of a set of biomedical voice measurements. Some of them has been recorded for people with Parkinson's desease.\n",
    "\n",
    "More about the dataset: https://archive.ics.uci.edu/ml/datasets/parkinsons\n",
    "\n",
    "First, we load the required packages:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "from sklearn.feature_selection import SelectKBest, chi2, mutual_info_classif, f_classif\n",
    "from sklearn.model_selection import train_test_split\n",
    "from sklearn.neighbors import KNeighborsClassifier\n",
    "from sklearn import metrics\n",
    "from sklearn import preprocessing\n",
    "import matplotlib.pyplot as plt"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 2. Data loading and analysis of the attributes\n",
    "Let's start with the data preparation, \n",
    "#### 2.1. Load the dataset from file parkinsons.csv into data frame using library pandas (pd.read_csv). Write the body of the read_data function to return a data frame with attributes and a list with class labels. Classes are available in 'status' column. You should also remove column 'name' from the data (see function drop of dataFrame https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.drop.html)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def read_data(path):\n",
    "    #TODO replace the following line with our code\n",
    "    return pd.DataFrame()\n",
    "\n",
    "data_X, data_Y = read_data(\"parkinsons.data\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's see what we have loaded."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "data_X.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "data_Y"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 1.2. Let's analyse the given data. \n",
    "* How many attributes are in given data?\n",
    "* Are the attributes on the common scale?\n",
    "* Are observations equally distributed for sick and healthy people?\n",
    "\n",
    "Plot the histogram of the assigned class and analyse the distribution.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#TODO\n",
    "# plt.hist(...)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Plot the histograms of the first 5 attributes. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#TODO"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 3. Train and test set selection\n",
    "\n",
    "#### We want to build our classifier and test it on another set of observations.\n",
    "\n",
    "To split data into train and test sets use train_test_split method from sklearn.model_selection module (https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html). Use 80% of cases in train set and 20% in test set. \n",
    "Use random_state = 5 just to be sure we all have the same rows in train and test sets :)\n",
    "\n",
    "split_data should return a tuple containing: dataframe with train set attributes, list of labels for train data, dataframe with test set attributes and a list of labels for test data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def split_data(data_X, data_Y, test_percent = 20, random_state=5):\n",
    "#   TODO replace the following line with your code\n",
    "    return data_X, data_Y, pd.DataFrame(), []\n",
    "    \n",
    "(train_X, train_Y, test_X, test_Y) = split_data(data_X, data_Y)\n",
    "print(\"rows in train set: \", train_X.shape[0])\n",
    "print(\"rows in test set:\", test_X.shape[0])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 4. Data standarization/normalization\n",
    "#### 4.1. Use k-nn algorithm to classify the obtained test set using k=3. What is the accuracy of the classification?\n",
    "\n",
    "Use KNeighborsClassifier class from sklearn.neighbors module. Useful methods: fit and predict. Classification accuracy can be obtained with accuracy_score method from sklearn.metrics. Function get_classification_accuracy should return the accuracy of classification of given test set on model build with train set.\n",
    "\n",
    "https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def get_classification_accuracy(train_data_X, train_data_Y, test_data_X, test_data_Y, k = 3):\n",
    "#     TODO\n",
    "    return 0\n",
    "\n",
    "get_classification_accuracy(train_X, train_Y, test_X, test_Y)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 4.2. Perform some normalization or standarization of attributes. Then repeat the classification. Do the classification accuracy change?\n",
    "\n",
    "You can use sklearn.preprocessing.StandardScaler, sklearn.preprocessing.MinMaxScaler or sklearn.preprocessing.MaxAbsScaler and their fit_transform/transform methods.\n",
    "\n",
    "Try other standarization methods to verify the standarization procedure influence the classification accuracy.\n",
    "standarize_train_and_test should return 2 dataFrames - with normalized train and normalized test sets."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def standarize_train_and_test(train_X, test_X):\n",
    "    #   TODO replace the following line with your code\n",
    "    return train_X, test_X\n",
    "\n",
    "norm_train_X, norm_test_X = standarize_train_and_test(train_X, test_X)\n",
    "get_classification_accuracy(norm_train_X, train_Y, norm_test_X, test_Y)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 5. Choosing k value\n",
    "Using obtained in previous exercices train set with normalization/standarization use k-nn algoritm using k from 1 to 20. Use 5-fold cross-validation within the train set to obtain the classification accuracy. Plot the obtained accuracy of the classification. Which k value seems to be the best for the given dataset?\n",
    "\n",
    "See https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_validate.html for more info about cross validation in sklearn."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# TODO"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 6. Testing classifier\n",
    "Train the k-NN classifier again and test it using the obtained best k value on a test set to check the final classification accuracy. You can just call the previous written function get_classification_accuracy)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# TODO\n",
    "# get_classification_accuracy(..."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.12.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}