$$ \text{T1 - Principal component analysis} $$

Data: For this labwork, we choose 2 dataset from UCI Machine Learning.

Repository namely Iris dataset and Breast Cancer Wisconsin dataset.

$$ \text{List Group} \newline \text{BA10-002 - Nguyễn Quang Anh} \newline \text{BI11-164 - Bùi Đắc Minh}\newline $$

Iris dataset

Statistical exploration

First, we read the Iris data into a dataframe using pandas (a package for exploring data in Python)

!wget <https://archive.ics.uci.edu/static/public/53/iris.zip>
!unzip iris.zip

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

column = ["sepal length", "sepal width", "petal length", "petal width", "class"]
df = pd.read_csv("iris.data", sep=',', header=None)
df.columns = column
myclass = df["class"]
df

There are 4 features recorded for each Iris sample and all the samples are divided into 3 diffferent class (Iris Setosa, Iris Veriscolour, Iris Virginica). It is obvious that all 4 attributes are continuous (they could hold any real value, eg: 5.1 or 5,11111 is satisfied) and quantitative (their values are numerical and could be ranked: eg: 5.1>4.9)

We continue the exploration by calculate the mean, variance, covariance, correlation:

type(df["sepal length"][0])

Looking at the df.corr() chart above, it is easily recognizable that the most correlated couple of features is petal length and petal width (with the correlation value of 0.962757 - really close to 1). They are positively related to each other (If one attribute is high, we can preditct that the other one is also high).

To demonstrate this relation, we use the plot this couple using matplotlib (a visualization library in Python).

plt.figure(figsize=(5, 5))
colors = {'Iris-setosa':'red', 'Iris-versicolor':'blue', 'Iris-virginica':'green'}
plt.scatter(df[df.columns[2]], df[df.columns[3]], c=df["class"].apply(lambda x: colors[x]))
plt.xlabel(df.columns[2])
plt.ylabel(df.columns[3])

# Change 0-1-2

Output:

df.mean()

Output: