Introduction to statistics

Biomedical Engineering - academic year 2023-2024

Maud Delattre

Topics covered in this course

  • Descriptive statistics

  • Basics of probability

  • Estimation

  • Confidence intervals

  • Tests

  • Advanced exploratory data analysis tools

What will you learn in this course?

  • Data exploration: describe and understand the structure of a dataset.

  • Modelling: move from a biological question to a statistical model accounting for the nature of the data and sources of variability.

  • Statistical inference: get some informations about the (unknown) parameters of the model:

    • estimating the parameter,

    • give a (restricted) range of possible values for the parameter,

    • decide whether the parameter belongs to a given interval or not.

  • R programming: data analysis.

First session: descriptive statistics

Objective

Learn how to use figures and graphs to

  1. Describe a sample

    • “How are individuals distributed w.r.t. height? To age?”

    • “How are individuals distributed between male and female?”

  1. Evaluate the relationship between descriptors

    • “How do height and age seem to be related?”

    • “Does the height distribution look the same for male and female?”

Definitions

Population vs sample

  • Population: entire group that you want to draw conclusions about (not always people)

  • Sample: often a smaller, manageable version of a larger group (population of interest)

  • Unit/individual: observed member of the population of interest

Why working on samples rather than the entire population of interest?

  • the population size is too large

  • data collection is costly

  • \(\ldots\) time consuming

  • \(\ldots\) and sometimes destructive

Variables

Variable: a descriptor, i.e. one characteristic measured on the sample

  • Quantitative/ numerical variables: characteristics that can be counted or measured

    • Continuous: variables that can take an infinite number of real values within a given interval

    • Discrete: variables that can only take a finite number of real values within a given interval

  • Qualitative/ categorical variables: values corresponding to categories (levels)

    • Nominal: levels without natural order

    • Ordinal: order relation between levels

Important

The distinction between categorical and quantitative variables is important because they are analyzed differently.

Raw data tables

The values taken by the variables on \(n\) individuals are presented in a table :

\[\left[\begin{array}{cccc} x_{1} & y_{1} & \ldots & z_{1} \\ x_{2} & y_{2} & \ldots & z_{2} \\ \vdots & \vdots & \vdots & \vdots \\ x_{n} & y_{n} & \ldots & z_{n} \end{array}\right]\]

  • \(n\) is the sample size.

  • Each line (resp. column) represents an individual (resp. a variable).

  • The sequence of values taken by a variable on the observed individuals is called a statistical series. The statistical series for variable \(X\) is denoted by \(x_1,x_2,\ldots,x_n\).

Example

Data

A data subset of licorice_gargle, a R dataset from a study enrolling adult patients undergoing elective thoracic surgery.

str(data)
'data.frame':   235 obs. of  4 variables:
 $ gender     : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
 $ BMI        : num  33 23.7 26.8 28.4 30.4 ...
 $ age        : num  67 76 58 59 73 61 66 61 83 69 ...
 $ surgerySize: Factor w/ 3 levels "1","2","3": 2 1 2 3 2 3 3 1 1 2 ...
head(data)
  gender   BMI age surgerySize
1      0 32.98  67           2
2      0 23.66  76           1
3      0 26.83  58           2
4      0 28.39  59           3
5      0 30.45  73           2
6      0 35.49  61           3

Questions

  • Sample size?

  • Nature of the variables?

Univariate descriptive statistics

Examining qualitative variables (1/3)

Example: surgery size

1- The statistical series can be summarized into a frequency table showing counts within each category

table(data$surgerySize)

  1   2   3 
 56 158  21 

or showing proportions within each category

prop.table(table(data$surgerySize))

        1         2         3 
0.2382979 0.6723404 0.0893617 

2- The mode of the series is the most frequently observed value.

Remark: An observed distribution may have several modes.

Examining qualitative variables (2/3)

Bar plot: height of the bars proportional to the numbers or proportions in each category of the variable

ggplot(data, aes(x=surgerySize)) + geom_bar() + xlab("Surgery size")

data.surgery <- data %>% count(surgerySize) %>% 
  mutate(perc = n / nrow(data)) 
ggplot(data.surgery, aes(x = surgerySize, y = perc)) + geom_bar(stat = "identity")

Examining qualitative variables (3/3)

Pie chart: piece areas are proportional to the proportions in each category of the variable

ggplot(data.surgery, aes(x = "", y = perc,fill=surgerySize)) +
geom_bar(stat = "identity",width=1) + coord_polar("y", start=0)

Examining quantitative variables (1/9)

  • Centrality measures that tells us about how a typical observation looks like.

  • Measures of dispersion that tell how observations are spread out around this typical individual.

Examining quantitative variables (2/9)

The mean (arithmetic) of a series \(\{x_i,i=1,\ldots,n\}\) is defined as: \[ \bar{x} = \frac{1}{n} \sum_{i=1}^{n} x_i \]

Caution

What you need to remember about the mean:

  • each series has only one mean,

  • it is rarely a value from the series,

  • it is susceptible to outliers.

Property :

Define the new series \(y\) as: \(y=ax+b\).

What is the formula of the arithmetic mean of the series \(y\) as a function of \(a\), \(b\) and \(\bar{x}\)?

Examining quantitative variables (3/9)

  • Mode

  • Median \(x_{1/2}\): {Nb of values \(\geq x_{1/2}\)} \(=\) {Nb of values \(\leq x_{1/2}\) }

  • Quantile of order \(p\), \(x_p\): value such that a proportion \(p\) of the observations is less than \(x_p\).

Example: BMI

summary(data$BMI)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  15.60   22.57   25.91   25.59   28.38   36.33 

Example: age

summary(data$age)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  18.00   47.00   62.00   57.37   68.00   86.00 

Examining quantitative variables (4/9)

1- Variance: measure of the dispersion of the series around the mean

  • empirical variance: \[ s_{emp,x}^2 = \frac{1}{n} \sum_{i=1}^{n} (x_i - \bar{x})^2 = \bar{x^2} - (\bar{x})^2 \]

  • corrected variance: \[ s_{corr,x}^2 = \frac{1}{n-1} \sum_{i=1}^{n} (x_i - \bar{x})^2% = \frac{1}{n-1} \sum_{i=1}^{n} x_i^2 - \frac{n}{n-1} (\bar{x})^2 \]

Property:

Define the new series \(y\) as: \(y=ax+b\).

What is the formula of the variance of the series \(y\) as a function of \(a\), \(b\) and \(s^2_{x}\)?

2- Standard deviation: square root of the variance; either \(s_{emp,x} = \sqrt{s_{emp,x}^2}\) or \(s_{corr,x} = \sqrt{ s_{corr,x}^2}\)

Examining quantitative variables (5/9)

  • Coefficient of variation: \[ CV = \frac{s_x}{\bar{x}} \]Unit?

  • Range: difference between the largest and the smallest value of the series: \[ E = x_{(n)} - x_{(1)} \] Remark: susceptible to outliers

  • Inter-quartile range: difference between third and first quartiles of the series: \[ E_{IQ} = x_{3/4} - x_{1/4} \]

Examining quantitative variables (6/9)

Example: age

stat.desc(data$age)
     nbr.val     nbr.null       nbr.na          min          max        range 
2.350000e+02 0.000000e+00 0.000000e+00 1.800000e+01 8.600000e+01 6.800000e+01 
         sum       median         mean      SE.mean CI.mean.0.95          var 
1.348200e+04 6.200000e+01 5.737021e+01 1.008554e+00 1.987006e+00 2.390376e+02 
     std.dev     coef.var 
1.546084e+01 2.694925e-01 

Example: BMI

stat.desc(data$BMI)
     nbr.val     nbr.null       nbr.na          min          max        range 
 235.0000000    0.0000000    0.0000000   15.6000000   36.3300000   20.7300000 
         sum       median         mean      SE.mean CI.mean.0.95          var 
6013.9900000   25.9100000   25.5914468    0.2786695    0.5490218   18.2493261 
     std.dev     coef.var 
   4.2719230    0.1669278 

Examining quantitative variables (7/9)

Representation of the series distribution via histograms

Examining quantitative variables (8/9)

or via boxplots

Boxplots

Quantitative variables (9/9)

Example: age

Bivariate descriptive statistics

Between quantitative variables (1/3)

The scatter plot is used to represent two quantitative variables simultaneously and have an idea of the nature of the relationship between them.

Example

Between quantitative variables (2/3)

The correlation coefficient is a measure of linear relationship between two quantitative variables.

Definition \[ r = \frac{s_{xy}}{s_x s_y} \] where \(s_x\) and \(s_y\): standard deviations of \(x\) and \(y\), and \(s_{xy}\): covariance between \(x\) and \(y\), : \[ s_{xy} = \frac{1}{n} \sum_{i=1}^{n} (x_i - \bar{x}) (y_i - \bar{y}) \]

Example

cor(data$age,data$BMI)
[1] 0.3138115

Properties and interpretation of the correlation coefficient

  • \(-1<r<1\),

  • if \(r<0\): negative/decreasing linear relationship,

  • if \(r>0\): positive/increasing linear relationship,

  • if \(|r|\approx1\): very strong linear relationship,

  • if \(r=0\): no linear relationship.

Between qualitative variables

The contingency table is a table showing the frequencies of one variable in rows and another in columns.

tab <- table(data$gender,data$surgerySize)
rownames(tab) <- levels(data$gender)
colnames(tab) <- levels(data$surgerySize)
tab
        
         Small Medium Large
  Male      32     93    17
  Female    24     65     4
proportions(tab)
        
              Small     Medium      Large
  Male   0.13617021 0.39574468 0.07234043
  Female 0.10212766 0.27659574 0.01702128
proportions(tab, margin=1)
        
              Small     Medium      Large
  Male   0.22535211 0.65492958 0.11971831
  Female 0.25806452 0.69892473 0.04301075
proportions(tab, margin=2)
        
             Small    Medium     Large
  Male   0.5714286 0.5886076 0.8095238
  Female 0.4285714 0.4113924 0.1904762

Take home message

  • Describing the data at hand is an essential step prior to any statistical study.

  • It provides some data summaries and can be used to highlight patterns in the data.

  • We have seen univariate and bivariate descriptive statistics tools.

  • Different tools are used depending on the nature of the variables.

Second session: some notions of probability

Why this probability class?

  • Kind of toolbox for the rest of the statistics course

    • Statistical models

    • Estimators, confidence intervals and tests

  • Introduce essential concepts and definitions (e.g. random variable, realization of a random variable)

  • Introduce the key probability distributions (Gaussian, Fisher, Student, Chi-square)

Preliminary definitions

Random experiment

  • A random experiment is one

    • in which the result cannot be (completely) predicted in advance,

    • if repeated several times under identical conditions, it can give rise to different results.

  • Some examples?

  • Throwing a die

  • Flipping a coin

  • Assessing whether a patient treated with a new drug will recover or not (or will develop side effects or not)

  • Evaluating, at the time of diagnosis of a disease, what will be the evolution of the patient’s symptoms

Random variable

  • An abstract way to talk about the outcomes of random experiments

  • Random variable (\(X\)):

    • process that relates the random experiment to a value

    • characteristic that is beeing measured in the random experiment

    • something whose value cannot be known in advance

  • Examples …

Realization of a random variable

  • Realization (eq. observation) :

    • value obtained for the characteristic (random variable) of interest \(X\) at the end of a random experiment

    • what the experimenter actually observes

  • Note that you cannot observe a random variable \(X\) itself …

  • Examples …

Domain

  • The domain of a random variable \(X\) is the set of values that \(X\) can take after the experiment has been carried out (set of possible values for \(X\)).

  • The domain of a random variable \(X\) is often denoted \(D_X\).

  • Examples …

  • Remark: \(D_X\) is different depending on the nature of \(X\).

    • Qualitative r.v.: discrete set corresponding to the set of \(X\)’s possible modalities

    • Discrete quantitative r.v.: (finite or infinite) discrete set of numerical values

    • Continuous quantitative r.v.: open or closed interval (e.g. the set of real numbers, the set of positive real numbers, the set of real numbers between 0 and 1).

Characterizing random variables

Distribution of a random variable

  • The (probability) distribution of r.v. \(X\) indicates how the values of the different individuals in the population under study are distributed (do some realizations occur more or less often than others? ).

  • It is defined differently depending on whether \(X\) is discrete or continuous.

P.d. of a discrete r.v.

  • For discrete r.v., we can enumerate all of its possible realizations and associate a specific probability with each possible realization.

  • Definition: Let \(J\) be the size of the domain of definition of \(X\), \(a_1\), \(a_2\), \(\ldots\), \(a_J\) be the possible values of \(X\) and \(p_j = P(X = a_j)\), \(j = 1,\ldots,J\) be the set of the probabilities associated with the different values in \(D_X\). Then, the probability distribution of \(X\) is the set \(\{(a_j ; p_j)\}_{j = 1,\ldots,J}\)

  • ✏️ Examples

  • Properties:

    1. \(P(X=x) \in [0,1]\) for any \(x \in D_X\)

    2. \(\sum_{j=1}^{J} p_j = 1\)

P.d. of a continuous r.v.

  • Impossible to list all the possible realizations of a continuous r.v.

  • Impossible to associate a specific probability with each possible realization

    \(\Rightarrow\) We need a different concept to characterize the distribution of the r.v.

The density function of a continuous r.v. \(X\) is the function \(f(\cdot)\) that associates a probability with each range of realizations of \(X\).

  • Properties:

    1. \(f(x) \geq 0\) for any \(x \in D_X\)

    2. \(P(a<X<b) = \int_{a}^b f(x)dx \geq 0\) for any \(a, b \in D_X\) s.t. \(a<b\)

    3. \(\int_{x \in D_X}^{} f(x)dx = 1\)

P.d. of a continuous r.v.

✏️ Graphical examples and different interpretations as area under the curve

Cumulative distribution function

  • ⚠️ For quantitative r.v.

  • Definitions

    • The cumulative distribution function (c.d.f.) of r.v. \(X\) is the function \(F_X(\cdot)\) defined as \[ F_X(x) = P(X \leq x), \; x \in \mathbf{R} \]

    • The quantile of order \(p\) (\(0 \leq p \leq 1\)) of a r.v. \(X\) is the value \(x_p\) s.t. \[ P(X \leq x_p) = p, \] i.e., s.t. \[ F_X(x_p) = p. \]

Cumulative distribution function: properties

(✏️ Graphical illustration of the following properties)

  1. A c.d.f. is defined on \(\mathbf{R}\) and has values in \([0; 1]\): it defines a probability !

  2. A distribution function is an increasing function (in the broad sense).

  3. \(\lim_x \rightarrow -\infty \; F_X(x) = 0\) and \(\lim_x \rightarrow +\infty \; F_X(x) = 1\)

  4. \(P(a < X < b) = F_X(b) - F_X(a)\)

  5. If the distribution of \(X\) is symmetric about \(0\): \(F_X(-x) = 1-F_X(x)\) for any \(x \in \mathbf{R}\)

  6. If \(X\) is a continuous r.v.: \[ f_X(x) = F_X^{' }(x), \; x \in \mathbf{R} \]

    \[ F_X(x) = \int_{-\infty}^{x} f_X(x) dx \]

Expected value and variance of a quantitative r.v.

Expected value: definition

  • Central tendency characteristics of a distribution.

  • The expectated value (or “theoretical” mean) of a r.v. \(X\) is the value taken as an average by \(X\). It is given by:

    • if \(X\) is a discrete r.v.:

      \[ E(X) = \sum_{j=1}^{J} a_j p_j = \sum_{j=1}^{J} a_j P(X=a_j) \]

    • if \(X\) is a continuous r.v.: \[ E(X) = \int_{D_X} x f_X(x)dx \]

Expected value: rules

Let \(a \in \mathbf{R}\) be a constant and \(X\) (\(X_1\), \(X_2\)) any quantitative r.v.

  1. \(E(a) = a\)

  2. \(E(aX) = aE(X)\)

  3. \(E(X_1 + X_2) = E(X_1) + E(X_2)\)

  4. \(E(a+X) = a+E(X)\)

📝 Exercise: \(E(X-E(X)) = ?\)

Variance: definition

  • Quantifies the dispersion of the values taken by a r.v. \(X\) its “theoretical mean”.

    • A large variance indicates an important dispersion.

    • A zero variance reveals that \(X\) is non random.

Definition

\[ V(X) =E\left[ (X-E(X))^2 \right] = E(X^2) - E(X)^2 \]

Calculation

  • if \(X\) is a discrete r.v. : \(V(X) = \sum_{j=1}^{J} (a_j-\mu)^2 p_j\)

  • if \(X\) is a continuous r.v. : \(V(X) = \int_{D_X}^{} (x-\mu)^2 f_X(x) dx\)

where \(\mu=E(X)\).

Variance: rules

Let \(a \in \mathbf{R}\) be a constant and \(X\) (\(X_1\), \(X_2\)) any quantitative r.v.

  1. \(V(a) = 0\)

  2. \(V(aX) = a^2 V(X)\)

  3. \(V(a+X) = V(X)\)

  4. \(V(X_1+X_2)= \ldots\) (It depends!)

Remarks

  • The standard deviation of a (quantitative) r.v. \(X\) is the square-root of its variance (\(\sqrt{V(X)}\)).

  • 📝 Exercise (centering and standardizing) : Let \(X\) be a r.v. with mean \(\mu\) and variance \(\sigma^2\) and define the transformed variable \[Y = \frac{X-\mu}{\sigma}\] What are the expected value and the variance of \(Y\)?

Some common distributions

Bernoulli distribution

To describe an experiment whose result can take only two values, called by convention, success or failure.

Ex.: a candidate passes or fails an examination, whether a sick patient is cured or not, \(\ldots\)

\[X \sim \mathcal{B}(p), \, 0 \leq p \leq 1\]

  • Domain: \(D_X = \{0,1\}\)

  • Probability distribution: \(P(X=0)=1-p\) , \(P(X=1)=p\)

  • Expectation: \(E(X)=p\)

  • Variance: \(V(X)=p(1-p)\)

Binomial distribution

To describe an experiment which consists in counting the number of successes after performing \(n\) independent Bernoulli trials with the same probability \(p\).

\[X \sim \mathcal{B}(n,p), \, n \in \mathbf{N}, \, 0 \leq p \leq 1\]

  • Domain: \(D_X = \{0,1,\ldots,n\}\)

  • Probability distribution: \(P(X=k)=\frac{k!}{n! (n-k)!}p^k(1-p)^{n-k}\)

  • Expectation: \(E(X)=n p\)

  • Variance: \(V(X)=n p(1-p)\)

(a) B(10,0.2)

(b) B(10,0.5)

(c) B(20,0.2)

Figure 1: Binomial distributions

Poisson distribution

To describe counts.

\[X \sim \mathcal{P}(\lambda), \lambda >0\]

  • Domain: \(D_X = \mathbf{N} = \{0,1,\ldots\}\)

  • Probability distribution: \(P(X=k)=e^{-\lambda} \frac{\lambda^k}{k!}\)

  • Expected value: \(E(X)=\lambda\)

  • Variance: \(V(X)=\lambda\)

(a) P(1.5)

(b) P(10)

(c) P(50)

Figure 2: Poisson distributions

Gaussian distribution

\[X\sim \mathcal{N}(\mu,\sigma^2)\]

  • Domain: \(D_X = \mathbf{R}\)

  • Density function: \(f_X(x) = (2\pi \sigma^2)^{-1/2} \exp\left(-\frac{1}{2\sigma^2 }(x-\mu)^2\right)\)

  • Expected value: \(E(X)=\mu\)

  • Variance: \(V(X)=\sigma^2\)

(a) N(0,1)

(b) N(0,10)

(c) N(-100,1)

Figure 3: Gaussian densities, dotted curve: \(\mathcal{N}(0,1)\)

Joint distributions

Independent random variables

Idea: (⚠️ not a rigorous definition)

Two random variables \(X_1\) and \(X_2\) are independent when knowing the value taken by \(X_1\) gives no information about the value that will be taken by \(X_2\) and vice versa.

Examples \(\ldots\)

Covariance

  • Measures the strength of the relationship between two (quantitative) variables.

  • Consider two r.v. \(X_1\) and \(X_2\). The covariance between \(X_1\) and \(X_2\) is defined by: \[ cov(X_1,X_2) = E[(X_1-E(X_1))(X_2-E(X_2))] = E(X_1 X_2)- E(X_1)E(X_2) \]

Rules

  1. If two r.v. are independent, then their covariance is zero. But the converse is false!

  2. \(cov(X,X) = V(X)\)

  3. \(cov(aX+bY,cZ+dT) = ac cov(X,Z) + ad cov(X,T) + bc cov(Y,Z) + bd cov(Y,T)\)

  4. \(V(X_1 + X_2) = V(X_1) + V(X_2) + 2cov(X_1,X_2)\)

Consequence: if \(X_1\) and \(X_2\) are independent, then \(V(X_1 + X_2) = V(X_1) + V(X_2)\)

Linear correlation coefficient

Consider two (quantitative) r.v. \(X_1\) and \(X_2\).

The linear correlation coefficient between \(X_1\) and \(X_2\) is denoted by: \[ \rho(X_1, X_2) = \frac{cov(X_1, X_2)}{\sqrt{V(X_1)}\sqrt{V(X_2)}} \]

Remarks :

  • \(-1 \leq \rho(X_1, X_2) \leq 1\).

  • \(\rho(X_1, X_2)\) only characterizes the linear relationship between \(X_1\) and \(X_2\).

✏️ Graphical considerations \(\ldots\)

Joint distributions

Discrete r.v.

Joint distribution: set of values \(p_{ij}\) defined as: \[p_{ij} = P(X=x_i, Y=y_j)\] Remark: \(\sum_{i,j} p_{ij} = 1\)

If \(X\) and \(Y\) are independent, then \[P(X=x_i \cap Y=y_j) = P(X=x_i) P(Y=y_j)\]

Continuous r.v.

If \(X\) and \(Y\) are independent, then \[f(x,y) = h(x) g(y)\]

Central limit theorem

Suppose \(X_{1},\ldots ,X_{n}\) is a sequence of i.i.d. random variables with \(\mathbb{E}[X_{i}]=\mu\) and \(V[X_{i}]=\sigma ^{2}<\infty\). Then as \(n\) approaches infinity, the random variables \(\sqrt {n}({\bar {X}}_{n}-\mu )\) converge in distribution to a normal \(\mathcal {N}(0,\sigma ^{2})\): \[{{\sqrt {n}}\left({\bar {X}}_{n}-\mu \right)\ \xrightarrow {d} \ {\mathcal {N}}\left(0,\sigma ^{2}\right).}\]

Statistical inference

Outline

The goal is to introduce tools to draw conclusions about a population from observations collected on a subset of individuals in that population (sample)1.

  1. Estimation.

  2. Confidence intervals.

  3. Statistical tests.

This requires the definition of a model, i.e. the specification of a random distribution for the data, and possibly an equation linking the parameters of the distribution2 of the variable to be explained to the explanatory variables.

Examples

▶️ We are interested in estimating the proportion of patients experiencing side effects after taking a new medication.

Denote by \(X\) the number of patients among those of a \(n\)-sample that experience a side effect. We can assume that \[X \sim \mathcal{B}(n,p)\]

\(p\) is unknown and represents the probability to experience a side effect. The \(n\)-sample is used to draw conclusions about \(p\).

▶️ We are interested in estimating the average cholesterol level in the population of people suffering from a given disease.

Denote by \((X_i)_{1 \leq i \leq n}\) the cholesterol levels for each individual in a \(n\)-sample.

It can be assumed that \((X_i)_{1 \leq i \leq n}\) are mutually independant such that:

\[X_i \sim \mathcal{N}(\mu,\sigma^2)\]

\(\mu\) and \(\sigma^2\) are unknown. \(\mu\) represents the average cholesterol level in the population of interest.. The \(n\)-sample is used to draw conclusions about \(\mu\).

Prerequisites

Gaussian variables manipulation

  • Scaling : Consider \(X \sim \mathcal{N}(\mu,\sigma^2)\) and define \(Z=\frac{X-\mu}{\sigma}\). Then, \[Z \sim \mathcal{N}(0,1)\]

  • Sum of independent gaussian r.v.’s : Consider \(X_i \sim \mathcal{N}(\mu_i,\sigma_i^2)\), \(i=1,\ldots,n\) independent and define \(Z=\sum_{i} X_i\). Then, \[Z \sim \mathcal{N}(\sum_{i}\mu_i,\sum_{i}\sigma_i^2)\]

Estimator, estimation

  • An estimator \(T_n\) of a parameter is a random variable defined as a function of the random variables from the statistical model, i.e. \[T_n = f(X_1,\ldots,X_n)\] An estimator is a random variable and is therefore characterized by a probability distribution.

  • The estimation is a realization of the random variable \(T_n\) calculated from the observations in the sample: \[t_n = f(x_1,\ldots,x_n)\]

A simple example

Let’s consider a Gaussian model, ie \(X_i \underset{i.i.d.}{\sim} \mathcal{N}(\mu,\sigma^2)\)1 and \[\bar{X_n}=\dfrac{1}{n}\sum\limits_{i=1}^{n} X_i\]

  1. Is \(\bar{X_n}\) a random variable or a constant?

  2. What are the distribution, the expected value and the variance of \(\bar{X_n}\)?

  3. Why would \(\bar{X_n}\) be a good estimator for \(\mu\)?

Desired properties of an estimator: unbiased, small variance,…

How are estimators defined in practice?

  • The maximum likelihood estimation (MLE) method can be used to estimate the parameters of a model1.

  • The idea is to find the value of the unknown parameter(s) that maximize the probability (i.e. the likelihood) of the data being observed (given the model).

📝 Example: Derivation of \(\bar{X_n}\) as the maximum likelihood estimator of \(\mu\) in the Gaussian model.

Confidence intervals

  • An estimation is not the true parameter value! We can only hope that the estimated value is close to the true parameter value.

  • Due to sampling variability, the estimation of a given parameter varies from sample to sample.

  • A confidence interval is a range of values that contains the true parameter value with a given probability \(1-\alpha\). \(1-\alpha\) is called the confidence level.

Building confidence intervals

These items describe the generic steps in the construction of a confidence interval. We explain them here in the Gaussian model.

  1. Transform the estimator into a pivotal statistics \(T_n\) whose distribution does not depend on the unknown parameters (i.e. is entirely known). Think about centering and standardizing \(\bar{X}_n\).

  2. Represent the distribution of \(T_n\) and place the quantiles of orders \(\alpha/2\) and \(1-\alpha/2\) (\(u_{\alpha/2}\) and \(u_{1-\alpha/2}\)) on this representation.

  3. What is the value of \(P(u_{\alpha/2} \leq T_n \leq u_{1-\alpha/2})\)?

  4. Deduce \(A_n\) and \(B_n\) such that \(P(A_n \leq \mu \leq B_n)\).

\([A_n,B_n]\) is called probability interval. The confidence interval is the realization of this probability interval calculated from the sample.

What if \(\sigma^2\) is not known?

Important probability distributions and results

Chi-square distribution : Let \(X_1\), \(X_2\), …, \(X_n\) be \(n\) independent r.v. and identically distributed according \(\mathcal{N}(0,1)\), and let \(U = X_1^2 +X_2^2 +...+X_n^2\).

Then \[U \sim \chi^2(n)\]

Figure 4: \(\chi^2(10)\)

Important property : Let \(X_1\), \(X_2\), , \(X_n\) be \(n\) independent r.v. and identically distributed according \(\mathcal{N}(\mu,\sigma^2)\). Then \[\frac{1}{\sigma^2}\sum_{i=1}^{n} (X_i - \bar{X}_n)^2 \sim \chi^2(n-1)\]

Important probability distributions and results

Student distribution : Let \(X \sim \mathcal{N}(0,1)\) and \(U \sim \chi^2(\nu)\) independent, and let \(T = X/\sqrt{(U/\nu)}\). Then \[T \sim \mathcal{T}(\nu)\]

Fisher distribution : Let \(U_1\) and \(U_2\) be independent r.v. s.t. \(U_1 \sim \chi^2(\nu_1)\) and \(U_2 \sim \chi^2(\nu_2)\) , and let \(F = (U_1/\nu_1)/(U_2/\nu_2)\). Then \[F \sim \mathcal{F}(\nu_1,\nu_2)\]

(a) \(\mathcal{T}(1)\)

(b) \(\mathcal{T}(10)\)

(c) \(\mathcal{F}(3,10)\)

Figure 5: dotted curve: \(\mathcal{N}(0,1)\)

What if \(\sigma^2\) is not known?

Consider \(X_i \underset{i.i.d.}{\sim} \mathcal{N}(\mu,\sigma^2)\). The usual estimator for \(\sigma^2\) is \[S_{n-1}^2 = \frac{1}{n-1} \sum_{i=1}^{n} (X_i - \bar{X_n})^2\] It is unbiased and follows a chi-square distribution: \[\frac{n-1}{\sigma^2} S_{n-1}^2 \sim \chi^2_{n-1}\]

What if \(\sigma^2\) is not known?

Instead of using \(T_n = \frac{\bar{X_n}-\mu}{\sigma/\sqrt{n}}\), we use \[T_n^{(2)} = \frac{\bar{X_n}-\mu}{S_{n-1}/\sqrt{n}} \sim \mathcal{T}_{n-1}\] as a pivotal statistics.

  1. Justify that \(T_n^{(2)}\) is a pivotal statistics.

  2. Repeat the steps detailed above to derive the confidence interval of \(\mu\) at confidence level \(1-\alpha\).

Synthetic example

data
   id gender cholesterol weight_kg height_cm
1   1      0         254        48       158
2   2      1         402        98       145
3   3      0         288        72       152
4   4      1         354        57       148
5   5      0         220        79       160
6   6      1         451        63       165
7   7      0         405        84       138
8   8      1         280        30       142
9   9      0         358        76       173
10 10      1         456        65       169
m <- mean(data$cholesterol)
s <- sd(data$cholesterol)
c(m-s/sqrt(10)*qt(0.975,9),m+s/sqrt(10)*qt(0.975,9))
[1] 287.4241 406.1759
t.test(data$cholesterol,conf.level=0.95)

    One Sample t-test

data:  data$cholesterol
t = 13.213, df = 9, p-value = 3.378e-07
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
 287.4241 406.1759
sample estimates:
mean of x 
    346.8 

Statistical tests

  • Statistical hypothesis: any statement concerning a characteristic of the population (ie a parameter of the model),

  • Null hypothesis (\(H_0\)): the hypothesis according to which the value of the parameter of interest is equal to a reference value.

  • Alternative hypothesis (\(H_1\)): must reflect an incompatibility with \(H_0\), for example, its opposite.

    • two-sided test: \(H_0\): \(\mu = \mu_0\) vs \(H_1\): \(\mu \neq \mu_0\)

    • one-sided test: \(H_0\): \(\mu = \mu_0\) vs \(H_1\): \(\mu > \mu_0\)

    • one-sided test : \(H_0\): \(\mu = \mu_0\) vs \(H_1\): \(\mu < \mu_0\)

The approach of statistical testing consists in choosing between \(H_0\) and \(H_1\) the most probable hypothesis in view of the observations contained in the sample.

✏️ Example. We wonder if the average cholesterol level of people with this disease is equal to the reference value of 400. How would you formulate \(H_0\) and \(H_1\)?

Test statistic

✏️ Still consider \(X_i \underset{i.i.d.}{\sim} \mathcal{N}(\mu,\sigma^2)\) and the following hypothesis: \[ H_0 : \mu = \mu_0 \; vs \; H_1 : \mu \neq \mu_0 \]

  1. Recall the expression for the mean estimator.

  2. What would be its distribution if \(H_0\) were true?

  3. Consider \[T_n^{(2)} = \sqrt{n} \frac{\bar{X_n}-\mu_0}{S_{n-1}}.\] What would be the distribution of \(T_n^{(2)}\) if \(H_0\) were true? What are the specificities of this distribution

➡️ \(T_n^{(2)}\) is the test statistic. The construction of the decision rule that will allow to choose between \(H_0\) and \(H_1\) given the data is based on this random variable (rejection area, critical value, p-value).

Decision rule

  • When performing a statistical test, it is always assumed that \(H_0\) is true.

  • Then, a significance level \(\alpha\) is chosen. It represents the proportion of the tests statistic’s values that would be the least compatible with \(H_0\) and that would lead to the decision to reject \(H_0\).

✏️ Exercise:

  1. Plot the distribution of \(T_n^{(2)}\) under \(H_0\) and represent \(\alpha\) and the rejection zone.

  2. What critical values delimit the rejection zone?

  3. State the decision rule of the test.

Back to the example

t.test(data$cholesterol, mu=400, alternative="two.sided")

    One Sample t-test

data:  data$cholesterol
t = -2.0269, df = 9, p-value = 0.0733
alternative hypothesis: true mean is not equal to 400
95 percent confidence interval:
 287.4241 406.1759
sample estimates:
mean of x 
    346.8 

Discussion

  • p-value concept

  • What would have changed if other alternative hypothesis had been considered?

t.test(data$cholesterol, mu=400, alternative="greater")

    One Sample t-test

data:  data$cholesterol
t = -2.0269, df = 9, p-value = 0.9633
alternative hypothesis: true mean is greater than 400
95 percent confidence interval:
 298.6855      Inf
sample estimates:
mean of x 
    346.8 

Errors and power of a test

  • type 1 error:

  • type 2 error:

  • power:

Principal Component Analysis and data clustering

Principal Component Analysis (PCA)

Principal Component Analysis : multidimensional exploratory analysis method

Objectives

  • Represent the observation of \(p\) variables (\(p>3\)) in a space of reduced dimension (typically in a plane or a 3-dimensional space) by limiting the loss of information,

  • Summarize the information in a synthetic way: how to keep the essential information of the data?

Tools

  • No prior statistical assumption

  • Matrix calculation, vectors and eigenvalues

Data

PCA can only be applied to quantitative variables observed on the same individuals.

Example

Cheese is made from cow’s, goat’s or sheep’s milk. It is a source of protein, calcium, but also vitamin A (retinol) and vitamin B9 (folic acid). The objective of this study is to specify the nutritional value of the most common cheeses.

  1. Which cheeses have the same nutritional profile, and which ones have different profiles? Can we highlight the main dimensions of nutritional variability of this set of cheeses?

  2. Associate similar cheeses with a particular nutritional profile.

'data.frame':   29 obs. of  10 variables:
 $ Fromages   : chr  "CarredelEst" "Babybel" "Beaufort" "Bleu" ...
 $ calories   : int  314 314 401 342 264 367 344 292 406 399 ...
 $ sodium     : num  354 238 112 336 314 ...
 $ calcium    : num  72.6 209.8 259.4 211.1 215.9 ...
 $ lipides    : num  26.3 25.1 33.3 28.9 19.5 28.8 27.9 25.4 32.5 32.4 ...
 $ retinol    : num  51.6 63.7 54.9 37.1 103 ...
 $ folates    : num  30.3 6.4 1.2 27.5 36.4 5.7 36.3 32.5 4.9 1.3 ...
 $ proteines  : num  21 22.6 26.6 20.2 23.4 23 19.5 17.8 26 29.2 ...
 $ cholesterol: int  70 70 120 90 60 90 80 70 110 120 ...
 $ magnesium  : int  20 27 41 27 20 30 36 25 28 51 ...

Notations

Data

The values taken by the \(p\) variables on \(n\) individuals are presented in a \(n\times p\) table \(X\):

\[X=\left[\begin{array}{cccc} x_{11} & x_{12} & \ldots & x_{1p} \\ x_{21} & x_{22} & \ldots & x_{2p} \\ \vdots & \vdots & \vdots & \vdots \\ x_{n1} & x_{n2} & \ldots & x_{np} \end{array}\right]\]

  • \(x_{ij}\) is the value taken by individual \(i\) for the variable \(j\).

  • The line \(x_{i\cdot}=(x_{i1} , x_{i2} , \ldots , x_{ip})\) corresponds to the values of \(p\) variables for individual \(i\).

  • The column \(x_{\cdot j}=(x_{1j},x_{2j},\ldots,x_{nj})\) corresponds to the values of the variable \(j\) on the \(n\) individuals.

Mean and variance

  • Mean of variable \(j\) : \(\bar{x}_{\cdot j} = \frac{1}{n} \sum\limits_{i=1}^{n} x_{ij}\)

  • Variance of variable \(j\) : \(\sigma^2_{j} = \frac{1}{n} \sum\limits_{i=1}^{n} (x_{ij}-\bar{x}_{\cdot j})^2\)

Preliminary data transformations

Data transformations:

  • Centering data: \(\tilde{X} = (x_{ij}-\bar{x}_{\cdot j})_{1\leq i \leq n , 1\leq j \leq p}\)

  • Centering and standardizing data: \(\tilde{X} = \left(\frac{x_{ij}-\bar{x}_{\cdot j}}{\sigma_j}\right)_{1\leq i \leq n , 1\leq j \leq p}\)

Why standardizing ?

  • The variables are of different natures; for example: age, salary, number of children

  • it is felt that their respective influences in the analysis should not depend on their variability; for example in the analysis of grades: should a subject with a high variance have a greater influence in the analysis?

Example: Should the data be standardized?

apply(fromages[,-1],2,mean)
   calories      sodium     calcium     lipides     retinol     folates 
  300.03448   210.08621   185.73448    24.15862    67.56207    13.01034 
  proteines cholesterol   magnesium 
   20.16897    74.58621    26.96552 
apply(fromages[,-1],2,sd)
   calories      sodium     calcium     lipides     retinol     folates 
  91.914356  108.678923   72.528882    8.129642   24.163098   11.723339 
  proteines cholesterol   magnesium 
   6.959788   28.245755   11.318388 

PCA’s goals

  • Extracting the essential information from a table \(X\) by providing the user with interpretation-friendly graphical representations.

  • Data compression/summary of large datasets

  • Two main aspects :

    1. an analysis of the similarities between individuals \(\rightarrow\) characterization of the individuals with the available variables

    2. an analysis of the relationships between variables

In what follows, PCA is presented on centered and standardized data tables.

The space (the cloud) of individuals

Geometrically, the individuals in \(X\) correspond to \(n\) points of \(\mathbf{R}^p\).

  • Similarity between individuals: euclidian distance \[ d_{ind}(i_1,i_2) = \sqrt{\sum_{j=1}^{p} (x_{i_1j}-x_{i_2j})^2} = \mid\mid x_{i_1 \cdot} - x_{i_2 \cdot} \mid\mid_p \]

  • Inertia \(\rightarrow\) variability between individuals \[ \begin{eqnarray*} I_{ind} & = & \frac{1}{n} \sum_{j=1}^{p}\sum_{i=1}^{n} (x_{ij}-\bar{x}_{\cdot j})^2\\ & = & \frac{1}{n} \sum_{i=1}^{n} \mid \mid x_{i\cdot} - g\mid \mid ^2\\ & = & \sum_{j=1}^{p} \sigma^2_{j} \end{eqnarray*} \]

  • Center of gravity of the cloud: \[ g = (\bar{x}_{\cdot 1},\ldots,\bar{x}_{\cdot p}) \]

  • The variance of the \(j\)-th variable is given by \[\sigma^2_{j} = \frac{1}{n} \sum\limits_{i=1}^{n} (x_{ij}-\bar{x}_{\cdot j})^2\]

  • Standardized case: \(I_{ind} = p\)

  • Remark: influence of each variable different in the non standardized case

The space (the cloud) of variables

Geometrically, the quantitative variables in the data table \(X\) correspond to \(p\) points of \(\mathbf{R}^n\).

  • Scalar product: \[ < x_{\cdot j_1}, x_{\cdot j_2} > = \frac{1}{n} \sum_{i=1}^{n} x_{i j_1} x_{i j_2} \] \(\hookrightarrow\) Expresses the correlation between the two variables (reduced case). Also a cosine.

  • The distance between two variables is given by the norm associated with this scalar product \[ d_{var}(j_1,j_2) = \sqrt{\frac{1}{n} \sum_{i=1}^{n} (x_{ij_1}-x_{ij_2})^2} = \mid\mid x_{\cdot j_1} - x_{\cdot j_2 } \mid\mid_n \]

  • Inertia (variability) between variables \[ I_{var} = \sum_{j=1}^{p} \mid\mid x_{\cdot j}\mid\mid_n^2 \]

  • Remark : \[ I_{var} = I_{ind} (= p) \]

Principal components and dimension reduction

  • If \(p > 3\), the individuals’ cloud in \(\mathbf{R}^p\) is not representable.

  • The PCA method produces an approximate representation in a low dimensional space (usually 2 or 3) by constructing from the \(p\) (correlated) variables \(q<p\) new variables called principal components

\(\Rightarrow\) in this new reduced space, the cloud is more easily represented and the analysis is easier.

  • The principal components (\(p\) in total) are constructed by matrix algebra operations such that

    • Each principal component is linear combination of \(p\) initial variables

    • The principal components are orthogonal to each other (i.e. uncorrelated)

    • Each principal component carries a certain proportion of the total inertia \(\rightarrow\) minimize the loss of information

    • If \(I_k\) is the inertia carried by \(k\)th principal component

\[ I_1 > I_2 > \ldots > I_p \]

and

\[ I_1 + I_2 + \ldots + I_p = I_T \]

Choosing the number of components

Goal:

Choose \(q < p\) in order to keep the maximum inertia with the minimum number of variables.

Several strategies:

  • Elbow criterion: keep the axes whose eigenvalues are located before the drop on the eigenvalue scree.

  • Kaiser criterion: only axes with inertia greater than the average inertia \(I_T /p\) are kept.

Remark: these criteria can lead to different choices

Choosing the number of components

#library(FactoMineR)
acp.res <- PCA(fromages[,-1], scale.unit = TRUE,graph=F)
acp.res$eig
        eigenvalue percentage of variance cumulative percentage of variance
comp 1 5.049113814            56.10126460                          56.10126
comp 2 1.844418362            20.49353736                          76.59480
comp 3 0.867800921             9.64223245                          86.23703
comp 4 0.577384673             6.41538526                          92.65242
comp 5 0.355210155             3.94677950                          96.59920
comp 6 0.175311592             1.94790658                          98.54711
comp 7 0.097434214             1.08260237                          99.62971
comp 8 0.028431123             0.31590137                          99.94561
comp 9 0.004895146             0.05439051                         100.00000
#library(factoextra)
fviz_eig(acp.res)

Variables’ representation on correlation circles

Purpose:

Visualize the relations

  • between variables \(\rightarrow\) identify groups of correlated variables

  • between principal components and initial variables \(\rightarrow\) interpreting the new variables resulting from the PCA

How?

  • The better a variable is represented, the closer it is to the circle.

  • For two variables perfectly represented on the plane, the cosine of the angle formed by these 2 variables is equal to the correlation coefficient between these two variables.

  • Similarly for the angle formed by a principal component and a variable perfectly represented on the plane.

Representation of variables

  1. What can we say about the relationships between variables?

  2. What is the concrete meaning of the first two principal components?

Representation of individuals

Interpreting the position of individuals

  • contributes to the understanding of what the principal components represent

  • allows to identify groups of individuals who are similar or on the contrary are distant

How?

  • An individual is well represented on the factorial plane if the angle formed by the individual with the axes forming this plane is small (cosine squared criteria close to 1).

  • The distance between two individuals on the plane is only interpretable if they are well represented:

    • if two well-represented individuals are close, they are similar individuals in the initial space

    • if two well-represented individuals are distant, they are individuals who were distant in the initial space

    • if one of the two individuals is misrepresented, nothing can be said

Representation of individuals

fviz_pca_ind(acp.res,axes=c(1,2),repel=T)

Representation of individuals

cos2<-cbind(acp.res$ind$cos2[,1:2],rowSums(acp.res$ind$cos2[,1:2]))
colnames(cos2) <- c("Dim.1","Dim.2","Plan.1.2")
cos2
                         Dim.1       Dim.2   Plan.1.2
CarredelEst        0.054513670 0.429989634 0.48450330
Babybel            0.305932561 0.152149334 0.45808189
Beaufort           0.782514162 0.100463687 0.88297785
Bleu               0.103648240 0.085351520 0.18899976
Camembert          0.107057342 0.418477137 0.52553448
Cantal             0.702584083 0.100133221 0.80271730
Chabichou          0.005532395 0.538443706 0.54397610
Chaource           0.101451314 0.751712152 0.85316347
Cheddar            0.647661421 0.019219083 0.66688050
Comte              0.717641173 0.074546157 0.79218733
Coulomniers        0.159759825 0.290881508 0.45064133
Edam               0.446589711 0.262189620 0.70877933
Emmental           0.624347552 0.211478990 0.83582654
Fr.chevrepatemolle 0.446655376 0.291348951 0.73800433
Fr.fondu.45        0.015681389 0.134713164 0.15039455
Fr.frais20nat.     0.858859993 0.124070892 0.98293089
Fr.frais40nat.     0.930465441 0.037203369 0.96766881
Maroilles          0.587564687 0.062168583 0.64973327
Morbier            0.716412761 0.006633021 0.72304578
Parmesan           0.587302799 0.007603017 0.59490582
Petitsuisse40      0.915231627 0.035236918 0.95046854
PontlEveque        0.007303862 0.166423043 0.17372690
Pyrenees           0.464108938 0.014284961 0.47839390
Reblochon          0.277707072 0.011700992 0.28940806
Rocquefort         0.160694290 0.504743837 0.66543813
SaintPaulin        0.131463626 0.430298747 0.56176237
Tome               0.012749876 0.055253361 0.06800324
Vacherin           0.171260304 0.482822070 0.65408237
Yaourtlaitent.nat. 0.634489274 0.312245808 0.94673508

Representation of individuals

fviz_pca_ind(acp.res,axes=c(1,2),repel=T,col.ind = "cos2")

Unsupervised classification (HCA and K-means)

Overview

Objective: Group \(n\) individuals into \(K\) groups of individuals with common characteristics:

  • each group should be as homogeneous as possible

  • members of a class are more similar to members of the same class than to members of other classes

Unsupervised classification:

  • Descriptive purpose

  • No a priori knowledge about the groups

  • Is it justified to assume the existence of several groups of individuals?

  • Question of the number of groups? \(K\) unknown

Tools

To classify, we need:

  • to choose a similarity/dissimilarity measure

  • to choose a criterion to evaluate the homogeneity of the classes

  • to choose an algorithm

  • to sometimes also choose a number of classes composing the classification

Dissimilarity

Definition

  • \(d(x_{1.},x_{2.}) = d(x_{2.},x_{1.}) \geq 0\)

  • \(d(x_{1.},x_{2.}) = 0 \Rightarrow x_{1.} = x_{2.}\)

  • Example: euclidian distance (the most used in practice)

\[ d^2(x_{1.},x_{2.}) = \sum_{j=1}^{p}(x_{1j} - x_{2j})^2 \]

Reminder:

  • normalization of variables

  • ontext dependent, scaling issues, do we need that all variables have the same importance,…

Scatter plot

  • Mean of a variable: \[ \bar{x}_{\cdot j} = \frac{1}{n}\sum_{i=1}^{n} x_{ij} \; , \; j=1,\ldots,p \]

  • Center of gravity of \(X\) (overall average): \[ g = (\bar{x}_{\cdot 1}, \ldots, \bar{x}_{\cdot p}) \]

  • Inertia of \(X\): \[ I_T = \sum_{i=1}^{n} d^2 (x_{i.},g) \]

Decomposition of the inertia

For a fixed number of groups \(K\):

\[ \begin{eqnarray*} I_T & = & \sum_{i=1}^{n} d^2 (x_{i.},g)\\ & = & \underbrace{\sum_{k=1}^K \sum_{i \in C_k} d^2(x_{i.},g_k)}_{I_{intra}} + \underbrace{\sum_{k=1}^K n_k d^2(g_k,g)}_{I_{inter}} \end{eqnarray*} \]

  • \(g_k\): barycenter of the group \(C_k\)

  • \(n_k\): number of individuals in the group \(C_k\)

  • \(I_{intra}\): intra-group inertia (dispersion of the points around their center in the different groups)

  • \(I_{inter}\): between-group inertia(separability of groups)

Remark: this decomposition is only valid if the chosen dissimilarity measure is the Euclidean distance.

Objective of the classification

Objective of the classification:

\[ I_T = I_{intra} + I_{inter} \]

For a given number of groups \(K\), maximize the inter-group inertia (minimize the intra-group inertia).

Remark: \(I_T\) is constant for a given dataset.

Difficulty:

  • we can’t evaluate the complete set of possible classifications (of the order of \(K^n\))

  • we need to use clever algorithms for exploring the solution space

HAC

  • Hierarchical Ascending Classification

  • Ascending method: agglomeration of groups

  • Hierarchical method: the successive partitions are nested within each other

  • Algorithm:

Initialization: each individual is a group (\(I_{intra} = 0\))

One iteration: merging two groups

End: the \(n\) individuals form a single group (\(I_{intra} = I_T\))

Choosing the groups to merge

  • Merging two groups is associated with an increase in intra-group variability; it is desired that this increase be as small as possible.

  • Increase of the intra-group variability induced by the merging of the \(C_k\) and \(C_l\) groups \[ \begin{eqnarray*} \Delta & = & I_{intra} (C_k \cup C_l) - I_{intra} (C_k, C_l)\\ & = & \frac{n_k n_l}{n_k + n_l} d^2(g_k,g_l) \end{eqnarray*} \]

  • Dissimilarity between groups (Ward’s distance) \[ d^2(C_k,C_l)=\frac{n_k n_l}{n_k + n_l} d^2(g_k,g_l) \]

Remark

Other possible dissimilarity measures

  1. \(d(C_k,C_l) = \underset{x_{i.} \in C_k, x_{i'.}\in C_l}{\text{min}} \; d(x_{i.},x_{i'.})\)

  2. \(d(C_k,C_l) = \underset{x_{i.} \in C_k, x_{i'.}\in C_l}{\text{max}} \; d(x_{i.},x_{i'.})\)

  3. \(d(C_k,C_l) = \frac{1}{n_l n_k}\sum\limits_{x_{i.} \in C_k, x_{i'.}\in C_l}^{} \; d(x_{i.},x_{i'.})\)

Dendrogram

  • The partitions of the data obtained by the HAC are represented as a tree: dendrogram.

  • Fusion of two elements: branch of height proportional to the percentage of inertia loss.

  • Determination of the number of groups: “big difference of inertia”.

Example

tab<-scale(fromages[,-1], center=TRUE,scale=TRUE)
data.dist <- dist(tab,method = "euclidean")
data.ward <- hclust(data.dist,method="ward.D2")
plot(data.ward,hang=-1)
rect.hclust(data.ward,k=2)

cluster <- cutree(data.ward, k = 2)

Interpretation of the groups

fviz_pca_ind(acp.res,axes=c(1,2),habillage=as.factor(cluster),repel=T)

Example

plot(data.ward,hang=-1)
rect.hclust(data.ward,k=3)

cluster <- cutree(data.ward, k = 3)

Interpretation of groups

fviz_pca_ind(acp.res,axes=c(1,2),habillage=as.factor(cluster),repel=T)

K-means method

  • The number of groups \(K\) is fixed a priori.

  • Algorithm:

Initialization: we choose \(K\) centers of groups \(\mu_1,\ldots,\mu_K\) among the individuals.

One iteration:

Each individual is classified in the group whose center is closest.

Definition of the new group centers as the barycenters of each group.

End: “convergence’’

Remark

  • “Convergence”: decrease of the intra-group inertia at each iteration.

  • Faster algorithm than HAC.

  • The result depends on the initialization.

  • There is a wide variety of initialization modes, often random.

Example

res.k.means <- kmeans(fromages[,-1],centers=3)
res.k.means$centers
  calories   sodium  calcium  lipides  retinol   folates proteines cholesterol
1 320.6667 288.5667 163.8867 26.38667 68.70667 16.253333  20.63333    78.66667
2 364.2222 158.3333 257.8000 29.02222 61.95556  4.066667  26.16667    95.55556
3 122.6000  67.8000 121.5600  8.72000 74.22000 19.380000   7.98000    24.60000
  magnesium
1  25.33333
2  37.88889
3  12.20000
res.k.means$size
[1] 15  9  5
res.k.means$totss
[1] 763883.3
res.k.means$withinss
[1] 127801.31  70549.80  49380.94
res.k.means$tot.withinss
[1] 247732.1
res.k.means$betweenss
[1] 516151.2

Example

Choosing the number of groups

  • There is no classification that is strictly better than the others.

  • This means measuring the quality of each competing classification and often making compromises.

  • The interpretability of classes often plays a role.

Example

inertie.expl <- rep(0,times=10)
for (k in 2:10){
  clus <- kmeans(tab,centers=k,nstart=5)
  inertie.expl[k] <- clus$betweenss/clus$totss }
plot(1:10,inertie.expl,type="b",xlab="Nb. of groups",ylab="% explained inertia")