Each line (resp. column) represents an individual (resp. a variable).
The sequence of values taken by a variable on the observed individuals is called a statistical series. The statistical series for variable \(X\) is denoted by \(x_1,x_2,\ldots,x_n\).
Example
Data
A data subset of licorice_gargle, a R dataset from a study enrolling adult patients undergoing elective thoracic surgery.
What is the formula of the variance of the series \(y\) as a function of \(a\), \(b\) and \(s^2_{x}\)?
2-Standard deviation: square root of the variance; either \(s_{emp,x} = \sqrt{s_{emp,x}^2}\) or \(s_{corr,x} = \sqrt{ s_{corr,x}^2}\)
Examining quantitative variables (5/9)
Coefficient of variation: \[
CV = \frac{s_x}{\bar{x}}
\]Unit?
Range: difference between the largest and the smallest value of the series: \[
E = x_{(n)} - x_{(1)}
\] Remark: susceptible to outliers
Inter-quartile range: difference between third and first quartiles of the series: \[
E_{IQ} = x_{3/4} - x_{1/4}
\]
Examining quantitative variables (6/9)
Example: age
stat.desc(data$age)
nbr.val nbr.null nbr.na min max range
2.350000e+02 0.000000e+00 0.000000e+00 1.800000e+01 8.600000e+01 6.800000e+01
sum median mean SE.mean CI.mean.0.95 var
1.348200e+04 6.200000e+01 5.737021e+01 1.008554e+00 1.987006e+00 2.390376e+02
std.dev coef.var
1.546084e+01 2.694925e-01
Example: BMI
stat.desc(data$BMI)
nbr.val nbr.null nbr.na min max range
235.0000000 0.0000000 0.0000000 15.6000000 36.3300000 20.7300000
sum median mean SE.mean CI.mean.0.95 var
6013.9900000 25.9100000 25.5914468 0.2786695 0.5490218 18.2493261
std.dev coef.var
4.2719230 0.1669278
Examining quantitative variables (7/9)
Representation of the series distribution via histograms
Examining quantitative variables (8/9)
or via boxplots
Quantitative variables (9/9)
Example: age
Bivariate descriptive statistics
Between quantitative variables (1/3)
The scatter plot is used to represent two quantitative variables simultaneously and have an idea of the nature of the relationship between them.
Example
Between quantitative variables (2/3)
The correlation coefficient is a measure of linear relationship between two quantitative variables.
Definition\[
r = \frac{s_{xy}}{s_x s_y}
\] where \(s_x\) and \(s_y\): standard deviations of \(x\) and \(y\), and \(s_{xy}\): covariance between \(x\) and \(y\), : \[
s_{xy} = \frac{1}{n} \sum_{i=1}^{n} (x_i - \bar{x}) (y_i - \bar{y})
\]
Example
cor(data$age,data$BMI)
[1] 0.3138115
Properties and interpretation of the correlation coefficient
\(-1<r<1\),
if \(r<0\): negative/decreasing linear relationship,
if \(r>0\): positive/increasing linear relationship,
if \(|r|\approx1\): very strong linear relationship,
if \(r=0\): no linear relationship.
Between qualitative variables
The contingency table is a table showing the frequencies of one variable in rows and another in columns.
Small Medium Large
Male 0.13617021 0.39574468 0.07234043
Female 0.10212766 0.27659574 0.01702128
proportions(tab, margin=1)
Small Medium Large
Male 0.22535211 0.65492958 0.11971831
Female 0.25806452 0.69892473 0.04301075
proportions(tab, margin=2)
Small Medium Large
Male 0.5714286 0.5886076 0.8095238
Female 0.4285714 0.4113924 0.1904762
Take home message
Describing the data at hand is an essential step prior to any statistical study.
It provides some data summaries and can be used to highlight patterns in the data.
We have seen univariate and bivariate descriptive statistics tools.
Different tools are used depending on the nature of the variables.
Second session: some notions of probability
Why this probability class?
Kind of toolbox for the rest of the statistics course
Statistical models
Estimators, confidence intervals and tests
Introduce essential concepts and definitions (e.g.random variable, realization of a random variable)
Introduce the key probability distributions (Gaussian, Fisher, Student, Chi-square)
Preliminary definitions
Random experiment
A random experiment is one
in which the result cannot be (completely) predicted in advance,
if repeated several times under identical conditions, it can give rise to different results.
Some examples?
Throwing a die
Flipping a coin
Assessing whether a patient treated with a new drug will recover or not (or will develop side effects or not)
Evaluating, at the time of diagnosis of a disease, what will be the evolution of the patient’s symptoms
…
Random variable
An abstract way to talk about the outcomes of random experiments
Random variable (\(X\)):
process that relates the random experiment to a value
characteristic that is beeing measured in the random experiment
something whose value cannot be known in advance
Examples …
Realization of a random variable
Realization (eq. observation) :
value obtained for the characteristic (random variable) of interest \(X\) at the end of a random experiment
what the experimenter actually observes
Note that you cannot observe a random variable \(X\) itself …
Examples …
Domain
The domain of a random variable \(X\) is the set of values that \(X\) can take after the experiment has been carried out (set of possible values for \(X\)).
The domain of a random variable \(X\) is often denoted \(D_X\).
Examples …
Remark:\(D_X\) is different depending on the nature of \(X\).
Qualitative r.v.: discrete set corresponding to the set of \(X\)’s possible modalities
Discrete quantitative r.v.: (finite or infinite) discrete set of numerical values
Continuous quantitative r.v.: open or closed interval (e.g. the set of real numbers, the set of positive real numbers, the set of real numbers between 0 and 1).
Characterizing random variables
Distribution of a random variable
The (probability) distribution of r.v. \(X\) indicates how the values of the different individuals in the population under study are distributed (do some realizations occur more or less often than others? ).
It is defined differently depending on whether \(X\) is discrete or continuous.
P.d. of a discrete r.v.
For discrete r.v., we can enumerate all of its possible realizations and associate a specific probability with each possible realization.
Definition: Let \(J\) be the size of the domain of definition of \(X\), \(a_1\), \(a_2\), \(\ldots\), \(a_J\) be the possible values of \(X\) and \(p_j = P(X = a_j)\), \(j = 1,\ldots,J\) be the set of the probabilities associated with the different values in \(D_X\). Then, the probability distribution of \(X\) is the set \(\{(a_j ; p_j)\}_{j = 1,\ldots,J}\)
✏️ Examples
Properties:
\(P(X=x) \in [0,1]\) for any \(x \in D_X\)
\(\sum_{j=1}^{J} p_j = 1\)
P.d. of a continuous r.v.
Impossible to list all the possible realizations of a continuous r.v.
Impossible to associate a specific probability with each possible realization
\(\Rightarrow\) We need a different concept to characterize the distribution of the r.v.
The density function of a continuous r.v. \(X\) is the function \(f(\cdot)\) that associates a probability with each range of realizations of \(X\).
Properties:
\(f(x) \geq 0\) for any \(x \in D_X\)
\(P(a<X<b) = \int_{a}^b f(x)dx \geq 0\) for any \(a, b \in D_X\) s.t. \(a<b\)
\(\int_{x \in D_X}^{} f(x)dx = 1\)
P.d. of a continuous r.v.
✏️ Graphical examples and different interpretations as area under the curve
Cumulative distribution function
⚠️ For quantitative r.v.
Definitions
The cumulative distribution function (c.d.f.) of r.v. \(X\) is the function \(F_X(\cdot)\) defined as \[
F_X(x) = P(X \leq x), \; x \in \mathbf{R}
\]
The quantile of order \(p\) (\(0 \leq p \leq 1\)) of a r.v. \(X\) is the value \(x_p\) s.t. \[
P(X \leq x_p) = p,
\]i.e., s.t. \[
F_X(x_p) = p.
\]
Cumulative distribution function: properties
(✏️ Graphical illustration of the following properties)
A c.d.f. is defined on \(\mathbf{R}\) and has values in \([0; 1]\): it defines a probability !
A distribution function is an increasing function (in the broad sense).
if \(X\) is a discrete r.v. : \(V(X) = \sum_{j=1}^{J} (a_j-\mu)^2 p_j\)
if \(X\) is a continuous r.v. : \(V(X) = \int_{D_X}^{} (x-\mu)^2 f_X(x) dx\)
where \(\mu=E(X)\).
Variance: rules
Let \(a \in \mathbf{R}\) be a constant and \(X\) (\(X_1\), \(X_2\)) any quantitative r.v.
\(V(a) = 0\)
\(V(aX) = a^2 V(X)\)
\(V(a+X) = V(X)\)
\(V(X_1+X_2)= \ldots\) (It depends!)
Remarks
The standard deviation of a (quantitative) r.v. \(X\) is the square-root of its variance (\(\sqrt{V(X)}\)).
📝 Exercise (centering and standardizing) : Let \(X\) be a r.v. with mean \(\mu\) and variance \(\sigma^2\) and define the transformed variable \[Y = \frac{X-\mu}{\sigma}\] What are the expected value and the variance of \(Y\)?
Some common distributions
Bernoulli distribution
To describe an experiment whose result can take only two values, called by convention, success or failure.
Ex.: a candidate passes or fails an examination, whether a sick patient is cured or not,\(\ldots\)
\[X \sim \mathcal{B}(p), \, 0 \leq p \leq 1\]
Domain: \(D_X = \{0,1\}\)
Probability distribution: \(P(X=0)=1-p\) , \(P(X=1)=p\)
Expectation: \(E(X)=p\)
Variance: \(V(X)=p(1-p)\)
Binomial distribution
To describe an experiment which consists in counting the number of successes after performing \(n\) independent Bernoulli trials with the same probability \(p\).
\[X \sim \mathcal{B}(n,p), \, n \in \mathbf{N}, \, 0 \leq p \leq 1\]
Domain: \(D_X = \{0,1,\ldots,n\}\)
Probability distribution: \(P(X=k)=\frac{k!}{n! (n-k)!}p^k(1-p)^{n-k}\)
Expectation: \(E(X)=n p\)
Variance: \(V(X)=n p(1-p)\)
Poisson distribution
To describe counts.
\[X \sim \mathcal{P}(\lambda), \lambda >0\]
Domain: \(D_X = \mathbf{N} = \{0,1,\ldots\}\)
Probability distribution: \(P(X=k)=e^{-\lambda} \frac{\lambda^k}{k!}\)
Expected value: \(E(X)=\lambda\)
Variance: \(V(X)=\lambda\)
(a) P(1.5)
(b) P(10)
(c) P(50)
Figure 2: Poisson distributions
Gaussian distribution
\[X\sim \mathcal{N}(\mu,\sigma^2)\]
Domain: \(D_X = \mathbf{R}\)
Density function: \(f_X(x) = (2\pi \sigma^2)^{-1/2} \exp\left(-\frac{1}{2\sigma^2 }(x-\mu)^2\right)\)
Two random variables \(X_1\) and \(X_2\) are independent when knowing the value taken by \(X_1\) gives no information about the value that will be taken by \(X_2\) and vice versa.
Examples\(\ldots\)
Covariance
Measures the strength of the relationship between two (quantitative) variables.
Consider two r.v. \(X_1\) and \(X_2\). The covariance between \(X_1\) and \(X_2\) is defined by: \[
cov(X_1,X_2) = E[(X_1-E(X_1))(X_2-E(X_2))] = E(X_1 X_2)- E(X_1)E(X_2)
\]
Rules
If two r.v. are independent, then their covariance is zero. But the converse is false!
\(cov(X,X) = V(X)\)
\(cov(aX+bY,cZ+dT) = ac cov(X,Z) + ad cov(X,T) + bc cov(Y,Z) + bd cov(Y,T)\)
Consequence: if \(X_1\) and \(X_2\) are independent, then \(V(X_1 + X_2) = V(X_1) + V(X_2)\)
Linear correlation coefficient
Consider two (quantitative) r.v. \(X_1\) and \(X_2\).
The linear correlation coefficient between \(X_1\) and \(X_2\) is denoted by: \[
\rho(X_1, X_2) = \frac{cov(X_1, X_2)}{\sqrt{V(X_1)}\sqrt{V(X_2)}}
\]
Remarks :
\(-1 \leq \rho(X_1, X_2) \leq 1\).
\(\rho(X_1, X_2)\) only characterizes the linear relationship between \(X_1\) and \(X_2\).
✏️ Graphical considerations \(\ldots\)
Joint distributions
Discrete r.v.
Joint distribution: set of values \(p_{ij}\) defined as: \[p_{ij} = P(X=x_i, Y=y_j)\]Remark: \(\sum_{i,j} p_{ij} = 1\)
If \(X\) and \(Y\) are independent, then \[P(X=x_i \cap Y=y_j) = P(X=x_i) P(Y=y_j)\]
Continuous r.v.
If \(X\) and \(Y\) are independent, then \[f(x,y) = h(x) g(y)\]
Central limit theorem
Suppose \(X_{1},\ldots ,X_{n}\) is a sequence of i.i.d. random variables with \(\mathbb{E}[X_{i}]=\mu\) and \(V[X_{i}]=\sigma ^{2}<\infty\). Then as \(n\) approaches infinity, the random variables \(\sqrt {n}({\bar {X}}_{n}-\mu )\) converge in distribution to a normal \(\mathcal {N}(0,\sigma ^{2})\): \[{{\sqrt {n}}\left({\bar {X}}_{n}-\mu \right)\ \xrightarrow {d} \ {\mathcal {N}}\left(0,\sigma ^{2}\right).}\]
Statistical inference
Outline
The goal is to introduce tools to draw conclusions about a population from observations collected on a subset of individuals in that population (sample)1.
Estimation.
Confidence intervals.
Statistical tests.
This requires the definition of a model, i.e. the specification of a random distribution for the data, and possibly an equation linking the parameters of the distribution2 of the variable to be explained to the explanatory variables.
Examples
▶️ We are interested in estimating the proportion of patients experiencing side effects after taking a new medication.
Denote by\(X\) the number of patients among those of a \(n\)-sample that experience a side effect. We can assume that \[X \sim \mathcal{B}(n,p)\]
\(p\) is unknown and represents the probability to experience a side effect. The\(n\)-sample is used to draw conclusions about \(p\).
▶️ We are interested in estimating the average cholesterol level in the population of people suffering from a given disease.
Denote by\((X_i)_{1 \leq i \leq n}\) the cholesterol levels for each individual in a \(n\)-sample.
It can be assumed that\((X_i)_{1 \leq i \leq n}\) are mutually independant such that:
\[X_i \sim \mathcal{N}(\mu,\sigma^2)\]
\(\mu\) and \(\sigma^2\) are unknown. \(\mu\) represents the average cholesterol level in the population of interest.. The \(n\)-sample is used to draw conclusions about \(\mu\).
Sum of independent gaussian r.v.’s : Consider \(X_i \sim \mathcal{N}(\mu_i,\sigma_i^2)\), \(i=1,\ldots,n\)independent and define \(Z=\sum_{i} X_i\). Then, \[Z \sim \mathcal{N}(\sum_{i}\mu_i,\sum_{i}\sigma_i^2)\]
Estimator, estimation
An estimator\(T_n\) of a parameter is a random variable defined as a function of the random variables from the statistical model, i.e.\[T_n = f(X_1,\ldots,X_n)\] An estimator is a random variable and is therefore characterized by a probability distribution.
The estimation is a realization of the random variable \(T_n\) calculated from the observations in the sample: \[t_n = f(x_1,\ldots,x_n)\]
A simple example
Let’s consider a Gaussian model, ie\(X_i \underset{i.i.d.}{\sim} \mathcal{N}(\mu,\sigma^2)\)1 and \[\bar{X_n}=\dfrac{1}{n}\sum\limits_{i=1}^{n} X_i\]
Is \(\bar{X_n}\) a random variable or a constant?
What are the distribution, the expected value and the variance of \(\bar{X_n}\)?
Why would \(\bar{X_n}\) be a good estimator for \(\mu\)?
Desired properties of an estimator: unbiased, small variance,…
How are estimators defined in practice?
The maximum likelihood estimation (MLE) method can be used to estimate the parameters of a model1.
The idea is to find the value of the unknown parameter(s) that maximize the probability (i.e. the likelihood) of the data being observed (given the model).
📝 Example: Derivation of \(\bar{X_n}\) as the maximum likelihood estimator of \(\mu\) in the Gaussian model.
Confidence intervals
An estimation is not the true parameter value! We can only hope that the estimated value is close to the true parameter value.
Due to sampling variability, the estimation of a given parameter varies from sample to sample.
A confidence interval is a range of values that contains the true parameter value with a given probability \(1-\alpha\). \(1-\alpha\) is called the confidence level.
Building confidence intervals
These items describe the generic steps in the construction of a confidence interval. We explain them here in the Gaussian model.
Transform the estimator into a pivotal statistics\(T_n\) whose distribution does not depend on the unknown parameters (i.e. is entirely known). Think about centering and standardizing \(\bar{X}_n\).
Represent the distribution of \(T_n\) and place the quantiles of orders \(\alpha/2\) and \(1-\alpha/2\) (\(u_{\alpha/2}\) and \(u_{1-\alpha/2}\)) on this representation.
What is the value of \(P(u_{\alpha/2} \leq T_n \leq u_{1-\alpha/2})\)?
Deduce \(A_n\) and \(B_n\) such that \(P(A_n \leq \mu \leq B_n)\).
\([A_n,B_n]\) is called probability interval. The confidence interval is the realization of this probability interval calculated from the sample.
What if \(\sigma^2\) is not known?
Important probability distributions and results
Chi-square distribution : Let \(X_1\), \(X_2\), …, \(X_n\) be \(n\) independent r.v. and identically distributed according \(\mathcal{N}(0,1)\), and let \(U = X_1^2 +X_2^2 +...+X_n^2\).
Then \[U \sim \chi^2(n)\]
Figure 4: \(\chi^2(10)\)
Important property : Let \(X_1\), \(X_2\), , \(X_n\) be \(n\) independent r.v. and identically distributed according \(\mathcal{N}(\mu,\sigma^2)\). Then \[\frac{1}{\sigma^2}\sum_{i=1}^{n} (X_i - \bar{X}_n)^2 \sim \chi^2(n-1)\]
Important probability distributions and results
Student distribution : Let \(X \sim \mathcal{N}(0,1)\) and \(U \sim \chi^2(\nu)\) independent, and let \(T = X/\sqrt{(U/\nu)}\). Then \[T \sim \mathcal{T}(\nu)\]
Fisher distribution : Let \(U_1\) and \(U_2\) be independent r.v. s.t. \(U_1 \sim \chi^2(\nu_1)\) and \(U_2 \sim \chi^2(\nu_2)\) , and let \(F = (U_1/\nu_1)/(U_2/\nu_2)\). Then \[F \sim \mathcal{F}(\nu_1,\nu_2)\]
(a) \(\mathcal{T}(1)\)
(b) \(\mathcal{T}(10)\)
(c) \(\mathcal{F}(3,10)\)
Figure 5: dotted curve:\(\mathcal{N}(0,1)\)
What if \(\sigma^2\) is not known?
Consider \(X_i \underset{i.i.d.}{\sim} \mathcal{N}(\mu,\sigma^2)\). The usual estimator for \(\sigma^2\) is \[S_{n-1}^2 = \frac{1}{n-1} \sum_{i=1}^{n} (X_i - \bar{X_n})^2\] It is unbiased and follows a chi-square distribution: \[\frac{n-1}{\sigma^2} S_{n-1}^2 \sim \chi^2_{n-1}\]
What if \(\sigma^2\) is not known?
Instead of using \(T_n = \frac{\bar{X_n}-\mu}{\sigma/\sqrt{n}}\), we use \[T_n^{(2)} = \frac{\bar{X_n}-\mu}{S_{n-1}/\sqrt{n}} \sim \mathcal{T}_{n-1}\] as a pivotal statistics.
Justify that \(T_n^{(2)}\) is a pivotal statistics.
Repeat the steps detailed above to derive the confidence interval of \(\mu\) at confidence level \(1-\alpha\).
m <-mean(data$cholesterol)s <-sd(data$cholesterol)c(m-s/sqrt(10)*qt(0.975,9),m+s/sqrt(10)*qt(0.975,9))
[1] 287.4241 406.1759
t.test(data$cholesterol,conf.level=0.95)
One Sample t-test
data: data$cholesterol
t = 13.213, df = 9, p-value = 3.378e-07
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
287.4241 406.1759
sample estimates:
mean of x
346.8
Statistical tests
Statistical hypothesis: any statement concerning a characteristic of the population (ie a parameter of the model),
Null hypothesis (\(H_0\)): the hypothesis according to which the value of the parameter of interest is equal to a reference value.
Alternative hypothesis (\(H_1\)): must reflect an incompatibility with \(H_0\), for example, its opposite.
one-sided test : \(H_0\): \(\mu = \mu_0\) vs \(H_1\): \(\mu < \mu_0\)
The approach of statistical testing consists in choosing between\(H_0\) and \(H_1\) the most probable hypothesis in view of the observations contained in the sample.
✏️ Example. We wonder if the average cholesterol level of people with this disease is equal to the reference value of 400. How would you formulate \(H_0\) and \(H_1\)?
Test statistic
✏️ Still consider \(X_i \underset{i.i.d.}{\sim} \mathcal{N}(\mu,\sigma^2)\) and the following hypothesis: \[
H_0 : \mu = \mu_0 \; vs \; H_1 : \mu \neq \mu_0
\]
Recall the expression for the mean estimator.
What would be its distribution if \(H_0\) were true?
Consider \[T_n^{(2)} = \sqrt{n} \frac{\bar{X_n}-\mu_0}{S_{n-1}}.\] What would be the distribution of \(T_n^{(2)}\) if \(H_0\) were true? What are the specificities of this distribution
➡️ \(T_n^{(2)}\) is the test statistic. The construction of the decision rule that will allow to choose between \(H_0\) and \(H_1\) given the data is based on this random variable (rejection area, critical value, p-value).
Decision rule
When performing a statistical test, it is always assumed that \(H_0\) is true.
Then, a significance level\(\alpha\) is chosen. It represents the proportion of the tests statistic’s values that would be the least compatible with \(H_0\) and that would lead to the decision to reject \(H_0\).
✏️ Exercise:
Plot the distribution of \(T_n^{(2)}\) under \(H_0\) and represent \(\alpha\) and the rejection zone.
One Sample t-test
data: data$cholesterol
t = -2.0269, df = 9, p-value = 0.0733
alternative hypothesis: true mean is not equal to 400
95 percent confidence interval:
287.4241 406.1759
sample estimates:
mean of x
346.8
Discussion
p-value concept
What would have changed if other alternative hypothesis had been considered?
One Sample t-test
data: data$cholesterol
t = -2.0269, df = 9, p-value = 0.9633
alternative hypothesis: true mean is greater than 400
95 percent confidence interval:
298.6855 Inf
sample estimates:
mean of x
346.8
Errors and power of a test
type 1 error:
type 2 error:
power:
Principal Component Analysis and data clustering
Principal Component Analysis (PCA)
Principal Component Analysis : multidimensional exploratory analysis method
Objectives
Represent the observation of \(p\) variables (\(p>3\)) in a space of reduced dimension (typically in a plane or a 3-dimensional space) by limiting the loss of information,
Summarize the information in a synthetic way: how to keep the essential information of the data?
Tools
No prior statistical assumption
Matrix calculation, vectors and eigenvalues
Data
PCA can only be applied to quantitative variables observed on the same individuals.
Example
Cheese is made from cow’s, goat’s or sheep’s milk. It is a source of protein, calcium, but also vitamin A (retinol) and vitamin B9 (folic acid). The objective of this study is to specify the nutritional value of the most common cheeses.
Which cheeses have the same nutritional profile, and which ones have different profiles? Can we highlight the main dimensions of nutritional variability of this set of cheeses?
Associate similar cheeses with a particular nutritional profile.
Centering data: \(\tilde{X} = (x_{ij}-\bar{x}_{\cdot j})_{1\leq i \leq n , 1\leq j \leq p}\)
Centering and standardizing data: \(\tilde{X} = \left(\frac{x_{ij}-\bar{x}_{\cdot j}}{\sigma_j}\right)_{1\leq i \leq n , 1\leq j \leq p}\)
Why standardizing ?
The variables are of different natures; for example: age, salary, number of children
it is felt that their respective influences in the analysis should not depend on their variability; for example in the analysis of grades: should a subject with a high variance have a greater influence in the analysis?
Center of gravity of the cloud: \[
g = (\bar{x}_{\cdot 1},\ldots,\bar{x}_{\cdot p})
\]
The variance of the \(j\)-th variable is given by \[\sigma^2_{j} = \frac{1}{n} \sum\limits_{i=1}^{n} (x_{ij}-\bar{x}_{\cdot j})^2\]
Standardized case:\(I_{ind} = p\)
Remark: influence of each variable different in the non standardized case
The space (the cloud) of variables
Geometrically, the quantitative variables in the data table \(X\) correspond to \(p\) points of \(\mathbf{R}^n\).
Scalar product: \[
< x_{\cdot j_1}, x_{\cdot j_2} > = \frac{1}{n} \sum_{i=1}^{n} x_{i j_1} x_{i j_2}
\]\(\hookrightarrow\) Expresses the correlation between the two variables (reduced case). Also a cosine.
The distance between two variables is given by the norm associated with this scalar product \[
d_{var}(j_1,j_2) = \sqrt{\frac{1}{n} \sum_{i=1}^{n} (x_{ij_1}-x_{ij_2})^2} = \mid\mid x_{\cdot j_1} - x_{\cdot j_2 } \mid\mid_n
\]
If \(p > 3\), the individuals’ cloud in \(\mathbf{R}^p\) is not representable.
The PCA method produces an approximate representation in a low dimensional space (usually 2 or 3) by constructing from the \(p\) (correlated) variables \(q<p\) new variables called principal components
\(\Rightarrow\) in this new reduced space, the cloud is more easily represented and the analysis is easier.
The principal components (\(p\) in total) are constructed by matrix algebra operations such that
Each principal component is linear combination of \(p\) initial variables
The principal components are orthogonal to each other (i.e. uncorrelated)
Each principal component carries a certain proportion of the total inertia \(\rightarrow\) minimize the loss of information
If \(I_k\) is the inertia carried by \(k\)th principal component
\[
I_1 > I_2 > \ldots > I_p
\]
and
\[
I_1 + I_2 + \ldots + I_p = I_T
\]
Choosing the number of components
Goal:
Choose \(q < p\) in order to keep the maximum inertia with the minimum number of variables.
Several strategies:
Elbow criterion: keep the axes whose eigenvalues are located before the drop on the eigenvalue scree.
Kaiser criterion: only axes with inertia greater than the average inertia \(I_T /p\) are kept.
Remark: these criteria can lead to different choices
between variables \(\rightarrow\) identify groups of correlated variables
between principal components and initial variables \(\rightarrow\) interpreting the new variables resulting from the PCA
How?
The better a variable is represented, the closer it is to the circle.
For two variables perfectly represented on the plane, the cosine of the angle formed by these 2 variables is equal to the correlation coefficient between these two variables.
Similarly for the angle formed by a principal component and a variable perfectly represented on the plane.
Representation of variables
What can we say about the relationships between variables?
What is the concrete meaning of the first two principal components?
Representation of individuals
Interpreting the position of individuals
contributes to the understanding of what the principal components represent
allows to identify groups of individuals who are similar or on the contrary are distant
How?
An individual is well represented on the factorial plane if the angle formed by the individual with the axes forming this plane is small (cosine squared criteria close to 1).
The distance between two individuals on the plane is only interpretable if they are well represented:
if two well-represented individuals are close, they are similar individuals in the initial space
if two well-represented individuals are distant, they are individuals who were distant in the initial space
if one of the two individuals is misrepresented, nothing can be said
\(n_k\): number of individuals in the group \(C_k\)
\(I_{intra}\): intra-group inertia (dispersion of the points around their center in the different groups)
\(I_{inter}\): between-group inertia(separability of groups)
Remark: this decomposition is only valid if the chosen dissimilarity measure is the Euclidean distance.
Objective of the classification
Objective of the classification:
\[
I_T = I_{intra} + I_{inter}
\]
For a given number of groups \(K\), maximize the inter-group inertia (minimize the intra-group inertia).
Remark:\(I_T\) is constant for a given dataset.
Difficulty:
we can’t evaluate the complete set of possible classifications (of the order of \(K^n\))
we need to use clever algorithms for exploring the solution space
HAC
Hierarchical Ascending Classification
Ascending method: agglomeration of groups
Hierarchical method: the successive partitions are nested within each other
Algorithm:
Initialization: each individual is a group (\(I_{intra} = 0\))
One iteration: merging two groups
End: the \(n\) individuals form a single group (\(I_{intra} = I_T\))
Choosing the groups to merge
Merging two groups is associated with an increase in intra-group variability; it is desired that this increase be as small as possible.
Increase of the intra-group variability induced by the merging of the \(C_k\) and \(C_l\) groups \[
\begin{eqnarray*}
\Delta & = & I_{intra} (C_k \cup C_l) - I_{intra} (C_k, C_l)\\
& = & \frac{n_k n_l}{n_k + n_l} d^2(g_k,g_l)
\end{eqnarray*}
\]
Dissimilarity between groups (Ward’s distance) \[
d^2(C_k,C_l)=\frac{n_k n_l}{n_k + n_l} d^2(g_k,g_l)
\]