1 Rappels d’Algèbre Linéaire

Soient $m,n\geq 1$. Une matrice $\boldsymbol{A}$ de taille $(m,n)$ ou $m\times n$ (à coefficients réels) est une application de $\{1,\dots,m\} \times \{1,\dots,n\}$ dans $\mathbb{R}$.

Plus simplement, il s’agit d’un tableau de nombres rééls ayant $m$ lignes et $n$ colonnes. On note $A_{ij}$ l’élément sur la ligne $i$ et sur la colonne $j$ de $\boldsymbol{A}$.

A = matrix(sample(1:8),2,4)
A

##      [,1] [,2] [,3] [,4]
## [1,]    7    6    5    4
## [2,]    8    3    2    1

A est une matrice $2\times4$ et $A_{1,3}$ (par exemple) vaut 5 - correspondant à A[1,3].

dim(A)

## [1] 2 4

A[1,3]

## [1] 5

A[1,]

## [1] 7 6 5 4

A[,3]

## [1] 5 2

Soit $\boldsymbol{A}$ une matrice réelle de taille $(m,n)$. La matrice transposée notée $\boldsymbol{A}^\top$ de taille $(n,m)$ est définie par $(\boldsymbol{A}^\top)_{ij}=A_{ji}$ pour $i=1,\dots,n$ et $j=1,\dots,m$.

(bien sûr $(\boldsymbol{A}^\top)^\top=\boldsymbol{A}$)

t(A)

##      [,1] [,2]
## [1,]    7    8
## [2,]    6    3
## [3,]    5    2
## [4,]    4    1

Une matrice carrée $\boldsymbol{A}$ de taille $(n,n)$ est dite symétrique si $\boldsymbol{A}=\boldsymbol{A}^\top$.

Si $\boldsymbol{x}$ et $\boldsymbol{y}$ sont deux vecteurs de dimension $n$ (deux matrices de taille $(n,1)$), on note $\boldsymbol{x}^\top y$ le produit scalaire entre $\boldsymbol{x}$ et $\boldsymbol{y}$ défini par $\boldsymbol{x}^\top \boldsymbol{y}=\sum_{i=1}^n x_i y_i$.

x = A[1,]
y = A[2,]
t(x)%*%y

##      [,1]
## [1,]   88

On note $\| \boldsymbol{x}\|$ la norme euclidienne du vecteur $\boldsymbol{x}$, donnée par $\|\boldsymbol{x}\|^2 = \boldsymbol{x}^\top \boldsymbol{x} = \sum x_i^2$.

## [1] 7 6 5 4

sum(x^2)

## [1] 126

Pour une matrice carrée de taille $n$, on appelle diagonale de $\boldsymbol{A}$, notée $\operatorname{diag}(\boldsymbol{A})$ la matrice de taille $(n,n)$ définie par $(\operatorname{diag}(\boldsymbol{A}))_{ii}=A_{ii}$ pour $i=1,\dots,n$ et $0$ sinon.

Soient $\boldsymbol{A}$ et $\boldsymbol{B}$ deux matrices réelles de taille $(m,n)$ et $\alpha,\beta\in R$ alors la matric $\boldsymbol{C}= \alpha \boldsymbol{A}+\beta\boldsymbol{B}$ est une matrice réelle de taille $(m,n)$ et donnée par $C_{ij} = \alpha A_{ij} +\beta B_{ij}$, $i=1,\dots,m$, $j=1,\dots,n$.

Soient $\boldsymbol{A}$ et $\boldsymbol{B}$ deux matrices réelles de taille $(m,n)$ et $(n,p)$ respectivement alors la matrice $\boldsymbol{C}$ de taille $(m,p)$ définie par le produit matriciel entre $\boldsymbol{A}$ et $\boldsymbol{B}$ est bien définie et on a $\boldsymbol{C}=\boldsymbol{A} \boldsymbol{B}$, $C_{ij} = \sum_{k=1}^n A_{ik} B_{kj}$ pour $i=1,\dots,m$, $j=1,\dots,p$.

Le produit matriciel n’est pas commutatif pour deux matrices quelconque de même taille: $\boldsymbol{A} \boldsymbol{B} \neq \boldsymbol{B} \boldsymbol{A}$.

Soit $\mathbb{I}_n$ la matrice de taille $(n,n)$ composée de 1 sur la diagonale et de 0 ailleurs. Alors, pour $\boldsymbol{A}$ de taille $(n,n)$, $\mathbb{I}_n$ est l’élément neutre tel que $\boldsymbol{A} \mathbb{I}_n = \mathbb{I}_n \boldsymbol{A} = \boldsymbol{A}$.

Soient $\boldsymbol{A}$, $\boldsymbol{B}$ et $\boldsymbol{C}$ trois matrices réelles de dimension concordante, alors

$(\boldsymbol{A} \boldsymbol{B}) \boldsymbol{C} = \boldsymbol{A} ( \boldsymbol{B} \boldsymbol{C})$ (associativité du produit)
$\boldsymbol{A}(\boldsymbol{B} +\boldsymbol{C}) = \boldsymbol{A} \boldsymbol{B} + \boldsymbol{A} \boldsymbol{C}$ (distributivité du produit)
$(\boldsymbol{A} \boldsymbol{B})^\top = \boldsymbol{B}^\top \boldsymbol{A}^\top$.

(A = matrix(1:4,nrow=2))

##      [,1] [,2]
## [1,]    1    3
## [2,]    2    4

(B = rbind(c(1,3),c(-2,1)))

##      [,1] [,2]
## [1,]    1    3
## [2,]   -2    1

(C = cbind(c(3,-1),c(-1,2)))

##      [,1] [,2]
## [1,]    3   -1
## [2,]   -1    2

A%*%B

##      [,1] [,2]
## [1,]   -5    6
## [2,]   -6   10

A%*%(B+C)

##      [,1] [,2]
## [1,]   -5   11
## [2,]   -4   16

(A%*%B + A%*%C)

##      [,1] [,2]
## [1,]   -5   11
## [2,]   -4   16

Une matrice symétrique réelle $\boldsymbol{A}$ de taille $(n,n)$ est dite définie positive (resp. semi-définie positive) si $\forall \boldsymbol{x} \in \mathbb{R}^n$, $\boldsymbol{x}^\top \boldsymbol{A} \boldsymbol{x} >0$ (resp. $\geq 0$)

Toute matrice $\boldsymbol{A}$ qui peut s’écrire sous la forme $\boldsymbol{A} =\boldsymbol{B}^\top \boldsymbol{B}$ est semi-définie positive positive

Soit $\boldsymbol{A}$ une matrice carrée de taille $(n,n)$. La trace d’une matrice est définie par \[ \operatorname{trace} (\boldsymbol{A}) = \sum_{i=1}^n A_{ii} \] On a

pour trois matrices $\boldsymbol{A},\boldsymbol{B},\boldsymbol{C}$, $\operatorname{trace}(\boldsymbol{A}\boldsymbol{B}\boldsymbol{C}) = \operatorname{trace}(\boldsymbol{B}\boldsymbol{C}\boldsymbol{A}) \neq \operatorname{trace}(\boldsymbol{B}\boldsymbol{A}\boldsymbol{C})$.
$\operatorname{trace}(\boldsymbol{A}^\top \boldsymbol{A}) \geq 0$
$\operatorname{trace}(\boldsymbol{A}^\top \boldsymbol{A}) = 0 \Leftrightarrow \boldsymbol{A}^\top \boldsymbol{A} = \mathbf 0 \Leftrightarrow \boldsymbol{A} =\mathbf 0.$

Le déterminant d’une matrice carrée $\boldsymbol{A}$ de taille $(n,n)$ est noté $\mathrm{det}(\boldsymbol{A})$ ou $|\boldsymbol{A}|$ est le réel dont la valeur absolue mesure le volume du parallélépipède engendré par les colonnes de $\boldsymbol{A}$. La formule générale est \[ |\boldsymbol{A}| = \sum_{\sigma \in S_n} \mathrm{sgn}(\sigma) \prod_{i=1}^n A_{i\sigma_i} \] où $S_n$ est l’ensemble des permutations de $\{1,\dots,n\}$ et où $\mathrm{sgn}(\sigma)$ désigne la signature de $\sigma \in \{-1,1\}$ (la signature d’une permutation vaut 1 si le nombre de transpositions est pair et vaut -1 sinon).

A = cbind(c(1,0,2),c(4,-1,3),c(0,5,8))
det(A)

## [1] 17

(1)^1*det(A[-1,-1]) + 0*(-1)^1 *det(A[-2,-1]) + (-1)^2 *2 * det(A[-3,-1])

## [1] 17

(1)^1*det(A[-1,-1]) + 4*(-1)^1 *det(A[-1,-2]) + (-1)^2 *0 * det(A[-1,-3])

## [1] 17

Soit $\boldsymbol{A}^{ij}$ la matrice de taille $(n-1,n-1)$ correspondant à la matrice $\boldsymbol{A}$ à laquelle ont été supprimées les $i$ème ligne et $j$ème colonne, alors $|\boldsymbol{A}| = \sum_{i=1}^n (-1)^{i+j} A_{ij} |\boldsymbol{A}^{ij}|$

$\det(\mathbb{I}_n)=1$
$\det(\boldsymbol{A}^\top) = \det (\boldsymbol{A})$
$\det(\alpha \boldsymbol{A}) = \alpha^n \det(\boldsymbol{A})$
Pour deux matrices carrées de taille identique, $\det(\boldsymbol{A}\boldsymbol{B}) = \det(\boldsymbol{A})\det(\boldsymbol{B})$
Soit $\boldsymbol{C}$ la matrice obtenue en permutant deux lignes ou deux colonnes de $\boldsymbol{A}$, alors $\det(\boldsymbol{C})=-\det(\boldsymbol{A})$.

B = matrix(1:9,nrow=3)
det(A%*%B)

## [1] 0

det(A)

## [1] 17

det(B)

## [1] 0

(C = A[c(1,3,2),])

##      [,1] [,2] [,3]
## [1,]    1    4    0
## [2,]    2    3    8
## [3,]    0   -1    5

det(C)

## [1] -17

Théorème de Sylvester: soient $\boldsymbol{A}$ et $\boldsymbol{B}$ deux matrices de taille $(m,n)$ et $(n,m)$ respectivement alors \[ \det( \mathbb{I}_m + \boldsymbol{A} \boldsymbol{B}) = \det( \mathbb{I}_n + \boldsymbol{B} \boldsymbol{A}). \]

Soit $\boldsymbol{A}$ une matrice carrée de taille $(n,n)$ dont le déterminant est non nul, alors $\boldsymbol{A}$ est dite non singulière et il existe une matrice inverse (de même taille) notée $\boldsymbol{A}^{-1}$ vérifiant $\boldsymbol{A} \boldsymbol{A}^{-1} = \boldsymbol{A}^{-1}\boldsymbol{A} = \mathbb{I}_n$. Ses coefficients sont donnés par \[ (\boldsymbol{A}^{-1})_{ij} = (-1)^{i+j} \frac{|\boldsymbol{A}^{ij}|}{|\boldsymbol{A}|} \]

En dimension 2, on a la formule simple \[ \left( \begin{array}{cc} A_{11} & A_{12} \\ A_{21} &A_{22}\end{array} \right)^{-1} = (A_{11}A_{22}-A_{12}A_{21})^{-1} \left(\begin{array}{cc} A_{22} & -A_{12} \\ -A_{21} & A_{11} \end{array} \right) \]

A = rbind(c(3,1,0),c(-1,4,1),c(0,1,3))
det(A)

## [1] 36

Ainv=solve(A)
Ainv%*%A

##      [,1]          [,2]         [,3]
## [1,]    1 -1.040834e-17 1.387779e-17
## [2,]    0  1.000000e+00 0.000000e+00
## [3,]    0 -5.551115e-17 1.000000e+00

round(A%*%Ainv,12)

##      [,1] [,2] [,3]
## [1,]    1    0    0
## [2,]    0    1    0
## [3,]    0    0    1

round(Ainv%*%A,12)

##      [,1] [,2] [,3]
## [1,]    1    0    0
## [2,]    0    1    0
## [3,]    0    0    1

Soient $\boldsymbol{A}$ et $\boldsymbol{B}$ deux matrices inversibles de taille $(n,n)$ alors

$(\boldsymbol{A}^{-1})^\top = (\boldsymbol{A}^\top)^{-1}$ (et donc $\boldsymbol{A}^{-1}$ est symétrique ssi $\boldsymbol{A}$ est symétrique).
$(\boldsymbol{A}^{-1})^{-1} = \boldsymbol{A}$.
$(\boldsymbol{A}\boldsymbol{B})^{-1} = \boldsymbol{B}^{-1} \boldsymbol{A}^{-1}$
$\det(\boldsymbol{A}^{-1}) = \displaystyle{\frac{1}{\det(\boldsymbol{A})}}$

1/det(A)

## [1] 0.02777778

det(Ainv)

## [1] 0.02777778

Soient $n_1$ et $n_2$ tels que $n=n_1+n_2$. Une matrice $(n,n)$ peut être écrite de la façon suivante \[ \boldsymbol{A} =\left(\begin{array}{cc} \boldsymbol{A}_{11} & \boldsymbol{A}_{12} \\ \boldsymbol{A}_{21} &\boldsymbol{A}_{22}\end{array}\right) \] où $\boldsymbol{A}_{11}$, $\boldsymbol{A}_{12}$, $\boldsymbol{A}_{21}$ et $\boldsymbol{A}_{22}$ sont de taille $(n_1,n_1)$, $(n_1,n_2)$, $(n_2,n_1)$ et $(n_2,n_2)$.

Soient $\boldsymbol{A}$ et $\boldsymbol{B}$ deux matrices par blocs de taille identique, alors la matrice $\boldsymbol{C}= \boldsymbol{A}\boldsymbol{B}$ est aussi une matrice par blocs dont les termes sont définis (dans le cas de 4 blocs) par $\boldsymbol{C}_{ij} = \sum_{k=1}^2 \boldsymbol{A}_{ik}\boldsymbol{B}_{kj}$ pour$i,j=1,2$.
$|\boldsymbol{A}| = |\boldsymbol{A}_{11}| |\boldsymbol{A}_{22} -\boldsymbol{A}_{21} \boldsymbol{A}^{-1}_{11} \boldsymbol{A}_{12}|$ si $\boldsymbol{A}_{11}$ est inversible.
Soit $\boldsymbol{A}$ une matrice par blocs inversible alors \[{\displaystyle \boldsymbol{A}^{-1} =\left(\begin{array}{cc} \boldsymbol{A}_{11}^{-1} + \boldsymbol{A}_{11}^{-1} \boldsymbol{A}_{12}\boldsymbol{A}_{22,1}^{-1} \boldsymbol{A}_{21} \boldsymbol{A}_{11}^{-1} & -\boldsymbol{A}_{11}^{-1} \boldsymbol{A}_{12} \boldsymbol{A}_{22,1}^{-1} \\ -\boldsymbol{A}_{22,1}^{-1} \boldsymbol{A}_{21} \boldsymbol{A}_{11}^{-1} & \boldsymbol{A}_{22,1}^{-1} \end{array} \right) }\] où $\boldsymbol{A}_{22,1}= \boldsymbol{A}_{22}-\boldsymbol{A}_{21}\boldsymbol{A}_{11}^{-1} \boldsymbol{A}_{12}$.

Les vecteurs $\boldsymbol{x}_1,\dots,\boldsymbol{x}_p$ de $\mathbb{R}^n$ sont dit linéairement indépendants si $\forall \mathbf a \in \mathbb{R}^p$, $\sum_{i=1}^p a_i \boldsymbol{x}_i=0 \Leftrightarrow a_1=\dots=a_p=0$. Si $\boldsymbol{x}_1,\dots,\boldsymbol{x}_p$ sont linéairement indépendants alors $p\leq n$.

On appelle rang d’une matrice $\boldsymbol{A}$ et on note $\operatorname{rang}(\boldsymbol{A})$ le nombre maximal de colonnes qui sont linéairement indépendantes.

Une matrice $\boldsymbol{A}$ de taille $(n,p)$ est dite de plein rang si $\operatorname{rang}(\boldsymbol{A})=\min(n,p)$.

Une matrice $\boldsymbol{A}$ de taille $(p,p)$ est dite inversible ssi $\operatorname{rang}(\boldsymbol{A})=p$.

Si $\boldsymbol{x}_1,\dots,\boldsymbol{x}_p$ sont des vecteurs linéairement indépendants de $\mathbb{R}^n$, ils sont la base de l’espace vectoriel \[ \mathcal V(\boldsymbol{x}_1,\dots,\boldsymbol{x}_p) = \{ \boldsymbol{y} \in \mathbb{R}^n: \boldsymbol{y} = \sum_{i=1}^p a_i \boldsymbol{x}_i, \mathbf a\in \mathbb{R}^p\}. \]

La dimension de cet espace vectoriel est $p$. Sous forme matricielle, $\boldsymbol{y} = \boldsymbol{x} \mathbf a$ où $\boldsymbol{x}=(\boldsymbol{x}_1, \dots,\boldsymbol{x}_p )$ est donc une matrice de taille $(n,p)$.
$\mathbf a$ est la coordonnée de $\boldsymbol{y}$ dans la base $\boldsymbol{x}_1,\dots,\boldsymbol{x}_p$.
Si $\boldsymbol{u}_1,\dots,\boldsymbol{u}_p$ est une autre base de $\mathcal V(\boldsymbol{x})$, alors il existe une matrice inversible de taille $(p,p)$ dite de changement de base telle que $(\boldsymbol{u}_1,\dots,\boldsymbol{u}_p) = (\boldsymbol{x}_1,\dots,\boldsymbol{x}_p) \boldsymbol{A}$ où $\boldsymbol{A}$ est de taille $(p,p)$. En fait les colonnes de $\boldsymbol{A}$ donnent les coordonnées de $\boldsymbol{u}_i$ dans l’ancienne base $\boldsymbol{u}_i = \sum_{j=1}^p A_{ji} \boldsymbol{x}_j$.
Si $\mathbf a$ est la coordonnée de $\boldsymbol{y}$ dans la base de $\boldsymbol{x}$, alors $\boldsymbol{A}^{-1}\mathbf a$ est la coordonnée de $\boldsymbol{y}$ dans la base des $\boldsymbol{u}$.

Les valeurs propres d’une matrice carrée de taille $(p,p)$, $\boldsymbol{A}$, sont les solutions de l’équation $\chi_\boldsymbol{A}(\lambda)=|\boldsymbol{A}-\lambda \mathbb{I}_p|=0$. En fait, $\chi_\boldsymbol{A}(\lambda)$ s’appelle le polynôme caractéristique de degré $p$ en $\lambda$; les racines de ce polynôme peuvent donc être complexes. De plus, certaines valeurs propres peuvent avoir une multiplicité supérieure à 1.

A chaque valeur propre, on peut associer un vecteur propre $\boldsymbol{u}_i$ vérifiant $\boldsymbol{A} \boldsymbol{u}_i = \lambda_i \boldsymbol{u}_i$. Le vecteur propre n’est pas défini de manière unique, puisque $c \boldsymbol{u}_i$ est aussi un vecteur propre de $\lambda_i$ pour tout réel $c$ non nul.

Une matrice réelle carrée $\boldsymbol{A}$ est dite diagonalisable s’il existe une matrice inversible $\boldsymbol{P}$ et une matrice diagonale $\boldsymbol{D}$ à coefficients réels satisfaisant la relation~: $\boldsymbol{A} = \boldsymbol{P} \boldsymbol{D} \boldsymbol{P}^{-1}$. Et dans ce cas chaque vecteur colonne de $\boldsymbol{P}$ est un vecteur propre de $\boldsymbol{A}$.

Si une matrice carrée $\boldsymbol{A}$ diagonalisable admet une valeur propre nulle, alors nécessairement $\det(\boldsymbol{A})=0$.

Soit $\boldsymbol{A}$ une matrice carrée de taille $(p,p)$ diagonalisable alors, $\operatorname{rang}(\boldsymbol{A})= p - m_0$ où $m_0\leq p$ est la multiplicité de la valeur propre 0 (éventuellement nulle).

Une matrice carrée de taille $(p,p)$, $\boldsymbol{P}$ est orthogonale si et seulement si : $\boldsymbol{P}$ est à coefficients réels, est inversible et son inverse est égale à sa transposée : $\boldsymbol{P}^{-1}= \boldsymbol{P}^\top$, et donc $\boldsymbol{P} \boldsymbol{P}^\top =\mathbb{I}_n$.

Les vecteurs colonnes d’une matrice orthogonale sont orthogonaux.

Une matrice orthogonale est une transformation rigide; elle préserve les longueurs et les angles: $\|\boldsymbol{P} \boldsymbol{x}\|= \|\boldsymbol{x}\|$ et $(\boldsymbol{P}\boldsymbol{x})^\top (\boldsymbol{P} \boldsymbol{y})=\boldsymbol{x}^\top \boldsymbol{y}$.

Soit $\boldsymbol{A}$ une matrice symétrique réelle de taille $(p,p)$, alors il existe une matrice orthogonale $\boldsymbol{P}$ (c’est-à-dire $\boldsymbol{P}^{-1}=\boldsymbol{P}^\top$) et une matrice $\boldsymbol{D}$ diagonale $\boldsymbol{D}=\operatorname{diag}(\lambda_i, i=1,\dots,p)$, telle que \[ \boldsymbol{A} = \boldsymbol{P} \boldsymbol{D} \boldsymbol{P}^\top \] Et dans ce cas, les éléments de $\boldsymbol{D}$ sont les valeurs propres de $\boldsymbol{A}$ et les vecteurs colonnes de $\boldsymbol{P}$ les vecteurs propres associés.

Théorème d’encadrement des valeurs propres d’une matrice symétrique: Soit $\boldsymbol{A}$ une matrice réelle symétrique de taille $(p,p)$ et soient $\lambda_1\leq \lambda_2\leq \dots \leq \lambda_p$ ses valeurs propres ordonnées. Alors, pour tout $\boldsymbol{x}\in \mathbb{R}^p$ \[ \lambda_1 \leq \frac{\boldsymbol{x}^\top \boldsymbol{A} \boldsymbol{x}}{\boldsymbol{x}^\top \boldsymbol{x}} \leq \lambda_p. \]

Une matrice $\boldsymbol{A}$ réelle symétrique est définie positive (resp. semi-définie positive) si et seulement si toutes ses valeurs propres sont toutes positives (resp. positives ou nulles).

(X = trunc(matrix(rnorm(6,0,5),nr=2)))

##      [,1] [,2] [,3]
## [1,]   -2   -3   -4
## [2,]   -1   -8   -3

(A = t(X)%*%X)

##      [,1] [,2] [,3]
## [1,]    5   14   11
## [2,]   14   73   36
## [3,]   11   36   25

eigen(A)

## eigen() decomposition
## $values
## [1]  9.566163e+01  7.338365e+00 -1.444161e-15
## 
## $vectors
##            [,1]       [,2]        [,3]
## [1,] -0.1900915  0.4585891  0.86807906
## [2,] -0.8624949 -0.5004045  0.07548514
## [3,] -0.4690073  0.7343646 -0.49065338

P = eigen(A)$vectors
vp = eigen(A)$values
P%*%diag(vp)%*%t(P)

##      [,1] [,2] [,3]
## [1,]    5   14   11
## [2,]   14   73   36
## [3,]   11   36   25

solve(P)

##            [,1]        [,2]       [,3]
## [1,] -0.1900915 -0.86249486 -0.4690073
## [2,]  0.4585891 -0.50040445  0.7343646
## [3,]  0.8680791  0.07548514 -0.4906534

t(P)

##            [,1]        [,2]       [,3]
## [1,] -0.1900915 -0.86249486 -0.4690073
## [2,]  0.4585891 -0.50040445  0.7343646
## [3,]  0.8680791  0.07548514 -0.4906534

Soit $\boldsymbol{y}$ un vecteur de $\mathbb{R}^n$. Sa projection sur $\mathcal V(\boldsymbol{x}_1,\dots,\boldsymbol{x}_p)$ est le vecteur $\hat{\boldsymbol{y}} = \boldsymbol{x}\hat{\mathbf a}$ minimisant $\|\boldsymbol{y} -\boldsymbol{x} \mathbf a\|$. La solution de ce problème donne $\hat{\mathbf a} = (\boldsymbol{x}^\top \boldsymbol{x})^{-1} \boldsymbol{x}^\top \boldsymbol{y}$ et le projeté est $\hat{\boldsymbol{y}} = \boldsymbol{x} \hat{\mathbf a}$.

La matrice $\mathcal{P} = \boldsymbol{x} ( \boldsymbol{x}^\top \boldsymbol{x})^{-1} \boldsymbol{x}$ est appelée projecteur ou matrice de projection sur $\{\boldsymbol{x}_1,\dots,\boldsymbol{x}_p\}$ et on a $\hat{ \boldsymbol{y}} = \mathcal{P} \boldsymbol{y}$.

Déterminons le projeté orthogonal de $\boldsymbol{y}=(1,-1,0)^\top$ sur les vecteurs $\boldsymbol{x}_1=(1,1,0)^\top$ et $\boldsymbol{x}_2=(3,0,1)^\top$ :

f = function(a,y=c(1,-1,0)){ (y[1]-(a[1]+3*a[2]))^2+(y[2]-a[1])^2+(y[3]-a[2])^2 }
optim(par=c(0,0),f)$par

## [1] -0.8181057  0.5454202

X=cbind(c(1,1,0),c(3,0,1)); solve(t(X)%*%X)%*%t(X)%*% c(1,-1,0)

##            [,1]
## [1,] -0.8181818
## [2,]  0.5454545

Soient $\boldsymbol{x}_1,\dots,\boldsymbol{x}_p$, $p$ vecteurs de $\mathbb{R}^n$ et soit $Q\subset\{1,\dots,p\}$ un sous-ensemble d’indices. On note $\hat y_Q$ le projeté orthogonal de $y$ sur l’espace $\mathcal V_Q$ engendré par les vecteurs $\boldsymbol{x}_j$ pour $j\in Q$ et on note $\boldsymbol{x}_Q=(\boldsymbol{x}_j, j\in Q)$ la matrice de taille $(n,\#Q)$. Alors on a les propriétés suivantes:

$\hat y_Q = \boldsymbol{x}_Q( \boldsymbol{x}_Q^\top \boldsymbol{x}_Q)^{-1} \boldsymbol{x}_Q^\top \boldsymbol{y}$.
La matrice de projection est égale à $\mathcal{P}_Q = \boldsymbol{x}_Q ( \boldsymbol{x}_Q^\top \boldsymbol{x}_Q)^{-1} \boldsymbol{x}_Q^\top$.
La matrice $\mathcal{P}_Q$ est idempotente et symétrique. En particulier $\mathcal{P}_Q \boldsymbol{x}_Q=\boldsymbol{x}_Q$.
Si $Q\subset Q^\prime \subset\{1,\dots,p\}$, alors $\mathcal{P}_Q \mathcal{P}_{Q^\prime}= \mathcal{P}_{Q^\prime} \mathcal{P}_Q = \mathcal{P}_Q$.
La matrice $\mathcal{P}^\perp_Q=\mathbb{I}_n-\mathcal{P}_Q$ est aussi une matrice de projection. Il s’agit du projecteur orthogonal sur $\mathcal V_Q^\perp$, l’orthogonal de $\mathcal V_Q$.
$\mathcal{P}_Q\mathcal{P}_Q^\perp = \mathcal{P}_Q^\perp \mathcal{P}_Q=0$ et en particulier, $\mathcal{P}_Q^\perp \boldsymbol{x}_Q=0$.

X = cbind(c(1,0,1),c(3,4,2))
X

##      [,1] [,2]
## [1,]    1    3
## [2,]    0    4
## [3,]    1    2

P = X%*% solve(t(X)%*%X)%*%t(X)
P

##           [,1]       [,2]       [,3]
## [1,] 0.5151515  0.1212121  0.4848485
## [2,] 0.1212121  0.9696970 -0.1212121
## [3,] 0.4848485 -0.1212121  0.5151515

P%*%P

##           [,1]       [,2]       [,3]
## [1,] 0.5151515  0.1212121  0.4848485
## [2,] 0.1212121  0.9696970 -0.1212121
## [3,] 0.4848485 -0.1212121  0.5151515

Porth = diag(rep(1,3))-P
Porth%*%Porth

##            [,1]        [,2]       [,3]
## [1,]  0.4848485 -0.12121212 -0.4848485
## [2,] -0.1212121  0.03030303  0.1212121
## [3,] -0.4848485  0.12121212  0.4848485

P%*%Porth

##               [,1]          [,2]          [,3]
## [1,] -8.326673e-17 -1.387779e-16  5.551115e-17
## [2,]  9.714451e-17  2.081668e-16  2.775558e-17
## [3,]  0.000000e+00 -2.012279e-16 -5.551115e-17

Si $\boldsymbol{A}$ est une matrice idempotente alors

$\operatorname{rang}(\boldsymbol{A})=\operatorname{trace}(\boldsymbol{A})$.
Les valeurs propres de $\boldsymbol{A}$ valent soit 0 soit 1.

Théorème de Cochran: soit $\boldsymbol{A}=\boldsymbol{A}_1+\dots+\boldsymbol{A}_k$. Alors les deux énoncés suivants sont équivalents

$\boldsymbol{A}$ est idempotente et $\operatorname{rang}(\boldsymbol{A})= \sum_{i=1}^k \operatorname{rang}(\boldsymbol{A}_i)$.
$\boldsymbol{A}_i$ est idempotente pour tout $i$ et $\boldsymbol{A}_i\boldsymbol{A}_j=0$ pour tout $i\neq j$.

2 Rappels de Statistiques

Davis = read.table(
"http://socserv.socsci.mcmaster.ca/jfox/Books/Applied-Regression-2E/datasets/Davis.txt")
Davis[12,c(2,3)] = Davis[12,c(3,2)]

2.1 Probabilities

Un vecteur aléatoire $\boldsymbol{X}=(X_1,\dots,X_d)$ de dimension $d$ admet pour fonction de répartition \[ F(\boldsymbol{x})=F(x_1,\dots,x_d)=\mathbb{P}[X_1\leq x_1,\dots,X_d\leq x_d] \] Si les composantes sont absolument continues, la densité associée est \[ f(\boldsymbol{x})=f(x_1,\dots,x_d)=\frac{\partial^n F(x_1,\dots,x_d)}{\partial x_1\dots\partial x_d} \] Soit $\boldsymbol{X}$ un vecteur aléatoire admettant une densité ou fonction de masse, de dimension $d$, alors la densité marginale de $X_i$ ($i=1,\dots,d$) est définie par \[ f_{X_i}(x_i ) = \int_{\mathbb{R}^{n-1}} f_{\boldsymbol{X}}(x_1,\dots,x_d) d x_1 \dots d x_{i-1} d x_{i+1} \dots d x_d \] dans le cas absolument continu.

Soit $I,J \subset \{1,\dots,d\}$ tels que $I\cap J =\emptyset$ et $I\cup J=\{1,\dots,d\}$. Pour $\boldsymbol{x}\in \mathbb{R}^d$ on note $\boldsymbol{X}_I$ et $\boldsymbol{x}_J$ les vecteurs associées aux composantes d’indices $I$ et $J$ respectivement.

La distribution conditionnelle de $\boldsymbol{X}_I$ sachant $\boldsymbol{X}_J=\boldsymbol{x}_J$ est donnée par \[ f_{\boldsymbol{X}_I}(\boldsymbol{x}_I) = \frac{f_\boldsymbol{X}(\boldsymbol{x})}{f_{\boldsymbol{X}_J}(\boldsymbol{x}_J)} \] pour $\boldsymbol{x}_J$ tel que $f_{\boldsymbol{X}_J}(\boldsymbol{x}_J)>0$, dans le cas absolument continu.

Soient $\boldsymbol{X}$ et $\boldsymbol{Y}$ deux vecteurs aléatoires de dimension $d$ et $d^\prime$.

L’espérance de $\boldsymbol{X}$, notée $\mathbb{E}(\boldsymbol{X})$ est définie (si elle existe) par $\mathbb{E}(\boldsymbol{X})= \left( \mathbb{E}(\boldsymbol{X}_1),\dots,\mathbb{E}(\boldsymbol{X}_d)\right)^\top$.
La matrice de covariance (appelée aussi matrice de variance-covariance de $\boldsymbol{X}$) est déifinie (si elle existe) par la matrice de taille $(d,d)$ \[ \operatorname{Var}(\boldsymbol{X}) = \mathbb{E} \left( (\boldsymbol{X}-\mathbb{E}(\boldsymbol{X})) (\boldsymbol{X}-\mathbb{E}(\boldsymbol{X}))^\top\right). \] Ainsi le terme $ij$ de cette matrice représente la covariance entre $X_i$ et $X_j$.
De la même façon, on définit la covariance entre $\boldsymbol{X}$ et $\boldsymbol{Y}$ par la matrice de taille $(d,d^\prime)$ \[ \operatorname{Cov}(\boldsymbol{X},\boldsymbol{Y}) = \mathbb{E} \left( (\boldsymbol{X}-\mathbb{E}(\boldsymbol{X})) (\boldsymbol{Y}-\mathbb{E}(\boldsymbol{Y}))^\top\right). \]

La matrice de variance-covariance est nécessairement une matrice symétrique semi-définie positive.

Soit $\boldsymbol{Y}$ un vecteur aléatoire de dimension $d$, de moyenne $\boldsymbol{\mu}$ et de matrice de covariance $\boldsymbol{Sigma}$. Soient $\boldsymbol{A}$ et $\boldsymbol{B}$ deux matrices réeeles de taille $(d,p)$ et $(d,q)$ et enfin soit $\mathbf a \in \mathbb{R}^p$ alors

$\operatorname{Var}(\boldsymbol{Y}) = \mathbb{E}(\boldsymbol{Y}\boldsymbol{Y}^\top) - \boldsymbol{\mu} \boldsymbol{\mu}^\top$.
$\mathbb{E} \left( \boldsymbol{A}^\top \boldsymbol{Y} + \mathbf a \right) =\boldsymbol{A}^\top \boldsymbol{\mu} + \mathbf a$.
$\operatorname{Var} \left( \boldsymbol{A}^\top \boldsymbol{Y} + \mathbf a \right) = \boldsymbol{A}^\top \boldsymbol{\Sigma} \boldsymbol{A}$.
$\operatorname{Cov}\left( \boldsymbol{A}^\top \boldsymbol{Y}, \boldsymbol{B}^\top \boldsymbol{Y} \right) = \boldsymbol{A}^\top \boldsymbol{\Sigma} \boldsymbol{B}$.

Soit $\boldsymbol{A}$ une matrice réelle symétrique de taille $(d,d)$ et $\boldsymbol{Y}$ un vecteur aléatoire de moyenne $\boldsymbol{\mu}$ et de matrice de covariance $\boldsymbol{\Sigma}$, alors \[ \mathbb{E} \left( \boldsymbol{Y}^\top \boldsymbol{A} \boldsymbol{Y}\right) = \boldsymbol{\mu}^\top \boldsymbol{A} \boldsymbol{\mu} + \operatorname{trace}( \boldsymbol{A} \boldsymbol{\Sigma}). \]

2.2 Lois Usuelles

2.2.1 Loi binomiale

\[{\displaystyle f(k)=\mathbb {P} (X=k)={n \choose k}\,p^{k}(1-p)^{n-k}} \] de moyenne $\mathbb{E}[X]=np$ et de variance $\text{Var}[X]=np(1-p)$

par(mfrow=c(1,2))
p=dbinom(0:10,size = 10,prob = .3)
names(p)=0:10
barplot(p,col=rgb(0,0,1,.4))
u=seq(-1,11,length=251)
plot(u,pbinom(u,size = 10,prob = .3),type="l",xlab="",ylab="")
lines(u,pnorm(u,10*.3,sqrt(10*.3*(1-.3))),col="red",lty=2)

2.2.2 Loi de Poisson

\[{\displaystyle f(k)= {\frac {\lambda ^{k}}{k!}}\mathrm {e} ^{-\lambda }} \] de moyenne $\mathbb{E}[X]=\lambda$ et de variance $\text{Var}[X]=\lambda$

par(mfrow=c(1,2))
p=dpois(0:10,lambda = 3)
names(p)=0:10
barplot(p,col=rgb(0,0,1,.4))
u=seq(-1,11,length=251)
plot(u,ppois(u,lambda = 3),type="l",xlab="",ylab="")
lines(u,pnorm(u,3,sqrt(3)),col="red",lty=2)

2.2.3 Loi Exponentielle

La fonction de répartition est donnée par : \[ {\displaystyle F(x)=\left\{{\begin{matrix}1-e^{-\lambda x}&{\text{si}}\;x\geqslant 0\\0&{\text{si}}\;x<0\end{matrix}}\right.} \] et la densité ${\displaystyle \lambda e^{-\lambda x}\boldsymbol{1}_{\mathbb {R} _{+}}(x)}$. La moyenne est ${\displaystyle \mathbb {E} (X)={\frac {1}{\lambda }}}$ et la variance ${\displaystyle \text{Var}(X)={\dfrac {1}{\lambda ^{2}}}}$

par(mfrow=c(1,2))
u=seq(-1,11,length=251)
plot(u,dexp(u,1/3),col="white",xlab="",ylab="")
polygon(c(u,rev(u)),c(dexp(u,1/3),rep(0,length(u))),col=rgb(0,0,1,.4), border=NA)
lines(u,dexp(u,1/3))
plot(u,pexp(u,1/3),type="l",xlab="",ylab="")

2.2.4 Loi Gaussienne

La densité de la loi normale (Gaussienne) $\mathcal{N}(\mu,\sigma^2)$ est donnée par : \[ {\displaystyle f(x)={\frac {1}{\sigma {\sqrt {2\pi }}}}\exp\big[-{\frac {1}{2}}\left({\frac {x-\mu }{\sigma }}\right)^{2}}\big] \] La moyenne est ${\displaystyle \mathbb {E} (X)=\mu}$ et la variance ${\displaystyle \text{Var}(X)=\sigma^2}$

par(mfrow=c(1,2))
u=seq(-1,11,length=251)
plot(u,dnorm(u,3,1.5),col="white",xlab="",ylab="")
polygon(c(u,rev(u)),c(dnorm(u,3,1.5),rep(0,length(u))),col=rgb(0,0,1,.4), border=NA)
lines(u,dnorm(u,3,1.5))
plot(u,pnorm(u,3,1.5),type="l",xlab="",ylab="")

Si ${\displaystyle X\sim {\mathcal {N}}(\mu ,\sigma ^{2})}$ alors ${\displaystyle aX+b\sim {\mathcal {N}}(a\mu +b,a^{2}\sigma ^{2})}$.

library(MASS)

## 
## Attaching package: 'MASS'

## The following object is masked from 'package:sm':
## 
##     muscle

fitdistr(Davis$height,"normal")

##       mean           sd     
##   170.5650000     8.9098695 
##  (  0.6300229) (  0.4454935)

hist(Davis$height,probability = TRUE, col=rgb(0,0,1,.4), main="")
x=seq(min(Davis$height)-2,max(Davis$height)+2,length=251)
y=dnorm(x,fitdistr(Davis$height,"normal")$estimate[1],fitdistr(Davis$height,"normal")$estimate[2])
lines(x,y,col="red",lwd=2)

2.2.5 Loi du Chi-Deux

Soient $\{X_1, \cdots , X_k\}$, $k$ variables aléatoires indépendantes suivant des lois normales centrées (de moyennes 0) et réduites (de variance 1), alors \[ {\displaystyle X=\sum _{i=1}^{k}X_{i}^{2}}\sim\chi^2(k) \] suit une loi du chi-deux à $k$ degrés de liberté. La densité de $X$ est \[ {\displaystyle f(x)={\frac {1}{2^{\frac {k}{2}}\Gamma ({\frac {k}{2}})}}x^{{\frac {k}{2}}-1}e^{-{\frac {x}{2}}}}\boldsymbol{1}_{\mathbb{R}_+}(x) \] On peut noter que $\mathbb{E}[X]=k$ et $\operatorname{var}[X]=2k$.

par(mfrow=c(1,2))
u=seq(-1,11,length=251)
plot(u,dchisq(u,3),col="white",xlab="",ylab="")
polygon(c(u,rev(u)),c(dchisq(u,3),rep(0,length(u))),col=rgb(0,0,1,.4), border=NA)
lines(u,dchisq(u,3))
plot(u,pchisq(u,3),type="l",xlab="",ylab="")

2.2.6 Loi de Student

Si $X$ est une variable aléatoire de loi normale centrée et réduite et $Y$ suit une loi $\chi^2(n)$, telles que $X$ et $Y$ soient indépendantes, alors ${\displaystyle {\frac {X}{\sqrt {Y/n}}}}$ suit une loi de Student à $n$ degrés de liberté, $\mathcal{S}td(n)$

par(mfrow=c(1,2))
u=seq(-4,4,length=251)
plot(u,dt(u,3),col="white",xlab="",ylab="",ylim=c(0,.4))
polygon(c(u,rev(u)),c(dt(u,3),rep(0,length(u))),col=rgb(0,0,1,.4), border=NA)
lines(u,dt(u,3))
lines(u,dnorm(u),col="red",lty=2)
plot(u,pt(u,3),type="l",xlab="",ylab="")
lines(u,pnorm(u),col="red",lty=2)

La densité de $T$ est donnée par : \[ {\displaystyle f(t)={\frac {1}{\sqrt {k\pi }}}{\frac {\Gamma ({\frac {k+1}{2}})}{\Gamma ({\frac {k}{2}})}}\left(1+{\frac {t^{2}}{k}}\right)^{-{\frac {k+1}{2}}}} \] qui est définie pour $n\in\mathbb{R}_{+\star}$ et pas juste $n\in\mathbb{N}_{\star}$. Si $n>1$, $\mathbb{E}[T]=1$, et si $n>2$, $\operatorname{var}[T]=n/(n-2)$.

2.2.7 Loi de Fisher

Le quotient de deux variables aléatoires indépendantes, $Q_1$ et $Q_2$, distribuées chacune selon des lois $\chi^2(d_1)$ et $\chi^2(d_2)$ respectivement suit une loi de Fisher \[ {\displaystyle \mathrm {F} = {\frac {U_{1}/d_{1}}{U_{2}/d_{2}}}}\sim\mathcal{F}(d_1,d_2) \] La densité d’une loi de Fisher est donnée par \[ {\displaystyle f(x)={\frac {1}{x\;\mathrm {B} (d_{1}/2,d_{2}/2)}}\left({\frac {d_{1}\,x}{d_{1}\,x+d_{2}}}\right)^{d_{1}/2}\;\left(1-{\frac {d_{1}\,x}{d_{1}\,x+d_{2}}}\right)^{d_{2}/2}} \] pour $x>0$. L’espérance de cette variable aléatoire est \[ \mathbb{E}[\mathrm{F}] =\frac{d_2}{d_{2}-2} \] dès lors que $d_2>2$.

2.2.8 Vecteur Gaussien

La densité d’un vecteur Gaussien $\mathcal{N}(\boldsymbol{\mu},\boldsymbol{\Sigma})$ est \[ {\displaystyle f(x_{1},\ldots ,x_{k})={\frac {\exp \left(-{\frac {1}{2}}({\mathbf {x} }-{\boldsymbol {\mu }})^{\top }{\boldsymbol {\Sigma }}^{-1}({\mathbf {x} }-{\boldsymbol {\mu }})\right)}{\sqrt {(2\pi )^{k}|{\boldsymbol {\Sigma }}|}}}} \]

plot(Davis$height,Davis$weight,xlab="Height (cm)",ylab="Weight (kg)")
library(mnormt)
f=function(x,y) dmnorm(cbind(x,y),c(mean(Davis$height),mean(Davis$weight)),var(Davis[,c("height","weight")]))
x=seq(min(Davis$height),max(Davis$height),length=101)
y=seq(min(Davis$weight),max(Davis$weight),length=101)
z=outer(x,y,f)
contour(x,y,z,add=TRUE,col="red")

Let \[ {\displaystyle \mathbf {X} ={\begin{bmatrix}\mathbf {x} _{1}\\\mathbf {x} _{2}\end{bmatrix}}{\text{ with sizes }}{\begin{bmatrix}q\times 1\\(N-q)\times 1\end{bmatrix}}}\]

and accordingly μ and Σ are partitioned as follows \[ {\boldsymbol {\mu }}={\begin{bmatrix}{\boldsymbol {\mu }}_{1}\\{\boldsymbol {\mu }}_{2}\end{bmatrix}}{\text{ with sizes }}{\begin{bmatrix}q\times 1\\(N-q)\times 1\end{bmatrix}}\] and \[{\displaystyle {\boldsymbol {\Sigma }}={\begin{bmatrix}{\boldsymbol {\Sigma }}_{11}&{\boldsymbol {\Sigma }}_{12}\\{\boldsymbol {\Sigma }}_{21}&{\boldsymbol {\Sigma }}_{22}\end{bmatrix}}{\text{ with sizes }}{\begin{bmatrix}q\times q&q\times (N-q)\\(N-q)\times q&(N-q)\times (N-q)\end{bmatrix}}} \]

\[{\displaystyle {\bar {\boldsymbol {\mu }}_{\mathbf {a}}}={\boldsymbol {\mu }}_{1}+{\boldsymbol {\Sigma }}_{12}{\boldsymbol {\Sigma }}_{22}^{-1}\left(\mathbf {a} -{\boldsymbol {\mu }}_{2}\right)}\]

and covariance matrix \[{\displaystyle {\overline {\boldsymbol {\Sigma }}_{\mathbf {a}}={\boldsymbol {\Sigma }}_{11}-{\boldsymbol {\Sigma }}_{12}{\boldsymbol {\Sigma }}_{22}^{-1}{\boldsymbol {\Sigma }}_{21}.}}\] In the bivariate case\[ {\displaystyle {\begin{pmatrix}X_{1}\\X_{2}\end{pmatrix}}\sim {\mathcal {N}}\left({\begin{pmatrix}\mu _{1}\\\mu _{2}\end{pmatrix}},{\begin{pmatrix}\sigma _{1}^{2}&\rho \sigma _{1}\sigma _{2}\\\rho \sigma _{1}\sigma _{2}&\sigma _{2}^{2}\end{pmatrix}}\right)}\] xx \[{\displaystyle X_{1}\mid X_{2}=x_{2}\ \sim \ {\mathcal {N}}\left(\mu _{1}+{\frac {\sigma _{1}}{\sigma _{2}}}\rho (x_{2}-\mu _{2}),\,(1-\rho ^{2})\sigma _{1}^{2}\right).} \] so that \[{\displaystyle \mathbb {E} (X_{1}\mid X_{2}=x_{2})=\mu _{1}+\rho {\frac {\sigma _{1}}{\sigma _{2}}}(x_{2}-\mu _{2})} \] while \[ {\displaystyle \operatorname {var} (X_{1}\mid X_{2}=x_{2})=(1-\rho ^{2})\sigma_1^2<\sigma_1^2= \operatorname {var} (X_{1})} \]

Supposons qu’on veuille simuler $\boldsymbol{X} \sim \mathcal{N}(\boldsymbol{\mu} ,\boldsymbol{\Sigma})$, un vecteur gaussien de taille $d$.

$\boldsymbol{X} \stackrel{\mathcal{L}}{=} \boldsymbol{X}^\prime +\boldsymbol{\mu}$, avec $\boldsymbol{X}^\prime \sim \mathcal{N}(\boldsymbol{0} ,\boldsymbol{\Sigma})$: il suffit donc de simuler $\boldsymbol{X}^\prime$.

Puisque $\boldsymbol{\Sigma}$ est réelle symétrique, $\exists \boldsymbol{P}$ orthogonale et $\boldsymbol{D}$ diagonale (dont les éléments sont $\geq 0$), telle que $\boldsymbol{\Sigma} = \boldsymbol{P} \boldsymbol{D} \boldsymbol{P}^\top$, alors \[ \boldsymbol{P} \boldsymbol{D}^{1/2} \boldsymbol{Y} \stackrel{\mathcal{L}}{=} \boldsymbol{X}^\prime \] où $\boldsymbol{Y}\sim\mathcal{N}(\boldsymbol{0},\mathbb{I}_d)$, puisque $\operatorname{Var}(\boldsymbol{P} \boldsymbol{D}^{1/2} \boldsymbol{Y}) = \boldsymbol{P} \boldsymbol{D}^{1/2} \mathbb{E}(\boldsymbol{Y} \boldsymbol{Y}^\top) \boldsymbol{D}^{1/2} \boldsymbol{P}^\top = \boldsymbol{\Sigma}$

Aussi, pour simuler une réalisation de $\boldsymbol{X} \sim \mathcal{N}(\boldsymbol{\mu},\boldsymbol{\Sigma})$:

on simule $d$ réalisations indépendantes d’une $\mathcal{N}(0,1)$ que l’on stocke dans $\boldsymbol{y}$, i.e. y=rnorm(d)
on diagonalise la matrice des vecteurs propres $\boldsymbol{P}$ et la matrice diagonale des racines carrées des valeurs propres $\boldsymbol{D}^{1/2}$, puis on calcule $\boldsymbol{P} \boldsymbol{D}^{1/2} \boldsymbol{y}$.
la réalisation désirée est: $\boldsymbol{x} = \boldsymbol{\mu} + \boldsymbol{P} \boldsymbol{D}^{1/2} \boldsymbol{y}$.

A = function(rho){
 Sigma=cbind(c(1,rho),c(rho,1))
 eig=eigen(Sigma)
 eig$vector %*% diag(sqrt(eig$val))
 }
A0.5 = A(0.5)
Y = rnorm(2,0,1)
A0.5%*%Y

##          [,1]
## [1,] 1.417492
## [2,] 1.290876

A0.5%*% matrix(rnorm(2*3),nr=2)

##            [,1]      [,2]     [,3]
## [1,] -0.8129428 0.6786791 2.119525
## [2,] -1.4852915 1.5903142 1.664039

m = 10^4
y = matrix(rnorm(2*m),nr=2)
X0=y
apply(X0,1,var)

## [1] 1.008306 1.032374

cov(X0[1,],X0[2,])

## [1] 0.01081344

par(mfrow = c(1,2))
library(plot3D)
xc=cut(X0[1,],20);yc=cut(X0[2,],20);z=table(xc,yc)
hist3D(z=z, border="black")
xc=cut(X0[1,],20);yc=cut(X0[2,],20);z=table(xc,yc)
image2D(z=z, border="black")

X0.5=A(0.5)%*%y
par(mfrow = c(1,2))
library(plot3D)
xc=cut(X0.5[1,],20);yc=cut(X0.5[2,],20);z=table(xc,yc)
hist3D(z=z, border="black")
xc=cut(X0.5[1,],20);yc=cut(X0.5[2,],20);z=table(xc,yc)
image2D(z=z, border="black")

X0.9=A(0.9)%*%y
par(mfrow = c(1,2))
library(plot3D)
xc=cut(X0.9[1,],20);yc=cut(X0.9[2,],20);z=table(xc,yc)
hist3D(z=z, border="black")
xc=cut(X0.9[1,],20);yc=cut(X0.9[2,],20);z=table(xc,yc)
image2D(z=z, border="black")

par(mfrow = c(1,1))

Soit $\boldsymbol{Y}\sim \mathcal{N}(\boldsymbol{0}, \mathbb{I}_d)$ et soit $\boldsymbol{A}$ une matrice réelle symétrique de rang $r$. Alors,

si $\boldsymbol{A}$ est une matrice idempotente, $\boldsymbol{Y}^\top \boldsymbol{A} \boldsymbol{Y} \sim \chi^2(r)$ où $r=\operatorname{rang}(\boldsymbol{A})$.
si $\boldsymbol{Y}^\top \boldsymbol{A} \boldsymbol{Y} \sim \chi^2(s)$ alors $\boldsymbol{A}$ est une matrice idempotente et $s=\operatorname{rang}(\boldsymbol{A})$.

Soit $\boldsymbol{Y} \sim \mathcal{N}(\boldsymbol{\mu}, \mathbb{I}_d)$, et définissons $\boldsymbol{X} = \boldsymbol{Y}^\top \boldsymbol{Y} = \|\boldsymbol{Y}\|^2$. On dit que $\boldsymbol{X}$ suit une loi du $\chi^2(d)$ et de paramètre de non centralité $\lambda = \boldsymbol{\mu}^\top \boldsymbol{\mu}$, et on note $\boldsymbol{X} \sim \chi^2(d,\boldsymbol{\mu}^\top \boldsymbol{\mu})$.

Soit $\boldsymbol{Y}\sim \mathcal{N}(\boldsymbol{\mu}, \boldsymbol{\Sigma})$ et soit $\boldsymbol{A}$ une matrice réelle symétrique de rang $r$. Alors,

si $\boldsymbol{A} \boldsymbol{\Sigma} \boldsymbol{A}=\boldsymbol{A}$, $\boldsymbol{Y}^\top \boldsymbol{A} \boldsymbol{Y} \sim \chi^2(r,\lambda)$ où $r=\operatorname{rang}(\boldsymbol{A})$ et $\lambda=\boldsymbol{\mu}^\top \boldsymbol{A} \boldsymbol{\mu}$.
si $\boldsymbol{Y}^\top \boldsymbol{\Sigma} \boldsymbol{Y} \sim \chi^2(s,\lambda)$ alors $\boldsymbol{A} \boldsymbol{\Sigma} \boldsymbol{A}=\boldsymbol{A}$, $s=\operatorname(\boldsymbol{A})$ et $\lambda=\boldsymbol{\mu}^\top \boldsymbol{A} \boldsymbol{\mu}$.

Théorème de Cochrane: Soit $\boldsymbol{X}\sim \mathcal{N}(\boldsymbol{\mu},\sigma^2 \mathbb{I}_d)$ où $\boldsymbol{\mu}\in \mathbb{R}^d$ et $\sigma>0$. Soient $F_1,\dots,F_m$ des sous-espaces vectoriels de $\mathbb{R}^d$ deux à deux orthogonaux, et $\boldsymbol{P}_{F_i}$ la matrice de projection orthogonale sur $F_i$ et $d_i$ la dimension de $F_i$ pour $1\leq i \leq m$, alors

Les vecteurs $\boldsymbol{P}_{F_1} \boldsymbol{X}, \dots,\boldsymbol{P}_{F_m} \boldsymbol{X}$ sont deux à deux indépendants et de loi $\mathcal{N}(\boldsymbol{P}_{F_i}\boldsymbol{\mu}, \sigma^2 \boldsymbol{P}_{F_i})$ pour $1\leq i \leq m$.
Les variables aléatoires $\|\boldsymbol{P}_{F_i} (\boldsymbol{X}-\boldsymbol{\mu})\|^2/\sigma^2$ pour $1\leq i\leq m$ sont 2 à 2 indépendantes et de loi $\chi^2(d_i)$.

Autrement, si $\boldsymbol{X}\sim \mathcal{N}(\boldsymbol{ 0},\mathbb{I}_d)$ et $F$ un sous-espace vectoriel de $\mathbb{R}^d$ de dimension $p< d$, si $F^\bot$ désigne son orthogonal, et si $\boldsymbol{P}_F$ et $\boldsymbol{P}_{F^\bot}$ les matrices de projection sur $F$ et $F^\bot$, alors

$\boldsymbol{P}_F \boldsymbol{X} \sim \mathcal{N}(\boldsymbol{ 0}, \boldsymbol{P}_F)$ et $\boldsymbol{P}_{F^\bot} \boldsymbol{X} \sim \mathcal{N}(\boldsymbol{ 0}, \boldsymbol{P}_{F^\bot})$ et $(\boldsymbol{P}_F \boldsymbol{X}) \bot (\boldsymbol{P}_{F^\bot} \boldsymbol{X})$
$\|\boldsymbol{P}_F \boldsymbol{X} \|^2 \sim \chi^2(p)$, $\|\boldsymbol{P}_{F^\bot} \boldsymbol{X} \|^2 \sim \chi^2(d-p)$ et ces deux variables sont indépendantes.

Une conséquence est que si $Y_1,\dots,Y_n$ $n$ sont des variables i.i.d. de loi $N(0,1)$ alors $\overline Y=n^{-1} \sum Y_i$ et $S^2=(n-1)^{-1}\sum_i (Y_i-\overline Y)$ sont deux variables indépendantes et de plus, $(n-1)S^2\sim\chi^2_{n-1}$.

2.2.9 La famille exponentielle

On dira que la loi de $Y$ est dans la famille exponentielle si sa masse de probabilité / densité s’écrit \[ f(y|\theta,\varphi) = \exp\left(\frac{y\theta-b(\theta)}{ \varphi }+c(y,\varphi)\right) \] où $a(\cdot)$, $b(\cdot)$ et $c(\cdot)$ sont des fonctions, et où $\theta$ est appelé . On peut aussi écrire \[ f(y|\theta,\varphi) = A(y)B(\theta)\cdot\exp\left(\frac{y\theta}{ \varphi }\right) \] Plus généralement, on peut considérer \[ f(y|\theta,\varphi) = A(y)B(\theta)\cdot\exp\left(\frac{1}{ \varphi }\sum_{j=1}^r T_j(y)\eta_j(\theta)\right) \] (par la suite, la version simple précédante suffira). $\boldsymbol{\eta}=(\eta_1(\theta),\dots,\eta_r(\theta))$ est le paramètre canonique.

2.3 Convergence

A sequence $\{X1, \dots, Xn, \cdots\}$ of random variables is said to converge in distribution to a random variable $X$ if \[ {\displaystyle \lim _{n\to \infty }F_{n}(x)=F(x)} \] for every $ {x }$ at which $F$ is continuous.

We will write $X_n{\xrightarrow {\mathcal{L}}}X$.

2.3.1 central limit theorem

Let $\{X1, \cdots, Xn, \cdots\}$ denote a sample of size $n$, i.e. independent and identically distributed (i.i.d.) random variables drawn from a distribution of expected value $\mu$ and finite variance $\sigma^2$. Let \[ \displaystyle{S_{n}={\frac {X_{1}+\cdots +X_{n}}{n}}} \] as $n$ goes to infinity, the random variables $\sqrt{n}(Sn − \mu)$ converge in distribution to a normal $\mathcal{N}(0,\sigma^2)$ \[ {\displaystyle {\sqrt {n}}\left(S_{n}-\mu \right)\ {\xrightarrow {\mathcal{L}}}\mathcal{N}\left(0,\sigma ^{2}\right).} \]

 m=100
 mean_samples=function(n=10){
   X=matrix(rnorm(n*m),nrow=m,ncol=n)
   return(apply(X,1,mean))
 }
 B=matrix(NA,100,20)
 for(i in 1:20){
   B[,i] = mean_samples(i*10)
 }
 colnames(B) = as.character(seq(10,200,by=10))
 boxplot(B)
 u = seq(0,21,by=.2)
 v = sqrt(u*10)
 lines(u,1.96/v,col="red")
 lines(u,-1.96/v,col="red")

2.3.2 Delta Method

For a univariate parameter $\theta$ such that \[{\displaystyle {{\sqrt {n}}[X_{n}-\theta ]\,{\xrightarrow {\mathcal{L}}}\,{\mathcal {N}}(0,\sigma ^{2})}}\] then \[ {\displaystyle {{\sqrt {n}}[g(X_{n})-g(\theta )]\,{\xrightarrow {\mathcal L}}\,{\mathcal {N}}(0,\sigma ^{2}\cdot [g'(\theta )]^{2})}} \] for any function $g$ such that $g'(\theta)$ exists and is non-zero.

For a multivariate parameter $\boldsymbol\theta$ such that \[{\displaystyle {{\sqrt {n}}[\boldsymbol{X}_{n}-\boldsymbol{\theta} ]\,{\xrightarrow {\mathcal{L}}}\,{\mathcal {N}}(\boldsymbol0,\boldsymbol\Sigma )}}\] then \[ {\displaystyle {{\sqrt {n}}[g(\boldsymbol{X}_{n})-g(\boldsymbol\theta )]\,{\xrightarrow {\mathcal{L}}}\,{\mathcal {N}}(0,\nabla g(\boldsymbol\theta )^\top\cdot\boldsymbol\Sigma\cdot \nabla g(\boldsymbol\theta ))}} \] for any function $g$ such that $ g())$ exists and is invertible.

2.4 Statistique Inférentielle

2.4.1 Statistique

On appelle échantillon le vecteur $\boldsymbol{x}= (x_1,\dots,x_n)$ vu comme la réalisation de $n$ variables aléatoires identiquement distribuées, $\boldsymbol{X}= (X_1,\dots,X_n)$. On supposera que les $X_i$ suivent une loi $F_{\theta}\in\mathcal{F}$ où le paramètre $\theta$ est inconnu, appartenant à un ensemble $\Theta$.

On parle de modèle identifiable si $F_{\theta}=F_{\theta'} \Longrightarrow\theta=\theta'$.

Une statistique est une variable aléatoire $T$ dépendant de l’échantillon $\boldsymbol{X}$ ne faisant pas intervenir le paramètre $\theta$. On notera $T=T(\boldsymbol{X})$.

On note $\mathcal{L}(\theta; \boldsymbol{x})$ la distribution jointe de $(\boldsymbol{X}_1,\dots,\boldsymbol{X}_n)$, appelée vraisemblance.

On dit que $T$ est exhaustive pour $\theta$, si la loi conditionnelle de $\boldsymbol{X}$ sachant $T=t$ ne dépend pas de $\theta$ pour tout $t$. Par le théorème de factorisation, $T$ est exhaustive s’il existe deux applications mesurables $g$ et $h$ telles que \[ \mathcal{L}(\theta;\boldsymbol{x}) = g(\boldsymbol{x}) \cdot h(T(\boldsymbol{x});\theta) \]

Soit $\boldsymbol{X}$ un échantillon de $n$ variables aléatoires i.i.d. de loi de Bernoulli $\mathcal B(p)$, alors la statistique $T(\boldsymbol{X})=\sum_j X_j$ est exhaustive pour $p$

Soit $\boldsymbol{X}$ un échantillon de $n$ variables aléatoires i.i.d. de loi normale $\mathcal{N}(\mu,\sigma^2)$, alors $T(\boldsymbol{X})=(\sum_j X_j,\sum_j X_j^2)$ est exhaustive pour le couple $(\mu,\sigma^2)$.

Soit une statistique $T$. Si pour $\boldsymbol{x}=(x_1,\dots,x_n)$ et $\boldsymbol{y}=(y_1,\dots,y_n)$ on a l’équivalence \[ T(\boldsymbol{x}) = T(\boldsymbol{y}) \Leftrightarrow \theta \mapsto \frac{\mathcal{L}(\theta,\boldsymbol{x})}{\mathcal{L}(\theta,\boldsymbol{y})} \text{ ne dépend pas de } \theta. \] alors $T$ est une statistique exhaustive minimale.

Une statistique exhaustive $T$t (à valeurs dans $\mathbb{R}^d$) d’un modèle paramétrique est dite complète si pour toute fonction mesurable $g:\mathbb{R}^d\to\mathbb{R}$ telle que $g(T)$ soit intégrable on ait \[ \forall \theta \in \Theta, \mathbb{E}_\theta (f(T))=0 \Rightarrow g(T)=0 \;\; P_\theta-p.s. \]

Soit un modèle paramétrique. Si $T$ est une statistique exhaustive complète

alors pour toute fonction mesurable $\varphi$ bijective, $\varphi(T)$ est une statistique exhaustive complète
alors $T$ est une statistique exhaustive minimale
alors $T$ est indépendante de toute statistique libre du modèle (théorème de Basu)

Dans un modèle statistique paramétrique, il existe toujours une statistique exhaustive minimale mais pas toujours de statistique exhaustive complète.

2.4.2 Vraisemblance, score et information de Fisher

Example d’un pile ou face

set.seed(123)
n = 20
(X = sample(0:1,size=n,replace=TRUE))

##  [1] 0 1 0 1 1 0 1 1 1 0 1 0 1 1 0 1 0 0 0 1

p=seq(0,1,by=.01)
logL=function(p){sum(log(dbinom(X,size=1,prob=p)))}
LL=sapply(p,logL)
plot(p,LL,type="l",col="red",lwd=2)
p0=.5
points(p0,logL(p0),pch=3,cex=1.5,lwd=2)
abline(v=p0,lty=2)
p_hat = mean(X)
points(p_hat,logL(p_hat),pch=3,cex=1.5,lwd=2,col="blue")
abline(v=p_hat,lty=2,col="blue")

Recall that if $X\sim F_\theta$ \[ {\displaystyle \mathbb{E}\left(\frac{\partial \log \mathcal{L}(X)}{\partial \theta}\right)}=0 \] while \[ {\displaystyle \operatorname{var}\left(\frac{\partial \log \mathcal{L}(X)}{\partial \theta}\right)}=\mathbb{E}\left(\frac{\partial \log \mathcal{L}(X)}{\partial \theta}\frac{\partial \log \mathcal{L}(X)}{\partial \theta^\top}\right)=I_n(\theta) \] Observe that \[ I_n(\theta)=n\cdot \operatorname{var}\left(\frac{\partial \log f_\theta(X)}{\partial \theta}\right)=\mathbb{E}\left(\frac{\partial \log f_\theta(X)}{\partial \theta}\frac{\partial \log f_\theta(X)}{\partial \theta^\top}\right)=n\cdot I(\theta) \] Given a vector $\boldsymbol{x}=(x_1,\dots,x_n)$, the score is the vector \[ s_n(\boldsymbol{x},\theta)=\nabla\log\mathcal{L}(\theta;\boldsymbol{x}) \]

To compute derivatives, use

slope_h = function(f,x,h=1e-6) (f(x+h)-f(x-h))/(2*h)

n = 77
S = rep(NA,1e5)
for(s in 1:length(S)){
  Xs = sample(0:1,size=n,replace=TRUE)
  logL = function(p){sum(log(dbinom(Xs,size=1,prob=p)))}
  S[s] = slope_h(logL,p0)
}
mean(S)

## [1] -0.03784

hist(S,probability = TRUE)
u=seq(min(S),max(S),length=251)
v=dnorm(u,0,sd(S))
lines(u,v,lty=2,col="red")

n = 77
S = rep(NA,1e5)
for(s in 1:length(S)){
  Xs = sample(0:1,size=n,replace=TRUE)
  logL = function(p){sum(log(dbinom(Xs,size=1,prob=p)))}
  dlogL = function(p){slope_h(logL,p)}
  S[s] = -slope_h(dlogL,p0)
}
mean(S)

## [1] 308.0002

n/(p0*(1-p0))

## [1] 308

2.4.3 Maximum de Vraisemblance

set.seed(123)
n = 20
(X = sample(0:1,size=n,replace=TRUE))

##  [1] 0 1 0 1 1 0 1 1 1 0 1 0 1 1 0 1 0 0 0 1

neglogL = function(p){-sum(log(dbinom(X,size=1,prob=p)))}
(pml = optim(fn=neglogL,par=p0,method="BFGS")$par)

## Warning in dbinom(X, size = 1, prob = p): NaNs produced

## Warning in dbinom(X, size = 1, prob = p): NaNs produced

## [1] 0.5499996

mean(X)

## [1] 0.55

2.4.4 Estimateur (ponctuel)

On appelle estimateur (ponctuel) de $\theta$ toute statistique $T$ prenant ses valeurs dans $\Theta$.

Un estimateur, $T$, de $\theta$ est dit sans biais, ou non biaisé, si $\mathbb{E}_\theta(T)=\theta$, où \[ \mathbb{E}_{\color{red}{\theta}}(T)=\int_{\mathbb{R}^n}T(x_1,\dots,x_n)dF_{\color{red}{\theta}}(x_1)\dots dF_{\color{red}{\theta}}(x_n) \] On définit le biais par $b(\theta)=\mathbb{E}_\theta(T)-\theta$

Le risque quadratique d’un estimateur $T$ de $\theta$ est $R(T,\theta) = \mathbb{E}_\theta[ (T-\theta)^2]$. On a \[ R(T,\theta) = b(\theta)^2 \, + \, \operatorname{Var}_\theta(T) \] (pour un estimateur sans biais, $R(T,\theta)=\operatorname{Var}_\theta(T)$)

Soient $T_1$ et $T_2$ deux estimateurs de $\theta$. On dira que $T_1$ est plus efficace que $T_2$ si $R(T_1,\theta)\leq R(T_2,\theta)$.

On dit que la suite d’estimateurs $(T_n)_{n\geq 1}$ d’estimateurs de $\theta$ est

convergente si $T_n \stackrel{\mathbb{P}}{\to} \theta$ pour tout $\theta\in \Theta$.
fortement convergente si $T_n \stackrel{p.s.}{\to} \theta$ pour tout $\theta\in \Theta$.
asymptotiquement normale si pour tout $\theta\in \Theta$, il existe une matrice de covariance $\boldsymbol{\Sigma}(\theta)$ telle que $\sqrt{n}(T_n-\theta) \to \mathcal{N}(0,\boldsymbol{\Sigma}(\theta))$

2.4.5 Maximum de Vraisemblance : propriétés asymptotiques

Sous certaines conditions de régularité (que nous n’énoncerons pas ici), alors l’estimateur du maximum de vraisemblance est fortement convergent et asymptotiquement efficace \[ \sqrt{n} \left( \hat\theta_n - \theta\right) \stackrel{\mathcal{L}}{\to } \mathcal{N}(0,I^{-1}(\theta)). \]

options(warn=-1)
set.seed(123)
n = 200
X = sample(0:1,size=n,replace=TRUE)
neglogL = function(p){-sum(log(dbinom(X,size=1,prob=p)))}
(pml = optim(fn=neglogL,par=p0,method="BFGS")$par)

## [1] 0.485

S = rep(NA,1e3)
for(s in 1:length(S)){
  Xs = sample(X,size=n,replace=TRUE)
  neglogL = function(p){-sum(log(dbinom(Xs,size=1,prob=p)))}
  S[s] = optim(fn=neglogL,par=p0,method="BFGS")$par
}
mean(S)

## [1] 0.4842846

var(S)

## [1] 0.001293088

(p0*(1-p0))/n

## [1] 0.00125

hist(S,probability = TRUE)
u=seq(min(S),max(S),length=251)
v=dnorm(u,mean(S),sd(S))
lines(u,v,lty=2,col="red")

2.4.6 Methode des Moments

2.4.7 Statistique Nonparametrique

X = Davis$height
par(mfrow=c(1,2))
plot(ecdf(X),main="")
u=seq(min(X)-2,max(X)+2,length=251)
v=ecdf(X)(u)
plot(u,v,type="s")

par(mfrow=c(1,2))
hist(X, main="")
plot(density(X))

 m=100
 inf_sample=function(n=10){
 X=matrix(rnorm(n*m),nrow=m,ncol=n)
 Xs=t(apply(X,1,sort))
 Pe_inf=matrix(rep((0:(n-1))/n,
 each=m),nrow=m,ncol=n)
 Pe_sup=matrix(rep((0:n)/n,each=m),
 nrow=m,ncol=n)
 Pt=pnorm(Xs)
 D1=abs(Pe_inf-Pt)
 D2=abs(Pe_sup-Pt)
 Df=(D1+D2)/2+abs(D2-D1)/2
 return(apply(Df,1,max))
 }
 B=matrix(NA,100,20)
 for(i in 1:20){
   B[,i]=inf_sample(i*10)
 }
 colnames(B)=as.character(seq(10,200,by=10))
 boxplot(B)

 u=seq(-3,3,by=.1)
 plot(u,u,ylim=c(0,1),col="white")
 M=matrix(NA,length(u),1000)
 for(m in 1:1000){
 n=100
 x=rnorm(n)
 Femp=Vectorize(function(t) mean(x<=t))
 v=Femp(u)
 M[,m]=v
 lines(u,v,col='light blue',type="s")
 }
 lines(u,apply(M,1,mean),col="red",type="l")
 lines(u,apply(M,1,function(x) quantile(x,.05)),
 col="red",type="s")
 lines(u,apply(M,1,function(x) quantile(x,.95)),
 col="red",type="s")

 x0=-1
 y=M[which(u==x0),]
 hist(y,probability=TRUE,
 breaks=seq(.015,0.55,by=.01))
 vu=seq(0,1,by=.001)
 lines(vu,dnorm(vu,pnorm(x0),
 sqrt((pnorm(x0)*(1-pnorm(x0)))/100)),
 col="red")

2.4.8 Melange

library(MASS)
hist(X,col=rgb(0,0,1,.4), border="white",proba=TRUE,xlab="",main="")
(param <- fitdistr(X,"normal")$estimate)

##      mean        sd 
## 170.56500   8.90987

f1 <- function(x) dnorm(x,param[1],param[2]) 
x=seq(100,210,by=.2) 
lines(x,f1(x),lty=2,col="red") 
lines(density(X))

library(mixtools)

## mixtools package, version 1.1.0, Released 2017-03-10
## This package is based upon work supported by the National Science Foundation under Grant No. SES-0518772.

mix <- normalmixEM(X)

## One of the variances is going to zero;  trying new starting values.
## number of iterations= 237

(param12 <- c(mix$lambda[1],mix$mu,mix$sigma))

## [1]   0.1502809 163.7585980 171.7687831   2.5671638   9.0894373

f2 <- function(x){ param12[1]*dnorm(x,param12[2],param12[4])+
 (1-param12[1])*dnorm(x,param12[3],param12[5]) }
hist(X,col=rgb(0,0,1,.4), border="white",proba=TRUE,xlab="",main="",ylim=c(0,.05))
(param <- fitdistr(X,"normal")$estimate)

##      mean        sd 
## 170.56500   8.90987

lines(x,f2(x),lwd=2, col="red") 
lines(density(X))

logdf <- function(x,parameter){
 p <- parameter[1]
 m1 <- parameter[2]
 s1 <- parameter[4]
 m2 <- parameter[3]
 s2 <- parameter[5]
 return(log(p*dnorm(x,m1,s1)+(1-p)*dnorm(x,m2,s2)))
 }
 logL <- function(parameter) -sum(logdf(X,parameter))
 Amat <- matrix(c(1,-1,0,0,0,0,
 0,0,0,0,1,0,0,0,0,0,0,0,0,1), 4, 5)
 bvec <- c(0,-1,0,0)
 constrOptim(c(.5,160,180,10,10), logL, NULL, ui = Amat, ci = bvec)$par

## [1]   0.5996263 165.2690084 178.4991624   5.9447675   6.3564746

(pM <- mean(Davis$sex=="M"))

## [1] 0.44

(paramF <- fitdistr(X[Davis$sex=="F"],"normal")$estimate)

##       mean         sd 
## 164.714286   5.633808

(paramM <- fitdistr(X[Davis$sex=="M"],"normal")$estimate)

##       mean         sd 
## 178.011364   6.404001

hist(X,col=rgb(0,0,1,.4), border="white",proba=TRUE,xlab="",main="")
(param <- fitdistr(X,"normal")$estimate)

##      mean        sd 
## 170.56500   8.90987

f4 <- function(x) pM*dnorm(x,paramM[1],paramM[2])+(1-pM)*dnorm(x,paramF[1],paramF[2])
lines(x,f4(x),lwd=3,col="blue")

2.5 Tests

2.5.1 Decision

set.seed(123)
n = 20
(X = sample(0:1,size=n,replace=TRUE))

##  [1] 0 1 0 1 1 0 1 1 1 0 1 0 1 1 0 1 0 0 0 1

neglogL = function(p){-sum(log(dbinom(X,size=1,prob=p)))}
(pml = optim(fn=neglogL,par=p0,method="BFGS")$par)

## [1] 0.5499996

nx=sum(X==1)
f = expression(nx*log(p)+(n-nx)*log(1-p))
Df = D(f, "p")
Df2 = D(Df, "p")
p=p0
score=eval(Df)
(IF=-eval(Df2))

## [1] 80

1/(p0*(1-p0)/n)

## [1] 80

On veut tester ici si la probabilité d’avoir pile peut-être égale à 50%, $H_0:p=1/2$.

2.5.2 Test de Wald

Considérons un test $H_0:\theta=\theta_0$, contre $H_1:\theta\neq\theta_0$, alors la statistique de Wald est \[ {\displaystyle W={\frac {({\widehat {\theta }}-\theta _{0})^{2}}{\operatorname {var} ({\hat {\theta }})}}}\] qui suit asymptotiquement, sous $H_0$ une loi $\chi^2(1)$.

alpha=0.05
(T=(pml-p0)^2*IF)

## [1] 0.199997

T<qchisq(1-alpha,df=1)

## [1] TRUE

u=seq(0,qchisq(.99,df=1),length=601)
v=dchisq(u,df=1)
plot(u,v,col="white",xlab="",ylab="")
text(2,2,"ACCEPT",col="green")
text(5.5,2,"REJECT",col="red")
idx=which(u<=qchisq(1-alpha,df=1))
polygon(c(u[idx],rev(u[idx])),c(v[idx],rep(0,length(idx))),col="green",border=NA)
idx=which(u>qchisq(1-alpha,df=1))
polygon(c(u[idx],rev(u[idx])),c(v[idx],rep(0,length(idx))),col="red",border=NA)
abline(v=qchisq(1-alpha,df=1),lty=2)
abline(v=T,col="blue")

or we can compute the p-value

1-pchisq(T,df=1)

## [1] 0.6547233

u=seq(0,qchisq(.99,df=1),length=601)
v=dchisq(u,df=1)
plot(u,v,col="white",xlab="",ylab="")
text(3,2,"p-VALUE",col="blue")
idx=which(u<=T)
polygon(c(u[idx],rev(u[idx])),c(v[idx],rep(0,length(idx))),col="yellow",border=NA)
idx=which(u>T)
polygon(c(u[idx],rev(u[idx])),c(v[idx],rep(0,length(idx))),col="blue",border=NA)
abline(v=qchisq(1-alpha,df=1),lty=2)
abline(v=T,col="blue")

2.5.3 Test du Rapport de Vraisemblance

Si on cherche à tester ${\displaystyle H_{0}\,:\,\theta \in \Theta _{0}}$ contre ${\displaystyle H_{1}\,:\,\theta \notin \Theta _{0}}$, la statistique de test est \[ {\displaystyle LR=-2\log \left[{\frac {\sup _{\theta \in \Theta _{0}}{\mathcal {L}}(\theta )}{\sup _{\theta \in \Theta }{\mathcal {L}}(\theta )}}\right]} \] qui suit asymptotiquement, sous $H_0$, une loi $\chi^2(1)$ si $\theta$ est univarié.

logL = function(p){sum(log(dbinom(X,size=1,prob=p)))}
(T=2*(logL(pml)-logL(p0)))

## [1] 0.2003347

T<qchisq(1-alpha,df=1)

## [1] TRUE

2.5.4 Test du Score

Considérons un test $H_0:\theta=\theta_0$, contre $H_1:\theta\neq\theta_0$, et notons ${\displaystyle U(\theta )}$ le score, \[ {\displaystyle U(\theta )={\frac {\partial \log L(\theta \mid x)}{\partial \theta }}.} \] et $I(\theta)$ l’information de Fisher, \[ {\displaystyle I(\theta )=-\mathbb{E} \left[{\frac {\partial ^{2}}{\partial \theta ^{2}}}\log \mathcal{L}(\theta )\right]} \] alors la statistique du score (ou le multiplicateur de Lagrande) est \[ {\displaystyle S(\theta _{0})={\frac {U(\theta _{0})^{2}}{I(\theta _{0})}}} \] qui suit asymptotiquement, sous $H_0$, une loi $\chi^2(1)$.

(T=score^2/IF)

## [1] 0.2

T<qchisq(1-alpha,df=1)

## [1] TRUE

2.6 Vraisemblance Profilee

set.seed(123)
x = exp(rnorm(100))
library(MASS)
(F = fitdistr(x,"gamma"))

##      shape       rate   
##   1.3562877   0.8207035 
##  (0.1732663) (0.1263275)

F$estimate[1]+c(-1,1)*1.96*F$sd[1]

## [1] 1.016686 1.695890

log_lik = function(theta){
   a = theta[1]
   b = theta[2]
   logL = sum(log(dgamma(x,a,b)))
   return(-logL)
}
optim(c(1,1),log_lik)

## $par
## [1] 1.3558113 0.8206505
## 
## $value
## [1] 147.6097
## 
## $counts
## function gradient 
##       45       NA 
## 
## $convergence
## [1] 0
## 
## $message
## NULL

options(warn=-1)
prof_log_lik = function(a){
   b = (optim(1,function(z) -sum(log(dgamma(x,a,z)))))$par
   return(-sum(log(dgamma(x,a,b))))
}
optim(1,prof_log_lik)

## $par
## [1] 1.356445
## 
## $value
## [1] 147.6097
## 
## $counts
## function gradient 
##       26       NA 
## 
## $convergence
## [1] 0
## 
## $message
## NULL

vx = seq(.5,3,length=101)
vl = -Vectorize(prof_log_lik)(vx)
plot(vx,vl,type="l")
abline(v=optim(1,prof_log_lik)$par,lty=2)
abline(h=-optim(1,prof_log_lik)$value)
abline(h=-optim(1,prof_log_lik)$value-qchisq(.95,1)/2)
segments(F$estimate[1]-1.96*F$sd[1],
           -170,F$estimate[1]+1.96*F$sd[1],
         -170,lwd=3,col="blue")

borne = -optim(1,prof_log_lik)$value-qchisq(.95,1)/2
(b1 = uniroot(function(z) Vectorize(prof_log_lik)(z)+borne,c(.5,1.5))$root)

## [1] 1.046505

(b2 = uniroot(function(z) Vectorize(prof_log_lik)(z)+borne,c(1.25,2.5))$root)

## [1] 1.727274

2.7 Analyse de la Variance

Avec les notations usuelles, on suppose disposer d’observations individuelles $y_{i,j}$ appartenant à un groupe $j\in\{1,2,\cdots,k\}$. On suppose que \[ y_{i,j} = \beta_0+\beta_j+\varepsilon_{i,j} \] où (classiquement) les $\varepsilon_{i,j}$ sont i.i.d. de loi $\mathcal{N}(0,\sigma^2)$.

Sur la forme, on a un soucis car ce modèle n’est pas identiable. On a trois options possible.

Supprimer la constante \[ y_{i,j} = \alpha_j+\varepsilon_{i,j} \]
Supprimer une modalité, par exemple la première, i.e. $\beta_1=0$ \[ y_{i,j} = \beta_0+\beta_j+\varepsilon_{i,j},~~~~\beta_1=0 \]
On rajoute une contrainte (linéaire) \[ y_{i,j} = \gamma_0+\gamma_j+\varepsilon_{i,j},~~~~\sum_{j=1}^k\gamma_j=0 \] (on parlera) de contrastes.

Ces trois modèles sont équivalents.

L’estimateur du maximum de vraisemblance du modèle est ici \[ \widehat{\alpha}_j=\frac{1}{n_j}\sum_{i\in\mathcal{I}_j}y_{i,j}=\overline{y}_{\cdot,j} \] \[ \widehat{\beta}_0=\frac{1}{n_1}\sum_{i\in\mathcal{I}_1}y_{i,j}=\overline{y}_{\cdot,1}~~~~~~~~ \widehat{\beta}_j=\overline{y}_{\cdot,j}-\overline{y}_{\cdot,1} \] ou \[ \widehat{\gamma}_0=\frac{1}{n}\sum_{j=1}^k\sum_{i\in\mathcal{I}_j}y_{i,j}~~~~~~~~ \widehat{\gamma}_j=\overline{y}_{\cdot,j}-\overline{y} \] On va alors chercher à tester $H_0:\alpha_1=\cdots=\alpha_k$, ou encore $H_0:\beta_2=\cdots=\beta_k=0$ ou enfin $H_0:\gamma_1=\cdots=\gamma_k=0$. Techniquement, on a ici un test multiple.

Notons pour commencer que \[ {\displaystyle SCR_{\text{total}}=SCR_{\text{facteur}}+SCR_{\text{résidus}}~} \] où \[ \underbrace{{\displaystyle \sum _{j=1}^{k}\sum _{i\in\mathcal{I}_j}(y_{i,j}-{\overline {y}})^{2}}}_{SCR_{\text{total}}} = \underbrace{{\displaystyle \sum _{j=1}^{k}(\overline{y}_{\cdot,j}-{\overline {y}})^{2}}}_{SCR_{\text{facteur}}} + \underbrace{{\displaystyle \sum _{j=1}^{k}\sum _{i\in\mathcal{I}_j}(y_{i,j}-{\overline {y}_{\cdot,j}})^{2}}}_{SCR_{\text{résidus}}} \] Sous une hypothèse de normalité des résidus $\varepsilon_{i,j}$, on peut montrer que si $H_0$ est vraie \[ {\displaystyle SCR_{\text{facteur}}=\sum _{j=1}^{k}n_{j}({\overline {y_{\cdot,j}}}-{\overline {y}})^{2}\sim \chi ^{2}(DL_{\text{facteur}})\qquad {\text{avec}}~DL_{\text{facteur}}=\sum _{j=1}^{k-1}1=k-1} \] \[ {\displaystyle SCR_{\text{résidus}}=\sum _{j=1}^{k}\sum _{i\in\mathcal{I}_j}(y_{i,j}-{\overline {y_{\cdot,j}}})^{2}\sim \chi ^{2}(DL_{\text{résidus}})\quad {\text{avec}}~DL_{\text{résidus}}=\sum_{j=1}^{k}(n_{j}-1)=(n_{1}-1)+(n_{2}-1)+\cdots +(n_{k}-1)=n-k} \] Alors \[ {\displaystyle F={\frac {\dfrac {SCR_{\text{facteur}}}{k-1}}{\dfrac {SCR_{\text{résidus}}}{n-k}}}\sim \mathcal F(k-1,n-k)} \] Dans les données Davis, on a deux groupes (correspondant au genre).

aov(weight ~ sex, data = Davis)

## Call:
##    aov(formula = weight ~ sex, data = Davis)
## 
## Terms:
##                      sex Residuals
## Sum of Squares  17799.20  17522.79
## Deg. of Freedom        1       198
## 
## Residual standard error: 9.407389
## Estimated effects may be unbalanced

AOV = aov(weight ~ sex, data = Davis)
summary(AOV)

##              Df Sum Sq Mean Sq F value Pr(>F)    
## sex           1  17799   17799   201.1 <2e-16 ***
## Residuals   198  17523      88                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

3 Bases de Données

3.1 Variable Continue

3.2 Variable Factorielle

Le genre sex dans la base Davis est ici une variable catégorielle, ou factorielle, prenant deux modalités F et M (dans cet ordre)

str(Davis$sex)

##  Factor w/ 2 levels "F","M": 2 1 1 2 1 2 2 2 2 2 ...

On peut changer l’ordre (et la modalité dite de référence) en utilisant

Davis$sex = relevel(Davis$sex, "M")
str(Davis$sex)

##  Factor w/ 2 levels "M","F": 1 2 2 1 2 1 1 1 1 1 ...

mais pour ne pas embrouiller, laissons F comme modalité de référence par la suite

Davis$sex = relevel(Davis$sex, "F")

Pour créer les variables indicatrices associées, on peut utiliser

M = model.matrix(~ 0+sex, Davis)
head(cbind(Davis$sex,M),10)

##      sexF sexM
## 1  2    0    1
## 2  1    1    0
## 3  1    1    0
## 4  2    0    1
## 5  1    1    0
## 6  2    0    1
## 7  2    0    1
## 8  2    0    1
## 9  2    0    1
## 10 2    0    1

4 Modele lineaire simple

\[ y_i = \beta_0+\beta_1 x_i+\varepsilon_i \] where

$\beta_0$ and $\beta_1$ are unknown regression parameters
$\varepsilon_i$ is the unobservable random error term (or residual)

Consider $n$ pairs of observations $(x_i,y_i)$.

par(mfrow=c(1,2))
plot(Davis$height,Davis$weight,xlab="Height (cm)",ylab="Weight (kg)")
plot(cars)

4.1 Modele Probabiliste

Hypothese 1 les observations $(y_i,x_i)$ sont des realisations de variables aleatoires $(Y_i,X_i)$.

Hypothese 2 les erreurs $\varepsilon_i$ sont des realisations de variables aleatoires $\varepsilon_i$, independantes, centrees $\mathbb{E}[\varepsilon_i]=0$ et de variance constante $\text{Var}[\varepsilon_i]=\sigma^2$.

4.2 Modele Gaussien

Hypothese 2’ les erreurs $\varepsilon_i$ sont des réalisations de variables aleatoires $\varepsilon_i$, independantes, de loi $\mathcal{N}(0,\sigma^2)$.

On suppose ici que $(Y|X=x)\sim\mathcal{N}( \beta_0+\beta_1 x,\sigma^2)$

reg = lm(weight~height, data=Davis)
summary(reg)

## 
## Call:
## lm(formula = weight ~ height, data = Davis)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -19.658  -5.381  -0.555   4.807  42.894 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -130.91040   11.52792  -11.36   <2e-16 ***
## height         1.15009    0.06749   17.04   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.505 on 198 degrees of freedom
## Multiple R-squared:  0.5946, Adjusted R-squared:  0.5925 
## F-statistic: 290.4 on 1 and 198 DF,  p-value: < 2.2e-16

4.3 Moindres Carres

La somme des carres de residus est \[ SCR(\beta_0,\beta_1)=\sum_{i=1}^n \big(y_i - [ \beta_0+\beta_1 x_i]\big)^2 \] On note \[ (\widehat{\beta}_0,\widehat{\beta}_1)=\text{argmin}\left\lbrace SCR(\beta_0,\beta_1)\right\rbrace \] First order conditions are here \[ \left.\frac{\partial SCR(\beta_0,\beta_1)}{\partial\beta_0}\right|_{(\widehat{\beta}_0,\widehat{\beta}_1)}=0 \] while \[ \left.\frac{\partial SCR(\beta_0,\beta_1)}{\partial\beta_1}\right|_{(\widehat{\beta}_0,\widehat{\beta}_1)}=0 \] Then \[ {\hat {\beta}_0}={\bar {y}}-{\hat {\beta }_1}{\bar {x}} \] while \[ {\hat {\beta }_1}={\frac {\sum_{i=1}^{n}(x_{i}-{\bar {x}})(y_{i}-{\bar {y}})}{\sum_{i=1}^{n}(x_{i}-{\bar {x}})^{2}}}={\frac {s_{xy}}{s_{x}}}= \text{corr}(x,y)\cdot{\frac {s_{y}}{s_{x}}} \] or \[ {\hat {\beta }_1}=\frac{\sum_{i=1}^{n}x_{i}y_{i}-\bar {x}\bar {y}}{\sum_{i=1}^{n}x_{i}^2-n{\bar {x}}} \]

X = Davis$height
Y = Davis$weight
(Bhat = cor(X,Y) * sd(Y)/sd(X))

## [1] 1.150092

(Ahat = mean(Y) - Bhat*mean(X))

## [1] -130.9104

par(mfrow=c(1,2))
plot(X,Y)
abline(v=mean(X),col="red",lty=2)
abline(h=mean(Y),col="red",lty=2)
abline(a = Ahat, b=Bhat, lwd=3, col="red")
Y_hat = Ahat + Bhat*X
E_hat = Y-Y_hat
plot(X,Y)
abline(a = Ahat, b=Bhat, lwd=3, col="red")
segments(X,Y,X,Y_hat,col="light blue")

SCR = function(B){sum((Y - (Ahat + B*X))^2)}
x = seq(-10,10)
y = Vectorize(SCR)(x)
plot(x,y,type="l",ylab="Somme des carres des erreurs")
abline(v=Bhat,col="red",lty=2)

The fitted values (or predictions) are \[ \widehat{y}_i = \hat {\beta }_0+\hat {\beta }_1x_i \] and residuals are \[ \widehat{\varepsilon}_i={y}_i-\widehat{y}_i \] Observe (first first order condition) that \[ \sum_{i=1}^n \widehat{\varepsilon}_i=0 \] (provided that there is an intercept term - $\beta_0$ - in the model) and (second first order condition) \[ \sum_{i=1}^n x_i\widehat{\varepsilon}_i=0 \] Notons que \[ \underbrace{\sum_{i=1}^n (y_i-\bar y)^2}_{\text{total sum of squares}} = \underbrace{\sum_{i=1}^n (y_i-\hat y_i)^2}_{\text{residual sum of squares}} + \underbrace{\sum_{i=1}^n (\hat y_i-\bar y)^2}_{\text{explained sum of squares}} \]

Le coefficient de determination $R^2$ est \[R^2=1-\frac{x}{x} \]

On peut montrer que \[ R^2=\widehat{\beta}_1^2\frac{s_x}{s_y}=\frac{s_{xy}^2}{s_xs_y}=\text{cor}[x,y]^2 \]

Considerons l’estimateur suivant de $\sigma^2$, \[ s^2=\frac{1}{n-2}\sum_{i=1}^n\hat\varepsilon_i^2 \]

Parmi les autres interprétations des estimateurs, notons que les paramètres sont des fonctions linéaires des $y_i$: \[ \widehat{\beta}_1 = \sum_{i=1}^n\color{blue}{\omega_{1,i}}\cdot y_i~~~\color{blue}{\omega_{1,i}}=\frac{x_i-\overline{x}}{s_x^2} \] et \[ \widehat{\beta}_0 = \sum_{i=1}^n\color{red}{\omega_{0,i}}\cdot y_i~~~\color{red}{\omega_{0,i}}=\frac{1}{n}-\overline{x}\color{blue}{\omega_{1,i}} \] Attention: la notation $\omega$ ne signifie pas vraiment que l’on ait ici des poids : les $\omega_i$ peuvent être négatifs. Par exemple, on notera que \[ \sum_{i=1}^n \color{blue}{\omega_{1,i}}=0,~~ \sum_{i=1}^n \color{blue}{\omega_{1,i}}\cdot x_i=1,~~ \sum_{i=1}^n \color{blue}{\omega_{1,i}}^2=\frac{1}{s_x^2} \]

4.4 Propriétés des estimateurs

$\widehat{\beta}_0$ et $\widehat{\beta}_1$ sont des estimateurs sans biais de $\beta_0$ et $\beta_1$ respectivement (cf section simulations), \[ \mathbb{E}[\widehat{\beta}_0]={\beta}_0~~,~~ \mathbb{E}[\widehat{\beta}_1]={\beta}_1 \] Les variances sont respectivement \[ \text{Var}[\widehat{\beta}_0]=\sigma^2\left(\frac{1}{n}+\frac{\overline{x}^2}{s_x^2}\right)~~,~~\text{Var}[\widehat{\beta}_1]=\frac{\sigma^2}{s_x^2} \] mais comme $\sigma$ est ici inconnue, on estime ces variances par \[ \widehat{\text{Var}}[\widehat{\beta}_0]=\widehat{\sigma}^2\left(\frac{1}{n}+\frac{\overline{x}^2}{s_x^2}\right)~~,~~\widehat{\text{Var}}[\widehat{\beta}_1]=\frac{\widehat{\sigma}^2}{s_x^2} \] soit pour le second un ecart-type \[ {\displaystyle s_{\hat {\beta }_1}={\sqrt {\frac {{\frac {1}{n-2}}\sum_{i=1}^{n}{\hat{\varepsilon }}_{i}^{,2}}{\sum_{i=1}^{n}(x_{i}-{\bar {x}})^{2}}}}} \] et pour le premier, une reecriture sous la forme \[ {\displaystyle s_{\hat {\beta }_0}=s_{\hat {\beta }_1}{\sqrt {{\frac {1}{n}}\sum_{i=1}^{n}x_{i}^{2}}}={\sqrt {{\frac{1}{n(n-2)}}\left(\sum_{i=1}^{n}{\hat{\varepsilon}}_{j}^{,2}\right){\frac {\sum_{i=1}^{n}x_{i}^{2}}{\sum_{i=1}^{n}(x_{i}-{\bar {x}})^{2}}}}}} \]

4.5 Significativité

Rappelons que \[ \underbrace{\sum_{i=1}^n (y_i-\bar y)^2}_{\text{TSS}} = \underbrace{\sum_{i=1}^n (y_i-\hat y_i)^2}_{\text{RSS}} + \underbrace{\sum_{i=1}^n (\hat y_i-\bar y)^2}_{\text{ESS}} \] La variance de $y$ est estimée par \[ s^2_y=\frac{1}{n-1}\sum_{i=1}^n (y_i-\bar y)^2=\frac{\text{TSS}}{n-1} \] La variance de $\varepsilon$ est estimée par \[ \widehat{\sigma}^2=\frac{1}{n-2}\sum_{i=1}^n (y_i-\widehat{y}_i)^2=\frac{1}{n-2}\sum_{i=1}^n \widehat{\varepsilon}_i^2=\frac{\text{RSS}}{n-2} \]

4.5.1 Test de Student

Si $Y|X$ suit une loi normale, avec $\varepsilon$ suit une loi normale, et en supposant l’indépendance des résidus, on obtient la normalité de $\widehat{\beta}_0$ et $\widehat{\beta}_1$. Aussi, une statistique naturelle de test de $H_0:\beta_1=\color{blue}{0}$ contre $H_1:\beta_0\neq\color{blue}0$ est basé sur \[ T=\frac{\widehat{\beta}_1-\color{blue}{0}}{\sqrt{\widehat{\text{Var}}[\widehat{\beta}_1]}} \] qui suit, sous $H_0$une loi de Student $\mathcal{S}td(n-2)$. Comme c’est un test bi-latéral, on utilise alors une des deux méthodes - région critique : si $t^{-1}_{n-2}(1-\alpha)$ désigne le quantile de niveau $1-\alpha$ de la loi $\mathcal{S}td(n-2)$, on rejette $H_0$ si $|T|>t^{-1}_{1,n-2}(1-\alpha/2)$, - p-value : on rejette $H_0$ si $p=\mathbb{P}[|Y|>|T||Y\sim\mathcal{S}td(n-2)]<\alpha$.

4.5.2 Test de Fisher

Considerons le test $H_0:\beta_1=0$ contre $H_1:\beta_0\neq0$. La statistique de Fisher (analyse de la variance) vise à comparer les résidus de deux modèles : (0) $y_i=\beta_0+\varepsilon_i$ et (0) $y_i=\beta_0+\beta_1x_i+\varepsilon_i$, i.e. \[ F=\frac{(TSS - RSS)/1}{RSS/(n-2)}=\frac{ESS}{RSS/(n-2)} \] qui suit, sous $H_0$ une loi de Fisher $\mathcal{F_{1,n-2}}$. On utilise alors une des deux méthodes - région critique : si $F^{-1}_{1,n-2}(1-\alpha)$ désigne le quantile de niveau $1-\alpha$ de la loi $\mathcal{F_{1,n-2}}$, on rejette $H_0$ si $F>F^{-1}_{1,n-2}(1-\alpha)$, - p-value : on rejette $H_0$ si $p=\mathbb{P}[Y>F|Y\sim\mathcal{F_{1,n-2}}]<\alpha$.

On peut noter que \[ F=(n-2)\frac{R^2}{1-R^2} \]

reg = lm(weight~height, data=Davis)
summary(reg)$r.squared

## [1] 0.5945555

(nrow(Davis)-2)*summary(reg)$r.squared/(1-summary(reg)$r.squared)

## [1] 290.353

summary(reg)$fstatistic

##   value   numdf   dendf 
## 290.353   1.000 198.000

4.5.3 Test de Wald

4.5.4 Lien entre les trois tests

Dans le cas de la significativité de la pente, i.e. $H_0:\beta_1=0$, notons que \[ T^2=\frac{\widehat{\beta}_1^2}{\widehat{\sigma}^2/s_x^2}=W= \frac{\widehat{\beta}_1^2s_x^2}{\widehat{\sigma}^2}= \frac{ESS}{RSS/(n-2)}=F. \]

4.6 Prévision et Incertitude

4.6.1 Prévision

Comme $\varepsilon\perp x$, \[ \widehat{y}_i=\widehat{\beta}_0+\widehat{\beta}_1x_i \] et si on considere une nouvelle observation $\color{red}{x}$, on aurait \[ \widehat{y}_{\color{red}{x}}=\widehat{\beta}_0+\widehat{\beta}_1\color{red}{x} \] et on aura comme observation \[y_{\color{red}{x}}=\beta_0+\beta_1{\color{red}{x}}+\varepsilon\] ou $\varepsilon$ est un bruit imprevisible.

Notons que \[ y_{\color{red}{x}}-\widehat{y}_{\color{red}{x}}=\big(\beta_0+\beta_1{\color{red}{x}}+\varepsilon\big)-\big(\widehat{\beta}_0+\widehat{\beta}_1\color{red}{x}\big) \] soit \[ y_{\color{red}{x}}-\widehat{y}_{\color{red}{x}}=\varepsilon+\big(\beta_0-\widehat{\beta}_0\big)+\big(\beta_1-\widehat{\beta}_1\big){\color{red}{x}} \] de telle sorte que \[ \mathbb{E}[y_{\color{red}{x}}-\widehat{y}_{\color{red}{x}}]=0 \] donc $\widehat{y}_{\color{red}{x}}$ est un estimateur dans biais de $y_{\color{red}{x}}$.

4.6.2 Incertitude

Si on continue, \[ \text{Var}[y_{\color{red}{x}}-\widehat{y}_{\color{red}{x}}] =\text{Var}[\varepsilon]+\text{Var}\big(\widehat{\beta}_0+\widehat{\beta}_1\color{red}{x}\big) \] qui se réécrit \[ \text{Var}[y_{\color{red}{x}}-\widehat{y}_{\color{red}{x}}] = \sigma^2 + \left(\frac{\sigma^2}{n}+\frac{\sigma^2(\color{red}{x}-\overline{x})^2}{s_x^2}\right) \] Frees (Regression Modeling with Actuarial and Financial Applications, 2014) appele “standard error of prediction” la racine carrée de l’estimateur de cette grandeur \[ \sqrt{\widehat{\text{Var}}[y_{\color{red}{x}}-\widehat{y}_{\color{red}{x}}] }=\widehat{\sigma}\cdot\sqrt{1+\frac{1}{n}+\frac{(\color{red}{x}-\overline{x})^2}{s_x^2}} \] En notant que si les residus sont supposes Gaussien, on peut alors construire un intervalle de confiance de prédiction pour $y_{\color{red}{x}}$ : \[ \underbrace{\widehat{\beta}_0+\widehat{\beta}_1\color{red}{x}}_{\widehat{y}_{\color{red}{x}}} \pm t_{n-2,1-\alpha/2}\cdot\widehat{\sigma}\cdot\sqrt{1+\frac{1}{n}+\frac{(\color{red}{x}-\overline{x})^2}{s_x^2}} \]

reg = lm(weight~height, data=Davis)
par(mfrow=c(1,1))
plot(Davis$height,Davis$weight)
u=seq(min(Davis$height),max(Davis$height),lenght=251)
v=predict(reg,newdata = data.frame(height=u),interval = "prediction")
lines(u,v[,1],col="blue",lwd=2)
lines(u,v[,2],col="red",lty=2)
lines(u,v[,3],col="red",lty=2)

On retrouve dans l’expression précédante deux sources d’erreur - l’erreur de modèle, venant du fait que la réalisation $y$ est $\beta_0+\beta_1x$ auquel s’ajoute un bruit $\varepsilon$ - l’erreur d’estimation, venant du fait que $\beta_0+\beta_1x$ est incertain

Pour l’erreur d’estimation, notons que \[ \text{Var}[\widehat{y}_{\color{red}{x}}] = \left(\frac{\sigma^2}{n}+\frac{\sigma^2(\color{red}{x}-\overline{x})^2}{s_x^2}\right) \] d’où un intervalle de confiance pour $\widehat{y}_{\color{red}{x}}$ de la forme \[ \underbrace{\widehat{\beta}_0+\widehat{\beta}_1\color{red}{x}}_{\widehat{y}_{\color{red}{x}}} \pm t_{n-2,1-\alpha/2}\cdot\widehat{\sigma}\cdot\sqrt{\frac{1}{n}+\frac{(\color{red}{x}-\overline{x})^2}{s_x^2}} \]

reg = lm(weight~height, data=Davis)
par(mfrow=c(1,1))
plot(Davis$height,Davis$weight)
u=seq(min(Davis$height),max(Davis$height),lenght=251)
v=predict(reg,newdata = data.frame(height=u),interval = "confidence")
lines(u,v[,1],col="blue",lwd=2)
lines(u,v[,2],col="red",lty=2)
lines(u,v[,3],col="red",lty=2)

4.6.3 Simulations

Deux méthodes de boostrap sont envisageables.

La première consiste a adapter directement la méthode dans le cas univarié en tirant avec remise de paires $(x_i,y_i)$

library(scales)
BETA = matrix(NA,1000,2)
for(s in 1:1000){
  set.seed(s)
  idx = sample(1:nrow(Davis),nrow(Davis),replace=TRUE)
  reg_sim = lm(weight~height, data=Davis[idx,])
  BETA[s,] = reg_sim$coefficients
}
pic_ani = function(k){
  plot(Davis$height,Davis$weight,ylab="Weight (kg)",xlab="Height (cm)",col="white")
  for(s in 1:100) abline(BETA[s,1],BETA[s,2],col=alpha("light blue",.3))
  set.seed(k)
  idx = sample(1:nrow(Davis),nrow(Davis),replace=TRUE)
  points(Davis$height,Davis$weight,col="grey")
  points(Davis$height[idx],Davis$weight[idx],pch=19)
  abline(BETA[k,1],BETA[k,2],col="blue",lwd=2)
}
for (k in 1:50) {pic_ani(k)}

La seconde consiste a utiliser le modele estime, et a tirer avec remise les residus dans l’echantillon $(x_i,\widehat{y}_i+\widehat{\varepsilon}_i)$

par(mfrow=c(1,2))
hist(BETA[,1],probability=TRUE,col=rgb(1,0,0,.4),border="white")
m_std = summary(lm(weight~height, data=Davis))$coefficients[1,1:2]
u=seq(m_std[1]-3*m_std[2],m_std[1]+3*m_std[2],length=251)
v=dnorm(u,m_std[1],m_std[2])
lines(u,v,col="red")
hist(BETA[,2],probability=TRUE,col=rgb(1,0,0,.4),border="white")
m_std = summary(lm(weight~height, data=Davis))$coefficients[2,1:2]
u=seq(m_std[1]-3*m_std[2],m_std[1]+3*m_std[2],length=251)
v=dnorm(u,m_std[1],m_std[2])
lines(u,v,col="red")

De ces simulations, on peut contruire des intervalles de confiances pour $\widehat{y}$,

u=seq(min(Davis$height),max(Davis$height),lenght=251)
PRED = matrix(NA,nrow(BETA),length(u))
for(s in 1:nrow(BETA)) PRED[s,]=BETA[s,1]+BETA[s,2]*u
plot(Davis$height,Davis$weight,xlab="Height (cm)",ylab="Weight (kg)")
lines(u,apply(PRED,2,mean),col="blue",lwd=2)
lines(u,apply(PRED,2,function(x) quantile(x,.975)),col="red",lty=2)
lines(u,apply(PRED,2,function(x) quantile(x,.025)),col="red",lty=2)

4.7 Régression sur une variable catégorielle

La variable de genre sex dans la base Davis est une variable catégorielle, prenant 2 modalités "F" et "M"

str(Davis$sex)

##  Factor w/ 2 levels "F","M": 2 1 1 2 1 2 2 2 2 2 ...

table(Davis$sex)

## 
##   F   M 
## 112  88

par(mfrow=c(1,3))
boxplot(Davis$weight~Davis$sex)
library(vioplot)
vioplot(Davis$weight~Davis$sex)
library(gplots)
plotmeans(weight ~ sex, data = Davis,
          main="", ylim=range(Davis$weight))

4.7.1 Testing correlation

On peut utiliser un test de comparaison de moyenne, pour voir si le poids moyen des hommes et des femmes sont differents,

t.test(x = Davis$weight[Davis$sex=="M"],
       y = Davis$weight[Davis$sex=="F"])

## 
##  Welch Two Sample t-test
## 
## data:  Davis$weight[Davis$sex == "M"] and Davis$weight[Davis$sex == "F"]
## t = 13.35, df = 131.41, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  16.18869 21.82105
## sample estimates:
## mean of x mean of y 
##  75.89773  56.89286

On peut aussi faire une Analyse de la Variance

res_aov = aov(weight ~ sex, data = Davis)
summary(res_aov)

##              Df Sum Sq Mean Sq F value Pr(>F)    
## sex           1  17799   17799   201.1 <2e-16 ***
## Residuals   198  17523      88                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

4.8 Regression sur un facteur

On ne peut pas régresser sur une variable qui prendre des modalités $\{$"F", "M"$\}$ : on va construire des variables binaires (indicatrices) pour les différentes modalités, comme (sex=="M").

On peut montrer facilement que si on construit un modèle $y_i=\beta_0+\beta_1\boldsymbol{1}(x_i=`\text{M}')$, alors \[ \widehat{\beta}_0=\frac{1}{n}\sum_{i=1}^n \boldsymbol{1}(x_i=`\text{F}') \] while que \[ \widehat{\beta}_1=\frac{1}{n}\sum_{i=1}^n \boldsymbol{1}(x_i=`\text{M}')-\widehat{\beta}_0 \]

reg = lm(weight ~ I((sex=="M")*1), data = Davis)
summary(reg)

## 
## Call:
## lm(formula = weight ~ I((sex == "M") * 1), data = Davis)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -21.898  -6.147  -0.893   5.107  43.102 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          56.8929     0.8889   64.00   <2e-16 ***
## I((sex == "M") * 1)  19.0049     1.3401   14.18   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.407 on 198 degrees of freedom
## Multiple R-squared:  0.5039, Adjusted R-squared:  0.5014 
## F-statistic: 201.1 on 1 and 198 DF,  p-value: < 2.2e-16

mean(Davis$weight[Davis$sex=="F"])

## [1] 56.89286

mean(Davis$weight[Davis$sex=="M"])-mean(Davis$weight[Davis$sex=="F"])

## [1] 19.00487

que l’on peut obtenir directement en tapant

reg = lm(weight ~ sex, data = Davis)
summary(reg)

## 
## Call:
## lm(formula = weight ~ sex, data = Davis)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -21.898  -6.147  -0.893   5.107  43.102 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  56.8929     0.8889   64.00   <2e-16 ***
## sexM         19.0049     1.3401   14.18   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.407 on 198 degrees of freedom
## Multiple R-squared:  0.5039, Adjusted R-squared:  0.5014 
## F-statistic: 201.1 on 1 and 198 DF,  p-value: < 2.2e-16

De maniere parfaitement equivalent, notons que

reg = lm(weight ~ I((sex=="F")*1), data = Davis)
summary(reg)

## 
## Call:
## lm(formula = weight ~ I((sex == "F") * 1), data = Davis)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -21.898  -6.147  -0.893   5.107  43.102 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           75.898      1.003   75.68   <2e-16 ***
## I((sex == "F") * 1)  -19.005      1.340  -14.18   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.407 on 198 degrees of freedom
## Multiple R-squared:  0.5039, Adjusted R-squared:  0.5014 
## F-statistic: 201.1 on 1 and 198 DF,  p-value: < 2.2e-16

mean(Davis$weight[Davis$sex=="M"])

## [1] 75.89773

mean(Davis$weight[Davis$sex=="F"])-mean(Davis$weight[Davis$sex=="M"])

## [1] -19.00487

On peut obtenir cette derniere regression en changeant la modalité de référence de la variable sex.

Davis$sex = relevel(Davis$sex, "M")
reg = lm(weight ~ sex, data = Davis)
summary(reg)

## 
## Call:
## lm(formula = weight ~ sex, data = Davis)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -21.898  -6.147  -0.893   5.107  43.102 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   75.898      1.003   75.68   <2e-16 ***
## sexF         -19.005      1.340  -14.18   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.407 on 198 degrees of freedom
## Multiple R-squared:  0.5039, Adjusted R-squared:  0.5014 
## F-statistic: 201.1 on 1 and 198 DF,  p-value: < 2.2e-16

On peut assi tenter la régression suivante (même si la section sur la régresion multiple suivra), en régressant à la fois sur $\boldsymbol{1}(x_i=`\text{F}')$ et $\boldsymbol{1}(x_i=`\text{M}')$

reg = lm(weight ~ I(sex=="F")+I(sex=="M"), data = Davis)
summary(reg)

## 
## Call:
## lm(formula = weight ~ I(sex == "F") + I(sex == "M"), data = Davis)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -21.898  -6.147  -0.893   5.107  43.102 
## 
## Coefficients: (1 not defined because of singularities)
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         75.898      1.003   75.68   <2e-16 ***
## I(sex == "F")TRUE  -19.005      1.340  -14.18   <2e-16 ***
## I(sex == "M")TRUE       NA         NA      NA       NA    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.407 on 198 degrees of freedom
## Multiple R-squared:  0.5039, Adjusted R-squared:  0.5014 
## F-statistic: 201.1 on 1 and 198 DF,  p-value: < 2.2e-16

On voit qu’il y a un soucis : en fait, ce modèle n’est pas identifiable ! En effet, si le vrai modèle était \[y_i=\beta_0+\beta_1\boldsymbol{1}(x_i=`\text{F}')+\beta_2\boldsymbol{1}(x_i=`\text{M}')\] on peut noter que \[y_i=(\beta_0+1)+(\beta_1-1)\boldsymbol{1}(x_i=`\text{F}')+(\beta_2-1)\boldsymbol{1}(x_i=`\text{M}')\] pourrait convenir : ces deux écritures sont parfaitement équivalentes. Or si deux jeux de paramètres différents dont le même modèle c’est qu’il n’est pas identifiable ! Une solution peut être de ne pas avoir de constante,

reg = lm(weight ~ 0+I((sex=="F")*1)+I((sex=="M")*1), data = Davis)
summary(reg)

## 
## Call:
## lm(formula = weight ~ 0 + I((sex == "F") * 1) + I((sex == "M") * 
##     1), data = Davis)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -21.898  -6.147  -0.893   5.107  43.102 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## I((sex == "F") * 1)  56.8929     0.8889   64.00   <2e-16 ***
## I((sex == "M") * 1)  75.8977     1.0028   75.68   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.407 on 198 degrees of freedom
## Multiple R-squared:  0.9802, Adjusted R-squared:   0.98 
## F-statistic:  4912 on 2 and 198 DF,  p-value: < 2.2e-16

mean(Davis$weight[Davis$sex=="F"])

## [1] 56.89286

mean(Davis$weight[Davis$sex=="M"])

## [1] 75.89773

mais nous reviendrons plus tard sur ce soucis.

Davis$sex = relevel(Davis$sex, "F")

4.9 Extensions

4.9.1 Régression pondérée

Il est possible d’utiliser des poids dans la regression : chaque observation $i$ se voit attribuer un poids $\omega_i$.

4.9.2 Hétéroscédasticité

5 Modele linéaire multiple

La variable Fire correspond au nombre d’incendies (par 1000 ménages) dans un quartier de Chicago (il y en a 47), en 1975. X1 est la proportion d’habitations construites avant 1940, X2 le nombre de vols commis, et X3 le revenu médian du quartier.

chicago=read.table("http://freakonometrics.free.fr/chicago.txt",header=TRUE,sep=";")
model_c = lm(Fire~X_2+X_3,data=chicago)
y=function(x2,x3) predict(model_c,newdata=data.frame(X_2=x2,X_3=x3))
VX2=seq(0,80,length=26); VX3=seq(5,25,length=26)
VY23=outer(VX2,VX3,y)

model_c = lm(Fire~X_1+X_3,data=chicago)
y=function(x1,x3) predict(model_c,newdata=data.frame(X_1=x1,X_3=x3))
VX1=seq(0,35,length=26); VX3=seq(5,25,length=26)
VY13=outer(VX1,VX3,y)

par(mfrow=c(1,2))
persp(VX1,VX3,VY13,xlab="Old Houses",ylab="Median Income",zlab="Fire",theta=20)
persp(VX2,VX3,VY23,xlab="Crimes",ylab="Median Income",zlab="Fire",theta=20)

library(DALEX)

## Welcome to DALEX (version: 0.4.4).
## Find examples and detailed introduction at: https://pbiecek.github.io/PM_VEE/
## Additional features will be available after installation of: ingredients, iBreakDown.
## Use 'install_dependencies()' to get all suggested dependencies

data(apartments, package = "DALEX")
str(apartments)

## 'data.frame':    1000 obs. of  6 variables:
##  $ m2.price         : num  5897 1818 3643 3517 3013 ...
##  $ construction.year: num  1953 1992 1937 1995 1992 ...
##  $ surface          : num  25 143 56 93 144 61 127 105 145 112 ...
##  $ floor            : int  3 9 1 7 6 6 8 8 6 9 ...
##  $ no.rooms         : num  1 5 2 3 5 2 5 4 6 4 ...
##  $ district         : Factor w/ 10 levels "Bemowo","Bielany",..: 6 2 5 4 3 6 3 7 6 6 ...

5.1 Inférence

Faisons une régression du prix (par mètre carré) m2.price sur l’année de construction construction.year, la surface surface et le nombre de chambres no.rooms. Sous une forme matricielle, \[ y_i = \boldsymbol{x}_i^\top\boldsymbol{\beta}+\varepsilon_i \]

n = nrow(apartments)
X = matrix(c(rep(1,n),apartments$construction.year,apartments$surface,apartments$no.rooms),n,4)
Y = matrix(apartments$m2.price,n,1)

L’estimateur par moindres carrés est $\widehat{\boldsymbol{\beta}}=(\boldsymbol{X}^\top\boldsymbol{X})^{-1}\boldsymbol{X}^\top\boldsymbol{y}$ soit ici

solve(t(X)%*%X)%*%(t(X)%*%Y)

##              [,1]
## [1,] 6295.7094722
## [2,]   -0.8829108
## [3,]   -9.3826890
## [4,]  -80.6138913

reg = lm(m2.price~construction.year+surface+no.rooms, data=apartments)
reg$coefficients

##       (Intercept) construction.year           surface          no.rooms 
##      6295.7094722        -0.8829108        -9.3826890       -80.6138913

La prévision est $\widehat{y}_i={\boldsymbol{x}}_i^\top \widehat{\boldsymbol{\beta}}$ et le résidu est estimé est $\widehat{\varepsilon}_i=y_i- \widehat{y}_i$. A partir de ces résidus, on peut estimer la variance $\sigma^2$, \[ \widehat{\sigma}^2=\frac{1}{n-4}\sum_{i=1}^n\widehat{\varepsilon}_i^2 \]

ypred = X %*% solve(t(X)%*%X)%*%(t(X)%*%Y)
residus = Y-ypred 
sum(residus^2)/(n-4)

## [1] 611175.7

de telle sorte que $\widehat{\sigma}$ vaut

sqrt(sum(residus^2)/(n-4))

## [1] 781.7772

summary(reg)$sigma

## [1] 781.7772

On peut alors en déduire un estimateur de $\text{Var}[\widehat{\boldsymbol{\beta}}]=\sigma^2(\boldsymbol{X}^\top \boldsymbol{X})^{-1}$

(sum(residus^2)/(n-4))*solve(t(X)%*%X)

##             [,1]          [,2]         [,3]         [,4]
## [1,] 3550207.637 -1.807629e+03 152.91300074 -3277.950843
## [2,]   -1807.629  9.214743e-01  -0.07991224     1.170977
## [3,]     152.913 -7.991224e-02   2.56217569   -64.046489
## [4,]   -3277.951  1.170977e+00 -64.04648859  1922.299635

vcov(reg)

##                   (Intercept) construction.year      surface     no.rooms
## (Intercept)       3550207.637     -1.807629e+03 152.91300074 -3277.950843
## construction.year   -1807.629      9.214743e-01  -0.07991224     1.170977
## surface               152.913     -7.991224e-02   2.56217569   -64.046489
## no.rooms            -3277.951      1.170977e+00 -64.04648859  1922.299635

5.2 Résidus

5.3 Moindes Carrés Pondérés

5.4 Variables factorielles et simplification

boxplot(m2.price~district,data=apartments)

P=matrix(NA,nlevels(apartments$district),nlevels(apartments$district))
colnames(P)=rownames(P)=substr(levels(apartments$district),1,4)
plot(1:nlevels(apartments$district),1:nlevels(apartments$district),
     col="white",xlab="",ylab="",axes=FALSE,xlim=c(-.5,10.5),
     ylim=c(0,10.5))
text(1:10,0,substr(levels(apartments$district),1,4))
text(0,1:10,substr(levels(apartments$district),1,4))
for(i in 1:nlevels(apartments$district)){
apartments$district=
  relevel(apartments$district,levels(apartments$district)[i])
p=summary(lm(m2.price~district,data=apartments))$coefficients[-1,4]
names(p)=substr(names(p),9,12)
P[substr(levels(apartments$district),1,4)[i],names(p)]=p
p=P[substr(levels(apartments$district),1,4)[i],]
idx=which(p>.05)
points(((1:10))[idx],rep(i,length(idx)),pch=1,cex=2)
idx=which(p>.1)
points(((1:10))[idx],rep(i,length(idx)),pch=19,cex=2)}

levels(apartments$district) = c("B","A","A","A","C","A","B","B","A","A")
summary(lm(m2.price~district,data=apartments))

## 
## Call:
## lm(formula = m2.price ~ district, data = apartments)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1413.75  -420.38   -10.55   419.96  1564.66 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3869.13      34.47  112.26   <2e-16 ***
## districtA    -855.79      42.21  -20.27   <2e-16 ***
## districtC    1313.62      68.93   19.06   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 597 on 997 degrees of freedom
## Multiple R-squared:  0.5674, Adjusted R-squared:  0.5665 
## F-statistic: 653.8 on 2 and 997 DF,  p-value: < 2.2e-16

Another way to look at it is to perform Fisher (multiple) test - which is simple an analysis of variance (ANOVA).

apartments = DALEX::apartments

Let us sort the modality according to the average price (since it is the variable of interest)

A = with(data = apartments, aggregate(m2.price,by=list(district),FUN=mean))
A = A[order(A$x),]
L = as.character(A$Group.1)
apartments$district = factor(apartments$district, level=L)
with(data = apartments, boxplot(m2.price ~ district))

reg = lm(m2.price ~ district, data=apartments)
summary(reg)

## 
## Call:
## lm(formula = m2.price ~ district, data = apartments)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1413.8  -426.8   -11.1   422.1  1546.0 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          2968.36      58.02  51.160   <2e-16 ***
## districtBielany        17.38      84.16   0.207    0.836    
## districtPraga          26.45      85.12   0.311    0.756    
## districtUrsynow        42.01      82.65   0.508    0.611    
## districtBemowo         80.10      83.71   0.957    0.339    
## districtUrsus         102.01      82.25   1.240    0.215    
## districtZoliborz      829.59      83.94   9.884   <2e-16 ***
## districtMokotow       887.10      81.86  10.837   <2e-16 ***
## districtOchota        987.93      84.16  11.738   <2e-16 ***
## districtSrodmiescie  2214.39      83.28  26.591   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 597.4 on 990 degrees of freedom
## Multiple R-squared:  0.5698, Adjusted R-squared:  0.5659 
## F-statistic: 145.7 on 9 and 990 DF,  p-value: < 2.2e-16

Let us see if the first five $\beta_j$’s can be considered as null,

library(car)

## Loading required package: carData

## 
## Attaching package: 'carData'

## The following object is masked _by_ '.GlobalEnv':
## 
##     Davis

## 
## Attaching package: 'car'

## The following object is masked from 'package:mixtools':
## 
##     ellipse

car::linearHypothesis(reg, c("districtBielany = 0", 
                        "districtPraga = 0",
                        "districtUrsynow = 0",
                        "districtBemowo = 0",
                        "districtUrsus = 0"))

## Linear hypothesis test
## 
## Hypothesis:
## districtBielany = 0
## districtPraga = 0
## districtUrsynow = 0
## districtBemowo = 0
## districtUrsus = 0
## 
## Model 1: restricted model
## Model 2: m2.price ~ district
## 
##   Res.Df       RSS Df Sum of Sq      F Pr(>F)
## 1    995 354051715                           
## 2    990 353269202  5    782513 0.4386 0.8217

The $p$-value is more than $82%$ so we can regroup those six districts all together. What now try with one more ?

library(car)
linearHypothesis(reg, c("districtBielany = 0", 
                        "districtPraga = 0",
                        "districtUrsynow = 0",
                        "districtBemowo = 0",
                        "districtUrsus = 0",
                        "districtZoliborz = 0"))

## Linear hypothesis test
## 
## Hypothesis:
## districtBielany = 0
## districtPraga = 0
## districtUrsynow = 0
## districtBemowo = 0
## districtUrsus = 0
## districtZoliborz = 0
## 
## Model 1: restricted model
## Model 2: m2.price ~ district
## 
##   Res.Df       RSS Df Sum of Sq      F    Pr(>F)    
## 1    996 405455409                                  
## 2    990 353269202  6  52186207 24.374 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

No… that was too ambitious.

levels(apartments$district) = c(rep("A",6),levels(apartments$district)[7:11])
with(data = apartments, boxplot(m2.price ~ district))

apartments$district = relevel(apartments$district,"Zoliborz")
reg = lm(m2.price ~ district, data=apartments)
linearHypothesis(reg, c("districtMokotow = 0",
                        "districtOchota = 0"))

## Linear hypothesis test
## 
## Hypothesis:
## districtMokotow = 0
## districtOchota = 0
## 
## Model 1: restricted model
## Model 2: m2.price ~ district
## 
##   Res.Df       RSS Df Sum of Sq      F Pr(>F)
## 1    997 355292524                           
## 2    995 354051715  2   1240809 1.7435 0.1754

With a $17%$ $p$-value, we can regroup those three categories together.

levels(apartments$district) = c("B","A",rep("B",2),"C")
apartments$district = relevel(apartments$district,"A")
with(data = apartments, boxplot(m2.price ~ district))

6 Selection de modele

6.1 Penalisation

AIC $AIC = 2k - 2\log(\mathcal{L}) = 2k + n\left[\log\left(2\pi \frac{1}{n}\sum_{i=1}^n \widehat{\varepsilon}_i^2 \right) + 1\right]$

BIC $BIC = { k \log(n) -2 \log(\mathcal{L}) } = k \log(n) + n\left[\log\left(2\pi \frac{1}{n}\sum_{i=1}^n \widehat{\varepsilon}_i^2 \right) + 1\right]$

model = lm(m2.price~.,data=apartments)
stats::AIC(model)

## [1] 14123.45

stats::AIC(model, k=log(nrow(apartments)))

## [1] 14162.71

6.2 Leverage et Points Influents

Rappelons que $\mathbf{H}=\mathbf{X}(\mathbf{X}^{\mathrm{T}}\mathbf{X})^{-1}\mathbf{X}^{\mathrm{T}}$ is the projection matrix such that $\widehat{\mathbf{y}}=\mathbf{X}\widehat{\beta}=\mathbf{H}\mathbf{y}$. Aussi on peut noter que \[ H_{i,i}=\frac{\partial \widehat{\mathbf{y}}_i}{\partial {\mathbf{y}}_i}\in[0,1] \] Le fait que $H_{i,i}\in[0,1]$ vient de la symmetrie de $\mathbf{H}$ et du fait que $\mathbf{H}^2=\mathbf{H}$ : \[ H_{i,i}^2+\underbrace{\sum_{j\neq i}H_{i,j}^2}_{\geq 0} = H_{i,i} \] Si $\widehat{\varepsilon}_i=y_i-\widehat{y}_i$, de telle sorte que $\widehat{\mathbf{\varepsilon}}={\mathbf{y}}-\widehat{\mathbf{y}}=[\mathbb{I}-\mathbf{H}]{\mathbf{y}}$, alors $\text{Var}[\widehat{\varepsilon}_i]=(1-H_{i,i})\sigma^2$.

Soit $\widehat{\beta}_{-i}$ construit sur les $n-1$ observations, lorsque la $i$-eme a ete retiree. Si on pose $\widehat{y}_{(i)}$ la prediction pour cette observation, i.e. $\widehat{y}_{(i)}=\mathbf{x}_i^{\mathrm{T}}\widehat{\beta}_{-i}$, et $\widehat{\varepsilon}_{(i)}=y_i-\widehat{y}_{(i)}$, alors on peut montrer que \[ \widehat{\varepsilon}_{(i)}=\frac{\widehat{\varepsilon}_i}{1-H_{i,i}} \]

reg   = lm(weight~height, data=Davis)
barplot(residuals(reg)[1:40])

pred = Vectorize(function(i){
reg_i = lm(weight~height, data=Davis[-i,])
y_i   = predict(reg_i,newdata=Davis[i,])})
barplot((Davis$weight[1:40]-pred(1:40)))

X = cbind(1,Davis$height)
H = X %*% solve(t(X) %*% X) %*% t(X)
barplot((1-diag(H))[1:40],ylim=c(0,1))

Cette relation permet de definir la distance de Cook, \[\displaystyle{C_i=\frac{\widehat{\varepsilon}_i^2}{p\cdot\text{MSE}}\cdot\left(\frac{H_{i,i}}{(1-H_{i,i})^2}\right)}\] avec $\boldsymbol{H}=\boldsymbol{X}(\boldsymbol{X}^{\text{T}}\boldsymbol{X})^{-1}\boldsymbol{X}^{\text{T}}=[H_{i,i}]$

Considerons les donnees d’Anscombe (via Anscombe (1981))

library(datasets)
data(anscombe)
par(mfrow=c(1,2))
plot(anscombe[,c("x3","y3")])
abline(lm(y3~x3,data=anscombe),col="red")
plot(anscombe[,c("x4","y4")])
abline(lm(y4~x4,data=anscombe),col="red")

par(mfrow=c(1,2))
plot(anscombe[,c("x3","y3")],col="white")
text(anscombe[,"x3"],anscombe[,"y3"],1:11)
abline(lm(y3~x3,data=anscombe),col="red")
plot(anscombe[,c("x4","y4")],col="white")
text(anscombe[,"x4"],anscombe[,"y4"],1:11)
abline(lm(y4~x4,data=anscombe),col="red")

Soit $\mathbf{H}$ la matrice chapeau, de diagonale $H_{i,i}$, \[ H_{i,i}=\frac{\text{cov}[\widehat{y}_i,y_i]}{\text{Var}[y_i]}=\frac{1}{n}+\frac{1}{n-1}\left(\frac{x_i-\overline{x}}{s_x}\right)^2 \]

par(mfrow=c(1,2))
model_3 = lm(y3~x3,data=anscombe)
model_4 = lm(y4~x4,data=anscombe)
H_3 = lm.influence(model_3)$hat
H_4 = lm.influence(model_4)$hat
plot(1:11,H_3,type="h",ylim=0:1,xlab="",ylab="")
plot(1:11,H_4,type="h",ylim=0:1,xlab="",ylab="")

For the second one, $H_{i,i}=1/10$ for all $i$’s but not the $8$th, and $H_{8,8}=1$. Observe that the trace of $\mathbf{X}$ is 2

sum(H_3)

## [1] 2

sum(H_4)

## [1] 2

Les residus studentizes sont $\displaystyle{u_i=\frac{\widehat{\varepsilon}_i}{\widehat{\sigma}\sqrt{1-H_{i,i}}}}$

6.3 Normaliteé des residus

6.4 Ridge & Lasso

6.4.1 Shrinkage

Gauss Markov theorem : in a linear regression model in which the errors are uncorrelated, have equal variances and expectation value of zero, the best linear unbiased estimator (BLUE) of the coefficients is given by the ordinary least squares (OLS) estimator - provided it exists

What about linear biased estimators ?

Recall that the mean squared error (MSE) of an estimator ${\displaystyle {\hat {\theta }}}$ (with respect to an unknown parameter ${\displaystyle \theta }$ is defined as \[ {\displaystyle \operatorname {MSE} ({\hat {\theta }})=\operatorname {E}\left[({\hat {\theta }}-\theta )^{2}\right]} \] or \[ {\displaystyle \operatorname {MSE} ({\hat {\theta }})=\operatorname {Var}({\hat {\theta }})+\operatorname {Bias} ({\hat {\theta }},\theta )^{2}} \]

Consider some $\mathcal{N}(\theta,1)$ sample, $\{y_1,\cdots,y_n\}$

set.seed(123)
n = 400
x = rnorm(n,1,1)
mse = function(a){
  m = rep(NA,1e4)
  for(s in 1:1e4){
  xs=sample(x,size=n,replace=TRUE)
  m[s] = a*mean(xs)}
  bias1 = mean(m-mean(x))
  bias2 = (1-a)*mean(x)
  v=a^2*var(x)
  return(c(bias1,bias2,v))
}
a=seq(.01,1,by=.01)
m=Vectorize(mse)(a)
plot(a,m[2,]^2+m[3,],ylim=c(0,1.2),type="l",ylab="")
lines(a,m[1,]^2,col="red")
lines(a,m[2,]^2,col="red",lty=2,lwd=2)
lines(a,m[3,],col="blue")

Using $\widehat{\theta}=\alpha\overline{y}$ can lead to an estimator with a lower mean squared error.

a[which.min(m[2,]^2+m[3,])]

## [1] 0.52

From a theoretical perspective, $\widehat{\theta}=\alpha\overline{y}$ satisfies $\mathbb{E}[\widehat{\theta}]=\alpha\cdot \mathbb{E}[\overline{y}]=\alpha\cdot \theta$, which is a bias estimator if $\alpha\neq 1$. And its variance is \[ \operatorname{var}[\widehat{\theta}]=\alpha^2\cdot \operatorname{var}[\overline{y}]=\frac{\alpha^2\cdot\sigma^2}{n} \] Thus, \[ \operatorname{mse}[\widehat{\theta}]=\mathbb{E}[(\widehat{\theta}-\theta)^2]=\big((\alpha-1)\cdot\theta\big)^2+\frac{\alpha^2\cdot\sigma^2}{n} \] and since \[ \frac{\partial\operatorname{mse}[\widehat{\theta}]}{\partial \alpha}=2\big((\alpha-1)\cdot\theta^2\big)+\frac{\alpha\cdot\sigma^2}{n} \] $\operatorname{mse}[\widehat{\theta}]$ is minimal when its derivative is null (first order condition), i.e. \[ \alpha^2=\frac{n\theta^2}{n\theta^2+\sigma^2}<1. \]

6.4.2 Ridge

\[ \widehat{\boldsymbol{\beta}} = \text{argmin}\left\lbrace \sum_{i=1}^n \left[y_i- \left(\beta_0+\sum_{j=1}^k \beta_j x_{j,i} \right)\right] + \color{red}{\lambda} \sum_{j=1}^k \beta_j ^2\right\rbrace \] with an explicit solution $\widehat{\boldsymbol{\beta}}=(\boldsymbol{X}^{\text{T}}\boldsymbol{X}-\color{red}{\lambda} \mathbb{I})^{-1} \boldsymbol{X}^{\text{T}}\boldsymbol{Y}$.

library(MASS,verbose=FALSE,quietly=TRUE,warn.conflicts=FALSE)
model_ridge <- lm.ridge(Fire ~ ., data=chicago, lambda=1)

see more generally Tikhonov regularization

6.4.3 LASSO

LASSO (least absolute shrinkage and selection operator) \[ \widehat{\boldsymbol{\beta}} = \text{argmin}\left\lbrace \sum_{i=1}^n \left[y_i- \left(\beta_0+\sum_{j=1}^k \beta_j x_{j,i} \right)\right] + \color{blue}{\lambda} \sum_{j=1}^k \vert\beta_j\vert\right\rbrace \]

library(glmnet,verbose=FALSE,quietly=TRUE,warn.conflicts=FALSE)
fit = glmnet(x = as.matrix(chicago[,2:4]), y = chicago[,1], family = "gaussian", alpha = 1)

plot(fit, xvar="lambda", label = TRUE )

Les regressions Ridge et Lasso donnent des resultats simples dans le cas ou $\mathbf{X}^\top\mathbf{X}=\mathbb{I}$ (variables orthonormees), \[ \widehat{\beta}_{\lambda,j}^{ridge} = \frac{\widehat{\beta}_{j}^{ols}}{1+\lambda} ~~\text{ and }~~ \widehat{\beta}_{\lambda,j}^{lasso} = \text{sign}(\widehat{\beta}_{j}^{ols})\cdot (|\widehat{\beta}_{j}^{ols}|-\lambda)_+ \]

library(factoextra)

## Loading required package: ggplot2

## Welcome! Related Books: `Practical Guide To Cluster Analysis in R` at https://goo.gl/13EFCZ

library("RColorBrewer")
colrs=brewer.pal(7,"Set1")
reg = lm(m2.price~.,data=apartments)
X = model.matrix(reg)
X = X[,-1]
for(j in 1:ncol(X)) X[,j] = (X[,j]-mean(X[,j]))/sd(X[,j])
pca = princomp(X)
pca_X = get_pca_ind(pca)$coord
y = apartments$m2.price
y = (y-mean(y))/sd(y)
library(glmnet)
glm_lasso = glmnet(pca_X, y, alpha=1)
par(mfrow=c(1,2))
plot(glm_lasso,xvar="lambda",col=colrs)
plot(glm_lasso,col=colrs)

Aussi, on a le biais et la variance de $\widehat{\beta}_{\lambda}^{ridge}$, \[ \text{bias}[ \widehat{\beta}_{\lambda}^{ridge} ]=\frac{\lambda}{1+\lambda}\widehat{\beta}^{ols} ~~\text{ and }~~ \text{Var}[ \widehat{\beta}_{\lambda}^{ridge} ]=\frac{1}{(1+\lambda)^2}\sigma^2\mathbb{I}= \frac{1}{(1+\lambda)^2}\text{Var}[\widehat{\beta}^{ols}] \] Notons que $\text{Var}[ \widehat{\beta}_{\lambda}^{ridge}]<\text{Var}[ \widehat{\beta}^{ols}]$, et \[ \text{mse}[\widehat{\beta}_{\lambda}^{ridge}]= \frac{p\sigma^2}{(1+\lambda)^2}+ \frac{\lambda^2}{(1+\lambda)^2}\beta^\top\beta \] qui sera alors optimal pour $\lambda^\star=p\sigma^2/\beta^\top\beta$.

library(glmnet)
glm_lasso = glmnet(pca_X, y, alpha=0)
par(mfrow=c(1,2))
plot(glm_lasso,xvar="lambda",col=colrs)
plot(glm_lasso,col=colrs)

6.4.4 Elastic Net

library(glmnet)
glm_13 = glmnet(X, y, alpha=.333333)
par(mfrow=c(1,2))
plot(glm_13,xvar="lambda",col=colrs)
plot(glm_13,col=colrs)

glm_23 = glmnet(X, y, alpha=.66667)
par(mfrow=c(1,2))
plot(glm_23,xvar="lambda",col=colrs)
plot(glm_23,col=colrs)

7 Non linéarités

par(mfrow=1:2)
plot(apartments$construction.year,apartments$m2.price,
     xlab="Construction year",ylab="Price (EUR/m2)")
plot(Davis$height,Davis$weight,xlab="Height (cm)",ylab="Weight (kg)")

7.1 Constant par morceaux

mod_ap_cst = lm(m2.price~cut(construction.year,c(1910,1939,1990,2020)),data=apartments)
mod_std_cst = lm(weight~cut(height,c(140,165,175,210)),data=Davis)
x_ap = seq(min(apartments$construction.year)-2,max(apartments$construction.year)+2,length=251)
x_std = seq(min(Davis$height)-2,max(Davis$height)+2,length=251)

par(mfrow=1:2)
plot(apartments$construction.year,apartments$m2.price,
     xlab="Construction year",ylab="Price (EUR/m2)")
lines(x_ap,predict(mod_ap_cst,newdata=data.frame(construction.year = x_ap)),lwd=3,col="red",type="s")
plot(Davis$height,Davis$weight,xlab="Height (cm)",ylab="Weight (kg)")
lines(x_std,predict(mod_std_cst,newdata=data.frame(height = x_std)),lwd=3,col="red",type="s")

7.2 Lineaire par morceaux

library(strucchange)

## Loading required package: sandwich

par(mfrow=c(1,2))
pic_ani = function(k){
  plot(Davis$height,Davis$weight,ylab="Weight (kg)",xlab="Height (cm)")
abline(v=c(min(Davis$height),k,max(Davis$height)),lty=2,col="light blue")
reg_0 = lm(weight~height,data=Davis)
reg_1 = lm(weight~height,data=Davis[Davis$height<k,])
reg_2 = lm(weight~height,data=Davis[Davis$height>k,])
x_0=seq(min(Davis$height),max(Davis$height),length=251)
x_1=seq(min(Davis$height),k,length=251)
x_2=seq(k,max(Davis$height),length=251)
y_0=predict(reg_0, newdata=data.frame(height=x_0))
y_1=predict(reg_1, newdata=data.frame(height=x_1))
y_2=predict(reg_2, newdata=data.frame(height=x_2))
lines(x_0,y_0,lwd=2,col="blue")
lines(x_1,y_1,lwd=2,col="red")
lines(x_2,y_2,lwd=2,col="red")
plot(Fstats(weight ~ height,data=Davis,from=7/50))
}
for (k in 162:178) {pic_ani(k)}

On considere ici deux modeles : le premier est le modele linear global $y_i=\mathbf{x}_i^\top\beta+\varepsilon_i$, et le second est base sur deux regressions, sur deux groupes ($j\in\{1,2\}$) \[ y_i=\mathbf{x}_i^\top\beta^{(j)}+\varepsilon_i^{(j)},~i\in I_j \] On pose alors \[ SCR_0 = \sum_{i=1}^n \widehat{\varepsilon}_i^2 ~\text{ et }~ SCR_j = \sum_{i:i\in I_j} \widehat{\varepsilon}_i^{(j)2} \] La statistique de test est \[ {\displaystyle {F=\frac {(SCR_{0}-(SCR_{1}+SCR_{2}))/k}{(SCR_{1}+SCR_{2})/(n_{1}+n_{2}-2k)}}.} \] qui doit suivre une loi de Fisher $\mathcal{F}(k,n_{1}+n_{2}-2k)$, cf Chow Test

\[ W_t=\frac{1}{\widehat{\sigma}\sqrt{n}}\sum_{i=1}^{ \lfloor nt \rfloor} \widehat{\varepsilon}_i \]

cusum <- efp(weight ~ height, type = "OLS-CUSUM",data=Davis[order(Davis$height),])
par(mfrow=c(1,2))
plot(cusum,ylim=c(-2,2))
plot(cusum, alpha = 0.05, alt.boundary = TRUE,ylim=c(-2,2))

see CUSUM test

7.3 Regression polynomiale

mod_ap_2 = lm(m2.price~poly(construction.year,2),data=apartments)
mod_ap_4 = lm(m2.price~poly(construction.year,4),data=apartments)
mod_std_2 = lm(weight~poly(height,2),data=Davis)

par(mfrow=1:2)
plot(apartments$construction.year,apartments$m2.price,
     xlab="Construction year",ylab="Price (EUR/m2)")
lines(x_ap,predict(mod_ap_2,newdata=data.frame(construction.year = x_ap)),lwd=3,col="red")
lines(x_ap,predict(mod_ap_4,newdata=data.frame(construction.year = x_ap)),lwd=3,col="blue")
plot(Davis$height,Davis$weight,xlab="Height (cm)",ylab="Weight (kg)")
lines(x_std,predict(mod_std_2,newdata=data.frame(height = x_std)),lwd=3,col="red")

8 Autres régressions

Pour illustrer les algorithmes de calculs de la médiane et des quantiles, consider the following data

n = 101 
set.seed(1)
y = rlnorm(n)

8.1 Régression médiane

Tout d’abord, calculons la médiane du vecteur y

median(y)

## [1] 1.077415

Rappelons que la médiane est solution de \[ \min_\mu \left\lbrace\sum_{i=1}^n|y_i-\mu|\right\rbrace \] et de manière assez naturelle, on va chercher à minimiser la somme des valeurs absolues des erreurs pour obtenir la régression médiane, \[ \min_{\boldsymbol{\beta}} \left\lbrace\sum_{i=1}^n|y_i-\boldsymbol{x}_i^\top\boldsymbol{\beta}|\right\rbrace \]

8.2 Aspects computationnels

To compute the median, one can use

optim(1,function(mu) sum(abs(y-mu)))$par

## [1] 1.077414

which is working, but it is not very clever since $x\mapsto|x|$ is not differentiable. One can use linear programming, since the previous problem is equivalent to \[\min_{\mu,\mathbf{a},\mathbf{b}}\left\lbrace\sum_{i=1}^na_i+b_i\right\rbrace\] with $a_i,b_i\geq 0$ and $y_i-\mu=a_i-b_i$, $\forall i=1,\cdots,n$. Thus, consider

library(lpSolve)
A1 = cbind(diag(2*n),0) 
A2 = cbind(diag(n), -diag(n), 1)
r = lp("min", c(rep(1,2*n),0),
rbind(A1, A2),c(rep(">=", 2*n), rep("=", n)), c(rep(0,2*n), y))
tail(r$solution,1)

## [1] 1.077415

8.3 Régression quantile

8.4 Aspects computationnels

8.4.1 Calcul des quantiles

The quantile of order $\tau$ (e.g. 30%) is

tau = .3
quantile(y,tau)

##       30% 
## 0.6741586

One can use, as before, the fact the the quantile of order $\tau$ is solution of \[\min_{\mu,\mathbf{a},\mathbf{b}}\left\lbrace\sum_{i=1}^n\tau a_i+(1-\tau)b_i\right\rbrace\] with $a_i,b_i\geq 0$ and $y_i-\mu=a_i-b_i$, $\forall i=1,\cdots,n$. Thus, consider

A1 = cbind(diag(2*n),0) 
A2 = cbind(diag(n), -diag(n), 1)
r = lp("min", c(rep(tau,n),rep(1-tau,n),0),
rbind(A1, A2),c(rep(">=", 2*n), rep("=", n)), c(rep(0,2*n), y))
tail(r$solution,1)

## [1] 0.6741586

8.4.2 Régression quantile simple

base = cars
names(base) = c("x","y")
plot(base)
library(quantreg)

## Loading required package: SparseM

## 
## Attaching package: 'SparseM'

## The following object is masked from 'package:base':
## 
##     backsolve

reg = rq(y~x, tau=tau, data=base)
summary(reg)

## 
## Call: rq(formula = y ~ x, tau = tau, data = base)
## 
## tau: [1] 0.3
## 
## Coefficients:
##             coefficients lower bd  upper bd 
## (Intercept) -20.50000    -29.08971  -9.01685
## x             3.50000      2.74138   4.09524

abline(reg)

require(lpSolve) 
tau = .3
n=nrow(base)
X = cbind( 1, base$x)
y = base$y
r = lp("min",
c(rep(tau, n), rep(1 - tau, n), rep(0, 2 * ncol(X))),
cbind(diag(n), -diag(n), X, -X),
rep("=", n),
y)
beta = tail(r$solution, 2 * ncol(X))
beta = beta[1:ncol(X)] - beta[ncol(X) + 1:ncol(X)]
beta

## [1] -20.5   3.5

that we can actually visualize

plot(base)
abline(a=beta[1],b=beta[2])

One can also consider some Iteratively Reweighted Least Squares (IRLS) procedure

plot(base)
eps = residuals(lm(y~x, data=base))
for(s in 1:25){
  reg = lm(y~x, data=base, weights=(tau*(eps>=0)+(1-tau)*(eps<0))/abs(eps))
  eps = residuals(reg)
  abline(reg$coefficients,col=rgb(1,0,0,.2))
}

reg$coefficients

## (Intercept)           x 
##  -20.104679    3.479194

8.5 Régression PLS (et ACP)

8.6 Least trimmed squares (LTS)

9 Regression logistique

9.1 Régression sur des variables continues

myocarde = read.table("http://freakonometrics.free.fr/myocarde.csv",head=TRUE, sep=";")
str(myocarde)

## 'data.frame':    71 obs. of  8 variables:
##  $ FRCAR: int  90 90 120 82 80 80 94 80 78 100 ...
##  $ INCAR: num  1.71 1.68 1.4 1.79 1.58 1.13 2.04 1.19 2.16 2.28 ...
##  $ INSYS: num  19 18.7 11.7 21.8 19.7 14.1 21.7 14.9 27.7 22.8 ...
##  $ PRDIA: int  16 24 23 14 21 18 23 16 15 16 ...
##  $ PAPUL: num  19.5 31 29 17.5 28 23.5 27 21 20.5 23 ...
##  $ PVENT: num  16 14 8 10 18.5 9 10 16.5 11.5 4 ...
##  $ REPUL: int  912 1476 1657 782 1418 1664 1059 1412 759 807 ...
##  $ PRONO: Factor w/ 2 levels "DECES","SURVIE": 2 1 1 2 1 1 2 2 2 2 ...

Représenter $(x_{j,i},y_i)$ n’aurait pas grand intérêt, car $y\in\{0,1\}$. Par contre, on peut tracer des box plots

for(i in 1:7) boxplot(myocarde[,i]~myocarde$PRONO,horizontal=TRUE,xlab=names(myocarde)[i])

On suppose ici que \[ \log\frac{\mathbb{P}[Y=1|\mathbf{X}=\mathbf{x}]}{\mathbb{P}[Y=0|\mathbf{X}=\mathbf{x}]}=\mathbf{x}^\top\mathbf{\beta} \] soit \[ \mathbb{P}[Y=1|\mathbf{X}=\mathbf{x}]=\mathbb{E}[Y|\mathbf{X}=\mathbf{x}]=\frac{\exp[\mathbf{x}^\top\mathbf{\beta}]}{1+\exp[\mathbf{x}^\top\mathbf{\beta}]}=\frac{1}{1+\exp[-\mathbf{x}^\top\mathbf{\beta}]} \]

summary(glm(PRONO~.,data=myocarde,family=binomial(link = "logit")))

## 
## Call:
## glm(formula = PRONO ~ ., family = binomial(link = "logit"), data = myocarde)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.24911  -0.41613   0.05261   0.39611   2.25781  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)
## (Intercept) -10.187642  11.895227  -0.856    0.392
## FRCAR         0.138178   0.114112   1.211    0.226
## INCAR        -5.862429   6.748785  -0.869    0.385
## INSYS         0.717084   0.561445   1.277    0.202
## PRDIA        -0.073668   0.291636  -0.253    0.801
## PAPUL         0.016757   0.341942   0.049    0.961
## PVENT        -0.106776   0.110550  -0.966    0.334
## REPUL        -0.003154   0.004891  -0.645    0.519
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 96.033  on 70  degrees of freedom
## Residual deviance: 41.043  on 63  degrees of freedom
## AIC: 57.043
## 
## Number of Fisher Scoring iterations: 7

L’estimation des paramètres se fait par maximum de vraimsemblance : ne pas utiliser une fonction d’optimisation (directe)

y = (myocarde$PRONO=="DECES")
X = cbind(1,as.matrix(myocarde[,1:7]))
negLogLik = function(beta){
 -sum(-y*log(1 + exp(-(X%*%beta))) - (1-y)*log(1 + exp(X%*%beta)))
}
beta_init = lm(PRONO=="DECES"~.,data=myocarde)$coefficients
logistic_opt = optim(par = beta_init, negLogLik, hessian=TRUE, method = "BFGS", control=list(abstol=1e-9))

On peut toutefois utiliser l’algorithme de Fisher

Y=(myocarde$PRONO=="DECES")
X=cbind(1,as.matrix(myocarde[,1:7]))
colnames(X)=c("Inter",names(myocarde[,1:7]))
 beta=as.matrix(lm(Y~0+X)$coefficients,ncol=1)
 for(s in 1:9){
   pi=exp(X%*%beta[,s])/(1+exp(X%*%beta[,s]))
   gradient=t(X)%*%(Y-pi)
   omega=matrix(0,nrow(X),nrow(X));diag(omega)=(pi*(1-pi))
   Hessian=-t(X)%*%omega%*%X
   beta=cbind(beta,beta[,s]-solve(Hessian)%*%gradient)}
beta[,8:10]

##                [,1]         [,2]         [,3]
## XInter 10.187641083 10.187641696 10.187641696
## XFRCAR -0.138178115 -0.138178119 -0.138178119
## XINCAR  5.862428937  5.862429037  5.862429037
## XINSYS -0.717083993 -0.717084018 -0.717084018
## XPRDIA  0.073668174  0.073668171  0.073668171
## XPAPUL -0.016756516 -0.016756506 -0.016756506
## XPVENT  0.106776011  0.106776012  0.106776012
## XREPUL  0.003154187  0.003154187  0.003154187

On obtient ainsi une estimation de la variance de nos estimateurs (comme on cherche l’estimateur du maximum de vraisemblance, on sait que la variance asymptotique est l’inverse de l’information de Fisher, qui est la Hessienne de la log-vraisemblance, au signe près)

Hessian

##              Inter        FRCAR        INCAR        INSYS        PRDIA
## Inter    -6.399291    -586.5942    -11.23996    -124.1888    -123.3891
## FRCAR  -586.594187  -55245.7868  -1043.69898  -11235.5831  -11510.6073
## INCAR   -11.239961   -1043.6990    -21.00480    -228.1544    -220.4816
## INSYS  -124.188750  -11235.5831   -228.15438   -2541.5217   -2395.8946
## PRDIA  -123.389104  -11510.6073   -220.48165   -2395.8946   -2505.3755
## PAPUL  -163.361825  -15255.4372   -293.75443   -3187.2732   -3309.2315
## PVENT   -68.681260   -6186.2374   -118.50078   -1334.2920   -1315.9193
## REPUL -7733.626413 -716116.2773 -13069.24214 -143720.4273 -153850.5927
##              PAPUL        PVENT         REPUL
## Inter    -163.3618    -68.68126     -7733.626
## FRCAR  -15255.4372  -6186.23737   -716116.277
## INCAR    -293.7544   -118.50078    -13069.242
## INSYS   -3187.2732  -1334.29200   -143720.427
## PRDIA   -3309.2315  -1315.91933   -153850.593
## PAPUL   -4402.7086  -1712.67324   -203692.896
## PVENT   -1712.6732   -854.38907    -82307.441
## REPUL -203692.8963 -82307.44065 -10027346.615

-solve(Hessian)

##              Inter         FRCAR       INCAR         INSYS         PRDIA
## Inter 141.50029499 -1.1120707443 42.29308470 -5.792019e+00 -2.585973e-01
## FRCAR  -1.11207074  0.0130218132 -0.68328148  6.113030e-02 -2.646821e-03
## INCAR  42.29308470 -0.6832814845 45.54684575 -3.245723e+00  3.088080e-01
## INSYS  -5.79201872  0.0611303045 -3.24572286  3.152285e-01  5.050851e-05
## PRDIA  -0.25859731 -0.0026468213  0.30880800  5.050851e-05  8.505231e-02
## PAPUL   1.15456409  0.0044720856 -0.97012941  2.087264e-03 -7.653170e-02
## PVENT  -0.03304634 -0.0002828814  0.02340121 -6.141984e-03 -1.516899e-02
## REPUL  -0.02103416 -0.0001058035  0.01811208 -1.791229e-04  3.594489e-04
##              PAPUL         PVENT         REPUL
## Inter  1.154564088 -3.304634e-02 -2.103416e-02
## FRCAR  0.004472086 -2.828814e-04 -1.058035e-04
## INCAR -0.970129411  2.340121e-02  1.811208e-02
## INSYS  0.002087264 -6.141984e-03 -1.791229e-04
## PRDIA -0.076531702 -1.516899e-02  3.594489e-04
## PAPUL  0.116926256  1.390261e-02 -1.290423e-03
## PVENT  0.013902608  1.222148e-02 -4.677069e-05
## REPUL -0.001290423 -4.677069e-05  2.392144e-05

Les termes diagonaux sont positifs, et leur racine-carrée vaut

diag(-solve(Hessian))

##        Inter        FRCAR        INCAR        INSYS        PRDIA        PAPUL 
## 1.415003e+02 1.302181e-02 4.554685e+01 3.152285e-01 8.505231e-02 1.169263e-01 
##        PVENT        REPUL 
## 1.222148e-02 2.392144e-05

On peut aussi utiliser des moindres carrés pondérés itérés

df = myocarde
beta_init = lm(PRONO=="DECES"~.,data=df)$coefficients
X = cbind(1,as.matrix(myocarde[,1:7]))
beta = beta_init
for(s in 1:1000){
p = exp(X %*% beta) / (1+exp(X %*% beta))
omega = diag(nrow(df))
diag(omega) = (p*(1-p))
df$Z = X %*% beta + solve(omega) %*% ((df$PRONO=="DECES") - p)
beta = lm(Z~.,data=df[,-8], weights=diag(omega))$coefficients
}
beta

##  (Intercept)        FRCAR        INCAR        INSYS        PRDIA        PAPUL 
## 10.187641696 -0.138178119  5.862429037 -0.717084018  0.073668171 -0.016756506 
##        PVENT        REPUL 
##  0.106776012  0.003154187

On peut le faire sur un petit exemple, pour illustrer

x1 = c(.4,.55,.65,.9,.1,.35,.5,.15,.2,.85)
x2 = c(.85,.95,.8,.87,.5,.55,.5,.2,.1,.3)
y = c(1,1,1,1,1,0,0,1,0,0)
df = data.frame(x1=x1,x2=x2,y=as.factor(y))
plot(x1,x2,pch=c(1,19)[1+y])

On obtient comme classification

clr10 = rev(heat.colors(10))
reg = glm(y~x1+x2,data=df,family=binomial(link = "logit"))
u = seq(0,1,length=101)
p = function(x,y) predict.glm(reg,newdata=data.frame(x1=x,x2=y),type="response")
v = outer(u,u,p)
image(u,u,v,xlab="Variable 1",ylab="Variable 2",col=clr10,breaks=(0:10)/10)
points(x1,x2,pch=19,cex=1.5,col="white")
points(x1,x2,pch=c(1,19)[1+y],cex=1.5)
contour(u,u,v,levels = .5,add=TRUE)

On note que les courbes d’iso-probabilités sont des droites (ce qui permet de dire que la régression logistique est un modèle linéaire)

Petite remarque, compte tenu du lien entre $\mathbf{1}(y=1)$ et $\mathbf{1}(y=0)$ (la somme des deux valant $1$), on en déduit une relation simple sur les sorties des régressions

summary(glm((PRONO=="DECES")~.,data=myocarde,family=binomial(link = "logit")))

## 
## Call:
## glm(formula = (PRONO == "DECES") ~ ., family = binomial(link = "logit"), 
##     data = myocarde)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.25781  -0.39611  -0.05261   0.41613   2.24911  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)
## (Intercept) 10.187642  11.895227   0.856    0.392
## FRCAR       -0.138178   0.114112  -1.211    0.226
## INCAR        5.862429   6.748785   0.869    0.385
## INSYS       -0.717084   0.561445  -1.277    0.202
## PRDIA        0.073668   0.291636   0.253    0.801
## PAPUL       -0.016757   0.341942  -0.049    0.961
## PVENT        0.106776   0.110550   0.966    0.334
## REPUL        0.003154   0.004891   0.645    0.519
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 96.033  on 70  degrees of freedom
## Residual deviance: 41.043  on 63  degrees of freedom
## AIC: 57.043
## 
## Number of Fisher Scoring iterations: 7

summary(glm((PRONO=="SURVIE")~.,data=myocarde,family=binomial(link = "logit")))

## 
## Call:
## glm(formula = (PRONO == "SURVIE") ~ ., family = binomial(link = "logit"), 
##     data = myocarde)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.24911  -0.41613   0.05261   0.39611   2.25781  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)
## (Intercept) -10.187642  11.895227  -0.856    0.392
## FRCAR         0.138178   0.114112   1.211    0.226
## INCAR        -5.862429   6.748785  -0.869    0.385
## INSYS         0.717084   0.561445   1.277    0.202
## PRDIA        -0.073668   0.291636  -0.253    0.801
## PAPUL         0.016757   0.341942   0.049    0.961
## PVENT        -0.106776   0.110550  -0.966    0.334
## REPUL        -0.003154   0.004891  -0.645    0.519
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 96.033  on 70  degrees of freedom
## Residual deviance: 41.043  on 63  degrees of freedom
## AIC: 57.043
## 
## Number of Fisher Scoring iterations: 7

9.2 Régression sur des covariables catégorielles

On obtient comme classification

admin = read.csv("https://stats.idre.ucla.edu/stat/data/binary.csv")
str(admin)

## 'data.frame':    400 obs. of  4 variables:
##  $ admit: int  0 1 1 1 0 1 1 0 1 0 ...
##  $ gre  : int  380 660 800 640 520 760 560 400 540 700 ...
##  $ gpa  : num  3.61 3.67 4 3.19 2.93 3 2.98 3.08 3.39 3.92 ...
##  $ rank : int  3 3 1 4 4 2 1 2 3 2 ...

for(i in 2:4) boxplot(admin[,i]~admin$admit,horizontal=TRUE)

On peut faire un tableau de contingence

xtabs(~admit + rank, data = admin)

##      rank
## admit  1  2  3  4
##     0 28 97 93 55
##     1 33 54 28 12

9.2.1 Commençons par le cas à deux modalités

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following object is masked from 'package:car':
## 
##     recode

## The following object is masked from 'package:DALEX':
## 
##     explain

## The following object is masked from 'package:MASS':
## 
##     select

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

admin$rank = as.factor(admin$rank)
admin$rank = recode_factor(admin$rank,`1`="1+2",`2`="1+2",`3`="3+4",`4`="3+4")
str(admin)

## 'data.frame':    400 obs. of  4 variables:
##  $ admit: int  0 1 1 1 0 1 1 0 1 0 ...
##  $ gre  : int  380 660 800 640 520 760 560 400 540 700 ...
##  $ gpa  : num  3.61 3.67 4 3.19 2.93 3 2.98 3.08 3.39 3.92 ...
##  $ rank : Factor w/ 2 levels "1+2","3+4": 2 2 1 2 2 1 1 1 2 1 ...

xtabs(~admit + rank, data = admin)

##      rank
## admit 1+2 3+4
##     0 125 148
##     1  87  40

chisq.test(xtabs(~admit + rank, data = admin))

## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  xtabs(~admit + rank, data = admin)
## X-squared = 17.056, df = 1, p-value = 3.63e-05

t.test(admin$admit[admin$rank=="1+2"],admin$admit[admin$rank=="3+4"])

## 
##  Welch Two Sample t-test
## 
## data:  admin$admit[admin$rank == "1+2"] and admin$admit[admin$rank == "3+4"]
## t = 4.3725, df = 396.44, p-value = 1.571e-05
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  0.1087621 0.2864607
## sample estimates:
## mean of x mean of y 
## 0.4103774 0.2127660

u=seq(0,.6,length=601)
p12=mean(admin$admit[admin$rank=="1+2"])
n12=sum(admin$rank=="1+2")
p34=mean(admin$admit[admin$rank=="3+4"])
n34=sum(admin$rank=="3+4")
plot(u,dnorm(u,p12,sqrt(p12*(1-p12)/n12)),col="blue",lwd=2,type="l",ylim=c(0,15))
lines(u,dnorm(u,p34,sqrt(p34*(1-p34)/n34)),col="red",lwd=2,type="l")

On rejette ici l’hypothèse nulle d’indépendance entre les deux variables.

On peut aussi tenter une régression logistique

reg = glm(admit ~ rank, family = "binomial", data = admin)
summary(reg)

## 
## Call:
## glm(formula = admit ~ rank, family = "binomial", data = admin)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.0279  -1.0279  -0.6917   1.3347   1.7593  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -0.3624     0.1396  -2.596  0.00944 ** 
## rank3+4      -0.9459     0.2264  -4.178 2.94e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 499.98  on 399  degrees of freedom
## Residual deviance: 481.66  on 398  degrees of freedom
## AIC: 485.66
## 
## Number of Fisher Scoring iterations: 4

## CIs using profiled log-likelihood
confint(reg)

## Waiting for profiling to be done...

##                  2.5 %      97.5 %
## (Intercept) -0.6387379 -0.09056309
## rank3+4     -1.3968586 -0.50785276

## CIs using standard errors
confint.default(reg)

##                  2.5 %      97.5 %
## (Intercept) -0.6360594 -0.08875188
## rank3+4     -1.3896372 -0.50221724

library(aod)
wald.test(b = coef(reg), Sigma = vcov(reg), Terms = 2)

## Wald test:
## ----------
## 
## Chi-squared test:
## X2 = 17.5, df = 1, P(> X2) = 2.9e-05

## odds ratios only
exp(coef(reg))

## (Intercept)     rank3+4 
##   0.6960000   0.3883194

exp(cbind(OR = coef(reg), confint(reg)))

## Waiting for profiling to be done...

##                    OR     2.5 %    97.5 %
## (Intercept) 0.6960000 0.5279583 0.9134167
## rank3+4     0.3883194 0.2473729 0.6017864

9.2.2 avec plusieurs modalités

admin = read.csv("https://stats.idre.ucla.edu/stat/data/binary.csv")
xtabs(~admit + rank, data = admin)

##      rank
## admit  1  2  3  4
##     0 28 97 93 55
##     1 33 54 28 12

chisq.test(xtabs(~admit + rank, data = admin))

## 
##  Pearson's Chi-squared test
## 
## data:  xtabs(~admit + rank, data = admin)
## X-squared = 25.242, df = 3, p-value = 1.374e-05

reg = glm(admit ~ as.factor(rank), family = "binomial", data = admin)
summary(reg)

## 
## Call:
## glm(formula = admit ~ as.factor(rank), family = "binomial", data = admin)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.2479  -0.9408  -0.7255   1.1085   1.8546  
## 
## Coefficients:
##                  Estimate Std. Error z value Pr(>|z|)    
## (Intercept)        0.1643     0.2569   0.639   0.5225    
## as.factor(rank)2  -0.7500     0.3080  -2.435   0.0149 *  
## as.factor(rank)3  -1.3647     0.3354  -4.069 4.72e-05 ***
## as.factor(rank)4  -1.6867     0.4093  -4.121 3.77e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 499.98  on 399  degrees of freedom
## Residual deviance: 474.97  on 396  degrees of freedom
## AIC: 482.97
## 
## Number of Fisher Scoring iterations: 4

## CIs using profiled log-likelihood
confint(reg)

## Waiting for profiling to be done...

##                       2.5 %     97.5 %
## (Intercept)      -0.3384783  0.6740676
## as.factor(rank)2 -1.3593568 -0.1487441
## as.factor(rank)3 -2.0324691 -0.7142109
## as.factor(rank)4 -2.5202351 -0.9071958

## CIs using standard errors
confint.default(reg)

##                       2.5 %     97.5 %
## (Intercept)      -0.3392869  0.6678930
## as.factor(rank)2 -1.3536388 -0.1464212
## as.factor(rank)3 -2.0220438 -0.7073522
## as.factor(rank)4 -2.4889571 -0.8845020

library(aod)
wald.test(b = coef(reg), Sigma = vcov(reg), Terms = 2)

## Wald test:
## ----------
## 
## Chi-squared test:
## X2 = 5.9, df = 1, P(> X2) = 0.015

## odds ratios only
exp(coef(reg))

##      (Intercept) as.factor(rank)2 as.factor(rank)3 as.factor(rank)4 
##        1.1785714        0.4723524        0.2554578        0.1851240

exp(cbind(OR = coef(reg), confint(reg)))

## Waiting for profiling to be done...

##                         OR      2.5 %    97.5 %
## (Intercept)      1.1785714 0.71285426 1.9622026
## as.factor(rank)2 0.4723524 0.25682591 0.8617897
## as.factor(rank)3 0.2554578 0.13101163 0.4895783
## as.factor(rank)4 0.1851240 0.08044069 0.4036546

9.3 Qualité de l’ajustement

9.3.1 Courbe ROC

reg=glm((PRONO=="DECES")~.,data=myocarde,family=binomial(link = "logit"))
seuil = .5
score = data.frame(yobs=(myocarde$PRONO=="DECES"),ypred=predict(reg,type="response"))
head(score)

##    yobs     ypred
## 1 FALSE 0.3986106
## 2  TRUE 0.8306231
## 3  TRUE 0.6710440
## 4 FALSE 0.1182406
## 5  TRUE 0.8575781
## 6  TRUE 0.9420705

xtabs(~yobs + (ypred>seuil), data = score)

##        ypred > seuil
## yobs    FALSE TRUE
##   FALSE    39    3
##   TRUE      4   25

(FP=sum((score$ypred>seuil)*(score$yobs==0))/sum(score$yobs==0))

## [1] 0.07142857

(TP=sum((score$ypred>seuil)*(score$yobs==1))/sum(score$yobs==1))

## [1] 0.862069

roc_curve = function(seuil){
FP=sum((score$ypred>seuil)*(score$yobs==0))/sum(score$yobs==0)
TP=sum((score$ypred>seuil)*(score$yobs==1))/sum(score$yobs==1)
return(c(FPR=FP,TPR=TP))}
u=seq(0,1,by=.01)
ROC.curve=t(Vectorize(roc_curve)(u))
plot(ROC.curve,type="s")
abline(a=0,b=1,lty=2,col="red")
points(FP,TP,pch=19,col="blue")

library(verification)

## Loading required package: fields

## Loading required package: spam

## Loading required package: dotCall64

## Loading required package: grid

## 
## Attaching package: 'grid'

## The following object is masked from 'package:mixtools':
## 
##     depth

## Spam version 2.2-2 (2019-03-07) is loaded.
## Type 'help( Spam)' or 'demo( spam)' for a short introduction 
## and overview of this package.
## Help for individual functions is also obtained by adding the
## suffix '.spam' to the function name, e.g. 'help( chol.spam)'.

## 
## Attaching package: 'spam'

## The following objects are masked from 'package:SparseM':
## 
##     backsolve, forwardsolve

## The following object is masked from 'package:Matrix':
## 
##     det

## The following objects are masked from 'package:base':
## 
##     backsolve, forwardsolve

## Loading required package: maps

## See https://github.com/NCAR/Fields for
##  an extensive vignette, other supplements and source code

## Loading required package: boot

## 
## Attaching package: 'boot'

## The following object is masked _by_ '.GlobalEnv':
## 
##     tau

## The following object is masked from 'package:car':
## 
##     logit

## The following object is masked from 'package:sm':
## 
##     dogs

## Loading required package: CircStats

## Loading required package: dtw

## Loading required package: proxy

## 
## Attaching package: 'proxy'

## The following object is masked from 'package:spam':
## 
##     as.matrix

## The following object is masked from 'package:SparseM':
## 
##     as.matrix

## The following object is masked from 'package:Matrix':
## 
##     as.matrix

## The following objects are masked from 'package:stats':
## 
##     as.dist, dist

## The following object is masked from 'package:base':
## 
##     as.matrix

## Loaded dtw v1.20-1. See ?dtw for help, citation("dtw") for use in publication.

roc.plot(score$yobs,score$ypred)

library(ROCR)
pred = prediction(score$ypred,score$yobs)
rocs = performance(pred, "tpr", "fpr")
plot(rocs)
perf = performance(pred, measure = "auc")
perf@y.values

## [[1]]
## [1] 0.9417077

perf = performance(pred, "rch")
plot(perf,col="red",add=TRUE)

9.3.2 Kolmogorov Smirnov

f0=c(0,sort(score$ypred[score$yobs==0]),1)
f1=c(0,sort(score$ypred[score$yobs==1]),1)
plot(f0,(0:(length(f0)-1))/(length(f0)-1),col="red",type="s",xlim=0:1)
lines(f1,(0:(length(f1)-1))/(length(f1)-1),col="blue",type="s")

9.4 Extentions

9.4.1 Cas binomial (et plus Bernoulli)

conting = xtabs(~admit + rank, data = admin)
cont_admin = data.frame(y0=conting[1,],y1=conting[2,],x=1:4,n=apply(conting,2,sum))
cont_admin

##   y0 y1 x   n
## 1 28 33 1  61
## 2 97 54 2 151
## 3 93 28 3 121
## 4 55 12 4  67

attach(cont_admin)

## The following objects are masked _by_ .GlobalEnv:
## 
##     n, x

reg=glm(cbind(y1, y0) ~ as.factor(x), data = cont_admin,family="binomial")
summary(reg)

## 
## Call:
## glm(formula = cbind(y1, y0) ~ as.factor(x), family = "binomial", 
##     data = cont_admin)
## 
## Deviance Residuals: 
## [1]  0  0  0  0
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)     0.1643     0.2569   0.639   0.5225    
## as.factor(x)2  -0.7500     0.3080  -2.435   0.0149 *  
## as.factor(x)3  -1.3647     0.3354  -4.069 4.72e-05 ***
## as.factor(x)4  -1.6867     0.4093  -4.121 3.77e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 25.01  on 3  degrees of freedom
## Residual deviance:  0.00  on 0  degrees of freedom
## AIC: 27.005
## 
## Number of Fisher Scoring iterations: 3

9.4.2 Cas multinomial

library(nnet)
model=multinom(as.factor(rank) ~ gre + gpa, data = admin)

## # weights:  16 (9 variable)
## initial  value 554.517744 
## iter  10 value 519.832717
## final  value 519.311942 
## converged

summary(model)

## Call:
## multinom(formula = as.factor(rank) ~ gre + gpa, data = admin)
## 
## Coefficients:
##   (Intercept)           gre        gpa
## 2    3.211128 -0.0004897883 -0.5894966
## 3    1.768934 -0.0031436638  0.2270097
## 4    3.751437 -0.0023341470 -0.6725334
## 
## Std. Errors:
##   (Intercept)         gre       gpa
## 2   0.7018681 0.001437975 0.3012374
## 3   0.7443790 0.001500964 0.3156397
## 4   0.7774992 0.001674163 0.3475131
## 
## Residual Deviance: 1038.624 
## AIC: 1056.624

(z = summary(model)$coefficients/summary(model)$standard.errors)

##   (Intercept)        gre        gpa
## 2    4.575117 -0.3406098 -1.9569170
## 3    2.376389 -2.0944293  0.7192052
## 4    4.825004 -1.3942171 -1.9352753

(pval = (1 - pnorm(abs(z), 0, 1)) * 2)

##    (Intercept)        gre        gpa
## 2 4.759549e-06 0.73339738 0.05035723
## 3 1.748301e-02 0.03622175 0.47201451
## 4 1.400002e-06 0.16325206 0.05295652

exp(coef(model))

##   (Intercept)       gre       gpa
## 2    24.80706 0.9995103 0.5546064
## 3     5.86460 0.9968613 1.2548421
## 4    42.58222 0.9976686 0.5104138

head(predict(model, data = admin, type="probs"))

##           1         2         3         4
## 1 0.1107439 0.2715501 0.4463341 0.1713719
## 2 0.1808019 0.3730891 0.3063242 0.1397848
## 3 0.2376033 0.3768749 0.2794000 0.1061218
## 4 0.1537982 0.4253067 0.2488354 0.1720598
## 5 0.1189638 0.4066822 0.2645914 0.2097626
## 6 0.1683534 0.4910100 0.1789031 0.1617335

10 Regression de Poisson

base = read.table("http://freakonometrics.free.fr/baseaffairs.txt",header=TRUE)
str(base)

## 'data.frame':    563 obs. of  9 variables:
##  $ SEX         : int  1 0 0 1 1 0 0 1 0 1 ...
##  $ AGE         : num  37 27 32 57 22 32 22 57 32 22 ...
##  $ YEARMARRIAGE: num  10 4 15 15 0.75 1.5 0.75 15 15 1.5 ...
##  $ CHILDREN    : int  0 0 1 1 0 0 0 1 1 0 ...
##  $ RELIGIOUS   : int  3 4 1 5 2 2 2 2 4 4 ...
##  $ EDUCATION   : int  18 14 12 18 17 17 12 14 16 14 ...
##  $ OCCUPATION  : int  7 6 1 6 6 5 1 4 1 4 ...
##  $ SATISFACTION: int  4 4 4 5 3 5 3 4 2 5 ...
##  $ Y           : int  0 0 0 0 0 0 0 0 0 0 ...

table(base$Y)

## 
##   0   1   2   3   4   5   6   7   8  10 
## 451  34  17  19  12  11  11   5   2   1

library(vcd)
gof = goodfit(base$Y, type = "poisson", method = "ML", par = NULL)
gof

## 
## Observed and fitted values for poisson distribution
## with parameters estimated by `ML' 
## 
##  count observed       fitted pearson residual
##      0      451 2.996841e+02     8.740829e+00
##      1       34 1.889660e+02    -1.127313e+01
##      2       17 5.957632e+01    -5.516089e+00
##      3       19 1.252196e+01     1.830659e+00
##      4       12 1.973933e+00     7.136158e+00
##      5       11 2.489329e-01     2.154817e+01
##      6       11 2.616080e-02     6.784738e+01
##      7        5 2.356530e-03     1.029506e+02
##      8        2 1.857389e-04     1.467365e+02
##      9        0 1.301309e-05    -3.607366e-03
##     10        1 8.205410e-07     1.072005e+03

plot(gof)

Manifestement, $Y$ ne suit pas une loi de Poisson…

10.1 Méthode de biais minimal

df=data.frame(y=base$Y,
              religion=as.factor(base$RELIGIOUS),
              occupation=as.factor(base$OCCUPATION),
              expo = 1)
(E=xtabs((y>=0)~religion+occupation,data=df))

##         occupation
## religion  1  2  3  4  5  6  7
##        1  4  1  8  4 16  9  0
##        2 23  3 11 17 56 36  6
##        3 29  1 10 12 39 25  2
##        4 38  7 12 21 59 44  2
##        5 13  1  3 10 19 19  3

(N=xtabs(y~religion+occupation,data=df))

##         occupation
## religion  1  2  3  4  5  6  7
##        1  4  1 13  3 13  7  0
##        2  1  1 13 10 25 43 10
##        3 15  0 12 11 34 35  1
##        4 24  1  3 15 11  9 10
##        5  6  0  0  6 11  7  0

sum(N)/sum(E)

## [1] 0.6305506

A=rep(1,length(levels(df$religion)))
B=rep(1,length(levels(df$occupation)))*sum(N)/sum(E)
A

## [1] 1 1 1 1 1

## [1] 0.6305506 0.6305506 0.6305506 0.6305506 0.6305506 0.6305506 0.6305506

E * A%*%t(B)

##         occupation
## religion          1          2          3          4          5          6
##        1  2.5222025  0.6305506  5.0444050  2.5222025 10.0888099  5.6749556
##        2 14.5026643  1.8916519  6.9360568 10.7193606 35.3108348 22.6998224
##        3 18.2859680  0.6305506  6.3055062  7.5666075 24.5914742 15.7637655
##        4 23.9609236  4.4138544  7.5666075 13.2415631 37.2024867 27.7442274
##        5  8.1971581  0.6305506  1.8916519  6.3055062 11.9804618 11.9804618
##         occupation
## religion          7
##        1  0.0000000
##        2  3.7833037
##        3  1.2611012
##        4  1.2611012
##        5  1.8916519

sum(B*E[1,])

## [1] 26.48313

sum(B*E[2,])

## [1] 95.84369

apply(t(B*t(E)),1,sum)

##         1         2         3         4         5 
##  26.48313  95.84369  74.40497 115.39076  42.87744

sum(A*E[,1])

## [1] 107

sum(A*E[,2])

## [1] 13

apply(A*E,2,sum)

##   1   2   3   4   5   6   7 
## 107  13  44  64 189 133  13

A=apply(N,1,sum)/apply(t(B*t(E)),1,sum)
B=apply(N,2,sum)/apply(A*E,2,sum)
A

##         1         2         3         4         5 
## 1.5481556 1.0746664 1.4515159 0.6326330 0.6996686

##         1         2         3         4         5         6         7 
## 0.4710775 0.2642760 0.8468916 0.7239135 0.4891248 0.7766910 1.6515534

for(i in 1:1000){
  A=apply(N,1,sum)/apply(t(B*t(E)),1,sum)
  B=apply(N,2,sum)/apply(A*E,2,sum)
}
A

##         1         2         3         4         5 
## 1.5404346 1.0447195 1.4825650 0.6553159 0.6634763

##         1         2         3         4         5         6         7 
## 0.4685515 0.2629769 0.8454435 0.7245310 0.4889697 0.7770553 1.6753750

E * A%*%t(B)

##         occupation
## religion          1          2          3          4          5          6
##        1  2.8870914  0.4050987 10.4188024  4.4643702 12.0516123 10.7730250
##        2 11.2586111  0.8242113  9.7157637 12.8678376 28.6068235 29.2249717
##        3 20.1450811  0.3898804 12.5342484 12.8899708 28.2722423 28.8008726
##        4 11.6678702  1.2063307  6.6483904  9.9707299 18.9053460 22.4055332
##        5  4.0413463  0.1744790  1.6827951  4.8070914  6.1639760  9.7955975
##         occupation
## religion          7
##        1  0.0000000
##        2 10.5017811
##        3  4.9677044
##        4  2.1957997
##        5  3.3347148

apply(N,1,sum)

##   1   2   3   4   5 
##  41 103 108  73  30

apply(E * A%*%t(B),1,sum)

##   1   2   3   4   5 
##  41 103 108  73  30

apply(N,2,sum)

##   1   2   3   4   5   6   7 
##  50   3  41  45  94 101  21

apply(E * A%*%t(B),2,sum)

##   1   2   3   4   5   6   7 
##  50   3  41  45  94 101  21

reg=glm(y~religion+occupation,data=df,family=poisson)
summary(reg)

## 
## Call:
## glm(formula = y ~ religion + occupation, family = poisson, data = df)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.2288  -1.2028  -1.0092  -0.7836   6.0644  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -0.32604    0.21325  -1.529 0.126285    
## religion2   -0.38832    0.18791  -2.066 0.038783 *  
## religion3   -0.03829    0.18585  -0.206 0.836771    
## religion4   -0.85470    0.19757  -4.326 1.52e-05 ***
## religion5   -0.84233    0.24416  -3.450 0.000561 ***
## occupation2 -0.57758    0.59549  -0.970 0.332083    
## occupation3  0.59022    0.21349   2.765 0.005699 ** 
## occupation4  0.43588    0.20603   2.116 0.034381 *  
## occupation5  0.04265    0.17590   0.242 0.808399    
## occupation6  0.50587    0.17360   2.914 0.003569 ** 
## occupation7  1.27415    0.26298   4.845 1.27e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 1295.2  on 562  degrees of freedom
## Residual deviance: 1215.0  on 552  degrees of freedom
## AIC: 1555.1
## 
## Number of Fisher Scoring iterations: 6

yp = predict(reg,type="response")
sum(yp)

## [1] 355

sum(df$y)

## [1] 355

xtabs(yp~df$religion+df$occupation)

##            df$occupation
## df$religion          1          2          3          4          5          6
##           1  2.8870914  0.4050987 10.4188024  4.4643702 12.0516123 10.7730250
##           2 11.2586112  0.8242113  9.7157637 12.8678376 28.6068235 29.2249717
##           3 20.1450813  0.3898804 12.5342484 12.8899708 28.2722424 28.8008726
##           4 11.6678703  1.2063307  6.6483904  9.9707300 18.9053460 22.4055332
##           5  4.0413464  0.1744790  1.6827951  4.8070914  6.1639761  9.7955975
##            df$occupation
## df$religion          7
##           1  0.0000000
##           2 10.5017811
##           3  4.9677044
##           4  2.1957997
##           5  3.3347148

E * A%*%t(B)

##         occupation
## religion          1          2          3          4          5          6
##        1  2.8870914  0.4050987 10.4188024  4.4643702 12.0516123 10.7730250
##        2 11.2586111  0.8242113  9.7157637 12.8678376 28.6068235 29.2249717
##        3 20.1450811  0.3898804 12.5342484 12.8899708 28.2722423 28.8008726
##        4 11.6678702  1.2063307  6.6483904  9.9707299 18.9053460 22.4055332
##        5  4.0413463  0.1744790  1.6827951  4.8070914  6.1639760  9.7955975
##         occupation
## religion          7
##        1  0.0000000
##        2 10.5017811
##        3  4.9677044
##        4  2.1957997
##        5  3.3347148

a=exp(coefficients(reg)[1]+c(0,coefficients(reg)[2:5]))
a/a[1]

##           religion2 religion3 religion4 religion5 
## 1.0000000 0.6781979 0.9624329 0.4254098 0.4307072

A/A[1]

##         1         2         3         4         5 
## 1.0000000 0.6781979 0.9624329 0.4254098 0.4307072

b=exp(coefficients(reg)[1]+c(0,coefficients(reg)[6:11]))
b/b[1]

##             occupation2 occupation3 occupation4 occupation5 occupation6 
##   1.0000000   0.5612551   1.8043769   1.5463210   1.0435773   1.6584203 
## occupation7 
##   3.5756477

B/B[1]

##         1         2         3         4         5         6         7 
## 1.0000000 0.5612551 1.8043770 1.5463210 1.0435773 1.6584203 3.5756478

yp = predict(reg,type="response")
xtabs(yp~df$religion+df$occupation)

##            df$occupation
## df$religion          1          2          3          4          5          6
##           1  2.8870914  0.4050987 10.4188024  4.4643702 12.0516123 10.7730250
##           2 11.2586112  0.8242113  9.7157637 12.8678376 28.6068235 29.2249717
##           3 20.1450813  0.3898804 12.5342484 12.8899708 28.2722424 28.8008726
##           4 11.6678703  1.2063307  6.6483904  9.9707300 18.9053460 22.4055332
##           5  4.0413464  0.1744790  1.6827951  4.8070914  6.1639761  9.7955975
##            df$occupation
## df$religion          7
##           1  0.0000000
##           2 10.5017811
##           3  4.9677044
##           4  2.1957997
##           5  3.3347148

E * A%*%t(B)

##         occupation
## religion          1          2          3          4          5          6
##        1  2.8870914  0.4050987 10.4188024  4.4643702 12.0516123 10.7730250
##        2 11.2586111  0.8242113  9.7157637 12.8678376 28.6068235 29.2249717
##        3 20.1450811  0.3898804 12.5342484 12.8899708 28.2722423 28.8008726
##        4 11.6678702  1.2063307  6.6483904  9.9707299 18.9053460 22.4055332
##        5  4.0413463  0.1744790  1.6827951  4.8070914  6.1639760  9.7955975
##         occupation
## religion          7
##        1  0.0000000
##        2 10.5017811
##        3  4.9677044
##        4  2.1957997
##        5  3.3347148

10.2 De la notion d’exposition (offset)

10.3 Application : données AIDS

library(boot)
data(aids)
reg=glm(y~as.factor(delay)+as.factor(time),data=aids)
D=unique(aids$delay)
T=unique(aids$time)
T2=unique(aids$year+(aids$quarter-1/2)/4)
M=matrix(NA,length(T2),length(D))
for(i in 1:length(T2)){
  for(j in 1:length(D)){
    M[i,j]=aids[(aids$time==i)&(aids$delay==D[j]),"y"]}}
for(i in 1:(length(D)-2)){
  M[nrow(M)+1-i,(2+i):length(D)]=NA
}
M

##       [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13]
##  [1,]    2    6    0    1    1    0    0    1    0     0     0     0     0
##  [2,]    2    7    1    1    1    0    0    0    0     0     0     0     0
##  [3,]    4    4    0    1    0    2    0    0    0     0     2     1     0
##  [4,]    0   10    0    1    1    0    0    0    1     1     1     0     0
##  [5,]    6   17    3    1    1    0    0    0    0     0     0     1     0
##  [6,]    5   22    1    5    2    1    0    2    1     0     0     0     0
##  [7,]    4   23    4    5    2    1    3    0    1     2     0     0     0
##  [8,]   11   11    6    1    1    5    0    1    1     1     1     0     0
##  [9,]    9   22    6    2    4    3    3    4    7     1     2     0     0
## [10,]    2   28    8    8    5    2    2    4    3     0     1     1     0
## [11,]    5   26   14    6    9    2    5    5    5     1     2     0     0
## [12,]    7   49   17   11    4    7    5    7    3     1     2     2     0
## [13,]   13   37   21    9    3    5    7    3    1     3     1     0     0
## [14,]   12   53   16   21    2    7    0    7    0     0     0     0     0
## [15,]   21   44   29   11    6    4    2    2    1     0     2     0     2
## [16,]   17   74   13   13    3    5    3    1    2     2     0     0     0
## [17,]   36   58   23   14    7    4    1    2    1     3     0     0     0
## [18,]   28   74   23   11    8    3    3    6    2     5     4     1     1
## [19,]   31   80   16    9    3    2    8    3    1     4     6     2     1
## [20,]   26   99   27    9    8   11    3    4    6     3     5     5     1
## [21,]   31   95   35   13   18    4    6    4    4     3     3     2     0
## [22,]   36   77   20   26   11    3    8    4    8     7     1     0     0
## [23,]   32   92   32   10   12   19   12    4    3     2     0     2     2
## [24,]   15   92   14   27   22   21   12    5    3     0     3     3     0
## [25,]   34  104   29   31   18    8    6    7    3     8     0     2     1
## [26,]   38  101   34   18    9   15    6    1    2     2     2     3     2
## [27,]   31  124   47   24   11   15    8    6    5     3     3     4     0
## [28,]   32  132   36   10    9    7    6    4    4     5     0     0    NA
## [29,]   49  107   51   17   15    8    9    2    1     1     0    NA    NA
## [30,]   44  153   41   16   11    6    5    7    2     0    NA    NA    NA
## [31,]   41  137   29   33    7   11    6    4    3    NA    NA    NA    NA
## [32,]   56  124   39   14   12    7   10    1   NA    NA    NA    NA    NA
## [33,]   53  175   35   17   13   11    2   NA   NA    NA    NA    NA    NA
## [34,]   63  135   24   23   12    1   NA   NA   NA    NA    NA    NA    NA
## [35,]   71  161   48   25    5   NA   NA   NA   NA    NA    NA    NA    NA
## [36,]   95  178   39    6   NA   NA   NA   NA   NA    NA    NA    NA    NA
## [37,]   76  181   16   NA   NA   NA   NA   NA   NA    NA    NA    NA    NA
## [38,]   67   66   NA   NA   NA   NA   NA   NA   NA    NA    NA    NA    NA
##       [,14] [,15]
##  [1,]     0     1
##  [2,]     0     0
##  [3,]     0     0
##  [4,]     0     0
##  [5,]     0     1
##  [6,]     0     0
##  [7,]     0     2
##  [8,]     0     1
##  [9,]     0     0
## [10,]     0     1
## [11,]     0     2
## [12,]     1     4
## [13,]     0     6
## [14,]     1     1
## [15,]     2     8
## [16,]     3     5
## [17,]     3     1
## [18,]     1     3
## [19,]     2     6
## [20,]     1     3
## [21,]     3     3
## [22,]     2     2
## [23,]     0     2
## [24,]     1     1
## [25,]     2     0
## [26,]     0    NA
## [27,]    NA    NA
## [28,]    NA    NA
## [29,]    NA    NA
## [30,]    NA    NA
## [31,]    NA    NA
## [32,]    NA    NA
## [33,]    NA    NA
## [34,]    NA    NA
## [35,]    NA    NA
## [36,]    NA    NA
## [37,]    NA    NA
## [38,]    NA    NA

aids2=aids
aids2$y[aids2$dud==1]=NA
reg=glm(y~as.factor(delay)+as.factor(time),data=aids2,family=poisson)
summary(reg)

## 
## Call:
## glm(formula = y ~ as.factor(delay) + as.factor(time), family = poisson, 
##     data = aids2)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.6079  -0.9572  -0.3329   0.6242   3.8243  
## 
## Coefficients:
##                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         6.107e-01  2.901e-01   2.105 0.035310 *  
## as.factor(delay)2   1.032e+00  3.615e-02  28.535  < 2e-16 ***
## as.factor(delay)5  -2.172e-01  4.755e-02  -4.567 4.94e-06 ***
## as.factor(delay)8  -7.097e-01  5.701e-02 -12.448  < 2e-16 ***
## as.factor(delay)11 -1.212e+00  7.069e-02 -17.138  < 2e-16 ***
## as.factor(delay)14 -1.386e+00  7.779e-02 -17.811  < 2e-16 ***
## as.factor(delay)17 -1.674e+00  9.076e-02 -18.449  < 2e-16 ***
## as.factor(delay)20 -1.941e+00  1.052e-01 -18.461  < 2e-16 ***
## as.factor(delay)23 -2.217e+00  1.231e-01 -18.009  < 2e-16 ***
## as.factor(delay)26 -2.345e+00  1.354e-01 -17.318  < 2e-16 ***
## as.factor(delay)29 -2.620e+00  1.597e-01 -16.405  < 2e-16 ***
## as.factor(delay)32 -2.894e+00  1.887e-01 -15.335  < 2e-16 ***
## as.factor(delay)35 -3.870e+00  3.181e-01 -12.168  < 2e-16 ***
## as.factor(delay)38 -3.002e+00  2.160e-01 -13.899  < 2e-16 ***
## as.factor(delay)41 -2.029e+00  1.417e-01 -14.312  < 2e-16 ***
## as.factor(time)2   -3.691e-14  4.082e-01   0.000 1.000000    
## as.factor(time)3    1.542e-01  3.934e-01   0.392 0.695171    
## as.factor(time)4    2.231e-01  3.873e-01   0.576 0.564510    
## as.factor(time)5    9.163e-01  3.416e-01   2.683 0.007305 ** 
## as.factor(time)6    1.179e+00  3.301e-01   3.570 0.000356 ***
## as.factor(time)7    1.365e+00  3.234e-01   4.221 2.43e-05 ***
## as.factor(time)8    1.204e+00  3.291e-01   3.658 0.000254 ***
## as.factor(time)9    1.658e+00  3.150e-01   5.265 1.40e-07 ***
## as.factor(time)10   1.689e+00  3.142e-01   5.377 7.57e-08 ***
## as.factor(time)11   1.922e+00  3.091e-01   6.218 5.04e-10 ***
## as.factor(time)12   2.303e+00  3.028e-01   7.605 2.84e-14 ***
## as.factor(time)13   2.206e+00  3.042e-01   7.254 4.03e-13 ***
## as.factor(time)14   2.303e+00  3.028e-01   7.605 2.84e-14 ***
## as.factor(time)15   2.413e+00  3.013e-01   8.008 1.17e-15 ***
## as.factor(time)16   2.464e+00  3.007e-01   8.193 2.54e-16 ***
## as.factor(time)17   2.546e+00  2.998e-01   8.491  < 2e-16 ***
## as.factor(time)18   2.668e+00  2.985e-01   8.939  < 2e-16 ***
## as.factor(time)19   2.674e+00  2.985e-01   8.960  < 2e-16 ***
## as.factor(time)20   2.867e+00  2.968e-01   9.661  < 2e-16 ***
## as.factor(time)21   2.927e+00  2.963e-01   9.877  < 2e-16 ***
## as.factor(time)22   2.838e+00  2.970e-01   9.556  < 2e-16 ***
## as.factor(time)23   2.927e+00  2.963e-01   9.877  < 2e-16 ***
## as.factor(time)24   2.904e+00  2.965e-01   9.796  < 2e-16 ***
## as.factor(time)25   3.069e+00  2.955e-01  10.387  < 2e-16 ***
## as.factor(time)26   2.994e+00  2.960e-01  10.115  < 2e-16 ***
## as.factor(time)27   3.185e+00  2.948e-01  10.804  < 2e-16 ***
## as.factor(time)28   3.057e+00  2.957e-01  10.338  < 2e-16 ***
## as.factor(time)29   3.128e+00  2.953e-01  10.592  < 2e-16 ***
## as.factor(time)30   3.235e+00  2.947e-01  10.977  < 2e-16 ***
## as.factor(time)31   3.192e+00  2.951e-01  10.815  < 2e-16 ***
## as.factor(time)32   3.193e+00  2.953e-01  10.815  < 2e-16 ***
## as.factor(time)33   3.375e+00  2.944e-01  11.464  < 2e-16 ***
## as.factor(time)34   3.252e+00  2.954e-01  11.008  < 2e-16 ***
## as.factor(time)35   3.480e+00  2.944e-01  11.820  < 2e-16 ***
## as.factor(time)36   3.604e+00  2.943e-01  12.245  < 2e-16 ***
## as.factor(time)37   3.602e+00  2.956e-01  12.187  < 2e-16 ***
## as.factor(time)38   3.594e+00  3.148e-01  11.417  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 14184.30  on 464  degrees of freedom
## Residual deviance:   716.48  on 413  degrees of freedom
##   (105 observations deleted due to missingness)
## AIC: 2190.5
## 
## Number of Fisher Scoring iterations: 5

11 Generalized Linear Model

Pour la régression linéaire Gaussienne, $Y\vert\boldsymbol{X}=\boldsymbol{x}\sim\color{red}{\mathcal{N}}(\color{green}{\boldsymbol{x}^\text{ T}\boldsymbol{\beta}},\sigma^2)$

Pour la régression logistique, $Y\vert\boldsymbol{X}=\boldsymbol{x}\sim\color{red}{\mathcal{B}}(\color{blue}{H}(\color{green}{\boldsymbol{x}^\text{ T}\boldsymbol{\beta}}))$ avec $\displaystyle{H(s)=\frac{e^s}{1+e^s}}$

Pour la régression de Poisson, $Y\vert\boldsymbol{X}=\boldsymbol{x}\sim\color{red}{\mathcal{P}}(\color{blue}{\exp}(\color{green}{\boldsymbol{x}^\text{ T}\boldsymbol{\beta}}))$

Plus généralement $Y\vert\boldsymbol{X}=\boldsymbol{x}\sim\color{red}{\mathcal{L}}(\theta_x,\varphi)$ avec $\theta_x=h(\mathbb{E}[Y\vert\boldsymbol{X}=\boldsymbol{x}]=\color{blue}{g}(\color{green}{\boldsymbol{x}^\text{ T}\boldsymbol{\beta}}))$

11.1 La famille exponentielle

On dira que la loi de $Y$ est dans la famille exponentielle si sa masse de probabilité / densité s’écrit \[ f(y|\theta,\varphi) = \exp\left(\frac{y\theta-b(\theta)}{ a(\varphi) }+c(y,\varphi)\right) \] où $a(\cdot)$, $b(\cdot)$ et $c(\cdot)$ sont des fonctions, et où $\theta$ est appelé . Le paramètre $\theta$ est le paramètre d’intérêet tandi que $\varphi$ est considéré comme un paramètres de nuisance (et supposé connu, dans un premier temps).

La loi de moyenne $\mu$ et de variance $\sigma^2$, $\mathcal{N}(\mu,\sigma^2)$ appartient à cette famille, avec $\theta=\mu$, $\varphi=\sigma^2$, $a(\varphi)=\varphi$, $b(\theta)=\theta^2/2$ et \[ c(y,\varphi)=-\frac{1}{2}\left(\frac{y^2}{\sigma^2}+\log(2\pi\sigma^2)\right), \hspace{2mm}y\in{\mathbb{R}}, \]

La loi de de moyenne $\pi$, $\mathcal{B}(\pi)$ correspond au cas $\theta=\log\{p/(1-p)\}$, $a(\varphi)=1$, $b(\theta)=\log(1+\exp(\theta))$, $\varphi=1$ et $\displaystyle{c(y,\varphi)=0}$.

La loi de moyenne $n\pi$, $\mathcal{B}(n,\pi)$ correspond au cas $\theta=\log\{p/(1-p)\}$, $a(\varphi)=1$, $b(\theta)=n\log(1+\exp(\theta))$, $\varphi=1$ et $\displaystyle{c(y,\varphi)=\log\binom{n}{y}}$.

La loi de de moyenne $\lambda$, $\mathcal{P}(\lambda)$ appartient à cette famille, \[ f(y|\lambda)=\exp(-\lambda)\frac{\lambda^y}{y!}=\exp\Big(y\log\lambda-\lambda-\log y!\Big), \hspace{2mm}y\in{\mathbb{N}}, \] avec $\theta=\log\lambda$, $\varphi=1$, $a(\varphi)=1$, $b(\theta)=\exp\theta=\lambda$ et $c(y,\varphi)=-\log y!$.

La loi , de paramètres $r$ et $p$, \[ f(k| r, p) = {y+r-1 \choose y} (1-p)^r p^y, \hspace{2mm}y\in \mathbb{N}. \] que l’on peut écrire \[ f(k| r, p) = \exp\left(y\log p +r\log (1-p) +\log{y+r-1 \choose y}\right) \] soit $\theta=\log p$, $b(\theta)=-r\log p$ et $a(\varphi)=1$

La loi (incluant la loi ) de moyenne $\mu$ et de variance $\nu^{-1}$, \[ f(y|\mu,\nu)=\frac{1}{\Gamma(\nu)}\left(\frac{\nu}{\mu}\right)^\nu y^{\nu-1}\exp\left(-\frac{\nu}{\mu}y\right), \hspace{2mm}y\in{\mathbb{R}_+}, \] est également dans la famille exponentielle. Il faut choisir $\displaystyle{\theta=-\frac{1}{\mu}}$, $a(\varphi)=\varphi$, $b(\theta)=-\log(-\theta)$, $\varphi=\nu^{-1}$ et \[ c(y,\varphi)= \left(\frac{1}{\varphi} - 1\right) \log(y) - \log\left( \Gamma \left( \frac{1}{\varphi} \right) \right) \]

Pour une variable aléatoire $Y$ dont la densité est de la forme exponentielle, alors \[ \color{red}{\mathbb{E}(Y)=b'(\theta)}\text{ et }\color{red}{\operatorname{var}(Y) = b''(\theta) \varphi}, \] i.e. la variance de $Y$ apparaêit comme le produit de deux fonctions:

la première, $b''(\theta)$ , qui dépend uniquement du paramètre $\theta$ est appelée fonction variance,
la seconde est indépendante de $\theta$ et dépend uniquement de $\varphi$.

En notant $\mu=\mathbb{E}(Y)$, on voit que le paramètre $\theta$ est lié à la moyenne $\mu$. La fonction variance peut donc êetre définie en fonction de $\mu$ , nous la noterons dorénavant \[\color{red}{\operatorname{V}(\mu)=b''([b']^{-1}(\mu)) \varphi}.\]

Dans le cas de la loi normale, $\operatorname{V}(\mu)=1$, dans le cas de la loi de Poisson, $\operatorname{V}(\mu)=\mu$ alors que dans le cas de la loi Gamma, $\operatorname{V}(\mu)=\mu^2$.

Chacune des lois de la famille exponentielle possède une fonction de lien spécifique, dite fonction de lien canonique, permettant de relier l’espérance $\mu$ au paramètre naturel (ou canonique) $\theta$. Le lien canonique est tel que $g_\star(\mu)=\theta$. Or, $\mu=b'(\theta)$ donc $\color{red}{g_\star(\cdot)=b'(\cdot)^{-1}}$.

Pour la loi normale, $\theta=\mu$ (link='identity'),

Pour la loi de Poisson, $\theta=\log(\mu)$ (link='log')

Pour la loi de Bernoulli, $\theta=\textit{logit}(\mu)=\displaystyle{\log\frac{\mu}{1-\mu}}$, (link='logit')

Pour la loi Gamma, $\theta=1/\mu$ (link='inverse')

a suggéré la famille suivante \[ f(y|\mu,\varphi) = A(y,\varphi) \cdot \exp \left\{\frac{1}{\varphi} \Big[ y \theta(\mu) - \kappa\big(\theta(\mu)\big) \Big]\right\}, \] où \[ \theta(\mu) = \begin{cases} \dfrac{\mu^{1-\gamma}}{1-\gamma} & \quad \gamma \ne 1 \\ \log \mu & \quad \gamma=1 \end{cases} \qquad \text{et} \qquad \kappa\big(\theta(\mu)\big) = \begin{cases} \dfrac{\mu^{2-\gamma}}{2-\gamma} & \quad \gamma \ne 2 \\ \log \mu & \quad \gamma = 2 \end{cases} \] La loi de $Y$ est alors une loi Poisson composée, avec des sauts suivant une loi Gamma, \[ Y\sim\mathcal{CP}oi\left({\mu^{2-\gamma}}{\varphi(2-\gamma)}, \mathcal{G}\left(-\frac{2-\gamma}{\varphi(1-\gamma)},\varphi(2-\gamma)\mu^{\gamma-1}\right)\right), \] où $\gamma\in[1,2]$.

On obtient alors une fonction variance de la forme $\color{blue}{\text{V}(\mu)=\varphi \mu^\gamma}$. On retrouve le modèle de quand $\gamma \rightarrow 1$ (ou $$) et une loi quand $\gamma \rightarrow 2$ (ou $\alpha \rightarrow 0$). Il est en fait possible d’obtenir une classe beaucoup plus large, y compris dans le cas où $\gamma>2$ en considérant des lois stables.

Le lien canonique est tel que $g(\mu_i)=\theta_i$. Or, $\mu_i=b'(\theta_i)$ d’où $g^{-1}=b'$.

La loi de Poisson correspondait au cas \[ f(y|\lambda)=\exp(-\lambda)\frac{\lambda^y}{y!}=\exp\Big(y\log\lambda-\lambda-\log y!\Big), \hspace{2mm}y\in{\mathbb{N}}, \] avec $\theta=\log\lambda$, $\varphi=1$. On a alors .

On souhaitera introduire un paramètre $\varphi\neq 1$, autorisant de la surdispersion ($\varphi>1$). On parle alors de loi (mais ce n’est pas une vraie loi). Avec un tel modèle, on aurait .

11.2 Régression

Considérons des variables aléatoires indépendantes $Y_1,Y_2,\ldots,Y_n$. La densité de chacune de celles-ci est \[ f(y_i|\theta_i,\varphi)=\exp\left\{\frac{y_i\theta_i-b(\theta_i)}{a(\varphi)}+c(y_i,\varphi)\right\} \] par rapport à la mesure dominante appropriée (mesure de comptage sur $\mathbb{N}$ ou mesure de Lebesgue sur $\mathbb{R}$). Dès lors, la vraisemblance est \[ \mathcal{L}(\boldsymbol{\theta},\varphi|\boldsymbol{y})=\prod_{i=1}^nf(y_i|\theta_i,\varphi) =\exp\left\{\frac{\sum_{i=1}^ny_i\theta_i-\sum_{i=1}^nb(\theta_i)}{a(\varphi)}+ \sum_{i=1}^n c(y_i,\varphi) \right\}. \] Plus précisément, notant $\mu_i$ la moyenne de $Y_i$, on suppose que \[ g(\mu_i)=\boldsymbol{x}_i'\boldsymbol{\beta}=\eta_i \] où

la fonction monotone et dérivable $g$ est appelée fonction de lien;
le vecteur $\boldsymbol{x}_i$ de dimension $p\times 1$ contient des variables explicatives relatives à l’individu $i$;
le vecteur $\boldsymbol{\beta}$ de dimension $p\times 1$ contient les paramètres.

Ainsi, un modèle linéaire généralisé est composé de trois éléments, à savoir

de variables à expliquer $Y_1,Y_2,\ldots,Y_n$ dont la est dans la famille exponentielle
d’un ensemble de paramètres $\boldsymbol{\beta}=(\beta_1,\beta_2,\ldots,\beta_p)'$ appartenant à un ouvert non vide de $\mathbb{R}^p$ et des : la matrice $n\times p$ $\boldsymbol{X}$, appelée matrice {}, ou matrice du plan d’expérience, est supposée êetre de rang $p$, i.e. la matrice carrée $p\times p$ $\boldsymbol{X}'\boldsymbol{X}$ est inversible;
d’une $g$ telle que \[ g(\mu_i)=\boldsymbol{x}_i'\boldsymbol{\beta}\mbox{ où } \mu_i=\mathbb{E}[Y_i] \] qui lie le prédicteur linéaire $\eta_i=\boldsymbol{x}_i'\boldsymbol{\beta}$ à la moyenne $\mu_i$ de $Y_i$.

11.3 Loi et fonction lien

x <- c(1,2,3,4,5)
y <- c(1,2,4,2,6)
base <- data.frame(x,y)
regN <- lm(y~x,data=base)
regNId <- glm(y~x,family=gaussian(link="identity"),data=base)
regNlog <- glm(y~x,family=gaussian(link="log"),data=base)
regPId <- glm(y~x,family=poisson(link="identity"),data=base)
regPlog <- glm(y~x,family=poisson(link="log"),data=base)
regGId <- glm(y~x,family=Gamma(link="identity"),data=base)
regGlog <- glm(y~x,family=Gamma(link="log"),data=base)
regGinv <- glm(y~x,family=Gamma,data=base)
regIGId <- glm(y~x,family=inverse.gaussian(link="identity"),data=base)
regIGlog <- glm(y~x,family=inverse.gaussian(link="log"),data=base)
library(statmod)
regTwId <- glm(y~x,family=tweedie(var.power=1.5,link.power=1),data=base)
regTwlog <- glm(y~x,family=tweedie(var.power=1.5,link.power=0),data=base)
DF=data.frame(obs=base$y,
              NId=predict(regNId,type="response"),
              Nlog=predict(regNlog,type="response"),
              PInd=predict(regPId,type="response"),
              Plog=predict(regPlog,type="response"),
              GId=predict(regGId,type="response"),
              Glog=predict(regGlog,type="response"),
              Ginv=predict(regGinv,type="response"),
              TwId=predict(regTwId,type="response"),
              Twlog=predict(regTwlog,type="response"))
DF

##   obs NId     Nlog     PInd     Plog      GId     Glog     Ginv     TwId
## 1   1   1 1.277371 1.036722 1.319414 1.021378 1.260933 1.531251 1.030693
## 2   2   2 1.833193 2.018361 1.873761 2.015682 1.826972 1.875599 2.016206
## 3   4   3 2.630870 3.000000 2.661015 3.009987 2.647110 2.419753 3.001719
## 4   2   4 3.775640 3.981639 3.779032 4.004291 3.835412 3.408695 3.987233
## 5   6   5 5.418534 4.963278 5.366779 4.998595 5.557148 5.764702 4.972746
##      Twlog
## 1 1.299162
## 2 1.856531
## 3 2.653025
## 4 3.791231
## 5 5.417753

apply(DF,2,sum)

##      obs      NId     Nlog     PInd     Plog      GId     Glog     Ginv 
## 15.00000 15.00000 14.93561 15.00000 15.00000 15.04993 15.12757 15.00000 
##     TwId    Twlog 
## 15.00860 15.01770

On notera qu’en utilisant le lien canonique, la somme des observations ($y_i$) est égale à la somme des prédictions ($\widehat{\mu}_i$)

summary(regN)

## 
## Call:
## lm(formula = y ~ x, data = base)
## 
## Residuals:
##          1          2          3          4          5 
## -5.551e-17  2.049e-16  1.000e+00 -2.000e+00  1.000e+00 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.986e-15  1.483e+00   0.000    1.000
## x           1.000e+00  4.472e-01   2.236    0.111
## 
## Residual standard error: 1.414 on 3 degrees of freedom
## Multiple R-squared:  0.625,  Adjusted R-squared:    0.5 
## F-statistic:     5 on 1 and 3 DF,  p-value: 0.1114

summary(regNId)

## 
## Call:
## glm(formula = y ~ x, family = gaussian(link = "identity"), data = base)
## 
## Deviance Residuals: 
##  1   2   3   4   5  
##  0   0   1  -2   1  
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.986e-15  1.483e+00   0.000    1.000
## x           1.000e+00  4.472e-01   2.236    0.111
## 
## (Dispersion parameter for gaussian family taken to be 2)
## 
##     Null deviance: 16  on 4  degrees of freedom
## Residual deviance:  6  on 3  degrees of freedom
## AIC: 21.101
## 
## Number of Fisher Scoring iterations: 2

cbind(coefficients(regNId),coefficients(regPId),coefficients(regGId),coefficients(regIGId),coefficients(regTwId))

##                     [,1]       [,2]       [,3]        [,4]       [,5]
## (Intercept) 1.986027e-15 0.05508318 0.02707408 -0.01245022 0.04517992
## x           1.000000e+00 0.98163894 0.99430419  1.01941679 0.98551315

library(RColorBrewer)
darkcols = brewer.pal(8, "Dark2")
plot(x,y,pch=19)
abline(regNId,col=darkcols[1])
abline(regPId,col=darkcols[2])
abline(regGId,col=darkcols[3])
abline(regIGId,col=darkcols[4])
abline(regTwId,lty=2)

pente=function(gamma) summary(glm(y~x,family=tweedie(var.power=gamma,link.power=1),data=base))$coefficients[2,1:2]
Vgamma = seq(-.5,3.5,by=.05)
Vpente = Vectorize(pente)(Vgamma)
plot(Vgamma,Vpente[1,],type="l",lwd=3,ylim=c(.965,1.03),xlab="power",ylab="slope")
text(0,.967,"Gaussian",srt=90,pos=4)
text(1,.967,"Poisson",srt=90,pos=4)
text(2,.967,"Gamma",srt=90,pos=4)
text(3,.967,"Inverse Gaussian",srt=90,pos=4)
for(i in 0:3) points(i,pente(i)[1],pch=19,col=darkcols[i+1])

plot(Vgamma,Vpente[1,],ylim=c(0,2),type="l",lwd=3,xlab="power",ylab="slope")
lines(Vgamma,Vpente[1,]+1.96*Vpente[2,],lty=2)
lines(Vgamma,Vpente[1,]-1.96*Vpente[2,],lty=2)
for(i in 0:3) points(i,pente(i)[1],pch=19,col=darkcols[i+1])

plot(x,y,pch=19)
u=seq(.8,5.2,by=.01)
lines(u,predict(regNlog,newdata=data.frame(x=u),type="response"),col=darkcols[1])
lines(u,predict(regPlog,newdata=data.frame(x=u),type="response"),col=darkcols[2])
lines(u,predict(regGlog,newdata=data.frame(x=u),type="response"),col=darkcols[3])
lines(u,predict(regIGlog,newdata=data.frame(x=u),type="response"),col=darkcols[4])
lines(u,predict(regTwlog,newdata=data.frame(x=u),type="response"),lty=2)

pente=function(gamma) summary(glm(y~x,family=tweedie(var.power=gamma,link.power=1),data=base))$coefficients[2,1:2]
pente=function(gamma) summary(glm(y~x,family=tweedie(var.power=gamma,link.power=1),data=base))$coefficients[2,1:2]
pente=function(gamma) summary(glm(y~x,family=tweedie(var.power=gamma,link.power=0),data=base))$coefficients[2,1:2]
Vpente = Vectorize(pente)(Vgamma)
plot(Vgamma,Vpente[1,],type="l",lwd=3,ylim=c(.3,.48),xlab="power",ylab="slope")
text(0,.3,"Gaussian",srt=90,pos=4)
text(1,.3,"Poisson",srt=90,pos=4)
text(2,.3,"Gamma",srt=90,pos=4)
text(3,.3,"Inverse Gaussian",srt=90,pos=4)
for(i in 0:3) points(i,pente(i)[1],pch=19,col=darkcols[i+1])

plot(Vgamma,Vpente[1,],ylim=c(0,.8),type="l",lwd=3,xlab="power",ylab="slope")
lines(Vgamma,Vpente[1,]+1.96*Vpente[2,],lty=2)
lines(Vgamma,Vpente[1,]-1.96*Vpente[2,],lty=2)
for(i in 0:3) points(i,pente(i)[1],pch=19,col=darkcols[i+1])

L’erreur sur la première observation est

erreur=function(gamma) predict(glm(y~x,family=tweedie(var.power=gamma,link.power=1),data=base),newdata=data.frame(x=1),type="response")-y[x==1] 
Verreur = Vectorize(erreur)(Vgamma)
plot(Vgamma,Verreur,type="l",lwd=3,ylim=c(-.1,.04),xlab="power",ylab="error")
abline(h=0,lty=2)
text(0,-.1,"Gaussian",srt=90,pos=4)
text(1,-.1,"Poisson",srt=90,pos=4)
text(2,-.1,"Gamma",srt=90,pos=4)
text(3,-.1,"Inverse Gaussian",srt=90,pos=4)
for(i in 0:3) points(i,erreur(i)[1],pch=19,col=darkcols[i+1])

erreur=function(gamma) predict(glm(y~x,family=tweedie(var.power=gamma,link.power=0),data=base),newdata=data.frame(x=1),type="response")-y[x==1] 
Verreur = Vectorize(erreur)(Vgamma)
plot(Vgamma,Verreur,type="l",lwd=3,ylim=c(.001,.32),xlab="power",ylab="error")
text(0,.001,"Gaussian",srt=90,pos=4)
text(1,.001,"Poisson",srt=90,pos=4)
text(2,.001,"Gamma",srt=90,pos=4)
text(3,.001,"Inverse Gaussian",srt=90,pos=4)
for(i in 0:3) points(i,erreur(i)[1],pch=19,col=darkcols[i+1])

11.3.1 Conditions du premier ordre

La (log)-vraisemblance s’écrit dans le cas des modèles exponentiels, \[ \log \mathcal L(\theta_1,\dots,\theta_n,\varphi,y_1,\dots,y_n) = \sum_{i=1}^n \left[ \frac{y_i\theta_i-b(\theta_i)}{a(\varphi)} + c(y_i,\varphi) \right]. \] On cherche les paramètres $\boldsymbol{\beta}$, il nous suffit de dériver la log-vraisemblance par rapport au paramètre $\boldsymbol{\beta}$ et d’écrire les condition du premier ordre.

Notons $\mu_i=E(Y_i)$ et $\eta_i=g(\mu_i)= X_i\beta$, le prédicteur linéaire.

Pour $i$ et $j$ donnés, on a \[ \frac{\partial \ln(\mathcal L_i)}{\partial \beta_j} = \frac{\partial \ln(\mathcal L_i)}{\partial \mu_i} \times \frac{\partial \mu_i}{\partial \beta_j}= \frac{\partial \mu_i}{\partial \eta_i} \times \frac{y_i-\mu_i}{ \text{V} (Y_i)} X_{ij} \] Ainsi on obtient les équations du score: \[ \color{red}{\sum_i \frac{\partial \ln(\mathcal L_i)}{\partial \beta_j} = \sum_i \frac{\partial \mu_i}{\partial \eta_i} \times \frac{y_i-\mu_i}{ \text{V} (Y_i)} X_{ij}=0} \] pour tout $j$.

Analytiquement, on ne peut pas résoudre ces équations, mais il est toujours possible de faire une descente de gradient. Ou de reconnaêitre la condition du premier ordre d’une régression linéaire pondérée \[ \color{red}{\sum_i \frac{\partial \ln(\mathcal L_i)}{\partial \beta_j} = \sum_i W_i \times \frac{\partial \eta_i}{\partial \mu_i} \frac{y_i-\mu_i}{ \text{V} (Y_i)} X_{ij}=0} ,\text{ avec }\color{red}{W_i =\frac{1}{ \text{V} (Y_i)}\left(\frac{\partial \mu_i}{\partial \eta_i}\right)^2}\] pour tout $j$.

L’algorithme est le mêeme que celui utilisé dans la régression logistique, et Poisson (mais plus général): à partir de $\boldsymbol{\beta}_{k}$

calculer le prédicteur linéaire, $\widehat{\eta}_i$, puis $\widehat{\mu}_i$ -calculer la matrice (diagonale) de poids telle que $W^{-1}=V(\widehat{\mu})g'(\widehat{\mu})$
utiliser le développement de Taylor à l’ordre 1, $Z=\widehat{\eta}+(Y-\widehat{\mu})g'(\widehat{\mu})$
régresser $Z$ sur les $\boldsymbol{X}$, avec des poids, pour obtenir $\boldsymbol{\beta}_{k+1}$

et on itère.

comme pour le modèle linéaire, l’estimation de $\boldsymbol{\beta}$ se fait indépendament de $\varphi$.

On peut montrer que $\widehat{\boldsymbol{\beta}}\overset{\mathbb{P}}{\rightarrow}\boldsymbol{\beta}$ et \[ \sqrt{n}(\widehat{\boldsymbol{\beta}}-\boldsymbol{\beta})\overset{\mathcal{L}}{\rightarrow} \mathcal{N}(\boldsymbol{0},I(\boldsymbol{\beta})^{-1}). \]

11.4 Deviance et résidus

la déviance est l’écart entre la log-vraisemblance obtenue en $\boldsymbol{\beta}$, et celle obtenue avec un modèle parfait (dit saturé), \[ D=2 \varphi \times \left[ \log \mathcal L(\boldsymbol{y}) - \log \mathcal L(\widehat{\boldsymbol{\mu}}) \right] \] où $\widehat{\boldsymbol{\mu}}=g^{-1}(\boldsymbol{X}\widehat{\boldsymbol{\beta}})$. On peut aussi définir la , \[ D^{\star}=\frac{D}{\varphi}=2 \times \left[ \log \mathcal L(\boldsymbol{y}) - \log \mathcal L(\widehat{\boldsymbol{\mu}}) \right] \] Par exemple pour une loi normale \[ D=\sum_{i=1}^n\omega_i(y_i-\mu_i)^2 \] pour une loi de Poisson, \[2\sum_{i=1}^n\omega_i\left\{y_i\ln\frac{y_i}{\mu_i}-(y_i-\mu_i)\right\}\] pour une loi Gamma \[2\sum_{i=1}^n\omega_i\left\{-\ln\frac{y_i}{\mu_i}+\frac{y_i-\mu_i}{\mu_i}\right\}\] pour une loi inverse gaussienne
\[\sum_{i=1}^n\omega_i\frac{(y_i-\mu_i)^2}{y_i\mu_i^2}\] et pour une loi binomiale \[2\sum_{i=1}^n\omega_im_i\left\{y_i\ln\frac{y_i}{\mu_i}+(1-y_i)\ln\frac{1-y_i}{1-\mu_i}\right\} \] Pour ce faire, posons $\sigma^2=a(\varphi)$. L’estimation de $\sigma^2$ est basée sur la déviance donnée par \[ D=\frac{1}{\sigma^2}\left\{\sum_{i=1}^ny_i\widehat{\theta}_i-\sum_{i=1}^n b(\widehat{\theta}_i)\right\}. \] Comme $\mathbb{E}(D)\approx n-p$, on pourrait estimer $\sigma^2$ par \[ \widetilde{\sigma}^2=\frac{1}{n-p}D. \] Cet estimateur est toutefois peu utilisé en pratique car il est très instable. Afin d’éviter ces désagréments, on a recours à un développement de Taylor à l’ordre 2 de $L(\boldsymbol{y}|\boldsymbol{y},\sigma^2)$ qui nous donne \[ \widehat{\sigma}^2=\frac{1}{n-p}(\boldsymbol{y}-\widehat{\boldsymbol{\mu}})' I_n(\widehat{\boldsymbol{\mu}})(\boldsymbol{y}-\widehat{\boldsymbol{\mu}}); \] cette dernière estimation est souvent appelée estimation du $\chi^2$ de Pearson.

Les résidus bruts sont $\widehat{\varepsilon}_i = Y_i -\widehat{\mu}_i$. Mais comme les modèles sont hétéroscédasituqes, ils n’ont qu’un intérêet limité.

Les résidus de Pearson sont \[ \widehat{\varepsilon}_{P,i} = \frac{y_i -\widehat{\mu}_i}{\sqrt{\operatorname{V}(\widehat{\mu}_i)}} \]

Les résidus de déviance sont \[ \widehat{\varepsilon}_{D,i} = \pm \sqrt{d_i}, \text{ où }D=\sum_{i=1}^n d_i. \]

Pour un modèle Gaussien, $\widehat{\varepsilon}_{P,i}=\widehat{\varepsilon}_{D,i}=Y_i -\widehat{\mu}_i$.

Pour un modèle de Poisson, $\widehat{\varepsilon}_{P,i}=\displaystyle{\frac{y_i -\widehat{\mu}_i}{\sqrt{\widehat{\mu}_i}}}$ et \[\widehat{\varepsilon}_{D,i}=\pm\sqrt{|Y_i\log[y_i /\widehat{\mu}_i]-[y_i -\widehat{\mu}_i]|}.\] Pour un modèle Gamma, $\widehat{\varepsilon}_{P,i}=\displaystyle{\frac{y_i -\widehat{\mu}_i}{\widehat{\mu}_i}}$ et \[\widehat{\varepsilon}_{D,i}=\pm\sqrt{|\log[y_i /\widehat{\mu}_i]-[y_i -\widehat{\mu}_i]/\widehat{\mu}_i|}.\]

Modèles Linéaires Appliqués - STT5100

Arthur Charpentier

03 February 2020