# Regression

## Correlation

When studying random variables, it may be useful to check their independence or the nature of their relationships. The analysis of the relationships between variables is generally carried out using graphical tools (point clouds), combined with numerical indicators (correlation coefficients).

The best known of the correlation coefficients is that of *Pearson*. It allows to
quantify the intensity of the link between the variables. It is constructed by
relating the covariance of variables X and Y to the product of the standard
deviations of X and Y. In this way, results between -1 and 1 are obtained.

- A correlation coefficient close to
*1*indicates a strong positive correlation: i.e. the variables move in the same direction and are closely linked - Un coefficient de corrélation proche de
*-1*, met en évidence une corrélation négative forte : c'est à dire que les variables évoluent dans le sens opposé et qu'elles sont étroitement liées - A correlation coefficient close to
*0*shows that the variables are not linearly correlated. This is not sufficient to conclude that the variables are independent.

Since the Pearson correlation coefficient is very sensitive to the presence of outliers, it is strongly recommended to use it in conjunction with a graph to avoid misinterpretation.

The use of correlation coefficients is not specific to continuous variables, the correlation of rank variables can also be measured. The Kendall coefficient, for example, can be used to quantify the relationships between ranks (or rankings) of observations.

When we want to study the correlation of several variables, we usually use a matrix of scatterplots.

**SOSstat** allows you to analyze the correlation of several variables. The
correlation (or covariance) results are returned in a matrix, which can also be
visualized as an image with a color scale corresponding to the correlation level.

## The simple regression

After having demonstrated the existence of a correlation between two random variables, we can legitimately try to model this relationship. The purpose of regression is to determine this model. Although the notions of correlation and regression are often linked, it is worth recalling their differences.

The correlation aims to quantify the intensity of the relationship between two
variables (their degree of dependence) in order to determine whether these
values are statistically moving in the same direction or in the opposite
direction. The regression approach is a little different. From the value pairs
*(x,y)*, we try to create a model such that we can predict the values of *Y*
knowing the values of *X*. We speak of "explanatory variable" for *X* and
"explained variable" for *Y*.

### Model Estimation

Let *x* and *x* be two dependent random variables. The prediction of *y*, $
\hat{y}$, is a function of the variable *x*: $\hat{y}=f(x)$ . To establish the
model coefficients, we use the least-squares optimization criterion, which aims
to minimize the squared error $(y-\hat{y})^{2}$ between the experimental value
of $y$ and its prediction $\hat{y}$.

If we put the linear prediction model in the form:

Then the least-squares application allows to establish (see demonstration below)
the values of $`\alpha`

$ and $`\beta`

$.

### Quality of the model

Unfortunately, a good application of the least squares method does not guarantee that the model is of good quality. Indeed, the quality of the predictions will depend on:

- The choice of the model. A linear model can be used when the relationship between the variables is not linear.
- The intensity of the relationship between the two variables. If
*Y*dependence on*X*is low, then it will be very difficult to predict the achievements of*Y*knowing*X*.

To verify that the model has a satisfactory "explanatory" character, the decomposition of variances is studied. In the regression, it is assumed that the total observed variability (SCT) is the sum of the variability explained by the model (SCM) and the residual variability (SCE), the deviations that the model cannot predict. This relationship is expressed as the sum of quadratic deviations.

where

- SCT is the sum of the total squares, i.e. the variability of the measurements around the mean.
- SCE is the sum of the squares of the measurements around the estimates, i.e. the residual variability.
- SCM is the sum of the squares of the estimates around the mean, i.e. the variability explained by the model.

This decomposition of the sources of variation makes it possible to establish the coefficient of determination: $R^{2}$. The determination coefficient represents the relationship between the variability explained by the model and the total variability:

As can be seen in the animation below, implementing a simple regression is
extremely easy in **SOSstat**. The regression graph can display the confidence
interval of the model as well as the prediction interval of the observations. In
addition, residuals are analyzed to verify their independence and normality.

**SOSstat** also provides numerical results, including model coefficients
(several models are available) and the coefficient of determination : $(R^2)$. A
significance test is applied to the correlation coefficient.

## Multiple regression

A natural extension of simple regression, presented above, is multiple or multilinear regression, which allows a model involving several input variables to be identified.

### Presentation of the problem

In this section, we will present the mathematical formalism used to generalize the least squares method to systems with several variables. Matrix calculation methods are particularly appreciated for solving systems of n equations to n unknown due to their synthetic writing and simple computerization.

When we carry out a regression, our aim is to identify the parameters of a
mathematical model. Consider the case of a system comprising two factors *A* and
*B*, which we wish to represent with a first-order model, i.e. with two factors
$E_{A}$, $E_{B}$ and an interaction $I_{AB}$. The mathematical model can then be
written as :

That can be written in a generic way:

This mathematical model can be applied for each combination of factor levels (4 experiments if both factors have two levels each). Thus we can express the answers according to $x_{1}$ and $x_{2}$ which will take the values -1 or +1 (reduced centered coordinates) according to whether they are at the minimum or maximum level.

This system of four equations can be represented in matrix form by adopting the following notations:

*Y* is the vector of the system's responses

*a* is the vector of the model's coefficients, i.e. the effects and interactions
(which are organized in a very precise way).

*X* is the experiment matrix that describes the succession of experiments
performed or observations of explanatory variables

In this representation **y** and **X** are known and we try to identify the
coefficients of **a**.

### Solution

The mathematical model we have sought to identify is purely theoretical since it
represents a deterministic relationship between the explained variable **Y** and
the explanatory variables **X**. To take into account random phenomena, a term
representing residuals is added to the model (Residuals are the deviations not
covered by the model).

To solve this system of equations, the multiple or multilinear regression technique is used. The latter seeks a solution that minimizes the sum of the squares of the differences between the model and the experimental results (least squares).

The solution of a linear system with the least squares criterion is given by the relationship:

**SOSstat** offers a multiple regression module to build complex models mixing
continuous and discrete variables. **SOSstat** calculates the model coefficients
and performs tests to determine if the model coefficients are significant.
Numerous residue analysis graphs complete the calculations to identify isolated
values or to highlight a fit fault (i.e. an inappropriate model).

**SOSstat** also offers a prediction module using the regression model. The
regression model allows the user to easily find the configuration of the
variables to target the desired response.