 # Logistic regession is practiced to detect relationship between a binary dependent variable (Y = 0 or 1) and one or several independent variables, either quantitative (= numerical) or qualitative (= categorical).

Qualitative independent variables (also called covariates) can be binary (or dichotomous, 2 modalities only) or polytomous (3 or more modalities), and secondly they can be ordinal (order of modalities has a signification, ex : little/medium/big) or not.

Example (in Hosmer & Lemeshow) :
Low birth weight is harmful to chidren's growing, so a study was undertaken to evaluate riks of Low Birth Weignt (Y variable). Among 8 potential risk factors (independent variables), Authors of this study decided to keep the 4 following variables :

• 2 quantitative variables : age of the mother (AGE) and its weight et last menstrual period (LWT),
• 2 qualitative variables : number of first trimester physician visits (FTV) and race of the mother (RACE coded in 3 categories).

Parameters of the model are calculated through method of generalized least squares, each coefficient is analysed through test of Wald, and significance of the model is evaluated through method of maximum likelihood.

# 1 - Principles of logistic regression :

1.1. Choice of the model of logistic regression

When dependent variable (Y) is quantitative, hypothesis of normality for its distribution is realistic, which is not the case when Y is a qualitative vairable, since its values are limited to some modalities (0 and 1 for binomial variables) : Y = f(X)

In this case, analysis of a regression model for a qualitative variable consists in calculation of probability for Y variable Y to take any modality. Ex : in the relationship between the probability of a binary dependent variable (Y) and an quantitative independent variable (X), we observe a "S" shaped curve (sigmoidal relationship), limited between 0 and 1 : Probability (Y=1) according to values of X

It is easy to understand that if we try to modelize these points with a linear regression model, this one will exceed limits of 0 and 1 (which is not conceivable). This curve most adapted model is the logistic distribution function, we talk about "logit" or logistic model.

Note : the "probit" model can also modelize a probabiliy, however the "logit" model allows to use Odds Ratio values (cf. below) which quantify the risk to have a positive outcome (Y=1) according to any modality of a qualitative independent variable, or according to increase of 1 unit of a quantitative independent variable.

## 1.2. Estimation of the logistic regression model

To find the best fitting model to describe the relationship between on binary dependent variable and k covariates, the k coeficients of the logistic regression model are calculated through the Maximum Likelihood method.

Significance of the k calculated coefficients (b1, b2, ..., bk) are tested through test of Wald (W) :

• H0: bk coefficient is not significantly different from 0
• H1: bk coefficient is significantly different from 0 with :

• bk estimation of the coefficient for the kth covariate
• SE(bk) standard error of the coefficient for the kth covariate

This value follows the standard normal distribution.

The significance of the complete model is evaluated with Likelihood Ratio test :

• H0: no coefficient of the logistic regression model is significantly different from 0
• H1: there is one coefficient of the logistic regression model whose value is significantly different from 0 with :

• L0 likelihood of the logistic regression model without covariates (i.e. only intercept)
• Lk likelihood of the logistic regression model with all covariates

This value follows the Khi² distribution.

## 1.3. Advantage of the "Logit" model : Odds Ratio

The Odds Ratio (OR) is a measure of association. It shows how much more likely or unlikely it is for the outcome (Y=1) to be true when X=1 than when X=0.

Example : if Y variable shows the presence (Y=1) or absence (Y=0) of a disease for subjects which are (X=1) or not (X=0) exposed to a risk factor (ex : subjects are smokers or not). P1 is the probability to have the disease if subject is exposed to the risk and P0 is the probability to have the disease if subject is not exposed to the risk. OR is calculated as follow : If OR is higher than 1, it means that disease is more likely for subjects which are exposed to the risk factor. On the contrary, if OR is lower than 1, it means that disease is more unlikely for subjects which are exposed to the risk factor.

Furthermore, value of this OR measures how much the risk is likely or unlikely. Thus, OR = 2 means that risk to have Y=1 is twice for subjects with X=1 compared to subjects with X=0.

With a logistic regression model, the odds ratio are directely calculated from the coefficients of covariates: OR = exp(coefficient).

## 1.5. Case of polytomous covariates in a logistic regression model

### Coding of a polytomous variable in design variables requires to choose which modality will be a "reference" compared to the others. Considering a qualitative variable with 4 modalities "A", "B", "C", "D". The last modality is the reference compared to the 3 others. Transformation of this polytomous variable leads to constitution of 3 new design variables, eash one is related to the 3 modalities, and encoded in "0/1" as follow :

 Polytomous variable design var. 1 design var. 2 design var. 3 A 1 0 0 B 0 1 0 C 0 0 1 D 0 0 0

### Coding of an ordinal variable in design variables requires to choose which modality will be a "reference" compared to the others. Considering a qualitative variable that lists size of tumors for a group of patients, the modalities of this variable are : "no tumor", "little tumor", "medium tumor" and "big tumor". The first modality is the reference compared to the 3 others. Transformation of this polytomous variable leads to constitution of 3 new design variables, eash one is related to the 3 modalities, and encoded in "0/1" as follow :

 Ordinal variable design var. 1 design var. 2 design var. 3 none (reference) 0 0 0 little tumor 1 0 0 medium tumor 1 1 0 big tumor 1 1 1

# 2 - Launch of logistic regression:

The main dialog box allows you to select :

• the dependent Y variable,
• any numerical independent variable,
• any categorical independent variable : 2.1. Selection of the dependent variable in the procedure of logistic regression with StatEL

After selection of the cells range containing data for the Y variable, StatEL lists the content of this variable : With this dialog box, you can remove one of the modalities. Furthermore, StatEL propose a default encoding in the list at the right part of the dialog box. You can move each code after selecting it (in the right list) by clicking on the "Up/Down" arrows, in order to place it in front of the expected modality (in the left list). Then you have to validate your selection.

2.2. Selection of a numerical independent variable in the procedure of logistic regression with StatEL

After selection of the cells range containing data for the independent variable, StatEL checks if your selection contains exactely the same number of cells as for selection of the dependent variable. If not, an error message is displayed : 2.3. Selection of a polytomous categorical independent variable in the procedure of logistic regression with StatEL

After selection of the cells range containing data for the polytomous independent variable (more than 2 modalities), StatEL lists the content of this variable and allows you to specify the way you want to encode modalities of this variable : a) Direct encoding of a polytomous variable :

Choose the upper option, the label of "OK" button changes in "Next step >>". Click on it to go to content validation step : In the above example, rank of encoding has been changed to place the code "0" in front of modality "1", the code "1" in front of modality "2", the code "2" in front of modality "3". Then you have to validate your selection.

b) Encoding of a polytomous variable in design variables :

Choose the second option. The lower list activates in order to select the reference modality compared to the others. Since, the dialog box widens to inform you about the encoding way for design variables: Click on the "OK" button. A message allows you to confirm the choice of reference modality.

c) Encoding of an ordinal variable in design variables :

Choose the second option and tick that allows to specify it is an ordinal variable. Then you see a new dialog box in which you can change ranks of each modality : Click on the "OK" button to validate order of modalities of the ordinal variable and come back to the previous dialog box in order to specify the reference modality. Click on the "OK" button. A message allows you to confirm the choice of reference modality.

When every variables are selected, you just have to click on the "Validate" button in order to launch the calculation procedure. If necessary, tick first option of Stepwise method for the selection of variables.

# 3 - Results of logistic regression:

Results display on a new sheet of you Excel file. Please note that some cells contains also comments :

In the left part are displayed the selected data, expected values of Y variable, les residuals, standardized residuals, levers and other date required for diagnostic analysis of the logistic regression model.

On the right part of the results sheet :

• descriptive statistics for each independent variable,
• the logistic regression model,
• significance analysis of every coefficients of the model with test of Wald, as well as associated odds ratio and their 95% confidence interval,
• significance analysis of the model with test of likelihood ratio,
• tests of Pearson's Khi², deviance and Hosmer & Lemeshow,
• plots of residuals and levers of different "covariate patterns". In this example (in Hosmer & Lemeshow), the test of likelihood ratio shows that the calculated model contains at least one variable whose coefficient is significantely different from 0 (with a p-value < 0.0335). The test of Wald (table) on every coefficients show that only variables LWT (p < 0.0145) and RACE 2 (p < 0.0219) have an influence on value of the Y variable.

Their odds ratio are also significant (with a p-value < 0.05) since their 95% confidence intervals exclude the value 1. We conclude that the risk to give birth to low weight children is multiplicated by 0.9858 when the weight of the mother increases of 1 unit (by extension, we can calculate that this risk is multiplicated by e(10 x -0.01426) = 0.867 when the weight of the mother increases of 10 units). Fruthermore, the risk to give birth to low weight children is multiplicated by 2.729 for mothers who belong to category RACE 2 compared to those belonging to reference category (RACE 1). In this same example, the Pearson's Khi² test and the deviance test do not reject goodness of fit hypothesis. The test of Hosmer & Lemeshow suggests rejection goodness of fit hypothesis. Nevertheless, the associated comment recalls that powerful of this statistic is reduced when these conditions are not respected : N > 400 and estimated frequencies > 5 (cf. table of H&L). Since none of these conditions is respected, we shall not trus result of this statistic. ad Science Company - 55, Boulevard Pereire, 75017 PARIS - France