[code]Generalized Linear Models: Introduction and Implementation in Ruby.
Overview
Most of us are well acquainted with linear regression and its use in analysig the relationship of one dataset with another. Linear regression basically shows the (possibly) linear relationship between one or more independent variables and a single dependent variable. But what if this relationship is not linear and the dependent and independent variables are associated with one another through some special function? This is where Generalized Linear Models (or GLMs) come in. This article will explain some core GLM concepts and their implementation in Ruby using the statsample-glm gem.
Generalized Linear Models Basics
The basic linear regression equation relating the dependent varible y with the independent variable x looks something like \(\begin{align} y = \beta_{0} + x_{1}*\beta_{1} + x_{2}*\beta_{2}... \end{align}\) This is the equation of a straight line, with \(\beta_{0}\) denoting the intercept of the line with the Y axis and \(\beta_{1}\) denoting the slope of the line. GLMs take this a step further. They try to establish a relationship between x and y through another function g(x), which is called the link function. This function depends on the probability distribution displayed by the independent variables and their corresponding y values. In its simplest form, it can be denoted as y = g(x).
GLM can be used to model numerous relations, depending on the distribution of the dependent conditional on the independent variables. We will first explore the various kinds of GLMs and their defining parameters and then understand the different methods employed in finding the co-efficients. The most common GLMs are:
- Logistic (or logit) regression.
- Normal regression.
- Poisson regression.
- Probit regression.
Let’s see all of the above one by one.
Logisitic Regression
Logistic, or Logit can be said to be one of the most fundamental of the GLMs. It is mainly used in cases where the independent variables show a binomial distribution (conditional on the dependent). In case of the binomial distribution, the number of successes are modelled on a fixed number of tries. The Bernoulli distribution is a special case of binomial where the outcome is either 0 or 1 (which is the case in the example at the bottom of this post). By using logit link function, one can determine the maximum probability of the occurence of each independent random variable. The values so obtained can be used to plot a sigmoid graph of x vs y, using which one can predict the probability of occurence of any random varible not already in the dataset. The defining parameter of the logistic is the probability y.
The logit link function looks something like \(\begin{align} y = \frac{e^{(\beta_{0} + x*\beta_{1})}}{1 + e^{(\beta_{0} + x*\beta_{1})}} \end{align}\) , where y is the probability for the given value of x.
Of special interest is the meaning of the values of the coefficients. In case on linear regression, \(\beta_{0}\) merely denotes the intercept while \(\beta_{1}\) is the slope of the line. However, here, because of the nature of the link function, the coefficient \(\beta_{1}\) of the independent variable is interpreted as “for every 1 increase in x the odds of y increase by \(e^{\beta_{1}}\) times”.
One thing that puzzled me when I started off with regression was the purpose of having several variables \((x_{1}, x_{2}...)\) in the same regression model at times. The purpose of multiple independent variables against a single dependent is so that we can compare the odds of \(x_{1}\) against \(x_{2}\). So basically, if you have multiple variables, it is to compare the effect on the dependent of one variable, when the others are constant. To compare the effect of one variable without considering the others, one could use an independent regression for each one.
The logistic graph generally looks like this:
![/assets//images/glm/logistic.gif][Generic Graph of Logistic Regression.]
Normal Regression
Normal regression is used when the DEPENDENT variable exhibits a normal probability distribution, CONDITIONAL ON THE independent variables. The independents are assumed to be normal even in a simple linear or multiple regression, and the coefficients of a normal are more easily calculated using simple linear regression methods. But since this is another very important and commonly found data set, we will look into it.
Normally distributed data is symmetric about the center and its mean is equal to its median. Commonly found normal distributions are heights of people and errors in measurement. The defining parameters of a normal distribution are the mean \(\mu\) and variance \(\sigma^2\). The link function is simply \(y = x*\beta_{1}\) if no constant is present. The coefficient of the independent variable is interpreted in exactly the same manner as it is for linear regression.
A normal regression graph generally looks like this:
![/assets//images/glm/normal.png][Generic Graph of Normal Regression]
Poisson Regression
A dataset often posseses a Poisson distribution when the data is measured by taking a very large number of trials, each with a small probability of success. For example, the number of earthquakes taking place in a region per year. It is mainly used in case of count data and contingency tables. Binomial distributions often converge into Poisson when the number of cases(n) is large and probability of success(p) small.
The poisson is completely defined by the rate parameter \(\lambda\). The link function is \(ln(y) = x*\beta_{1}\), which can be written as \(y = e^{x*\beta_{1}}\). Because the link function is logarithmic, it is also referred to as log-linear regression.
The meaning of the co-efficient in the case of poisson is “for increase 1 of x, y changes \(y = e^\beta_{1}\) times.”.
A poisson graph looks something like this:
![/assets//images/glm/poisson.png][Graph of Poisson Regression]
Probit Regression
Probit is used for modeling binary outcome varialbles. Probit is similar to logit, the choice between the two largely being a matter of personal preference.
In the probit model, the inverse standard normal distribution of the probability is modeled as a linear combination of the predictors (in simple terms, something like \(y = \Phi(\beta_{0} + x_{1}*\beta_{1}...)\) , where \(\Phi\) is the CDF of the standard normal). Therefore, the link function can be written as \(z = \Phi^{-1}(p)\) where \(\Phi(z)\) is the standard normal cumulative density function (here p is probability of the occurence of a random variable x and z is the z-score of the y value).
The fitted mean values of the probit are calculated by setting the upper limit of the normal CDF integral as \(x*\beta_{1}\), and lower limit as \(-\infty\). This is so because evaluating any normally distributed random number over its CDF will yield the probability of its occurence, which is what we expect from the fitted values of a probit.
The coefficient of x is interpreted as “one unit change in x leads to a change \(\beta_{1}\) in the z-score of y”.
Looking at the graph of probit, one can see the similarities between logit and probit:
![/assets//images/glm/probit.png][label]
Finding the coefficients of a GLM
There are two major methods of finding the coefficients of a GLM:
- Maximum Likelihood Estimation (MLE).
- Iteratively Reweighed Least Squares (IRLS).
Maximum Likelihood Estimation
The most obvious way of finding the coefficients of the given regression analysis is by maximizing the likelihood function of the distribution that the independent variables belong to. This becomes much easier when we take the natural logarithm of the likelihood function. Hence, the name ‘Maximum Likelihood Estimation’. The Newton-Raphson method is used to this effect for maximizing the beta values (coefficients) of the log likelihood function.
The first derivative of the log likelihood wrt to \(\beta\) is calculated for all the \(x_{i}\) terms (this is the jacobian matrix), and so is the second derivative (this is the hessian matrix). The coefficient is estimated by first choosing an initial estimate for \(x_{old}\), and then iteratively correcting this initial estimate by trying to bring the equation
\[\begin{align} x_{new} = x_{old} - inverse(hessian)*jacobian ..(1) \end{align}\]to equality (with a pre-set tolerance level). A good implementation of MLE can be found here.
Iteratively Reweighed Least Squares
Another useful but somewhat slower method of estimating the regression coefficients of a dataset is Iteratively Reweighed Least Squares. It is slower mainly because of the number of co-efficients involved and the somewhat extra memory that is taken up by the various matrices used by this method. The upside of IRLS is that it is very easy to implement as is easily extensible to any kind of GLM.
The IRLS method also ultimately boils to the equation of the Newton Raphson (1), but the key difference between the two is that in MLE we try to maximize the likelihood but in IRLS we try to minimize the errors. Therefore, the manner in which the hessian and jacobian matrices are calculated is quite different. The IRLS equation is written as:
\[\begin{align} b_{new} = b_{old} - inverse(X'*W*X)*(X'*(y - \mu)) \end{align}\]Here, the hessian matrix is \(-(X'*W*X)\) and the jacobian is \((X'*(y - \mu))\). Let’s see the significance of each term in each of these matrices:
- X - The matrix of independent variables \(x_{1}, x_{2},...\) alongwith the constant vector.
- X’ - Transpose of X.
- W - The weight matrix. This is the most important entity in the equation and understanding it completely is paramount to gaining an understanding of the IRLS as whole.
- The weight matrix is present to reduce favorism of the best fit curve towards larger values of x. Hence, the weight matrix acts as a mediator of sorts between the very small and very large values of x (if any). It is a diagonal matrix with each non-zero value representing the weight for each vector \(x_{i}\) in the sample data.
- Calculation of the weight matrix is dependent on the probability distribution shown by the independent random variables. The weight expression can be calculated by taking a look at the equation of the hessian matrix. So in the case of logistic regression, the weight matrix is a diagonal matrix with the ith entry as \(p(x_{i}, \beta_{old})*(1 - p(x_{i}, \beta_{old}))\).
- The W matrix is (the inverse?) of the variance/covariance matrix. On logistic and Poisson regression, the variance on each case depend on the mean, so that is the meaning of \(p(x_{i}, \beta_{old})*(1 - p(x_{i}, \beta_{old}))\).
- \((y - \mu)\) - This is a matrix whose ith value the is difference between the actual corresponding value on the y-axis minus \(\mu = x*b_{old}\). The value of this term is crucial in determining the error with which the coefficients have been calculated. Frequently an error of 10e-4 is acceptable.
Generalized Linear Models in Ruby
Calculating the co-efficients and a host of other properties of a GLM is extremely simple and intuitive in Ruby. Let us see some examples of GLM by using the daru
and statsample-glm
gems:
First install statsample-glm
by running gem install statsample-glm
, statsample will be downloaded alongwith it if it is not installed directly. Then download the CSV files from here.
Statsample-glm supports a variety of GLM methods, giving the choice of both, IRLS and MLE algorithms to the user for almost every distribution, and all this through a simple and intutive API. The primary calling function for all distribtions and algorithms is Statsample::GLM.compute(data_set, dependent, method, options)
. We specify the data set, dependent variable, type of regression and finally an options hash in which one can specify a variety of customization options for the computation.
To compute the co-efficients of a logistic regression, try this code:
require 'daru'
require 'statsample-glm'
# Code for computing coefficients and related attributes of a logistic regression.
data_set = Daru::DataFrame.from_csv "logistic_mle.csv"
glm = Statsample::GLM.compute data_set, :y, :logistic, {constant: 1, algorithm: :mle}
# Options hash specifying addition of an extra constants
# vector all of whose values is '1' and also specifying
# that the MLE algorithm is to be used.
puts glm.coefficients
#=> [0.3270, 0.8147, -0.4031,-5.3658]
puts glm.standard_error
#=> [0.4390, 0.4270, 0.3819,1.9045]
puts glm.log_likelihood
#=> -38.8669
Similar to the above code, you can try implementing poisson, normal or probit regression models and use the data files from the link above as sample data. Just go through the tests in the source code on GitHub or read the documentation for further details and feel free to drop me a mail in case you have any doubts/suggestions for improvements.
Cheers!