This shows you the differences between two versions of the page.
| Both sides previous revisionPrevious revisionNext revision | Previous revision | ||
| en:iot-reloaded:regression_models [2024/12/02 19:08] – [Errors and their meaning] ktokarz | en:iot-reloaded:regression_models [2024/12/10 23:33] (current) – pczekalski | ||
|---|---|---|---|
| Line 1: | Line 1: | ||
| + | ====== Regression Models ====== | ||
| + | |||
| + | |||
| + | ===== Introduction ===== | ||
| + | |||
| + | While AI and especially Deep Learning techniques have advanced tremendously, | ||
| + | The term regression towards a mean value of a population was widely promoted by Francis Galton, who introduced the term " | ||
| + | |||
| + | ===== Linear regression model ===== | ||
| + | |||
| + | Linear regression is an algorithm that computes the linear relationship between the dependent variable and one or more independent features by fitting a linear equation to observed data. In its essence, linear regression allows the building of a linear function – a model that approximates a set of numerical data in a way that minimises the squared error between the model prediction and the actual data. Data consists of at least one independent variable (usually denoted by x) and the function or dependent variable (usually denoted by y). | ||
| + | If there is just one independent variable, then it is known as Simple Linear Regression, while in the case of more than one independent variable, it is called Multiple Linear Regression. In the same way, in the case of a single dependent variable, it is called Univariate Linear Regression. In contrast, in the case of many dependent variables, it is known as Multivariate Linear Regression. | ||
| + | For illustration purposes in the figure {{ref> | ||
| + | |||
| + | |||
| + | <figure Galton' | ||
| + | {{ : | ||
| + | < | ||
| + | </ | ||
| + | |||
| + | If the fathers' | ||
| + | |||
| + | <figure Linear model 1> | ||
| + | {{ : | ||
| + | < | ||
| + | </ | ||
| + | |||
| + | |||
| + | where: | ||
| + | * yi – ith child height | ||
| + | * xi – ith father height | ||
| + | * β0 and β1 y axis crossing and slope coefficients of the linear function correspondingly | ||
| + | |||
| + | Unfortunately, | ||
| + | It means that the following equation might describe the model: | ||
| + | |||
| + | <figure Linear model 2> | ||
| + | {{ : | ||
| + | < | ||
| + | </ | ||
| + | |||
| + | where | ||
| + | * y'i – ith child height estimated by the model | ||
| + | * xi – ith father height | ||
| + | * Β’0 and β’1 y axis crossing and slope coefficients' | ||
| + | |||
| + | <figure Model error> | ||
| + | {{ : | ||
| + | < | ||
| + | </ | ||
| + | |||
| + | The estimated beta values might be calculated as follows: | ||
| + | |||
| + | <figure Coefficient velues> | ||
| + | {{ : | ||
| + | < | ||
| + | </ | ||
| + | |||
| + | where: | ||
| + | * Cor(X, Y) – Correlation between X and Y (capital letters mean vectors of individual x and y corresponding values) | ||
| + | * σx and σy – standard deviations of vectors X and Y | ||
| + | * µx and µy – mean values of the vectors X and Y | ||
| + | |||
| + | Most modern data processing packages possess dedicated functions for building linear regression models with few lines of code. The result is illustrated in the figure {{ref> | ||
| + | |||
| + | <figure Galton' | ||
| + | {{ : | ||
| + | < | ||
| + | </ | ||
| + | |||
| + | ===== Errors and their meaning ===== | ||
| + | |||
| + | As discussed previously, an error in the context of the linear regression model represents a distance between the estimated dependent variable values and the estimate provided by the model, which the following equation might represent: | ||
| + | |||
| + | <figure Coefficient velues> | ||
| + | {{ : | ||
| + | < | ||
| + | </ | ||
| + | |||
| + | where, | ||
| + | * y'i – ith child height estimated by the model | ||
| + | * yi - ith childer height true values | ||
| + | * ei - error of the model' | ||
| + | |||
| + | Since an error for a given yith might be positive or negative and the model itself minimises the overall error, one might expect that the error is typically distributed around the model, with a mean value of 0 and its sum close to or equal to 0. Examples of the error for a few randomly selected data points are depicted in the following figure {{ref> | ||
| + | |||
| + | <figure Galton' | ||
| + | {{ : | ||
| + | < | ||
| + | </ | ||
| + | |||
| + | Unfortunately, | ||
| + | Another important aspect is the order of magnitude of the errors compared to the measurements, | ||
| + | |||
| + | <figure Error_distribution_example> | ||
| + | {{ : | ||
| + | < | ||
| + | </ | ||
| + | |||
| + | In figure {{ref> | ||
| + | |||
| + | <figure Error_distribution_example2> | ||
| + | {{ : | ||
| + | < | ||
| + | </ | ||
| + | |||
| + | From this discussion, a few essential notes have to be taken: | ||
| + | * Error distributions (around 0) should be treated as carefully as the models themselves; | ||
| + | * In most cases, error distribution is complex to notice even if the errors are illustrated; | ||
| + | * It is essential to look into the distribution to ensure that there are no regularities. | ||
| + | If any regularities are noticed, whether a simple variance increase or cyclic nature, they point to something the model does not consider. It might point to a lack of data, i.e., other factors that influence the modelled process, but they are not part of the model, which is therefore exposed through the nature of the error distribution. It also might point to an oversimplified look at the problem, and more complex models should be considered. In any of the mentioned cases, a deeper analysis should be considered. | ||
| + | In a more general way, the linear model might be described with the following equation: | ||
| + | |||
| + | <figure Linear model> | ||
| + | {{ : | ||
| + | < | ||
| + | </ | ||
| + | |||
| + | Here, the error is considered to be normally distributed around 0, with its standard deviation sigma and variance sigma squared. Variance provides at least a numerical insight into the error distribution; | ||
| + | |||
| + | <figure Sigma> | ||
| + | {{ : | ||
| + | < | ||
| + | </ | ||
| + | |||
| + | Here, the variance estimated value' | ||
| + | |||
| + | <figure Variance> | ||
| + | {{ : | ||
| + | < | ||
| + | </ | ||
| + | |||
| + | |||
| + | ===== Multiple linear regression ===== | ||
| + | |||
| + | In many practical problems, the target variable Y might depend on more than one independent variable X, for instance, wine quality, which depends on its level of serenity, amount of sugars, acidity and other factors. In the case of applying a linear regression model that doesn' | ||
| + | |||
| + | <figure Multiple linear model> | ||
| + | {{ : | ||
| + | < | ||
| + | </ | ||
| + | |||
| + | During the application of the linear regression model, the error term to be minimised is described by the following equation: | ||
| + | |||
| + | <figure Multiple linear model error estimate> | ||
| + | {{ : | ||
| + | < | ||
| + | </ | ||
| + | |||
| + | Unfortunately, | ||
| + | |||
| + | ===== Piecewise linear models ===== | ||
| + | |||
| + | Piecewise linear models, as the name suggests, allow splitting the overall data sample into pieces and building a separate model for every piece, thus achieving better prediction for the data sample. The formal representation of the model is as follows: | ||
| + | |||
| + | <figure Piecewise linear model> | ||
| + | {{ : | ||
| + | < | ||
| + | </ | ||
| + | |||
| + | As it might be noticed, the individual models are still linear and individually simple. However, the main difficulty is to set the threshold values b that splits the sample into pieces. | ||
| + | To illustrate the problem better, one might consider the following artificial data sample (figure {{ref> | ||
| + | |||
| + | <figure Complex_data_example> | ||
| + | {{ : | ||
| + | < | ||
| + | </ | ||
| + | |||
| + | Intuition suggests splitting the sample into two pieces and, with the boundary b around 0, fitting a linear model for each of the pieces separately (figure {{ref> | ||
| + | |||
| + | <figure Piecewise_linear_model_two> | ||
| + | {{ : | ||
| + | < | ||
| + | </ | ||
| + | |||
| + | Since we do not know the exact best split, it might seem logical to play with different numbers of splits at different positions. For instance, a random number of splits might generate the following result (figure {{ref> | ||
| + | |||
| + | <figure Piecewise_linear_model_many> | ||
| + | {{ : | ||
| + | < | ||
| + | </ | ||
| + | |||
| + | It is evident from the figure above that some of the individual linear models do not reflect the overall trends, i.e. the slope steepness and direction (positive or negative) seem to be incorrect. However, it is also apparent that those individual models might be better for the given limited sample split. This simple example brings a lot of confusion in selecting the number of splits and their boundaries. | ||
| + | Unfortunately, | ||
| + | * Using contextual information, | ||
| + | * Some additional methods might be used to find the best split automatically. In this case, software packages usually have tools for this. For Python developers, a very handy package mlinsights ((https:// | ||