Statistical Learning

Number

stl

ECTS

4.0

Specification

Find optimal f for y = f(x) by means of statistics

Level

Advanced

Content

Many statisticians argue that Data Science and Machine Learning are just new names for statistics. The discussion of this statement is left to the students, but machine learning is really not much more than fitting a function to a training data set, with the hope that the recovered function also generalizes to test data.

Statistical Learning deals with estimating a function f that optimally solves the regression or classification problem y = f(x). This module will explore different possible families of functions for f and, in particular, how one differs against another in terms of an error or performance measure and, ultimately, which one is best suited to the problem under consideration. Important in this module: All this is to be done taking into account the limited nature of the chosen sample.

Learning outcomes

LE1: Theoretical basics of the STL Students are able to statistically formulate the regression and classification problem and their optimal solutions. They understand the difference between parametric and non-parametric function families, know suitable measures for assessing the goodness of fit and are familiar in particular with the bias-variance tradeoff.

LE2: Linear regression Students understand regression parameters as statistical quantities and can include categorical variables, interactions between variables, and non-linear relationships in regression problems. They are aware of the limitations of using the linear regression method.

LE3: Classification problems Students will be familiar with the best-known approaches to solving classification problems (logistic regression, linear discriminant analysis (LDA), Naive Bayes) and will be able to apply them to appropriate data sets.

LE4: Generalized Linear Models (GLMs) Students understand GLMs as a generalization of the classical regression model. They know the application areas of frequently used link functions and can model adequate data sets with them.

LE4: Resampling The impact of a restricted sample on performance measures can be statistically considered by students using cross-validation (CV) and bootstrap.

LE5: Model Selection Students can select the best model from a group of models using various selection criteria (subset selection, AIC, BIC, Adjusted R2 ) and taking into account the limited size of the sample.

LE6: Non-linear regression Students will recognize the applications of nonlinear regression and, in particular, be able to fit polynomial regression, splines, local regression, and generalized additive models (GAMs) to data.

Evaluation

Mark

Built on the following competences

Probability Modelling (WER), Exploratory Data Analysis (EDA), Foundation in Linear Algebra (GLA), Foundation in Calculus (GAN), Linear and Logistic Regression (LLR).

Modultype

Portfolio Module