Description
This module aims to introduce several fundamental ways by which scalability plays a role in statistical data science, namely large data (both in the number of observations and the number of covariates) and large models (with inferential, engineering and computational implications). It is primarily intended for third and fourth year undergraduates and taught postgraduates registered on the degree programmes offered by the Department of Statistical Science (including the CSML programme). The academic prerequisites for these students (in addition to their compulsory modules) are STAT0041 and STAT0042Ìý(UG), or one of COMP0078Ìý/ÌýCOMP0088 (PGT).
Intended Learning Outcomes
- understand the statistical assumptions, pitfalls and possibilities in the analysis of high-dimensional data;
- be able to scale up statistical inference and machine learning for large datasets;
- be able to efficiently deploy methods for fitting and comparing complex models;
- be able to lead and coordinate projects for heterogeneous andÌýstructured data;
- have a deeper understanding of trade-offs between modelling flexibility and computational costs (Level 7 only).
Applications - Technological advances have brought new ways of generating data, as well as allowing for more complex models to be developed using improved computational resources. Students seeking to reach the forefront of data science must understand how such massive datasets and models can be manipulated effectively using advanced statistical methodology and large-scale algorithms. This module will allow students with previous exposure to statistical inference and machine learning to acquire further skills to handle data and models too complex to be approached by standard methods.
Indicative Content - Formalising the challenges of inference at scale through empirical risk minimisation, theoretical convergence rates and the computational complexity of algorithms. Dealing with large-scale samples (scale 1): gradient-based optimisation, stochastic optimisation and parallelism. High-dimensional statistics (scale 2): the curse of dimensionality, dealing with a large or infinite number of variables for problems of regression, and dimensionality reduction. Dealing with expensive problems (scale 3): Gaussian process regression, active learning and Bayesian optimisation.
Key Texts - Available from .
Module deliveries for 2024/25 academic year
Last updated
This module description was last updated on 19th August 2024.
Ìý