# Topological and geometrical methods in data analysis

**Time: **
Fri 2021-06-11 14.00

**Location: **
Via Zoom: https://kth-se.zoom.us/j/69867498539, Meeting ID: 698 6749 8539, (English)

**Subject area: **
Mathematics

**Doctoral student: **
Oliver Gäfvert
, Matematik (Inst.)

**Opponent: **
Professor Henry Schenck, Auburn University, USA

**Supervisor: **
Professor Sandra di Rocco, Matematik (Avd.), Matematik

## Abstract

This thesis concerns two related data analysis pipelines, using topological and geometrical methods respectively, to extract relevant information. The first pipeline, referred to as the *topological data analysis (TDA) pipeline*, constructs a filtered simplicial complex on a given data set in order to describe its shape. The shape is described using a *persistence module*, which characterizes the topological features of the filtration, and the final step of the pipeline extracts algebraic invariants from this object. The second pipeline, referred to as the *geometric data analysis (GDA) pipeline*, associates an algebraic variety to a given data set and aims to describe the structure of this variety. Its structure is described using *homology*, an invariant which for most algebraic varieties can only be computed numerically using sampling methods.

In Paper A we consider invariants on multi-parameter persistence modules. We explain how to convert discrete invariants into stable ones via what we call hierarchical stabilization. We illustrate this process by constructing stable invariants for multi-parameter persistence modules with respect to the interleaving distance and so called simple noise systems. For one parameter, we recover the standard barcode information. For more than one parameter we prove that the constructed invariants are in general NP-hard to calculate. A consequence is that computing the feature counting function, proposed by Scolamiero et. al. (2016), is NP-hard.

In Paper B we introduce an efficient algorithm to compute a minimal presentation of a multi-parameter persistent homology module, given a chain complex of free modules as input. Our approach extends previous work on this problem in the 2-parameter case, and draws on ideas underlying the F4 and F5 algorithms for Gröbner basis computation. In the *r*-parameter case, our algorithm computes a presentation for the homology of C ->^{F }A ->^{G }B, with modules of rank l,n,m respectively, in O(r^{2}n^{r+1} + n^{r}m + n^{r-1}m^{2} + rn^{2} l) arithmetic operations. We implement this approach in our new software Muphasa, written in C++. In preliminary computational experiments on synthetic TDA examples, we compare our approach to a version of a classical approach based on Schreyer's algorithm, and find that ours is substantially faster and more memory efficient. In the course of developing our algorithm for computing presentations, we also introduce algorithms for the closely related problems of computing Gröbner bases for the image and kernel of the morphism *G*. This algorithm runs in time O(n^{r}m + n^{r-1}m^{2}) and memory O(n^{2} + mn + nr + K), where *K* is the size of the output.

Paper C analyzes the complexity of fitting a variety, coming from a class of varieties, to a configuration of points in R^{N}. The complexity measure, called the *algebraic complexity*, computes the Euclidean Distance Degree (EDD) of a certain variety called the *hypothesis variety* as the number of points in the configuration increases. Finally, we establish a connection to complexity of architectures of polynomial neural networks. For the problem of fitting an (N-1)-sphere to a configuration of m points in R^{N}, we give a closed formula for the algebraic complexity of the hypothesis variety as m grows for the case of N=1. For the case N>1 we conjecture a generalization of this formula supported by numerical experiments.

In Paper D we present an efficient algorithm to produce a provably dense sample of a smooth compact variety. The procedure is partly based on computing *bottlenecks* of the variety. Using geometric information such as the bottlenecks and the *local reach* we also provide bounds on the density of the sample needed in order to guarantee that the homology of the variety can be recovered from the sample. An implementation of the algorithm is provided together with numerical experiments and a computational comparison to the algorithm by Dufresne et. al. (2019).