Clustering Methods (5 cp) 3621552


Course description

Clustering is a basic tool used in data analysis, pattern recognition and machine learning for finding groups in data. K-means is still the most popular algorithm in clustering. But is it good enough? How to decide how many clusters? What if the data is non-numerical like categorical, graph, text or more complex objects like GPS trajectories? Outliers, noise and missing values also degrade the clustering performance so how to deal with these problems? Besides these problems, clustering problem is like other optimization problems. It consists of the following three main design problems: (1) define distance function suitable for the data, (2) select cost function to measure goodness of the clusters, (3) design algorithm to optimize for the cost function.

Teaching methods

Course will be arranged as a series of (1) Youtube video lectures and related discussion sessions (Teams) every thursday; (2) exercises every Tuesday; (3) Series of mini-exams (to be implemented later) or classical 4 hour offline exam. Students will also be required to implement clustering program that will be gradually extended during the exercises. Suitable programming languages are Python, C, C++, C#, Java, JavaScript, R, Matlab, PHP, Go, Ruby.

Teachers

Lecturer: Pasi Fränti
Course assistants: Sami Sieranoja and Gulraiz I Choudhary

Intro

Schedule

Video lectures (~28h): Thursday 14-16 (Teams)
Exercises (7): Tuesdays 14-16 (Teams)
Starting from 12.1.2021  (Intro lecture)

12.1. Discussion of practicalities
13.1. Introduction to clustering
20.1. K-means, Fast k-means, Random swap
27.1. ---
3.2. Graph clustering, Mumford-Shah k-means
10.2. Cost functions, text clustering, clustering of web pages
17.2. Clustering evaluation, outlier detection
24.2. Number of clusters, location-based data
3.3. Divisive clustering, Genetic algorithm
8.3. Density peaks, case study
10.3. Agglomerative clustering (on-line + discussion)

Video lectures

All lectures in YouTube

Exercises

Exercise 1: 18.1.
Exercise 2: 1.2.
Exercise 3: 8.2.
Exercise 4: 15.2.
Exercise 5: 22.2.
Exercise 6: 1.3.
Exercise 7: 8.3.

Submit your exercises in Moodle

Preliminary knowledge

Design & Analysis of Algorithms

Exams

18.3. 12-16, Room M100 (Joensuu), Room CA101 (Kuopio)
22.4. 12-16, Room M100 (Joensuu), Room CA101 (Kuopio)

Links

Clusterator
Animator
Lectures Notes and material from 2014