Clustering datasets

Speech and Image Processing Unit
School of Computing
University of Eastern Finland
P.O.Box 111
FIN-80101 Joensuu
Finland

Image data
[bridge.pgm]
Bridge
(256x256)

N=4096, D=16
4x4 pixel blocks  ts  txt
4x4 binarized pixel blocks  ts  txt
4x4 pixel blocks: 25% randomly sampled (for training)  ts  txt
4x4 pixel blocks: 75% randomly sampled (for testing)  ts  txt
[house.ppm]
House
(256x256)

N=34112, D=3
RGB-values, quantized to 5 bits per color  ts  txt
RGB-values, 8 bits per color  ts  txt
[missa001.pgm]
Miss America
(360x288)

N=6480, D=16
4x4 pixel blocks from the difference image of frame 1 and 2  ts  txt
4x4 pixel blocks from the difference image of frame 2 and 3  ts  txt
europe
Europe
(vector)
europe differentials
Europe
N=169308, D=2
Differential coordinates of Europe map ts  txt  original 
Birch-sets

Birch1

Birch2
Synthetic 2-d data with N=100,000 vectors and M=100 clusters.

Zhang et al., "BIRCH: A new data clustering algorithm and its applications", Data Mining and Knowledge Discovery, 1 (2), 141-182, 1997.

Birch3
 
Birch1: Clusters in regular grid structure  ts  txt
Birch2: Clusters at a sine curve  ts  txt
Birch3: Random sized clusters in random locations  ts  txt
 
S-sets
S1
S1
S3
S3
S2
S2
S4
S4
Synthetic 2-d data with N=5000 vectors and M=15 Gaussian clusters with different degree of cluster overlapping.

P. Fränti and O. Virmajoki, "Iterative shrinking method for clustering problems", Pattern Recognition, 39 (5), 761-765, May 2006.

S1:  ts  txt
S2:  ts  txt
S3:  ts  txt
S4:  ts  txt

Ground truth centroids and partitions:  zip
s3 and s4 updated 4.2.2015
 
A-sets
A1
A1
N=3000, M=20
A2
A2
N=5250, M=35
Synthetic 2-d data with varying number of vectors (N) and clusters (M). There are 150 vectors per cluster.

I. Kärkkäinen and P. Fränti, "Dynamic local search algorithm for the clustering problem", Research Report A-2002-6 (pdf)

A1:  ts  txt
A2:  ts  txt
A3:  ts  txt
A3
A3
N=7500, M=50
   
 
Dim-sets
Dim2
Dim2
  Synthetic data with Gaussian clusters in multi-dimensional space.
1351-10126 vectors in 2-15 dimensional space

ts  txt
DIM-sets (high)
dim032
dim032
32 dimensions
dim064
dim064
64 dimensions
High-dimensional data sets N=1024 and M=16 Gaussian clusters.

P. Fränti, O. Virmajoki and V. Hautamäki, "Fast agglomerative clustering using a k-nearest neighbor graph", IEEE Trans. on Pattern Analysis and Machine Intelligence, 28 (11), 1875-1881, November 2006.

Ground truth centroids in cb and txt format.
dim128
dim128
128 dimensions
dim256
dim256
256 dimensions
Data sets in TS and TXT, ground truth partitions in PA format:
dim032:  ts  txt pa
dim064:  ts  txt pa
dim128:  ts  txt pa
dim256:  ts  txt pa
dim512:  ts  txt pa
dim1024:  ts  txt pa

dim512
dim512
512 dimensions
dim1024
dim1024
1024 dimensions
 
 
KDDCUP04Bio set
KDDCUP04Bio
KDDCUP04Bio
N=145751, M=2000, 74-dim
  KDDCUP04Bio biology dataset.

KDDCUP04Bio:  ts  txt
UCI datasets
Thyroid
Thyroid
N=215, M=2, D=5
ts  txt
Wine
Wine
N=178, M=3, D=13
ts  txt
UCI datasets original source is http://archive.ics.uci.edu/ml/

Breast-Cancer-Wisconsin: We have removed features 1 (sample id) and 11 (class label). All missing values are given value 1.
Yeast
Yeast
N=1484, M=10, D=8
txt
ts  integer
Breast
Breast
N=699, M=2, D=9
ts  txt
Iris
Iris
N=150, C=3, D=4
ts  txt  labels
Glass
Glass
N=214, M=7, D=9,
ts  txt  labels
Wdbc
Wdbc
N=569, M=2, D=32
ts
numeric (31-d)  full (32-d)
Categorical
Census
Census
N=1000-512000, D=68
zip 
  Categorical attributes from Public Use Microdata Samples (PUMS) person records. Includes subsets of size 1000, 2000, 4000, ..., 512000. Source
g2 sets
g2-2-30
g2-2-30
1024 vectors per cluster,
2 clusters
1-1024 dimensions
variance 10-100
  Gaussian clusters dataset.

g2:  ts's in zip file (53MB) 
Shape sets
from_literature
Third column is the label.
Aggregation
N=788, M=7, D=2
  Aggregation:  txt
Gionis, A., H. Mannila, and P. Tsaparas, Clustering aggregation. ACM Transactions on Knowledge Discovery from Data (TKDD), 2007. 1(1): p. 1-30.
Compound
N=399, M=6, D=2
  Compound:  txt
Zahn, C.T., Graph-theoretical methods for detecting and describing gestalt clusters. IEEE Transactions on Computers, 1971. 100(1): p. 68-86.
Pathbased
N=300, M=3, D=2
  Pathbased:  txt
Chang, H. and D.Y. Yeung, Robust path-based spectral clustering. Pattern Recognition, 2008. 41(1): p. 191-203.
Spiral
N=312, M=3, D=2
  Spiral:  txt
Chang, H. and D.Y. Yeung, Robust path-based spectral clustering. Pattern Recognition, 2008. 41(1): p. 191-203.
D31
N=3100, M=31, D=2
  D31:  txt
Veenman, C.J., M.J.T. Reinders, and E. Backer, A maximum variance cluster algorithm. IEEE Trans. Pattern Analysis and Machine Intelligence 2002. 24(9): p. 1273-1280.
R15
N=600, M=15, D=2
  R15:  txt
Veenman, C.J., M.J.T. Reinders, and E. Backer, A maximum variance cluster algorithm. IEEE Trans. Pattern Analysis and Machine Intelligence, 2002. 24(9): p. 1273-1280.
Jain
N=373, M=2, D=2
  Jain:  txt
Jain, A. and M. Law, Data clustering: A user's dilemma. Lecture Notes in Computer Science, 2005. 3776: p. 1-10.
Flame
N=240, M=2, D=2
  Flame:  txt
Fu, L. and E. Medico, FLAME, a novel fuzzy clustering method for the analysis of DNA microarray data. BMC bioinformatics, 2007. 8(1): p. 3.
Mopsi locations
users_locations_Finland
Users' locations
N=13467, D=2
  Mopsi locations Finland until 2012 dataset.

Users' locations:  cb  txt 
MopsiLocations2012-Joensuu
Users' locations, Joensuu
N=6014, D=2
  Users' locations in Joensuu 2012 dataset.

Users' locations Joensuu:  ts  txt 
Miscellaneous
t4.8k
t4.8k
N=8000, M=6, D=2
t4.8k.txt 

ConfLongDemo
N=164,860, M=11, D=3
txt 
t4.8k: G. Karypis, E.H. Han, V. Kumar, CHAMELEON: A hierarchical 765 clustering algorithm using dynamic modeling, IEEE Trans. on Computers, 32 (8), 68-75, 1999.

ConfLongdemo has eight attributes, of which only three numerical attributes are included here.

MNIST
N=10000, M=10, D=748
txt 

MiniBooNE
N=130,065, D=50
txt
MNIST includes 10 handwriting digits and contains 60,000 477 training patterns and 10,000 test patterns of 784 dimensions.

MiniBooNE

Related links