Clustering datasets

Speech and Image Processing Unit
School of Computing
University of Eastern Finland
P.O.Box 111
FIN-80101 Joensuu
Finland
Image data
[bridge.pgm]
Bridge
(256x256)

4096 vectors, 16-d
4x4 pixel blocks  ts  txt
4x4 binarized pixel blocks  ts  txt
4x4 pixel blocks: 25% random subsample  ts  txt
4x4 pixel blocks: 75% random subsample  ts  txt
[house.ppm]
House
(256x256)

34112 vectors, 3-d
RGB-values, quantized to 5 bits per color  ts  txt
RGB-values, 8 bits per color  ts  txt
[missa001.pgm]
Miss America
(360x288)

6480 vectors, 16-d
4x4 residual blocks between frames 1 and 2  ts  txt
4x4 residual blocks between frames 2 and 3  ts  txt
 
Birch-sets

Birch1

Birch2
100 000 synthetic 2-d data in 100 clusters.

Zhang et al., "BIRCH: A new data clustering algorithm and its applications", Data Mining and Knowledge Discovery, 1 (2), 141-182, 1997.

Birch3
 
Birch1: Clusters in regular grid structure  ts  txt
Birch2: Clusters at a sine curve  ts  txt
Birch3: Random sized clusters in random locations  ts  txt
 
S-sets
S1
S1
S3
S3
S2
S2
S4
S4
Synthetic 2-d data with 5000 vectors and 15 Gaussian clusters with different degree of cluster overlapping.

P. Fränti and O. Virmajoki, "Iterative shrinking method for clustering problems", Pattern Recognition, 39 (5), 761-765, May 2006.

S1:  ts  txt
S2:  ts  txt
S3:  ts  txt
S4:  ts  txt

Source and labels:  zip
 
A-sets
A1
A1
3000 vectors,
20 clusters
A2
A2
5250 vectors,
35 clusters
Synthetic 2-d data with varying number of clusters and vectors.

A1:  ts  txt
A2:  ts  txt
A3:  ts  txt
A3
A3
7500 vectors,
50 clusters
   
 
Dim-sets
  Dim2
Dim2
Synthetic data with Gaussian clusters in multi-dimensional space.
1351-10126 vectors, 2-d - 15-d

ts  txt
DIM-sets (other)
DIM032
DIM032
1024 vectors,
16 clusters
32 dimensions
DIM064
DIM064
1024 vectors,
16 clusters
64 dimensions
Dim-sets.

DIM032:  ts  txt
DIM064:  ts  txt
DIM128:  ts  txt
DIM256:  ts  txt
DIM512:  ts  txt
DIM1024:  ts  txt

Ground truths in cb and txt format.
DIM128
DIM128
1024 vectors,
16 clusters
128 dimensions
DIM256
DIM256
1024 vectors,
16 clusters
256 dimensions
 
DIM512
DIM512
1024 vectors,
16 clusters
512 dimensions
DIM1024
DIM1024
1024 vectors,
16 clusters
1024 dimensions
 
 
KDDCUP04Bio set
KDDCUP04Bio
KDDCUP04Bio
145751 vectors,
2000 clusters
74 dimensions
  KDDCUP04Bio biology dataset.

KDDCUP04Bio:  ts  txt
UCI datasets
Thyroid
Thyroid
215 vectors,
2 clusters
5 dimensions
  Thyroid dataset.

Thyroid:  ts  txt
Wine
Wine
178 vectors,
3 clusters
13 dimensions
  Wine dataset.

Wine:  ts  txt
Yeast
Yeast
1484 vectors,
10 clusters
8 dimensions
  Yeast dataset.

Yeast:  txt
Yeast_times100:  ts  txt
Breast
Breast
699 vectors,
2 clusters
9 dimensions
  Breast-cancer-Wisconsin dataset.

Breast:  ts  txt

info
Iris
Iris
150 vectors,
4 dimensions
3 clusters
  Iris dataset.

Iris:  ts
txt without labels
txt with labels
Glass
Glass
214 vectors,
9 dimensions
7 clusters
  Glass dataset.

Glass:  ts
txt without labels
txt with labels
Wdbc
Wdbc
569 vectors,
32 dimensions
2 clusters
  Wdbc dataset.

Wdbc:  ts
txt numeric, 31 dim.
txt
g2 sets
g2-2-30
g2-2-30
1024 vectors per cluster,
2 clusters
1-1024 dimensions
variance 10-100
  Gaussian clusters dataset.

g2:  ts's in zip file (53MB) 
Shape sets
from_literature
Third column is the label.
Aggregation
788 vectors,
2 dimensions
7 clusters
  Aggregation:  txt
Gionis, A., H. Mannila, and P. Tsaparas, Clustering aggregation. ACM Transactions on Knowledge Discovery from Data (TKDD), 2007. 1(1): p. 1-30.
Compound
399 vectors,
2 dimensions
6 clusters
  Compound:  txt
Zahn, C.T., Graph-theoretical methods for detecting and describing gestalt clusters. Computers, IEEE Transactions on, 1971. 100(1): p. 68-86.
Pathbased
300 vectors,
2 dimensions
3 clusters
  Pathbased:  txt
Chang, H. and D.Y. Yeung, Robust path-based spectral clustering. Pattern Recognition, 2008. 41(1): p. 191-203.
Spiral
312 vectors,
2 dimensions
3 clusters
  Spiral:  txt
Chang, H. and D.Y. Yeung, Robust path-based spectral clustering. Pattern Recognition, 2008. 41(1): p. 191-203.
D31
3100 vectors,
2 dimensions
31 clusters
  D31:  txt
Veenman, C.J., M.J.T. Reinders, and E. Backer, A maximum variance cluster algorithm. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 2002. 24(9): p. 1273-1280.
R15
600 vectors,
2 dimensions
15 clusters
  R15:  txt
Veenman, C.J., M.J.T. Reinders, and E. Backer, A maximum variance cluster algorithm. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 2002. 24(9): p. 1273-1280.
Jain
373 vectors,
2 dimensions
2 clusters
  Jain:  txt
Jain, A. and M. Law, Data clustering: A user's dilemma. Lecture Notes in Computer Science, 2005. 3776: p. 1-10.
Flame
240 vectors,
2 dimensions
2 clusters
  Flame:  txt
Fu, L. and E. Medico, FLAME, a novel fuzzy clustering method for the analysis of DNA microarray data. BMC bioinformatics, 2007. 8(1): p. 3.
Mopsi locations
users_locations_Finland
Users' locations
13467 vectors,
2 dimensions
  Mopsi locations Finland until 2012 dataset.

Users' locations:  cb  txt 
MopsiLocations2012-Joensuu
Users' locations, Joensuu
6014 vectors,
2 dimensions
  Users' locations in Joensuu 2012 dataset.

Users' locations Joensuu:  ts  txt 
Europe
europe
Europe
169308 vectors,
2 dimensions
  Europe dataset.

Europe:  ts  txt 
Miscellaneous
ConfLongDemo_JSI_164860
t4.8k
MINST
  ConfLongDemo_JSI_164860.txt 
t4.8k.txt 
MINST.txt 

Related links