GETTING STARTED USING CB-MODULE
Ismo Kärkkäinen, Pasi Fränti
Last modified: 8.12.2000 TKa
               1.12.2000 PF

A Quick Survival Guide

1. Introduction
2. Reading and initialization
3. Basic information about data
4. Distances and errors
5. Changing the solution
6. Saving the data and cleaning up
7. Data structures and file formats


1. Introduction
---------------

CB.C and it's header file CB.H define structures and functions that
perform basic tasks in clustering. These include reading and writing
information from and to a file and handling the solution in general
terms. Data is known as a training set and it is used along with two
complementary representations of clustering: a codebook (centroids or
averages of all points in cluster) and a partititoning (mapping of points
to clusters). Most common operations on these are bundled in
CB-module. This document is intended to give a brief introduction to using
the module and thus prevent the urge to write your own routines.


2. Reading and initialization
-----------------------------

The data is represented by a TRAININGSET. It is a set of vectors and holds
all essential information such as the dimensionality of the data and
frequencies of individual vectors, since there may be several identical
vectors.

The file format of TRAININGSET is a binary format. Data in ASCII-format
can be converted to this format with separate tools. Inverse conversion
is also possible. When dealing with other formats it is simplest to first
convert the data to an ASCII file by the user, and then use the existing
tools (such as ASC2CB and RAW2CB) to convert to ASCII-file to CB- or TS-
format. The inverse format can be done using CBSHOW, which outputs the
content of CB/TS files in ASCII format on screen.

To read a data file, you must first have a variable of type TRAININGSET
and then just read the data with ReadTrainingSet-routine. That's it.

After you have the data, you can create both CODEBOOKs and PARTITIONINGs
by passing functions CreateNewCodebook and CreateNewPartitioning the
TRAININGSET and the number of clusters you wish the structures to be able
to handle. This number must be same for codebooks and partitionings that
will be used together.

Example:
  TRAININGSET TS;
  ReadTrainingSet("bridge.ts", &TS);


3. Basic information about data
-------------------------------

There's a multitude of macros that give information about the data. Due
to the similarity of a training set and codebook, they work for both
types. The CB in functions and macros below refers to a CODEBOOK or
a TRAININGSET.

The following macros are the most important.

-BookSize(CB) returns the number of vectors. This is the number of
separately represented vectors.
-TotalFreq(CB) is the total number of vectors. This also takes into
account frequencies of different vectors.
-VectorSize(CB) is the dimensionality of the data.

-Vector(CB,index) returns vector from CB with the given index. Valid
indices are from 0 to BookSize(CB) - 1.
-VectorScalar(CB,index,scalar) returns the element scalar of vector
index. scalar may be from 0 to VectorSize(C) - 1.
-VectorFreq(CB,index) tells the frequency of the vector.


PARTITIONING is a mapping of points to clusters. All points in a given
cluster can be accessed sequentially, since the points are stored in a
list. For it, the most important macros are:

-Map(P,item) tells to which cluster vector item has been mapped. For
TRAININGSET TS, item can be from 0 to BookSize(TS) - 1.
-UniqueVectors(P,Pindex) is the number of vectors in cluster Pindex.
Vector frequencies are not taken into account.
-PartitionCount(P) is the number of partitions and should always be the
same as the BookSize(CB) of the CODEBOOK with which the partitioning is
used.
-FirstVector(P,Pindex) returns the index of the first vector in a cluster.
-NextVector(P,item) returns the next vector in the same cluster to whict
item belongs to (i.e. use the index of first or previous next vector).
-EndOfPartition(item) tells if the partition has ended, parameter is the
index returned by NextVector.

Example for looping through all training vectors in all clusters:

  PARTITIONING* P; /* This has to be legally initialized. */
  int i, j;

  /* Through all clusters. */
  for( i = 0; i < PartitionCount(P); i++ )
    {
    /* Through all training vectors in cluster 'i'. */   
    for( j = FirstVector(P, i); ! EndOfPartition(j); j = NextVector(P, j) )
      {
      /* Do something with trainig vector 'j' in cluster 'i'. *
      }
    }


4. Distances and errors
-----------------------

CB-module also contains functions to calculate distances between vectors
and errors (or goodness) of a given clustering.

VectorDistance calculates the distance between two vectors. Two
first parameters are vectors, and they can be obtained from CB by
Vector(CB,index). Vsize is the dimensionality of data, and it can
be obtained by VectorSize(CB). The fourth parameter maxdist is the
distance which terminates the distance calculation if it is exceeded.
This can be used when looking for the closest point (i.e. if we observe
that the point is farther that the closest so far, there's no need to
calculate the exact distance). Last parameter is the type of distance.
Note that EUCLIDEQANSQ returns the square of the Euclidean distance
between vectors.

VectorDist is a simplified version of the VectorDistance, and it is
implemented as a macro. Use this one if you don't need the more sophis-
ticated features!

To calculate the error use AverageErrorForSolution.

Note that AverageErrorForSolution calculates distances between each
training vector and its current codevector. Especially, it does not
search the nearest code vector for a training vector before distance
calculation. If you want to know the average error when each training
vector is mapped to its nearest code vector, use AverageErrorCB or
AverageErrorCBFast instead.


5. Changing the solution
------------------------

There are two improtant functions that are commonly used to improve
given solution:
- GenerateOptimalPartitioning generates optimal partitioning
with respect to a given codebook and error function.
- GenerateOptimalCodebook generates optimal codebook with
respect to a given partitioning. (Together these two routines constitute
a single GLA-iteration!)

Other commonly used routines are:
- LocalReapartitioningGeneral maps the points from a given cluster to
their nearest cluster.

- RepartitionDueToNewVectors moves vectors from their old partitions
to one of the clusters given as argument. The idea is that if the
centroids for only the given clusters have moved, then points may only
move to the moved clusters. Assumption is that all points have already
been mapped to the nearest cluster. Points either stay in the old cluster
or they move to one of the clusters given as a parameter.

- RepartitionDueToNewVectorGeneral does that same as above, but for single
cluster.

- ChangePartition is used to move a point from one partition to another.


6. Saving the data and cleaning up
----------------------------------

To write a codebook (or a training set) to a file, use the function
WriteCodebook. It takes the filename, CODEBOOK to be saved and whether
overwriting an existing file is allowed.

Saving a PARTITIONING is done with WritePartitioning, which takes
filename, PARTITIONING, TRAININGSET and info whether overwriting an
existing file is allowed.

TRAININGSET and CODEBOOK can be destroyed with FreeCodebook.
PARTITIONING can be destroyed with FreePartitioning.


7. Data structures and file formats
-----------------------------------

The data structures and file formats for TRAININGSET and CODEBOOK
are identical. They also share the same set of functions. The only
difference is in the version string.

   typedef     struct { char	    Versionstr[MaxVersionLength];
			int	    BlockSizeX;
			int	    BlockSizeY;
			int	    CodebookSize;
			int	    TotalFreq;
			int	    BytesPerElement;
			int	    MinValue;
			int	    MaxValue;
			int	    Preprocessing;
			char	    GenerationMethod[MaxGenMethodLength];
			BOOKTYPE    Book;
			int	    AllocatedSize;
		      } CODEBOOK;

The file format contains the same information as the data structure.
The header of the file format is in ASCII format, and the vectors are
stored in binary format separated from the header by string with
undefined number of minus signs: "---------------------------\n".

char Versionstr[]: Contains identification string and version number.
Current versions are "VQ CODEBOOK 2.0\n" and "VQ TRAINING SET 2.0\n".

int BlockSizeX, BlockSizeY: These defines the vector dimension. The data
structure support 2-dimensional vectors (e.g. image blocks). Otherwise if
there data does not contain any 2-D structure, it is recommended to define
the vectors as Xsize=42, Ysize=1 in the case of vectors with 42 elements.

int CodebookSize: Number of vectors in the codebook.

int TotalFreq: The number of all vectors in the codebook. This is different
from the CodebookSize only if the vector frequencies are higher than 1.
In this case, TotalFreq should equal to the sum of the vector frequencies.

int BytesPerElement: Defines by how many bytes are used per element in
the representation of the vector elements. Supported numbers are 1,2,3
and 4. Use 1 byte per element by default, which is enough to represent
values in the range [0,255].

int MinValue, MaxValue: These defines the effective range of the vector
elements. With 1 byte per element, use MinValue=0 and MaxValue=255 by
default.

int Preprocessing: This field is preserved for user defined additional
information about the parameters how the particular TRAININGSET or CODEBOOK
was created. Currently, nobody uses this field. Set is to 0 if no use.

char GenerationMethod[MaxGenMethodLength]: In this string you may document
how the particular TRAININGSET or CODEBOOK was created. In case of CODEBOOK,
this field usually contains the name of the algorithm with version number.

BOOKTYPE Book: This is a pointer to the structure containing the vectors.
In the file structure, this part is stored in binary format and separeted
from the ASCII file headear by the "--------------\n" string.
int AllocatedSize:

Vectors:
--------
Vecotrs are stored in BOOKTYPE, which is a dynamically allocated table
of type BOOKNODE:

   typedef     struct { VECTORTYPE    vector;
                        VECTORELEMENT vmean;
                        int           freq;
                        char*         name;
                      } BOOKNODE;

Vector: This is similarly a dynamically allocated list of integers (int).
It is assumed that a 32-bit compiler is used so that each element can use
1-4 bytes per elements defined in the codebook header.

vmean: This field contains the average value of the vector elements.
For most program this is just additional information which is never
used. However, all routines in CB maintains this value so if you
do not modify the vectors directly by yourself, you can assume that
the vmean value is always correct, in case you will need it.

freq: Frequency of the vector. This field tells how many times the
same vector is included in the TRAININGSET or CODEBOOK. It is therefore
important always include the frequency in the calculations when you
operate with the TRAININGSET or CODEBOOK.

name: Documentary field. If the vectors are set of attributes of some
real clustering problem, you may document the name of the vectors using
this string.