SEMINARIESCHEMA FÖR MATEMATISK STATISTIK Fredag 8/3 15.15 Timo Koski, Institutionen för matematik, KTH CLUSTERING OF BINARY VECTORS BY MINIMIZATION OF STOCHASTIC COMPLEXITY Abstract The operation of partitioning some data set into separate subsets is called clustering or classification. The clusters considered here can be represented by certain probability distributions over the binary hypercube. Thus the clustering forms in a sense a statistical theory for any underlying data base of binary vectors. One basic problem is to determine the number of clusters to be formed. More clusters can always explain the data better, so we shall limit the number of classes to be found by minimization of stochastic complexity, which also determines the clusters themselves. Intuitively this corresponds to the most concise explanation or the briefest possible binary recording of the data. We shall also mention and comment the related minimum message length and the autoclass (i.e. Bayesian) techniques of clustering or classification. Stochastic complexity is here evaluated by means of the statistical model for the clusters, which is in fact a finite mixture of multivariate Bernoulli distributions. Given a clustering there are both explicit and asymptotic expressions (related to the maximum likelihood classification estimate) for stochastic complexity. Our aim is to find the clustering with the minimal stochastic complexity using these formulae. Applications of the method to a data base of some several thousands strains of Enterobacteria i.e. to numerical taxonomy of bacteria will be discussed. This is joint work with prof. Mats Gyllenberg from University of Turku. The talk will be in Swedish. Lokal: Rum 227 i Mattehuset.