Let’s create some dummy data…
### Random Dataset set.seed(739) n <- 7500 # numer of points nr_other_vars <- 4 mat <- matrix(rnorm(nr_other_vars*n),n,nr_other_vars) me<-4 # mean x <- c(rnorm(n/3,me/2,1),rnorm(2*n/3,-me/2,1)) y <- c(rnorm(n/3,0,1),rnorm(n/3,me,1),rnorm(n/3,-me,1)) true_clust <- c(rep(1,n/3),rep(2,n/3),rep(3,n/3)) # true clusters dat <- cbind(mat,x,y) dat<- as.data.frame(scale(dat)) # scaling summary(dat) ## V1 V2 V3 V4 ## Min. :-3.40352 Min. :-4.273673 Min. :-3.82710 Min. :-3.652267 ## 1st Qu.:-0.67607 1st Qu.:-0.670061 1st Qu.:-0.66962 1st Qu.:-0.684359 ## Median : 0.

The goal is to compare a few algorithms for missing imputation when used before k-means clustering is performed. For the latter we use the same algorithm as in ClustImpute to ensure that only the computation time of the imputation is compared. In a nutshell, we’ll se that ClustImpute scales like a random imputation and hence is much faster than a pre-processing with MICE / MissRanger. This is not surprising since ClustImpute basically runs a fixed number of random imputations conditional on the current cluster assignment.

We are happily introducing a new k-means clustering algorithm that includes a powerful multiple missing data imputation at the computational cost of a few extra random imputations (benchmarks following in a separate article). More precisely, the algorithm draws the missing values iteratively based on the current cluster assignment so that correlations are considered on this level (we assume a more granular dependence structure is not relevant if we are “only” interest in k partitions).