New plot functionality for ClustImpute 0.2.0 and other improvements

Let’s create some dummy data…

### Random Dataset
set.seed(739)
n <- 7500 # numer of points
nr_other_vars <- 4
mat <- matrix(rnorm(nr_other_vars*n),n,nr_other_vars)
me<-4 # mean
x <- c(rnorm(n/3,me/2,1),rnorm(2*n/3,-me/2,1)) 
y <- c(rnorm(n/3,0,1),rnorm(n/3,me,1),rnorm(n/3,-me,1))
true_clust <- c(rep(1,n/3),rep(2,n/3),rep(3,n/3)) # true clusters
dat <- cbind(mat,x,y)
dat<- as.data.frame(scale(dat)) # scaling
summary(dat)
##        V1                 V2                  V3                 V4           
##  Min.   :-3.40352   Min.   :-4.273673   Min.   :-3.82710   Min.   :-3.652267  
##  1st Qu.:-0.67607   1st Qu.:-0.670061   1st Qu.:-0.66962   1st Qu.:-0.684359  
##  Median : 0.01295   Median :-0.006559   Median :-0.01179   Median : 0.001737  
##  Mean   : 0.00000   Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.000000  
##  3rd Qu.: 0.67798   3rd Qu.: 0.684672   3rd Qu.: 0.67221   3rd Qu.: 0.687404  
##  Max.   : 3.35535   Max.   : 3.423416   Max.   : 3.80557   Max.   : 3.621530  
##        x                 y            
##  Min.   :-2.1994   Min.   :-2.151001  
##  1st Qu.:-0.7738   1st Qu.:-0.975136  
##  Median :-0.2901   Median : 0.009932  
##  Mean   : 0.0000   Mean   : 0.000000  
##  3rd Qu.: 0.9420   3rd Qu.: 0.975788  
##  Max.   : 2.8954   Max.   : 2.265420

…with missings…

library(ClustImpute)
dat_with_miss <- miss_sim(dat,p=.2,seed_nr=120)
mis_ind <- is.na(dat_with_miss) # missing indicator

…that is clearly hard to impute using a simple random imputation:

Any clustering based on data “completed” this way will not provide good results. With ClustImpute we come a bit closer to a clustering based on the the full data as we can see here:

How would we look at the cluster results if we did not knew that the clusters exist in a 2-dimensional subspace? One way would be to look at the marginal distribution of each features within a cluster. Basically what is shown above by making use of the ggExtra package. ClustImpute has this now as build-in default plot:

plot(res)+xlim(-2.5,2.5)
## Warning: Removed 385 rows containing non-finite values (stat_bin).

We trunctate the x-axis here to focus on the body of the distribution but of course this is optional. Clearly, the clusters only really differ by feature x and y. The orange bars show the cluster centroids - alternatively one can also show the mean of all data points grouped by cluster and feature (which may differ slightly since the last step in ClustImpute is the cluster assignment based on the final centroids).

Alternatively one can also visualize the marginal distributions with a box-plot:

plot(res, type="box")

Other new functionality

There are some other new features - perhaps there are separate posts following up on those.

  • It used to be the (strong) recommendation to center the data if a weight function is used (n_end >1). Now, by default, the scaling with the weight function is towards the global overall mean of each feature. Thus, for centered data there is almost no change (due the random imputation mechanism data with a true unknown mean of zero might have an empirical mean unequal to zero). This is relevant for you if you have to work with uncentered data for whatever (good) reason.
  • There is a check if the data is centered, and potentially a warning (if you scale the imputed values towards zero instead of the actual mean).
  • Added custom print function showing clsuter centrois and number of observations per cluster in nicely formated tables. Nothing dramatic but nice to look at.