This post is not about a new technique or package, but rather combining existing functionality in interpretable machine learning and data visualization in a way to facilitate analyses of model results. We’ll make use of two packages DALEX and PLOTLY ot create interactive Individual Conditional Expectation (ICE) plots show how to use them to find interesting behavior. Let’s take a random forest (RF) trained on an imputed version of the titanic data as an example, on which we create a DALEX explainer.
Let’s create some dummy data… ### Random Dataset set.seed(739) n <- 7500 # numer of points nr_other_vars <- 4 mat <- matrix(rnorm(nr_other_vars*n),n,nr_other_vars) me<-4 # mean x <- c(rnorm(n/3,me/2,1),rnorm(2*n/3,-me/2,1)) y <- c(rnorm(n/3,0,1),rnorm(n/3,me,1),rnorm(n/3,-me,1)) true_clust <- c(rep(1,n/3),rep(2,n/3),rep(3,n/3)) # true clusters dat <- cbind(mat,x,y) dat<- as.data.frame(scale(dat)) # scaling summary(dat) ## V1 V2 V3 V4 ## Min. :-3.40352 Min. :-4.273673 Min. :-3.82710 Min. :-3.652267 ## 1st Qu.:-0.67607 1st Qu.:-0.670061 1st Qu.:-0.66962 1st Qu.:-0.684359 ## Median : 0.
This short tutorial provdes a quick guide on how to develop an R package from scratch and how use Travis CI for automatic builds on various R versions and automatic test coverage calculation. The resulting package can be found here: CIexamplePkg A very nice general introduction can be found here: rOpenSci Packages: Development, Maintenance, and Peer Review Some material is taken from the awesome UseR 2019 tutorial from Colin Gillespie: https://www.
We present a novel approach for measuring feature importance in k-means clustering, or variants thereof, to increase the interpretability of clustering results. In supervised machine learning, feature importance is a widely used tool to ensure interpretability of complex models. We adapt this idea to unsupervised learning via partitional clustering. Our approach is model agnostic in that it only requires a function that computes the cluster assignment for new data points.
The goal is to compare a few algorithms for missing imputation when used before k-means clustering is performed. For the latter we use the same algorithm as in ClustImpute to ensure that only the computation time of the imputation is compared. In a nutshell, we’ll se that ClustImpute scales like a random imputation and hence is much faster than a pre-processing with MICE / MissRanger. This is not surprising since ClustImpute basically runs a fixed number of random imputations conditional on the current cluster assignment.
We are happily introducing a new k-means clustering algorithm that includes a powerful multiple missing data imputation at the computational cost of a few extra random imputations (benchmarks following in a separate article). More precisely, the algorithm draws the missing values iteratively based on the current cluster assignment so that correlations are considered on this level (we assume a more granular dependence structure is not relevant if we are “only” interest in k partitions).