Developing an R package from scratch with Travis continuous integration

This short tutorial provdes a quick guide on how to develop an R package from scratch and how use Travis CI for automatic builds on various R versions and automatic test coverage calculation. The resulting package can be found here: CIexamplePkg A very nice general introduction can be found here: rOpenSci Packages: Development, Maintenance, and Peer Review Some material is taken from the awesome UseR 2019 tutorial from Colin Gillespie: https://www.

Measuring feature importance in k-means clustering and variants thereof

We present a novel approach for measuring feature importance in k-means clustering, or variants thereof, to increase the interpretability of clustering results. In supervised machine learning, feature importance is a widely used tool to ensure interpretability of complex models. We adapt this idea to unsupervised learning via partitional clustering. Our approach is model agnostic in that it only requires a function that computes the cluster assignment for new data points.

Benchmarking missing data strategies for k-means clustering

The goal is to compare a few algorithms for missing imputation when used before k-means clustering is performed. For the latter we use the same algorithm as in ClustImpute to ensure that only the computation time of the imputation is compared. In a nutshell, we’ll se that ClustImpute scales like a random imputation and hence is much faster than a pre-processing with MICE / MissRanger. This is not surprising since ClustImpute basically runs a fixed number of random imputations conditional on the current cluster assignment.

Intoducing ClustImpute: A new approach for k-means clustering with build-in missing data imputation

We are happily introducing a new k-means clustering algorithm that includes a powerful multiple missing data imputation at the computational cost of a few extra random imputations (benchmarks following in a separate article). More precisely, the algorithm draws the missing values iteratively based on the current cluster assignment so that correlations are considered on this level (we assume a more granular dependence structure is not relevant if we are “only” interest in k partitions).