A partial clustering algorithm with automatic estimation of the number of clusters and identification of outliers

This function performs the CrossClustering algorithm. This method combines the Ward's minimum variance and Complete-linkage (default, useful for finding spherical clusters) or Single-linkage (useful for finding elongated clusters) algorithms, providing automatic estimation of a suitable number of clusters and identification of outlier elements.

cc_crossclustering(
  dist,
  k_w_min = 2,
  k_w_max = attr(dist, "Size") - 2,
  k2_max = k_w_max + 1,
  out = TRUE,
  method = c("complete", "single")
)

# S3 method for crossclustering
print(x, ...)

Arguments

dist: A dissimilarity structure as produced by the function dist
k_w_min: (int) Minimum number of clusters for the Ward's minimum variance method. By default is set equal 2
k_w_max: (int) Maximum number of clusters for the Ward's minimum variance method (see details)
k2_max: (int) Maximum number of clusters for the Complete/Single-linkage method. It can not be equal or greater than the number of elements to cluster (see details)
out: (lgl) If TRUE (default) outliers must be searched (see details)
method: (chr) "complete" (default) or "single". CrossClustering combines Ward's algorithm with Complete-linkage if method is set to "complete", otherwise (if method is set to 'single') Single-linkage will be used.
x: an object used to select a method.
...: further arguments passed to or from other methods.

Value

A list of objects describing characteristics of the partitioning as follows:

Optimal_cluster: number of clusters
cluster_list_elements: a list of clusters; each element of this lists contains the indices of the elements belonging to the cluster
Silhouette: the average silhouette width over all the clusters
n_total: total number of input elements
n_clustered: number of input elements that have actually been clustered

Details

See cited document for more details.

Functions

print(crossclustering):

References

Tellaroli P, Bazzi M., Donato M., Brazzale A. R., Draghici S. (2016). Cross-Clustering: A Partial Clustering Algorithm with Automatic Estimation of the Number of Clusters. PLoS ONE 11(3): e0152333. doi:10.1371/journal.pone.0152333

#' Tellaroli P, Bazzi M., Donato M., Brazzale A. R., Draghici S. (2017). E1829: Cross-Clustering: A Partial Clustering Algorithm with Automatic Estimation of the Number of Clusters. CMStatistics 2017, London 16-18 December, Book of Abstracts (ISBN 978-9963-2227-4-2)

Author

Paola Tellaroli, <paola dot tellaroli at unipd dot it>;; Marco Bazzi, <bazzi at stat dot unipd dot it>; Michele Donato, <mdonato at stanford dot edu>

Examples

library(CrossClustering)

#### Example of Cross-Clustering as in reference paper
#### method = "complete"

data(toy)

### toy is transposed as we want to cluster samples (columns of the
### original matrix)
toy_dist <- t(toy) |>
  dist(method = "euclidean")

### Run CrossClustering
cc_crossclustering(
  toy_dist,
  k_w_min = 2,
  k_w_max = 5,
  k2_max = 6,
  out = TRUE
)
#> 
#>     CrossClustering with method complete.
#> 
#> Parameter used:
#>   - Interval for the number of cluster of Ward's algorithm: [2, 5].
#>   - Interval for the number of cluster of the complete algorithm: [2, 6].
#>   - Outliers are considered.
#> 
#> Number of clusters found: 3.
#> Leading to an avarage silhouette width of: 0.8405.
#> 
#> A total of 6 elements clustered out of 7 elements considered.

#### Simulated data as in reference paper
#### method = "complete"
set.seed(10)
sg <- c(500, 250, 700, 300, 100)

# 5 clusters

t <- matrix(0, nrow = 5, ncol = 5)
t[1, ] <- rep(6, 5)
t[2, ] <- c( 0,  5, 12, 13, 15)
t[3, ] <- c(15, 11,  9,  5,  0)
t[4, ] <- c( 6, 12, 15, 10,  5)
t[5, ] <- c(12, 17,  3,  7, 10)

t_mat <- NULL
for (i in seq_len(nrow(t))) {
  t_mat <- rbind(
    t_mat,
    matrix(rep(t[i, ], sg[i]), nrow = sg[i], byrow = TRUE)
  )
}

data_15 <- matrix(NA, nrow = 2000, ncol = 5)
data_15[1:1850, ] <- matrix(
  abs(rnorm(sum(sg) * 5, sd = 1.5)),
  nrow = sum(sg),
  ncol = 5
) + t_mat

set.seed(100) # simulate outliers
data_15[1851:2000, ] <- matrix(
  runif(n = 150 * 5, min = 0, max = max(data_15, na.rm = TRUE)),
  nrow = 150,
  ncol = 5
)

### Run CrossClustering
cc_crossclustering(
  dist(data_15),
  k_w_min = 2,
  k_w_max = 19,
  k2_max = 20,
  out = TRUE
)
#> 
#>     CrossClustering with method complete.
#> 
#> Parameter used:
#>   - Interval for the number of cluster of Ward's algorithm: [2, 19].
#>   - Interval for the number of cluster of the complete algorithm: [2, 20].
#>   - Outliers are considered.
#> 
#> Number of clusters found: 10.
#> Leading to an avarage silhouette width of: 0.7064.
#> 
#> A total of 1925 elements clustered out of 2000 elements considered.


#### Correlation-based distance is often used in gene expression time-series
### data analysis. Here there is an example, using the "complete" method.

data(nb_data)
nb_dist <- as.dist(1 - abs(cor(t(nb_data))))
cc_crossclustering(dist = nb_dist, k_w_max = 20, k2_max = 19)
#> 
#>     CrossClustering with method complete.
#> 
#> Parameter used:
#>   - Interval for the number of cluster of Ward's algorithm: [2, 20].
#>   - Interval for the number of cluster of the complete algorithm: [2, 19].
#>   - Outliers are considered.
#> 
#> Number of clusters found: 17.
#> Leading to an avarage silhouette width of: 0.1408.
#> 
#> A total of 73 elements clustered out of 100 elements considered.




#### method = "single"
### Example on a famous shape data set
### Two moons data

data(twomoons)

moons_dist <- twomoons[, 1:2] |>
  dist(method = "euclidean")

cc_moons <- cc_crossclustering(
  moons_dist,
  k_w_max = 9,
  k2_max = 10,
  method = 'single'
)

moons_col <- cc_get_cluster(cc_moons)
plot(
  twomoons[, 1:2],
  col = moons_col,
  pch      = 19,
  xlab     = "",
  ylab     = "",
  main     = "CrossClustering-Single"
)


### Worms data
data(worms)

worms_dist <- worms[, 1:2] |>
  dist(method = "euclidean")

cc_worms <- cc_crossclustering(
  worms_dist,
  k_w_max = 9,
  k2_max  = 10,
  method  = "single"
)

worms_col <-  cc_get_cluster(cc_worms)

plot(
  worms[, 1:2],
  col = worms_col,
  pch = 19,
  xlab = "",
  ylab = "",
  main = "CrossClustering-Single"
)



### CrossClustering-Single is not affected to chain-effect problem

data(chain_effect)

chain_dist <- chain_effect |>
  dist(method = "euclidean")
cc_chain <- cc_crossclustering(
  chain_dist,
  k_w_max = 9,
  k2_max = 10,
  method = "single"
)

chain_col <- cc_get_cluster(cc_chain)

plot(
  chain_effect,
  col = chain_col,
  pch = 19,
  xlab = "",
  ylab = "",
  main = "CrossClustering-Single"
)