% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/BootKmeans_func_20250108.R
\name{BootKmeans}
\alias{BootKmeans}
\title{BootKmeans() function}
\usage{
BootKmeans(
  z1_matrix,
  z2_matrix,
  z3_matrix,
  z4_matrix,
  z5_matrix,
  threshold = 0.01,
  no_scans = 1000,
  max_k = 40,
  iter.max = 1e+06,
  nstart = 200,
  algorithm = "Hartigan-Wong",
  path_out = path_out
)
}
\arguments{
\item{z1_matrix}{a matrix with numerical values of the first z-descriptor for
each amino acid position in all sequences in the data set.}

\item{z2_matrix}{a matrix with numerical values of the second z-descriptor
for each amino acid position in all sequences in the data set.}

\item{z3_matrix}{a matrix with numerical values of the third z-descriptor for
each amino acid position in all sequences in the data set.}

\item{z4_matrix}{a matrix with numerical values of the fourth z-descriptor
for each amino acid position in all sequences in the data set.}

\item{z5_matrix}{a matrix with numerical values of the fifth z-descriptor for
each amino acid position in all sequences in the data set.}

\item{threshold}{a numerical value between 0 and 1 specifying the threshold
of reduction in BIC for selecting a k estimate for each kmeans clustering
model. The value specifies a proportion of the max observed reduction in BIC
when increasing k by 1 (default 0.01).}

\item{no_scans}{an integer specifying the number of k estimation scans
to run (default 1,000).}

\item{max_k}{an integer specifying the hypothetical maximum number of clusters
to detect (default 40). In each k estimation scan, the algorithm runs a
kmeans() clustering model for each value of k between 1 and max_k.}

\item{iter.max}{an integer specifying the maximum number of iterations allowed
in each kmeans() clustering model (default 1,000,000).}

\item{nstart}{an integer specifying the number of rows in the set of input
matrices that will be chosen as initial centers in the kmeans() clustering
models (default 200).}

\item{algorithm}{character vector, specifying the method for the kmeans()
clustering function, one of c("Hartigan-Wong", "Lloyd", "Forgy", "MacQueen"),
default is "Hartigan-Wong".}

\item{path_out}{a user defined path to the folder where the output files
will be saved.}
}
\value{
The function produces three folders in path_out, which contain for
  each scan the estimated k-clusters saved as .RData files, an elbow plot saved
  as .pdf, and a stats summary table saved as a .csv file. In path_out a summary
  of all scans performed in the bootstrap run is also saved as .csv. This table
  is also shown in the console.
  Should alternative elbow plots be desired, they may be produced manually with
  the stats presented in the summary tables for each scan.
}
\description{
\code{\link{BootKmeans}} is a wrapper for the kmeans() function of the 'stats'
package, which allows for bootstrapping. Bootstrapping k-estimates may be
desirable in data sets, where the BIC- vs. k-values do not produce clear
inflection points ("elbows").
}
\details{
BootKmeans() performs multiple runs of kmeans() scanning k-values from 1 to a
maximum value defined by the user. In each scan, an optimal k-value is
estimated using a user-defined threshold of BIC reduction. The method is an
automated version of visually inspecting elbow plots of BIC- vs. k-values.
The number of scans to be performed is defined by the user.

For each k-estimate scan, the algorithm produces a summary of the stats incl.
total within SS, AIC, and BIC, an elbow plot (BIC vs. k), and a set of cluster
files corresponding to the estimated optimal k-value. It also produces a table
summarizing the stats of the final selected kmeans() models corresponding to
the estimated optimal k-values of each scan.

After running BootKmeans() on a data set, it is recommended to subsequently
evaluate the repeatability of the bootstrapped k-estimation scans with the
ClusterMatch() function also included in MHCtools.

Input data format:
A set of five z-matrices containing numerical values of the z-descriptors
(z1-z5) for each amino acid position in a sequence alignment. Each column
should represent an amino acid position and each row one sequence in the
alignment.

If you publish data or results produced with MHCtools, please cite both of
the following references:
Roved, J. (2022). MHCtools: Analysis of MHC data in non-model species. Cran.
Roved, J. (2024). MHCtools 1.5: Analysis of MHC sequencing data in R. In S.
Boegel (Ed.), HLA Typing: Methods and Protocols (2nd ed., pp. 275–295).
Humana Press. https://doi.org/https://doi.org/10.1007/978-1-0716-3874-3_18
}
\note{
If z-matrices were generated with the DistCalc() function, please make sure
  to load the z-matrices from the .csv files exported by DistCalc(). Calling e.g.
  'z1_matrix' without loading the exported tables will engage the default test
  data set in MHCtools.

Setting max_k too high can cause kmeans to fail with the error "more cluster
  centers than distinct data points" - this problem can be solved by reducing
  max_k.

AIC and BIC are calculated from the kmeans model objects by the following
  formulae:
  - AIC = D + 2*m*k
  - BIC = D + log(n)*m*k
  in which:
  - m = ncol(fit$centers)
  - n = length(fit$cluster)
  - k = nrow(fit$centers)
  - D = fit$tot.withinss
}
\examples{
z1_matrix <- z1_matrix
z2_matrix <- z2_matrix
z3_matrix <- z3_matrix
z4_matrix <- z4_matrix
z5_matrix <- z5_matrix
path_out <- tempdir()
BootKmeans(z1_matrix, z2_matrix, z3_matrix, z4_matrix, z5_matrix, threshold=0.01,
no_scans=10, max_k=20, iter.max=10, nstart=10, algorithm="Hartigan-Wong",
path_out=path_out)
}
\seealso{
\code{\link{ClusterMatch}}; \code{\link{DistCalc}}
}
