% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/gl.report.heterozygosity.r
\name{gl.report.heterozygosity}
\alias{gl.report.heterozygosity}
\title{Reports observed, expected and unbiased heterozygosities and FIS
(inbreeding coefficient) by population or by individual from SNP data}
\usage{
gl.report.heterozygosity(
  x,
  method = "pop",
  n.invariant = 0,
  nboots = 0,
  conf = 0.95,
  CI.type = "bca",
  ncpus = 1,
  plot.display = TRUE,
  plot.theme = theme_dartR(),
  plot.colors.pop = gl.colors("dis"),
  plot.colors.ind = gl.colors(2),
  error.bar = "SD",
  save2tmp = FALSE,
  verbose = NULL
)
}
\arguments{
\item{x}{Name of the genlight object containing the SNP [required].}

\item{method}{Calculate heterozygosity by population (method='pop') or by
individual (method='ind') [default 'pop'].}

\item{n.invariant}{An estimate of the number of invariant sequence tags used
to adjust the heterozygosity rate [default 0].}

\item{nboots}{Number of bootstrap replicates to obtain confidence intervals
[default 0].}

\item{conf}{The confidence level of the required interval  [default 0.95].}

\item{CI.type}{Method to estimate confidence intervals. One of
"norm", "basic", "perc" or "bca" [default "bca"].}

\item{ncpus}{Number of processes to be used in parallel operation. If ncpus
> 1 parallel operation is activated,see "Details" section [default 1].}

\item{plot.display}{Specify if plot is to be produced [default TRUE].}

\item{plot.theme}{Theme for the plot. See Details for options
[default theme_dartR()].}

\item{plot.colors.pop}{A color palette for population plots or a list with
as many colors as there are populations in the dataset
[default gl.colors("dis")].}

\item{plot.colors.ind}{List of two color names for the borders and fill of
the plot by individual [default gl.colors(2)].}

\item{error.bar}{statistic to be plotted as error bar either "SD" (standard 
deviation) or "SE" (standard error) or "CI" (confident intervals)
 [default "SD"].}

\item{save2tmp}{If TRUE, saves any ggplots and listings to the session
temporary directory (tempdir) [default FALSE].}

\item{verbose}{Verbosity: 0, silent or fatal errors; 1, begin and end; 2,
progress log; 3, progress and results summary; 5, full report
[default NULL, unless specified using gl.set.verbosity].}
}
\value{
A dataframe containing population labels, heterozygosities, FIS,
their standard deviations and sample sizes
}
\description{
Calculates the observed, expected and unbiased expected (i.e.
corrected for sample size) heterozygosities and FIS (inbreeding coefficient)
for each population or the observed heterozygosity for each individual in a
genlight object.
}
\details{
Observed heterozygosity for a population takes the proportion of
heterozygous loci for each individual and averages it over all individuals in
that population. The calculations take into account missing values.

Expected heterozygosity for a population takes the expected proportion of
heterozygotes, that is, expected under Hardy-Weinberg equilibrium, for each
locus, then averages this across the loci for an average estimate for the
population.

The unbiased expected heterozygosity is calculated using the correction for 
sample size following equation 2 from Nei 1978.

Accuracy of all heterozygosity estimates is affected by small sample sizes,
and so is their comparison between populations or repeated analysis. Expected
heterozygosities are less affected because their calculations are based on 
allele frequencies while observed heterozygosities are strongly susceptible 
to sampling effects when the sample size is small.  

Observed heterozygosity for individuals is calculated as the proportion of
loci that are heterozygous for that individual.

Finally, the loci that are invariant across all individuals in the dataset
(that is, across populations), is typically unknown. This can render
estimates of heterozygosity analysis specific, and so it is not valid to
compare such estimates across species or even across different analyses 
(see Schimdt et al 2021). This is a similar problem faced by microsatellites. 
If you have an estimate of the
number of invariant sequence tags (loci) in your data, such as provided by
\code{\link{gl.report.secondaries}}, you can specify it with the n.invariant
parameter to standardize your estimates of heterozygosity. This is called
autosomal heterozygosities by Schimddt et al (2021).

\strong{NOTE}: It is important to realise that estimation of adjusted (autosomal)
heterozygosity requires that secondaries not to be removed.

Heterozygosities and FIS (inbreeding coefficient) are calculated by locus
within each population using the following equations, and then averaged across 
all loci:
\itemize{
\item Observed heterozygosity (Ho) = number of heterozygotes / n_Ind,
where n_Ind is the number of individuals without missing data for that locus.
\item Observed heterozygosity adjusted (Ho.adj) <- Ho * n_Loc /
 (n_Loc + n.invariant),
where n_Loc is the number of loci that do not have all missing data  and
n.invariant is an estimate of the number of invariant loci to adjust
heterozygosity.
\item Expected heterozygosity (He) = 1 - (p^2 + q^2),
where p is the frequency of the reference allele and q is the frequency of
the alternative allele.
\item Expected heterozygosity adjusted (He.adj) = He * n_Loc /
(n_Loc + n.invariant)
\item Unbiased expected heterozygosity (uHe) = He * (2 * n_Ind /
(2 * n_Ind - 1))
\item Inbreeding coefficient (FIS) = 1 - Ho / uHe
}
\strong{ Function's output }
Output for method='pop' is an ordered barchart of observed heterozygosity,
unbiased expected heterozygosity and FIS (Inbreeding coefficient) across populations
together with a table of mean observed and expected heterozygosities and FIS
by population and their respective standard deviations (SD).
In the output, it is also reported by population: the number of loci used to
 estimate heterozygosity(n.Loc), the number of polymorphic loci (polyLoc),
 the number of monomorphic loci (monoLoc) and loci with all missing data
  (all_NALoc).
Output for method='ind' is a histogram and a boxplot of heterozygosity across
individuals.
 Plots and table are saved to the session temporary directory (tempdir)
 Examples of other themes that can be used can be consulted in \itemize{
 \item \url{https://ggplot2.tidyverse.org/reference/ggtheme.html} and \item
 \url{https://yutannihilation.github.io/allYourFigureAreBelongToUs/ggthemes/}
 }
 
  \strong{Error bars}
 
 The best method for presenting or assessing genetic statistics depends on 
 the type of data you have and the specific questions you're trying to 
 answer. Here's a brief overview of when you might use each method:
 
  \strong{1. Confidence Intervals ("CI"):}
  
 - Usage: Often used to convey the precision of an estimate.
 
 - Advantage: Confidence intervals give a range in which the true parameter 
 (like a population mean) is likely to fall, given the data and a specified 
 probability (like 95%).
 
 - In Context: For genetic statistics, if you're estimating a parameter,
  a 95% CI gives you a range in which you're 95% confident the true parameter
   lies.
 
  \strong{2. Standard Deviation ("SD"):}
  
 - Usage: Describes the amount of variation from the average in a set of data.
 
 - Advantage: Allows for an understanding of the spread of individual data
  points around the mean.
  
 - In Context: If you're looking at the distribution of a quantitative trait 
 (like height) in a population with a particular genotype, the SD can 
 describe how much individual heights vary around the average height.
 
  \strong{3. Standard Error ("SE"):}
  
 - Usage: Describes the precision of the sample mean as an estimate of the 
 population mean.
 
 - Advantage: Smaller than the SD in large samples; it takes into account 
 both the SD and the sample size. 
 
 - In Context: If you want to know how accurately your sample mean represents
  the population mean, you'd look at the SE.
  
   \strong{Recommendation:}
   
  - If you're trying to convey the precision of an estimate, confidence 
  intervals are very useful.
  
  - For understanding variability within a sample, standard deviation is key.
  
  - To see how well a sample mean might estimate a population mean, consider 
  the standard error.
  
  In practice, geneticists often use a combination of these methods to 
  analyze and present their data, depending on their research questions and 
  the nature of the data.
  
 \strong{Confident Intervals}

The uncertainty of a parameter, in this case the mean of the statistic, can
be summarised by a confidence interval (CI) which includes the true parameter
value with a specified probability (i.e. confidence level; the parameter
"conf" in this function).

In this function, CI are obtained using Bootstrap which is an inference
method that samples with replacement the data (i.e. loci) and calculates the
 statistics every time.

 This function uses the function \link[boot]{boot} (package boot) to perform
 the bootstrap replicates and the function \link[boot]{boot.ci}
 (package boot) to perform the calculations for the CI.

 Four different types of nonparametric CI can be calculated
  (parameter "CI.type" in this function):
  \itemize{
   \item First order normal approximation interval ("norm").
   \item Basic bootstrap interval ("basic").
   \item Bootstrap percentile interval ("perc").
   \item Adjusted bootstrap percentile interval ("bca").
   }

The studentized bootstrap interval ("stud") was not included in the CI types
 because it is computationally intensive, it may produce estimates outside
 the range of plausible values and it has been found to be erratic in
 practice, see for example the "Studentized (t) Intervals" section in:

   \url{https://www.r-bloggers.com/2019/09/understanding-bootstrap-confidence-interval-output-from-the-r-boot-package/}

    Nice tutorials about the different types of CI can be found in:

    \url{https://www.datacamp.com/tutorial/bootstrap-r}

    and

   \url{https://www.r-bloggers.com/2019/09/understanding-bootstrap-confidence-interval-output-from-the-r-boot-package/}

     Efron and Tibshirani (1993, p. 162) and Davison and Hinkley
     (1997, p. 194) suggest that the number of bootstrap replicates should
     be between 1000 and 2000.

 \strong{It is important} to note that unreliable confident intervals will be
  obtained if too few number of bootstrap replicates are used.
  Therefore, the function \link[boot]{boot.ci} will throw warnings and errors
   if bootstrap replicates are too few. Consider increasing the number of
   bootstrap replicates to at least 200.

   The "bca" interval is often cited as the best for theoretical reasons,
   however it may produce unstable results if the bootstrap distribution
    is skewed or has extreme values. For example, you might get the warning
    "extreme order statistics used as endpoints" or the error "estimated
    adjustment 'a' is NA". In this case, you may want to use more bootstrap
    replicates or a different method or check your data for outliers.

   The error "estimated adjustment 'w' is infinite" means that the estimated
   adjustment ‘w’ for the "bca" interval is infinite, which can happen when
   the empirical influence values are zero or very close to zero. This can
   be caused by various reasons, such as:

   The number of bootstrap replicates is too small, the statistic of interest
    is constant or nearly constant across the bootstrap samples, the data
    contains outliers or extreme values.

    You can try some possible solutions, such as:

Increasing the number of bootstrap replicates, using a different type of
bootstrap confidence interval or removing or transforming the outliers or
 extreme values.
 
 \strong{Parallelisation}

 If the parameter ncpus > 1, parallelisation is enabled. In Windows, parallel
  computing employs a "socket" approach that starts new copies of R on each
   core. POSIX systems, on the other hand (Mac, Linux, Unix, and BSD),
   utilise a "forking" approach that replicates the whole current version of
    R and transfers it to a new core.

    Opening and terminating R sessions in each core involves a significant
    amount of processing time, therefore parallelisation in Windows machines
   is only quicker than not using parallelisation when nboots > 1000-2000.
}
\examples{
 \donttest{
require("dartR.data")
df <- gl.report.heterozygosity(platypus.gl)
df <- gl.report.heterozygosity(platypus.gl,method='ind')
n.inv <- gl.report.secondaries(platypus.gl)
gl.report.heterozygosity(platypus.gl, n.invariant = n.inv[7, 2])
}
df <- gl.report.heterozygosity(platypus.gl)
}
\references{
Nei, M. (1978). Estimation of average heterozygosity and genetic distance
from a small number of individuals. Genetics, 89(3), 583-590.
}
\seealso{
\code{\link{gl.filter.heterozygosity}}

Other unmatched report: 
\code{\link{gl.allele.freq}()},
\code{\link{gl.report.basics}()},
\code{\link{gl.report.diversity}()}
}
\author{
Custodian: Luis Mijangos (Post to
\url{https://groups.google.com/d/forum/dartr})
}
\concept{unmatched report}
