% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/OSA.R
\name{OSA}
\alias{OSA}
\title{Optimal String Alignment (OSA) String/Sequence Comparator}
\usage{
OSA(
  deletion = 1,
  insertion = 1,
  substitution = 1,
  transposition = 1,
  normalize = FALSE,
  similarity = FALSE,
  ignore_case = FALSE,
  use_bytes = FALSE
)
}
\arguments{
\item{deletion}{positive cost associated with deletion of a character
or sequence element. Defaults to unit cost.}

\item{insertion}{positive cost associated insertion of a character
or sequence element. Defaults to unit cost.}

\item{substitution}{positive cost associated with substitution of a
character or sequence element. Defaults to unit cost.}

\item{transposition}{positive cost associated with transposing (swapping)
a pair of characters or sequence elements. Defaults to unit cost.}

\item{normalize}{a logical. If TRUE, distances are normalized to the
unit interval. Defaults to FALSE.}

\item{similarity}{a logical. If TRUE, similarity scores are returned
instead of distances. Defaults to FALSE.}

\item{ignore_case}{a logical. If TRUE, case is ignored when comparing
strings.}

\item{use_bytes}{a logical. If TRUE, strings are compared byte-by-byte
rather than character-by-character.}
}
\value{
An \code{OSA} instance is returned, which is an S4 class inheriting from
\code{\linkS4class{StringComparator}}.
}
\description{
The Optimal String Alignment (OSA) distance between two strings/sequences
\eqn{x} and \eqn{y} is the minimum cost of operations (insertions,
deletions, substitutions or transpositions) required to transform  \eqn{x}
into \eqn{y}, subject to the constraint that \emph{no substring/subsequence is
edited more than once}.
}
\details{
For simplicity we assume \code{x} and \code{y} are strings in this section,
however the comparator is also implemented for more general sequences.

An OSA similarity is returned if \code{similarity = TRUE}, which
is defined as
\deqn{\mathrm{sim}(x, y) = \frac{w_d |x| + w_i |y| - \mathrm{dist}(x, y)}{2},}{sim(x, y) = (w_d |x| + w_i |y| - dist(x, y))/2}
where \eqn{|x|}, \eqn{|y|} are the number of characters in \eqn{x} and
\eqn{y} respectively, \eqn{dist} is the OSA distance, \eqn{w_d}
is the cost of a deletion and \eqn{w_i} is the cost of an insertion.

Normalization of the OSA distance/similarity to the unit interval
is also supported by setting \code{normalize = TRUE}. The normalization approach
follows Yujian and Bo (2007), and ensures that the distance remains a metric
when the costs of insertion \eqn{w_i} and deletion \eqn{w_d} are equal.
The normalized distance \eqn{\mathrm{dist}_n}{dist_n} is defined as
\deqn{\mathrm{dist}_n(x, y) = \frac{2 \mathrm{dist}(x, y)}{w_d |x| + w_i |y| + \mathrm{dist}(x, y)},}{dist_n(x, y) = 2 * dist(x, y) / (w_d |x| + w_i |y| + dist(x, y)),}
and the normalized similarity \eqn{\mathrm{sim}_n}{sim_n} is defined as
\deqn{\mathrm{sim}_n(x, y) = 1 - \mathrm{dist}_n(x, y) = \frac{\mathrm{sim}(x, y)}{w_d |x| + w_i |y| - \mathrm{sim}(x, y)}.}{sim_n(x, y) = 1 - dist_n(x, y) = sim(x, y) / (w_d |x| + w_i |y| - sim(x, y)).}
}
\note{
If the costs of deletion and insertion are equal, this comparator is
symmetric in \eqn{x} and \eqn{y}. The OSA distance is not a proper metric
as it does not satisfy the triangle inequality. The Damerau-Levenshtein
distance is closely related---it allows the same edit operations as OSA,
but removes the requirement that no substring can be edited more than once.
}
\examples{
## Compare strings with a transposition error
x <- "plauge"; y <- "plague"
OSA()(x, y) != Levenshtein()(x, y)

## Unlike Damerau-Levenshtein, OSA does not allow a substring to be 
## edited more than once
x <- "ABC"; y <- "CA"
OSA()(x, y) != DamerauLevenshtein()(x, y)

## Compare car names using normalized OSA similarity
data(mtcars)
cars <- rownames(mtcars)
pairwise(OSA(similarity = TRUE, normalize=TRUE), cars)

}
\references{
Boytsov, L. (2011), "Indexing methods for approximate dictionary searching:
Comparative analysis", \emph{ACM J. Exp. Algorithmics} \strong{16},
Article 1.1.

Navarro, G. (2001), "A guided tour to approximate string matching",
\emph{ACM Computing Surveys (CSUR)}, \strong{33}(1), 31-88.

Yujian, L. & Bo, L. (2007), "A Normalized Levenshtein Distance Metric",
\emph{IEEE Transactions on Pattern Analysis and Machine Intelligence}
\strong{29}: 1091–1095.
}
\seealso{
Other edit-based comparators include \code{\link{Hamming}}, \code{\link{LCS}},
\code{\link{Levenshtein}} and \code{\link{DamerauLevenshtein}}.
}
