% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/embed-all-the-things.R
\name{predict.textspace}
\alias{predict.textspace}
\title{Predict using a Starspace model}
\usage{
\method{predict}{textspace}(
  object,
  newdata,
  type = c("generic", "labels", "knn", "embedding"),
  k = 5L,
  sep = " ",
  basedoc,
  ...
)
}
\arguments{
\item{object}{an object of class \code{textspace} as returned by \code{\link{starspace}} or \code{\link{starspace_load_model}}}

\item{newdata}{a data frame with columns \code{doc_id} and \code{text} or a character vector with text where the names of the character vector represent an identifier of that text}

\item{type}{character string: either 'generic', 'labels', 'embedding', 'knn'. Defaults to 'generic'}

\item{k}{integer with the number of predictions to make. Defaults to 5. Only used in case \code{type} is set to \code{'generic'} or \code{'knn'}}

\item{sep}{character string used to split \code{newdata} using boost::split. Only used in case \code{type} is set to \code{'generic'}}

\item{basedoc}{optional, either a character vector of possible elements to predict or 
the path to a file in labelDoc format, containing basedocs which are set of possible things to predict, if different than 
the ones from the training data. Only used in case \code{type} is set to \code{'generic'}}

\item{...}{not used}
}
\value{
The following is returned, depending on the argument \code{type}:
\itemize{
\item In case type is set to \code{'generic'}: a list, one for each row or element in \code{newdata}. 
Each list element is a list with elements 
\itemize{
\item doc_id: the identifier of the text
\item text: the character string with the text
\item prediction: data.frame with columns label, label_starspace and similarity 
indicating the predicted label and the similarity of the text to the label
\item terms: a list with elements basedoc_index and basedoc_terms indicating the position in basedoc and the terms 
which are part of the dictionary which are used to find the similarity
}
\item In case type is set to \code{'labels'}: a data.frame is returned namely:\cr
The data.frame \code{newdata} where several columns are added, one for each label in the Starspace model. 
These columns contain the similarities of the text to the label. 
Similarities are computed with \code{\link{embedding_similarity}} indicating embedding similarities 
of the text compared to the labels using either cosine or dot product as was used during model training.
\item In case type is set to \code{'embedding'}: \cr
A matrix of document embeddings, one embedding for each text in \code{newdata} as returned by \code{\link{starspace_embedding}}. 
The rownames of this matrix are set to the document identifiers of \code{newdata}.
\item In case type is set to \code{'knn'}: a list of data.frames, one for each row or element in \code{newdata} \cr
Each of these data frames contains the columns doc_id, label, similarity and rank indicating the
k-nearest neighbouring (most similar) elements of the model dictionary compared to your input text as returned by \code{\link{starspace_knn}}
}
}
\description{
The prediction functionality allows you to retrieve the following types of elements from a Starspace model:
\itemize{
\item \code{generic}: get general Starspace predictions in detail
\item \code{labels}: get similarity of your text to all the labels of the Starspace model
\item \code{embedding}: document embeddings of your text (shorthand for \code{\link{starspace_embedding}})
\item \code{knn}: k-nearest neighbouring (most similar) elements of the model dictionary compared to your input text (shorthand for \code{\link{starspace_knn}})
}
}
\examples{
data(dekamer, package = "ruimtehol")
dekamer$text <- strsplit(dekamer$question, "\\\\W")
dekamer$text <- lapply(dekamer$text, FUN = function(x) setdiff(x, ""))
dekamer$text <- sapply(dekamer$text, 
                       FUN = function(x) paste(x, collapse = " "))

idx <- sample(nrow(dekamer), size = round(nrow(dekamer) * 0.9))
traindata <- dekamer[idx, ]
testdata <- dekamer[-idx, ]
set.seed(123456789)
model <- embed_tagspace(x = traindata$text, 
                        y = traindata$question_theme_main, 
                        early_stopping = 0.8,
                        dim = 10, minCount = 5)
scores <- predict(model, testdata)                        
scores <- predict(model, testdata, type = "labels")
str(scores)
emb <- predict(model, testdata[, c("doc_id", "text")], type = "embedding")
knn <- predict(model, testdata[1:5, c("doc_id", "text")], type = "knn", k=3)


\dontrun{
library(udpipe)
data(dekamer, package = "ruimtehol")
dekamer <- subset(dekamer, question_theme_main == "DEFENSIEBELEID")
x <- udpipe(dekamer$question, "dutch", tagger = "none", parser = "none", trace = 100)
x <- x[, c("doc_id", "sentence_id", "sentence", "token")]
set.seed(123456789)
model <- embed_sentencespace(x, dim = 15, epoch = 5, minCount = 5)
scores <- predict(model, "Wat zijn de cijfers qua doorstroming van 2016?", 
                  basedoc = unique(x$sentence), k = 3) 
str(scores)

#' ## clean up for cran
file.remove(list.files(pattern = ".udpipe$"))
}
}
