% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/prepare_data.R
\name{prepare_data}
\alias{prepare_data}
\title{Prepare raw data by binning them in 1d or 2d}
\usage{
prepare_data(
  data = NULL,
  t_in = NULL,
  t_out = NULL,
  u = NULL,
  s_in = NULL,
  s_out,
  events,
  min_t = NULL,
  max_t = NULL,
  min_u = NULL,
  max_u = NULL,
  min_s = NULL,
  max_s = NULL,
  dt = NULL,
  du = NULL,
  ds,
  individual = FALSE,
  covs = NULL
)
}
\arguments{
\item{data}{A data frame.}

\item{t_in}{(optional) A vector of entry times on the time scale \code{t}.}

\item{t_out}{(optional) A vector of exit times on the time scale \code{t}.}

\item{u}{(optional) A vector of fixed-times at entry in the process.}

\item{s_in}{(optional) A vector of entry times on the time scale \code{s}.}

\item{s_out}{A vector of exit times on the time scale \code{s}.}

\item{events}{A vector of event's indicators (possible values 0/1).}

\item{min_t}{(optional) A minimum value for the bins over \code{t}.
If \code{NULL}, the minimum of \code{t_in} will be used.}

\item{max_t}{(optional) A maximum value for the bins over \code{t}.
If \code{NULL}, the maximum of \code{t_out} will be used.}

\item{min_u}{(optional) A minimum value for the bins over \code{u}.
If \code{NULL}, the minimum of \code{u} will be used.}

\item{max_u}{(optional) A maximum value for the bins over \code{u}.
If \code{NULL}, the maximum of \code{u} will be used.}

\item{min_s}{(optional) A minimum value for the bins over \code{s}.
If \code{NULL}, the minimum of \code{s_in} will be used.}

\item{max_s}{(optional) A maximum value for the bins over \code{s}.
If \code{NULL}, the maximum of \code{s_out} will be used.}

\item{dt}{(optional) A scalar giving the length of the intervals on the \code{t} time scale.}

\item{du}{(optional) A scalar giving the length of the intervals on the \code{u} axis.}

\item{ds}{A scalar giving the length of the intervals on the \code{s} time scale.}

\item{individual}{A Boolean. Default is \code{FALSE}: if \code{FALSE} computes the matrices \code{R} and \code{Y}
collectively for all observations; if \code{TRUE} computes the matrices \code{R} and \code{Y} separately for each individual record.}

\item{covs}{A data.frame with the variables to be used as covariates.
The function will create dummy variables for any factor variable passed as argument in \code{covs}.
If a variable of class character is passed as argument, it will be converted to factor.}
}
\value{
A list with the following elements:
\itemize{
\item \code{bins} a list:
\itemize{
\item \code{bins_t} if \code{t_out} is provided, this is a vector of bins extremes for the time scale \code{t}.
\item \code{mid_t} if \code{t_out} is provided, this is a vector with the midpoints of the bins over \code{t}.
\item \code{nt} if \code{t_out} is provided, this is the number of bins over \code{t}.
\item \code{bins_u} if \code{u} is provided, this is a vector of bins extremes for \code{u} axis.
\item \code{midu} if \code{u} is provided, this is a vector with the midpoints of the bins over \code{u}.
\item \code{nu} if \code{u} is provided, this is the number of bins over \code{u}.
\item \code{bins_s} is a vector of bins extremes for the time scale \code{s}.
\item \code{mids} is a vector with the midpoints of the bins over \code{s}.
\item \code{ns} is the number of bins over \code{s}.
}
\item \code{bindata}:
\itemize{
\item \code{r} or \code{R} an array of exposure times: if binning the data over
one time scale only this is a vector.
If binning the data over two time scales and if \code{individual == TRUE}
then \code{R} is an array of dimension nu by ns by n, otherwise it is an
array of dimension nu by ns
\item \code{y} or \code{Y} an array of event counts: if binning the data over one time
scale only this is a vector.
If binning the data over two time scales and if \code{individual == TRUE}
then \code{Y} is an array of dimension nu by ns by n, otherwise it is an
array of dimension nu by ns
\item \code{Z} A matrix of covariates' values to be used in the model,
of dimension n by p
}
}
}
\description{
\code{prepare_data()} prepares the raw individual time-to-event data
for hazard estimation in 1d or 2d.

Given the raw data, this function first constructs the bins over one or two
time axes and then computes the aggregated (or individual)
vectors or matrices of exposure times and events indicators. A data.frame with
covariates values can be provided by the user.
}
\details{
A few words about constructing the grid of bins.
Bins are containers for the individual data. There is no 'golden rule' or
optimal strategy for setting the number of bins over each time axis, or deciding
on the bins' width. It very much depends on the data structure, however, we
try to give some directions here. First, in most cases, more bins is better
than less bins. A good number is about 30 bins.
However, if data are scarce, the user might want to find a compromise between
having a larger number of bins, and having many bins empty.
Second, the chosen width of the bins (that is \code{du} and \code{ds}) does depend on
the time unit over which the time scales are measured. For example, if the time
is recorded in days, as in the example below, and several years of follow-up
are available, the user can split the data in bins of width 30 (corresponding
to about one month), 60 (about two months), 90 (about three months), etc.
If the time scale is measured in years, then appropriate width could be 0.25
(corresponding to a quarter of a year), or 0.5 (that is half year). However,
in some cases, time might be measure in completed years, as is often the case
for age. In this scenario, an appropriate bin width is 1.

Finally, it is always a good idea to plot the data first, and explore the range
of values over which the time scale(s) are recorded. This will give insight
about reasonable values for the arguments \code{min_s}, \code{min_u}, \code{max_s} and \code{max_u}
(that in any case are optional).

Regarding names of covariates or levels of categorical covariates/factors:
When using "LMMsolver" to fit a model with covariates that which have names
(or factor labels) including a symbol such as "+", "-", "<" or ">" will result
in an error. To avoid this, the responsible names (labels) will be rewritten
without mathematical symbols. For example: "Lev+5FU" (in the colon cancer data)
is replaced by "Lev&5FU".
}
\examples{

# Bin data over s = time since recurrence only, with intervals of length 30 days
# aggregated data (no covariates)
# The following example provide the vectors of data directly from the dataset
binned_data <- prepare_data(s_out = reccolon2ts$timesr, events = reccolon2ts$status, ds = 30)
# Visualize vector of event counts
print(binned_data$bindata$y)
# Visualize midpoints of the bins
print(binned_data$bins$mids)
# Visualize number of bins
print(binned_data$bins$ns)

# Now, the same thing is done by providing a dataset and the name of all relevant variables
binned_data <- prepare_data(data = reccolon2ts, s_out = "timesr", events = "status", ds = 30)
# Visualize vector of event counts
print(binned_data$bindata$y)

# Now using ds = .3 and the same variable measured in years
binned_data <- prepare_data(s_out = reccolon2ts$timesr_y, events = reccolon2ts$status, ds = .3)
# Visualize vector of exposure timess
print(binned_data$bindata$r)


# Bin data over u = time at recurrence and s = time since recurrence, measured in days
# aggregated data (no covariates)
# Note that if we do not provide du this is taken to be equal to ds
binned_data <- prepare_data(
  u = reccolon2ts$timer, s_out = reccolon2ts$timesr,
  events = reccolon2ts$status, ds = 30
)

# Visualize matrix of event counts
print(binned_data$bindata$Y)

# Visualize midpoints of bins over u
print(binned_data$bins$midu)


# Bin data over u = time at recurrence and s = time since recurrence, measured in day
# individual-level data required
# we provide two covariates: nodes (numerical) and rx (factor)
covs <- subset(reccolon2ts, select = c("nodes", "rx"))
binned_data <- prepare_data(
  u = reccolon2ts$timer, s_out = reccolon2ts$timesr,
  events = reccolon2ts$status, ds = 30, individual = TRUE, covs = covs
)

# Visualize structure of binned data
print(str(binned_data$bindata))

# Alternatevely:
binned_data <- prepare_data(
  data = reccolon2ts,
  u = "timer", s_out = "timesr",
  events = "status", ds = 30, individual = TRUE, covs = c("nodes", "rx")
)

}
\author{
Angela Carollo \email{carollo@demogr.mpg.de}
}
