% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/SimRPrInDT.R
\name{SimRPrInDT}
\alias{SimRPrInDT}
\title{Interdependent estimation for regression}
\usage{
SimRPrInDT(data,ctestv=NA,Struc=NA,inddep,N=99,pobs=0.9,ppre=c(0.9,0.7),
                M=1,nsub=1,conf.level=0.95,minsplit=NA,minbucket=NA)
}
\arguments{
\item{data}{Input data frame with continuous target variables with column indices 'inddep' and the\cr
influential variables, which need to be factors or numericals (transform logicals and character variables to factors)}

\item{ctestv}{Vector of character strings of forbidden split results;\cr
Example: ctestv <- rbind('variable1 == \{value1, value2\}','variable2 <= value3'), where
character strings specified in 'value1', 'value2' are not allowed as results of a splitting operation in variable 1 in a tree.\cr
For restrictions of the type 'variable <= xxx', all split results in a tree are excluded with 'variable <= yyy' and yyy <= xxx.\cr
Trees with split results specified in 'ctestv' are not accepted during optimization.\cr
A concrete example is: 'ctestv <- rbind('ETH == \{C2a, C1a\}','AGE <= 20')' for variables 'ETH' and 'AGE' and values 'C2a','C1a', and '20';\cr
If no restrictions exist, the default = NA is used.}

\item{Struc}{= list(name,check,labs), cf. description for explanations}

\item{inddep}{Column indices of target variables in datain}

\item{N}{Number of repetitions of subsampling (integer) of predictors; default = 99}

\item{pobs}{Percentage(s) of observations for subsampling; default=0.9}

\item{ppre}{Percentage(s) of predictors for subsampling; default=c(0.9,0.7)}

\item{M}{Number of repetitions of subsampling of elements of substructure}

\item{nsub}{(List of) numbers of different elements of substructure per subsample}

\item{conf.level}{(1 - significance level) in function \code{ctree} (numerical, > 0 and <= 1);\cr
default = 0.95}

\item{minsplit}{Minimum number of elements in a node to be splitted;\cr
default = 20}

\item{minbucket}{Minimum number of elements in a node;\cr
default = 7}
}
\value{
\describe{
  \item{modelsF}{Best trees at stage 1} 
  \item{modelsI}{Best trees for the different values of 'nsub' at stage 2}
  \item{modelsJ}{Best trees for the different values of 'nsub' after mean optimization}
  \item{Struc}{Used structure}
  \item{msub}{Best numbers of elements in substructure: 2nd stage, 3rd stage}
  \item{depnames}{names of dependent variables}
  \item{R2All}{R2s of best trees at stages 1, 2, mean max.}  
}
}
\description{
The function \code{\link{SimRPrInDT}} applies structured subsampling for finding an optimal subsample to model
the relationship between the continuous variables with indices 'inddep' and all other factor and numerical variables
in the data frame 'datain'. \cr
The substructure of the observations used for subsampling is specified by the list 'Struc' which consists of the factor variable 'name' representing the substructure,
the name 'check' of the factor variable with the information about the categories of the substructure, and the matrix 'labs' which specifies the values of 'check'
corresponding to two categories in its rows, i.e. in 'labs[1,]' and 'labs[2,]'. The names of the categories have to be specified by \code{rownames(labs)}.\cr
In structured subsampling, first 'M' repetitions of subsampling of the variable 'name' with 'nsub' different elements of each category in 'check' are realized. If 'nsub' is a list, each entry is employed individually. If 'nsub' is larger than the maximum available number of elements with a certain value of 'check', the maximum possible number of elements is used. Then,
for each of the subsamples 'N' repetitions of subsampling of 'ppre' percentages of the predictors are carried out.\cr 
Subsampling of observations can additionally be restricted to 'pobs' percentages.\cr
The optimization citerion is the goodness of fit R2 on the full sample. At stage 2, the models are optimized individually. 
At stage 3, the mean of accuracies is optimized over all models.\cr
Struc=NA causes random subsampling of observations instead of structured subsampling.\cr
The trees generated from undersampling can be restricted by not accepting trees 
including split results specified in the character strings of the vector 'ctestv'.\cr
The parameters 'conf.level', 'minsplit', and 'minbucket' can be used to control the size of the trees.\cr
}
\details{
See Buschfeld & Weihs (2025), Optimizing decision trees for the analysis of World Englishes and sociolinguistic data. Cambridge University Press, section 4.5.6.2, for further information.

Standard output can be produced by means of \code{print(name)} or just \code{name} as well as \code{plot(name} where 'name' is the output data 
frame of the function.\cr
}
\examples{
data <- PrInDT::data_vowel
data <- na.omit(data)
CHILDvowel <- data$Nickname
data$Nickname <- NULL
syllable <- 3 - data$syllables
data$syllabels <- NULL
data$syllables <- syllable
data$speed <- data$word_duration / data$syllables
names(data)[names(data) == "target"] <- "vowel_length"
# interpretation restrictions (split exclusions)
ctestv <- rbind('ETH == {C2a, C1a}','MLU == {1, 3}') # split exclusions
# structure definition
name <- CHILDvowel
check <- "data$ETH"
labs <- matrix(1:6,nrow=2,ncol=3)
labs[1,] <- c("C1a","C1b","C1c")
labs[2,] <- c("C2a","C2b","C2c")
rownames(labs) <- c("children 1","children 2")
Struc <- list(name=name,check=check,labs=labs)
# column indices of dependent variables
inddep <- c(13,9) 
outSimR <- SimRPrInDT(data,ctestv=ctestv,Struc=Struc,inddep=inddep,N=3,M=2,
                   nsub=c(19,20),conf.level=0.99)
outSimR
plot(outSimR)

}
