% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/Gen_Data.R
\name{Gen_Data}
\alias{Gen_Data}
\title{Data simulator for high-dimensional}
\usage{
Gen_Data(
  n = 200,
  p = 5000,
  sigma = 1,
  num_ctgidx = NULL,
  pos_ctgidx = NULL,
  num_truecoef = NULL,
  pos_truecoef = NULL,
  level_ctgidx = NULL,
  effect_truecoef = NULL,
  correlation = c("ID", "AR", "MA", "CS"),
  rho = 0.5,
  family = c("gaussian", "binomial", "poisson")
)
}
\arguments{
\item{n}{Sample size, number of rows of for the feature matrix to be generated.}

\item{p}{number of columns for the feature matrix to be generated.}

\item{sigma}{Parameter for noise level.}

\item{num_ctgidx}{The number of features that are categorical. Set to FALSE for only numerical features. Default is FALSE.}

\item{pos_ctgidx}{Vector of indices denoting which columns are categorical.}

\item{num_truecoef}{The number of features (columns) that affect response. Default is 5.}

\item{pos_truecoef}{Vector of indices denoting which features (columns) affect the response variable.}

\item{level_ctgidx}{A vector to indicate the levels of categorical features in 'pos_ctgidx'. Default is 2.}

\item{effect_truecoef}{Effects for the relevant features in 'pos_truecoef'.}

\item{correlation}{Correlation structure among features. \code{correlation = 'ID'} for independent,
\code{correlation = 'MA'} for moving average, \code{correlation = 'CS'} for compound symmetry, \code{correlation = 'AR'} for auto regressive Default is "ID".For more information see details.}

\item{rho}{Parameter controlling the correlation strength. See details.}

\item{family}{Models to generate the response from the synthetic features:
\code{'gaussian'} for normally distributed data, \code{'poisson'} for non-negative counts,
\code{'binomial'} for binary (0-1).}
}
\value{
Returns a \code{"sdata"} object with

\item{Y}{Response variable vector of length \eqn{n}}

\item{X}{Feature matrix or Dataframe (Matrix if \code{num_ctgidx =FALSE} and dataframe otherwise)}

\item{index}{Vector of columns indices of X for the features that affect the response variables (relevant features).}

\item{Beta}{Vector of effects for the relevant features.}
}
\description{
This function generates synthetic datasets from GLMs with a user-specified correlation structure.
It permits both numerical and categorical features, whose quantity can be larger than the sample size.
}
\details{
Simulated data \eqn{(y_i , x_i)} for \eqn{i = 1, . . . , n} are generated as follows:
First, we generate a \eqn{p \times 1} model coefficient vector beta with all entries being zero, except on the positions specified in \code{pos_truecoef},
on which \code{effect_truecoef} is used. When \code{pos_truecoef} is not specified, we randomly choose \code{num_truecoef} positions from the coefficient
vector. When \code{effect_truecoef} is not specified, we randomly set the strength of the true model coefficients following Chen's setting:
\deqn{(4*\frac{\log{N}}{\sqrt{N}}+U(0,1))*Z}
where U is uniform distribution.  and \eqn{P(Z=1)=1/2,P(Z=-1)=1/2}.

Next, we generate a \eqn{n \times p} feature matrix X based on the choice in
\code{correlation} specified as follows.

Independent (ID):  all features are independently generated from \eqn{N( 0, 1)}.

Moving average (MA): candidate features \eqn{x_1,..., x_p} are joint normal,
marginally \eqn{N( 0, 1)}, with

cov\eqn{(x_j, x_{j-1}) = \rho}, cov\eqn{(x_j, x_{j-2}) = \frac{\rho}{2}} and cov\eqn{(x_j, x_h) = 0} for \eqn{|j-h| \geq 3}.

Compound symmetry (CS): candidate features \eqn{x_1,..., x_p} are joint normal, marginally \eqn{N( 0, 1)}, with cov\eqn{(x_j, x_h) =  \rho} if \eqn{j} ,\eqn{h}
are both in the set of important features and \eqn{cov(x_j, x_h) = \frac{\rho}{2}} when only
one of \eqn{j} or \eqn{h} are iOn the set of important features.

Auto-regressive (AR): candidate features \eqn{x_1,..., x_p} are joint normal marginally \eqn{N( 0, 1)}, with

cov\eqn{(x_j, x_h) = \rho^{|j-h|}} for all \eqn{j} and \eqn{h}.

Then, generate the response variable Y according to its response type. For Gaussian model, \eqn{Y =x^T \cdot \beta + \epsilon} where \eqn{\epsilon\ \in} \eqn{N( 0, 1)}.
For the binary model let \eqn{\pi = P(Y = 1|x)}. Sample y from Bernoulli(\eqn{\pi}) where \eqn{logit(\pi) = x^T \cdot\beta}.
Finally, for the Poisson model, Y is generated from Poisson distribution with the link \eqn{\pi =exp(x^T \cdot \beta )}.
For more details (see reference below)
}
\examples{
#Simulating data with binomial response and independent strcture.
Data<-Gen_Data(family ="binomial",correlation = "ID")
cor(Data$X[,1:5])
print(Data)


}
\references{
Chen Xu and Jiahua Chen. (2014),
The Sparse MLE for Ultrahigh-Dimensional Feature Screening
* Journal of the American Statistical Association*109:507,
pages:1257-1269
}
