%\VignetteIndexEntry{Synthetic microarray data}
%\VignettePackage{biocDatasets}

%
% NOTE -- ONLY EDIT THE .Rnw FILE!!!  The .tex file is
% likely to be overwritten.
%

\documentclass[10pt, a4paper, twoside]{article}

\title{Making synthetic array data}

\begin{document}

\maketitle

\section{Introduction}

This vignette describes how to build a complete synthetic
microarray dataset for expression data.
Situations in which this can be useful are:
\begin{itemize}
  \item dataset of arbitrary size for demonstration purposes
    (minimizing the need for data packages)
  \item dataset for test / monitoring purposes (memory usage, etc)
  \item simulations
\end{itemize}

The example here is simple, if not simplistic, and supplementary
rules governing the creation of data such as sequence-dependent
characteristics of the probe signal or behavior of the signal
across the dynamic range can be added.

<<results=hide>>=
library("biocDatasets")
@ 


\section{Generating a random array design}

\subsection{Probe design}
\label{subsec:probe_design}
We first consider the number of transcripts our 
synthetic array will be designed to measure signal for and generate sequences for each transcript.


<<>>=
n_transcripts <- 250
transcripts <- randomDNASequences(n_transcripts, 
                                  sample(300:10000, n_transcripts, 
                                         replace=TRUE))
@ 

Each transcript represents the biological entity 
that the array was designed to measure.

This section outlines the creation of a short oligonucleotide array. Features or
probes in the array are created as random subsequences of transcripts. 


<<>>=
n_probes <- 10
len_probe <- 25
ir <- vector("list", length=n_transcripts)
features <- data.frame(seq = I(rep("" ,n_transcripts * n_probes)),
                       target = rep(as.integer(NA), n_transcripts * n_probes))
for (i in 1:n_transcripts) {
  s <- transcripts[[i]]
  ir[[i]] <- randomIRanges(n_probes, len_probe, 
                           1, length(s), replace=FALSE)
  features$seq[(1:n_probes) + (i-1) * n_probes] <-
    as.character(msubseq(s, ir[[i]]))
  features$target[(1:n_probes) + (i-1) * n_probes] <-
    i
}

@ 

Obtaining duplicated probe sequences from the procedure
above is very unlikely, but it is nevertheless preferable 
to verify it:
<<>>=
summary(duplicated(features$seq))

@ 

\subsection{Probe positions}

Probes on a microarray are deposited onto `spots', at given locations
on the surface of the slide.

The layout is often a grid, and creating such a layout is easy:

<<>>=
ppos <- createProbeCoords(50, 50)
@

The layout obtained can then be visualized easily:

<<fig=TRUE>>=
plot(ppos$x, ppos$y, pch=".")
@ 


\subsection{Assemble the layout}

We can now assemble the feature information
(see Section~\ref{subsec:probe_design})
with the probe positions.

<<>>=
scramble_i <- sample(seq(along=ppos$x))
array_layout <- cbind(features, ppos[scramble_i, ])
row.names(array_layout) <- seq(along=array_layout[[1]])
@ 


\section{Generating random expression-array signal}

Here we focus on generating synthetic signal for arrays.


\subsection{Replicate the same array}

Creating expression array signal for an array is straightforward:
<<>>=
x <- expression_arraywide(50*50)
@ 

Creating synthetic replicates, that is simulate replicate
arrays for the same experiment, is equally easy:
<<>>=
x2 <- replicate_arraywide(x)
@ 

Both array are stored in a matrix, which will constitute
the expression matrix at the probe level:
<<>>=
m <- cbind(x, x2)
colnames(m) <- c(1, 2)
rownames(m) <- seq(1, 50*50)
@ 

Creating an \verb+ExpressionSet+ is now straightforward:
<<>>=
library(Biobase)
eset <- new("ExpressionSet",
            featureData = new("AnnotatedDataFrame", array_layout),
            exprs = m)
@ 

We are now ready to see how to create a slightly more complex
synthetic object.

\subsection{Replicates of different arrays}

<<label=eset2dataframe>>=

eset2dataframe <- function(eset){
  pdataf <- pData(eset)
  
  dataf <- lapply(pdataf,
                  function(col) rep(col, 
                                  rep(nrow(exprs(eset)), nrow(pdataf)))
                  )
  dataf <- as.data.frame(dataf)
  dataf <- cbind(exprs = c(exprs(eset)), 
                 array=rep(rownames(pdataf), each=nrow(exprs(eset))), 
                 dataf)
  return(dataf)
}


@ 

We start by creating two separate array signals, each representing
arbitrarily different experimental conditions.
<<>>=
sample_a_seed <- expression_arraywide(50*50)

sample_b_seed <- expression_arraywide(50*50)
@ 

We can then set up a two-sample experiment by
replicating each one of the two arrays.
<<>>=
m <- vector("list", length=6)
for (i in 1:3) {
  m[[i]] <- replicate_arraywide(sample_a_seed)
  m[[i+3]] <- replicate_arraywide(sample_b_seed)
}

m <- matrix(unlist(m), ncol=6)

sample_data <- data.frame(grp = rep(c('a', 'b'), c(3, 3)))
rownames(sample_data) <- 1:6
@ 

Creating an \verb+ExpressionSet+ is again obvious:
<<>>=

eset <- new("ExpressionSet",
            featureData = new("AnnotatedDataFrame", array_layout),
            exprs = m,
            phenoData = new("AnnotatedDataFrame", 
              sample_data))

@ 


<<fig=TRUE>>=

dataf <- eset2dataframe(eset)
library(lattice)
p <- densityplot( ~ exprs | grp, 
                 groups = dataf$array, 
                 data = dataf,
                 pch = ".")
print(p)

@ 


\end{document}