Purpose

Polyploid sequence based virtual breeding (pSBVB) is a modification of SBVB software (Pérez-Enciso et al. 2016, https://doi.org/10.1534/genetics.116.194878 ) that allows simulating traits of an arbitrary genetic complexity in polyploids. Its goal is to simulate complex traits and genotype data starting with a vcf file that contains the genotypes of founder individuals and following a given pedigree. The main output are the genotypes of all individuals in the pedigree and/or molecular relationship matrices (GRM) using all sequence or a series of SNP lists, together with phenotype data. The program implements very efficient algorithms where only the recombination breakpoints for each individual are stored, therefore allowing the simulation of thousands of individuals very quickly. Most of computing time is actually spent in reading the vcf file. Future developments will optimize this step by reading and writing binary mapped files. The vcf file may not contain missing genotypes and is assumed to be phased.

General Software’ View

General Software’ View

Main features

  • Any number of traits.
  • Tool adapted to work with both, auto and allo-polyploid organisms.
  • Any number of QTNs, trait specific.
  • Any number of additive and dominant effects.
  • Can generate a correlation matrix to modelate meiosis in polyploid especies.
  • Can generate correlated allelic effects and frequencies.
  • Efficient algorithms to generate haplotypes and sample SNP genotypes.
  • Computes genomic relationship matrices for any number of SNP arrays simultaneously.
  • It allows to compute Genomic relationship matrix in several ways.
  • Any number of chromosomes, allows for sex chromosomes and varying local recombination rates, that can be sex specific.

Installation

The source code, manual and examples can be obtained from https://github.com/lauzingaretti/pSBVB

To compile:

gfortran -O3 kind.f90 ALliball.f90 aux_sub11.f90 pSBVB.f90 -o sbvb -lblas

or

make

To install in /usr/local/bin

sudo make install

The program requires blas libraries but these are standard in any unix or OS mac system. We have tested pSBVB only in linux with gfortran compiler; intel ifort seems not working, but gfortran in mac OS looks ok.

Usage

As input, the software requires a file with chromosome, position in base pairs and phased genotypes as follow:

#Chrom Pos I1_c1 I1_c2 ... I1_ch ... In_c1 In_c2 ... In_ch
1     453   1     0   ...   0   ...   1     1   ...   1
1     530   1     1   ...   0   ...   0     1   ...   1
.......................................................
.......................................................
12   5000   0     1   ...   1   ...   1     0   ...   1   

Where chromosomes and positions should be integer numbers and sorted, and I1_c1, I1_c2,…,I1_ch refere to the genotypes of the first individual. For instance, if the organism is tetraploid, it could be 0 0 1 0, or 0 1 0 0, say. Alleles should be coded 0/1. If you have the standard input file the command is

 cat file.gen | pSBVB -i sbvb.par

A python script is provided to read directly vcf files

cat file.vcf | python3 vcftogen.py -p Ploidy | pSBVB  -i file.par

If you have genotype’s file where the genotype is coded as 0…h as follow:

#Chrom Pos I1    I2   ...    In
1     453   2     0   ...    3   
1     530   2     1   ...    h   
...............................
...............................
12   5000   0     2   ...   h-1   

where phase is unknown, then the python script generates a phase configuration, and linkage disequilibrium can be obtained by generating an individual genome. To run this option:

cat file.geno | python3 vcftogen.py -p Ploidy | pSBVB  -i file.par

By default, pSBVB expects phased genotypes without missing values. If missing values are found, the user can specify for genotypes to be sampled according to allele frequency. To run this option:

cat file.vcf | python3 vcftogen.py -m True -p Ploidy | pSBVB  -i file.par

To run the program with the same random seed:

… | pSBVB -isbvb.par –seed iseed

where iseed is an integer number

Phenotype simulation

pSBVB takes ploidy into account to generate the phenotypes and incorporates several options to generate the molecular relationship matrix that are pertinent to polyploids. In a diploid organism, the phenotype for \(i_{th}\) individual can be simulated from:

\(y_{i}=\mu + \sum_{j=1}^{Q} \gamma_{ij}\alpha_{j} + \sum_{j=1}^{Q} \delta_{ij}d_{j} + \epsilon_{i}\)

Where \(\mu\) is the mean general, \(\alpha\) is the additive effect of \(_{th}\) locus, that is, half the expected difference between homozygous genotypes, \(\gamma_{ij}\) takes values -1, 0 and 1 for homozygous, heterozygous and alternative homozygous genotypes, respectively. \(d_j\) is the dominance effect of \(j_{th}\) locus, and \(\delta_{ij}\) takes value 1 if the genotype is heterozygous, 0 otherwise, and \(\epsilon_i\) is a normal residual. For polyploids, the phenotype of individual \(i\) (\(y_i\)) (equivalent equation) is simulated from:

\(y_{i}=\mu + \sum_{j=1}^{Q} \eta_{ij}\alpha_{j} + \sum_{j=1}^{Q} \phi_{ij}d_{j} + \epsilon_{i}\)

where \(\eta_{ij}\) is the number of copies of the alternative allele (coded say as 1) minus half the ploidy (\(h/2\)) for \(j\)-th locus and \(i\)-th individual, and \(\alpha_j\) is therefore the expected change in phenotype per copy of allele ‘1’ in the \(j\)-th locus. In polyploids, as many dominance coefficients as ploidy level (\(h\)) minus two can technically be defined. However, this results in an over-parameterized model that is of no practical use. Here instead we define the \(\phi_{ij}\) parameter as the minimum number of copies of allele 1 such that the expected phenotype is \(d_j\). By default, pSBVB uses \(\phi_{ij}=1\) , that is, any genotype having at least one allele ‘1’ and ‘0’ has the expected phenotypic value \(d_j\). (see Figure 1 on Zingaretti, et al., 2018)

Parameter file

The parameter file controls all pSBVB behavior. It consists of a list of sections in UPPER CASE (in any order) followed in the following line by the required data e.g.,

QTNFILE

sbvb.qtl

tells the program that QTN specifications are in sbvb.qtl file. Comments can be mixed starting with # or ! A full list of options in the parameter file is in Appendix).The main ones are:

The number of traits to be generated is especified (by default is one):

NTRAIT

ntraits

in parameter file. Otherwise this section is not needed. pSBVB requires the user to provide the list of causal SNPs (QTNs) as specified in QTNFILE section. The format of the QTN file is an orderedfile containing:

chrom pos

WARNING: chromosome ids must be integer consecutive numbers

or

chrom pos add_eff_Trait_1  add_eff_Trait_2 ... add_eff_Trait_n

or to especified additive and dominant effects:

chrom pos add_eff_Trait_1  add_eff_Trait_2 ...add_eff_Trait_n dom_eff_Trait_1  dom_ef_Trait_2  ...dom_eff_Trait_n

where chr is chromosome and pos is position in base pair, \(add_{eff}\) is additive effect, i.e the effect of homozygous alleles and \(dom_{eff}\) is the heterozygous effect.

NOTE: Irrespective of the number of traits, the complete list of QTNs must be specified- If a QTN does not affect a phenotype, add and dom effects should be set to 0 for that trait.

WARNING: QTN position must coincide with one SNP position in the vcf file, otherwise it is not considered.

If QTN effects are not provided, they can be simulated specifying using an uniform, normal or gamma distribution. QTNDISTA is to simulates adittive effects, whereas QTNDISTD controls dominant ones.

QTNDISTA

[u a b] [n mu var] [g s b]

and

QTNDISTD

[u a b] [n mu var] [g s b]

in parameter file.

where ‘u a b’ means effects are sampled from a uniform distribution \(U\sim(a,b)\), ‘n mu var’ from a normal distribution \(N\sim(mu,var)\) and ‘g s b’ from a gamma \(\gamma \sim (s,b)\). For a gamma distribution, you can specify the probability p that a derived allele decreases the phenotype with:

PSIGNQTN

p

The default value is 50%. By default, effects are sampled independently of frequency, i.e., half effects are + and the rest are -, but it is possible to generate a correlation (rho) using the following parameter:

RHOQA

rho

where rho is the desired correlation between QTN additive effects and their frequencies. This option can be useful to simulate past selection (Pérez-Enciso et al., 2016), since selection induces a negative correlation between frequency and effect. By default, rho is 0.

For instance, if you want to simulate the additive effects using a Uniform distribution (\(U\sim(0.2,0.6)\)), your parameter file should have a section as:

QTNDISTA

u 0.2 0.6

The broad sense heritability is specified as:

H2G

h2

The genotypes from the base population (in the vcf file) are used to adjust the environmental variance such that the heritability is as desired.

For multiple traits, the fields H2 or H2G, RHOQA, and QTLDISTA and QTLDISTD must be repeated, eg, for two traits:

H2G

0.5

0.23

RHOQA

0

-0.4

QTNDISTA

u -0.2 0.2

g 1 0.5

which means that the firts trait should have a heritability of \(0.5\), no correlation between genetic effects and frequencies and that addtive effects are sampled from an uniform distribution \((0.2,0.2)\), whereas the second trait have a heritability of \(0.23\), a negative correlation \(-0.4\) between genetic effects and frequencies with genetic effects following a gamma distribution with parameters \((1,0.5)\).

Recombination in polyploids

The ploidy level must be specified with section

PLOIDY

h

in the parameter file. By default, pSBVB assumes autopolyploidy and permits recombination between each homeolog chromosome pair with equal probability. Strict alloploidy is specified with

ALLOPLOIDY

Intermediate rates of recombination between homeolog pairs can be specified with

RHOPLOIDY

rho_elements

where rho_elements is a vector of hxh elements specifying the probability of recombination between i and j homeologs. The diagonal (P of recombining with itself) is set to 0 by the program. For instance, if the organism is tetraploid, you can set rho as:

\[ \begin{bmatrix} 0 & 0.5 & 0.2 & 0.3 \\ 0.5 & 0 & 0.3 & 0.2 \\ 0.2 & 0.3 & 0 & 0.5 \\ 0.3 & 0.2 & 0.5 & 0 \\ \end{bmatrix} \]

Pedigree file (PEDFILE)

The format is id id_father id_mother

where all ids must be consecutive integers, \(0\) if father or mother unknown. The number of individuals in the vcf file must be specified with section:

NBASE

nbase

in the parfile. The pedigree file must contain the first rows as

  1   0  0 
  2   0  0 
 ...  0  0 
nbase 0  0  

that is, those in vcf file are assumed to be unrelated.

Recombination map files (MAPFILE)

By default, pSBVB assumes a cM to Mb ratio of 1. This ratio can be changed genomewide with CM2MB section in the par file. In addition, local recombination rates can be specified with the MAPFILE section. The mapfile takes format

MAPFILE

chr  last_bp  local_cm2mb 

where local_cm2mb is the recombination rate between last_bp and previous bound (1 bp if first segment) , or

chr last_bp local_cm2mbMales local_cm2mb_females

SNP array files (SNPFILE)

pSBVB can compute the genomic relationship matrix (GRM) using all sequence data and/or specific SNP subsets to mimic different genotyping arrays. The lists of SNPs are specified in the SNPFILE sections. Several SNP lists can be analyzed in the same run repeating the SNPFILE section in the par file. Each SNP file has the same format as the QTN file, i.e., chromosome and base pair position, as idicated:

chrom pos 

Output

Output is controlled in the parameter file with sections OUTGFILE, OUTYFILE, OUTQFILE, OUTGFILE and OUTMFILE:

GRM files (GRMFILE)

pSBVB computes one GRM for each of the SNP files plus the sequence data. By default, equation 1 in Zingaretti, et al. (2018). if you add the command MIMIC_DIPLOID, Genomic matrix is computed mimic diploid. Only 0, 1 and 2 or more copies of a given allele can be distinguished. In this case, all genotypes with values larger than 2 are assumed to be observed as ‘2’,then molecular matrix has elements ranging between 0 and 2 and ploidy is set to 2.

if you add MIMIC_HAPLOID to parameter file, it is assumed that only one full homozygous can be distinguished for the rest of genotypes, then the molecular matrix has elements ranging between 0 and 1 and ploidy is set to 1. See Zingaretti et al for details.

OUTYFILE

Specifies the name of the phenotype / genotype file, which has format:

 id y  [(add__eff)_i , i=1,..,ntraits] [(add_eff+ dom_eff)_i, i=1,..,ntraits]

where \(add_{eff}\) is the first sum in equation of pSBVB software, shows above and \(dom_{eff}\) is the second term. For several traits, first are printed all add effects for every trait, following add+dom.

OUTQFILE format (contains QTN info):

 chr pos freq_{base} [(add_eff_i) i=1,..,ntraits] [(dom_eff_i) i=1,..,ntraits]

where \(chr\) is chromosome, \(pos\) is \(QTN\) bp position, \(freq_{base}\) is frequency in .vcf file, freq is frequency along the pedigree, plus additive, dominant effects and add variance \((2pq\alpha^2)\) contribution for each locus by trait.

OUTGFILE format (contains GRM, one per SNPFILE plus sequence)

A matrix of \(n \times n\), where \(n\) is the number of individuals in the pedigree. As many outgfiles as snpfiles are written with subscripts .1, .2 etc. .0 corresponds to sequence. To avoid using sequence, add NOSEQUENCE command in parfile.

OUTMFILE

format (contains genotypes for evey SNP file and sequence, in plink format optionally using OUTPLINK in parfile). As many outmfiles as snpfiles are written with subscripts .1, .2 etc. .0 corresponds to sequence. To avoid using sequence, NOSEQUENCE in parfile

Outqfile, outqtn, GRM and marker files are written only if the respective sections OUTQFILE, OUTGFILE and OUTMFILE appear in the .par file. Note in particular that OUTMFILE with sequence can be huge! To avoid printing sequence info, use

NOSEQUENCE

in par file.

NOTE: To compress marker output, include GZIP option in parfile.

Restart the program keeping the same haplotypes

Sometimes one can be interested in running the same experiment but with different genetic architectures or different SNP arrays. The program offers two convenient ways to do this as it may keep track of haplotypes so exactly the same genetic structure is preserved, RESTART and RESTARTQTL options in .par file.

1.With RESTART, haplotypes, phenotypes and QTN effects are preserved. This is useful to implement selection.

2.With RESTARTQTN, haplotypes are preserved but phenotypes and QTN effects are sampled again. RESTARQTN can be used to run different genetic architectures in the same haplotypes so results can be exactly comparable across models.

The program then writes a .hap file that contains all haplotype structure the first time is run. When pSBVB is called again with say another SNPFILE, then individuals have the same haplotypes as in previous runs and a new GRM can be generated with the new SNP file. An important application is to run selection. In fact, pSBVB can be run with different pedigree files and the RESTART option. pSBVB generates only new haplotypes for those individuals not in current .hap file. In a selection scheme, the user should add a new generation pedigree to current pedfile with the offspring of selected individuals. In the new run, pSBVB generates haplotypes and phenotypes for the new offspring.

IMPORTANT: The .hap file is used only if RESTART is included in parfile. If no .hap file is present, a new one is generated the first time. You can check that RESTART is in use checking, e.g, that all phenotypes are the same in different runs.

WARNING: RESTARTQTN is logically not suitable for selection, since effects are sampled anew in each run.

Expanding the base population

Very often, complete sequence is available only for very few individuals. pSBVB implements an automatic option to generate additional individuals by randomly crossing the available ones and random breeding for a pre specified number of generations. To use this feature, the pedigree file must contain larger number of individuals with unknown parents than in the vcf file. For instance, assume your vcf file contains only four individuals and the pedfile is

  1   0  0 
  2   0  0 
 ...  0  0 
 20   0  0 
 21   1  7 
 ...  .  . 

Then individuals 5-20 are generated by randomly crossing 1-4 ids, from id 21 onwards, normal pedigree gene dropping is implemented. The option in parfile is

EXPAND_BASEPOP

ntgen nfam 

which means that the new individuals are generated by crossing nfam individuals of the vcf file for ntgen generations.

Toy Examples

Prepare your files

To facilitate the software’ usability, we have written some additional functions using R and python to generate the requested files (pedigree, qtn, chip) and to transform vcf file into .gen. We also incorporate two functions to generate the numerator relationship matrix and to performed predictive ability (PA) from GBLUP model.

Function to create a pedigree to use as .ped file

Pedigree file is required to simulate. The number of founder has to be equal or higher than the number of individuals in .gen (or .vcf) file. The founders are those individuals with genomic information.

We provide a R function to generate pedigree file. You can choose any number of founders, generations and individuals by generation.

The following pedigree file has 47 founders, 8 generations, 100 individuals of each generation from 1 to 7.

source("/Additional_functions/pedigree.R")
M<-pedgenerator(100,4,c(rep(100,3),150),
   path="/path_to_file/",exclude=47)

To generate a pedigree based relationship matrix

We implemented a R function to generate a Relationship matrix to compare with genomic relationship matrix. Our function could be used to compute both, additive and dominant relationship matrices. As input, a pedigree file is needed.

source("/Additional_functions/RelationshipMatrix.R")
data<-read.table("Path_to_File_st.ped",header=FALSE)
A<-RelMatrix(data,dominance=FALSE,path= "path_to_file"")

Run pSBVB

Once you have a genotypes, a pedigree, a list with SNPs and a Chip file. You can to create several .par files to perform simulations with different options.

Here, we give two toy dataset with examples, which are in /toy_potato and /toy_strawberry folders.

You could run several examples simultaneosly, by executing the .sh files in /toy_potato and /toy_strawberry folders as indicates here:

cat toy_st.gen | sbvb -i file_1.par

or using a .sh file:

cd /path_to_/toy_strawberry (or path_to/toy_potato)
./run_toy_st.sh 

The following images show how it works:
Run the program from command line

Run the program several times simultaneosly using a bash script

Run the program several times simultaneosly using a bash script

GBLUP prediction

There are numerous GS methods that use genome-wide markers to predict breeding values and address the large p small n problem. Here, breeding values were predicted using G-BLUP, which is a genomic-based extension of traditional pedigree (P-BLUP), which is widely used to predict breeding values in animal and plant breeding (Henderson, 1984) and it only uses the pedigree information. Moreover, G-BLUP is equivalent to Ridge Regression Model (RR-BLUP). The model is:

\[\begin{equation} \mathbf{y}=\mathbf{1^T}\mu+\mathbf{Z}u+\epsilon_i \label{eq:equation5} \end{equation}\]

The R script to perform Predictive abilities is /Additional_Functions/GBlupFunction.R

Toy Potato dataset

We used a subset of 396 SNPs and 150 individuals from Enciso-Rodriguez et al. (2018) with genotypes coded between 0 and 4 (the potato ploidy level). We used these genotypes to generate a vcf- like file with random phases. SNP positions were obtained from Rosyara et al.

source("/Additional_Functions/pedigree.R")
source("/Additional_Functions/RelationshipMatrix.R")
source("/Additional_Functions/GBlupFunction.R")
source("/Additional_Functions/generate_vcf_from_gen.R")

We created a function to transform genotypes into vcf format. To do that, a dataset with genotypes (data varying between 0 and ploidy level) and a map file containing the SNP’s physical coordinates are needed. We used genotypes from potato database (https://figshare.com/articles/Supplemental_Material_for_Enciso-Rodriguez_et_al_2018/6262214). Note that our function randomly generates the phases in the vcf file. In order to generate linkage disequilibrium, we used pSBVB parameters EXPAND_BASEPOP and INDFIRST, which simultaneously exclude the initial individuals and generate a new set of founders.

Furthermore, the toy_potato/FilesGenerator.R file have all the functions needed to create a Chip, pedigree, QTNs location files and run potato example.

# go to file containers folder
# setwd("/toy_potato")
map<-read.table("Sol_tet_mapa.map",sep="\t")
#read genotype
G<-read.delim("Pot_gen.gen",sep="\t")
#select a sub-sample from whole dataset 
G=G[sample(c(1:nrow(G)),150),]
G=G[,sample(c(1:ncol(G)),500)]
G<-G[,colnames(G)%in%map[,1]]
dim(G)
## [1] 150 400
#generate genotypes vcf format 
A=GenotoVcf(G,p=4,map,path="NULL")
## Loading required package: data.table
library(ggplot2)
# data<-read.table("File_st.ped",header=FALSE)

#check the dimensions of dataset 
#generate relationship matrix
dim(data)
## [1] 700   3
A<-RelMatrix(data,dominance=FALSE)

## make sure you're in the folder toy_potato 

#G<- read.table("/results_potato/Y.grm.1")
#y<- read.table("/results_potato/Y.outy")
RelMatrix<-A[-c(1:150),-c(1:150)]


h2=0.5
ntraits=1
# split testing and training(the last 150 individuals are used to test)
make_predictions=c(401:550)
S<-GBlup_predict(G=G,y=y,h2=h2,make_predictions=make_predictions,ntraits=ntraits)
S$rho
## [1] 0.4852854
Phenohat<-S$uhat[make_predictions] + S$mean
Pheno<-S$yout_corr[make_predictions]

datos<-data.frame(Phenohat,Pheno)
colnames(datos)<-c("Yhat","Y")

#plotting predictions
ggplot(datos, aes(y=Yhat, x=Y,colour="red")) +
  ggtitle("Estimated values for testing set - Genomic matrix")+
  geom_point(size=0.6) + scale_shape_manual(values=c(2,4)) + 
  geom_abline(intercept=lm(Yhat ~ Y,data=datos)$coefficients[1], slope=lm(Yhat ~ Y,data=datos)$coefficients[2])+
  #scale_x_continuous(limits = c(min(datos[,1]), max(datos[,1])))+
  #scale_y_continuous(limits = c(min(datos[,2]), max(datos[,2])))+
  
  theme(panel.background = element_blank(),
        panel.grid.major = element_blank(),
        panel.grid.minor = element_blank(), legend.position ="none")

#Assesing PA using Relationship Matrix
SR<-GBlup_predict(G=RelMatrix,y=y,h2=h2,make_predictions=c(401:550),ntraits=ntraits)
SR$rhoM
## [1] 0.446899
#plotting predictions with relationship matrix
Pheno<-SR$yout_corr[make_predictions]
datos_r<-data.frame(Phenohat,Pheno)
colnames(datos_r)<-c("Yhat","Y")

ggplot(datos_r, aes(y=Yhat, x=Y,colour="red")) +
  geom_point(size=0.6) + scale_shape_manual(values=c(2,4)) + 
  geom_abline(intercept=lm(Yhat ~ Y,data=datos_r)$coefficients[1], slope=lm(Yhat ~ Y,data=datos_r)$coefficients[2])+
   ggtitle("Estimated values for testing set - N Relationsip matrix")+
  theme(panel.background = element_blank(),
        panel.grid.major = element_blank(),
        panel.grid.minor = element_blank(), legend.position ="none")

Toy Strawberry dataset

Running toy_strawberry using three heritabilities parameters

The first example (in path /toy_strawberry/file_1.par) incorporate three “H2G” parameters of 0.3, 0.4 and 0.5 and 150 causals SNPs to simulate the phenotype. The three phenotypes with additive effects are simulated simultaneosly. The Genomic Relationship matrix is generated using G- total, then the M values varying between 0 and 8. We show the PA obtained by the thirth output:

library(ggplot2)
## make sure you're in the folder toy_strawberry and you has been executed /run_toy_st.sh file, which generates all examples folder and outputs 
#G_1<- read.table("/results_example1/Y.grm.1")
#y_1<- read.table("/results_example1/Y.outy")

h2=c(0.3,0.4,0.5)
ntraits=3
#select training population 
make_predictions=c(353:503)

S<-GBlup_predict(G=G_1,y=y_1,h2=h2,make_predictions=make_predictions,ntraits=ntraits)

#Predictive ability computed with GModel 
S[[3]]$rho
## [1] 0.5219936
Phenohat<-S[[3]]$uhat[make_predictions] + S[[3]]$mean
Pheno<-S[[3]]$yout_corr[make_predictions]

datos<-data.frame(Phenohat,Pheno)
colnames(datos)<-c("Yhat","Y")

ggplot(datos, aes(y=Yhat, x=Y,colour="red")) +
  geom_point(size=0.6) + scale_shape_manual(values=c(2,4)) + 
  ggtitle("Estimated values for testing set - Genomic matrix") +
  geom_abline(intercept=lm(Yhat ~ Y,data=datos)$coefficients[1], slope=lm(Yhat ~ Y,data=datos)$coefficients[2])+
  theme(panel.background = element_blank(),
        panel.grid.major = element_blank(),
        panel.grid.minor = element_blank(), legend.position ="none")

Compute predictive ability from model with dominant effects

This model includes dominant effects from an uniform distribution with parameters a=0.2, b=0.6

# read G and y from /toy_strawberry/results_example4/
#G_2<- read.table("Y.grm.1")
#y_2<- read.table("Y.outy")
# read relationship matrix (R) from /toy_strawberry folder 
#R<-read.table("Relationship.mat")

# to eliminate the base population (first 47 individuals) 
R<-R[c(48:550),c(48:550)]

h2=0.5
ntraits=1
# split test and train 
make_predictions=c(354:503)

#genetic and predigree based model
S<-GBlup_predict(G=G_2,y=y_2,h2=h2,make_predictions=c(354:503),ntraits=ntraits)
SR<-GBlup_predict(G=R,y=y_2,h2=h2,make_predictions=c(354:503),ntraits=ntraits)
## print the PA from GENETIC MODEL 
S$rhoM
## [1] 0.4177179
# print the PA from RELATIONSHIP MODEL
SR$rhoM
## [1] 0.5063592

Computing PA from MIMIC_DIPLOID G matrix

This example includes a heritability parameter of 0.5 and 150 QTN’s. The Genetic matrix was computed with MIMIC_DIPLOID option.

# read G and y from /toy_strawberry/results_example2/
#G_3<- read.table("Y.grm.1")
#y_3<- read.table("Y.outy")
# Numerator relationship matrix is the same than before 

h2=0.5
ntraits=1
make_predictions=c(354:503)

S<-GBlup_predict(G=G_3,y=y_3,h2=h2,make_predictions=c(354:503),ntraits=1)
SR<-GBlup_predict(G=R,y=y_3,h2=h2,make_predictions=c(354:503),ntraits=1)
#correlation with Genetic Matrix
S$rhoM
## [1] 0.4968042
#correlation with Numerator relationship matrix
SR$rhoM
## [1] 0.4671476

Computing PA from MIMIC_HAPLOID G matrix

This snippet includes h2= 0.5 and 150 QTN’s The Genetic matrix was computed with MIMIC_HAPLOID option.

# example from G generated through mimic haploid option  (results on /toy_strawberry/results_example8)
#G<- read.table("Y.grm.1")
#y<- read.table("Y.outy")
#Numerator relationship matrix is always the same. 

h2=0.5


S<-GBlup_predict(G=G_4,y=y_4,h2=h2,make_predictions=c(354:503),ntraits=1)
SR<-GBlup_predict(G=R,y=y_4,h2=h2,make_predictions=c(354:503),ntraits=1)
#Genomic PA
S$rhoM
## [1] 0.3644044
#Numerator Relationship PA
SR$rhoM
## [1] 0.3118678

Strawberry complete dataset

The complete GBS data set is in Data_strawberry folder in GitHub (https://github.com/lauzingaretti/pSBVB).

Citation

L. Zingaretti, A. Monfort, M. Pérez-Enciso. 2018. pSBVB: a versatile simulation tool to evaluate genomic selection in polyploid species. Manuscript submitted for publication.

Pérez-Enciso, M., Forneris, N., de los Campos, G., & Legarra, A. (2017). Evaluating sequence-based genomic prediction with an efficient new simulator. Genetics, 205(2), 939-953.

Appendix (full list of options in parameter file)

NTRAIT #–> specifies no. of traits (int, [0])

PLOIDY #–> specifies ploidy (int, [2])

QTLFILE !–> file with qtl posns (chr& bp) add &dom effects can be defined in cols 3 & 4 (str)

SNPFILE !–> file with genotyped snps: chr, bp, can be repeated

MAPFILE !–>recomb map file: chr, basepos, cm2Mb [cm2Mb_sex2]

HAPFILE !–> hap structure so program can be restarted with RESTART

OUTPLINK !–> prints mkr in plink tpedformat

OUTGFILE !–> GRM outfile

OUTQFILE !–> output qtl file out_q_file

OUTYFILE !–> y outfile outyfile

GZIP !–> compress output files

NBASE !–>nind which genotypes are read from .vcfor .gen file nbase

H2G !–> broad heritability, repeated if multiple traits

RHOQA !–> desired correlation between allele effect and frequency, repeated if multiple traits

SIGNQTN !–> P of derived allele being deleterious (only with gamma) [0.5]

QTLDISTA !–> QTL add effects are sampled from a distribution: u(niform), g(amma), n(ormal) [u, l_bound, u_bound] [n, mu, var] [g, s, b] ! repeated if multiple traits

QTLDISTD !–> QTL dom effects are sampled from a distribution [u, l_bound, u_bound] [n, mu, var] [g, s, b] ! repeated if multiple traits

CM2MB !–> cM to Mb rate, default cm2mb [1.0]

MXOVER !–> Max no xovers, default 3

RESTART !–> prepares files for new run of sbvb

RESTARTQTL !–> restart qtl effects but keeps haplotype structure

NOPRINTHAP !–> does not print hap file, eg, if no new haplotypes have been generated

NOSEQUENCE !–> does not use sequence for GRM,

EXPAND_BASEPOP !–> breeds new base individuals involving random mating for ntgen generations ! from nfam families

ALLOPOLYPLOID !–> if PLOIDY is higher than 2 and you want to simulate an allopolyploid organism.

MIMIC_DIPLOID !–> this option assumes than only presence or absence of the alternative allele can be ascertained for genotypes’ values higher than 2 (only should be on if organism is polyploid)

MIMIC_HAPLOID !–> Assuming that only one allele can be distinguished for the others, i.e., that a given marker allele behaves as fully dominant.