Tutorial

Introduction of main functions of G2P


This section is intended for running main functions of G2P.

Files input and data preprocessing

Format conversion of genotypic data

  • GSTransForm
  • Description
    The function can only be used in G2P-sif. By calling external software like VCFtools Plink and TASSEL etc for the genotype format conversion.
  • Usage
    GSTransForm(pattern, input, output, workdir = "./", memory = 8 )
  • Arguments
    • pattern(character) pattern of transformation, including "hmp2plink", "plink2vcf", "vcf2hmp", "vcf2plink", "plink2hmp", and "hmp2vcf".
    • input(character) name of original file, prefix for plink, full name for Hapmap format and VCF format.
    • output(character) name of original file, prefix for all.
    • workdir(character) set working directory.
    • memory(integer) set memory size.

Example

data("cubic_hmp")
setwd("./")
write.table(hmp,file = "test.hmp.txt",row.names = F,col.names = T,sep = "\t",quote = F)

## transformation ##
GSTransForm(pattern="hmp2plink",input="test.hmp.txt",output="h2p",workdir="./")

Filter genotypic data

  • GSFiltering
  • Description
    The function can only be used in G2P-sif that can raduce genotypic data by calling Plink.
  • Usage
    GSFiltering(input, output , inputformat = "hmp", missingRate = 0, maf = 0.05,window = 100,step = 5,r2 = 0.1)
  • Arguments
    • input(character) name of original file, prefix for plink, full name for Hapmap format and VCF format.
    • output(character) name of original file, prefix for all.
    • inputformat(character) file format of input file, "hmp" for hapmap,"vcf" VCF or "plink" for bed, bim, fam.
    • missingRate(numeric [0-1]) filters out all variants with missing call rates exceeding the provided value to be removed.
    • maf(numeric [0-1]) filters out all variants with minor allele frequency below the provided threshold (default 0.05).
    • window(numeric or character) window size in variant count (numeric) or kilobase (if the 'kb' modifier is present).
    • step(numeric) variant count to shift the window at the end of each step.
    • r2(numeric) pairwise r2 threshold: at each step.
    • workdir(character) set working directory.
    • memory(integer) set memory size.

Example

## load example data ##
data("cubic_hmp")
setwd("./")
write.table(hmp,file = "test.hmp.txt",row.names = F,col.names = T,sep = "\t",quote = F)

## filter 
GSFiltering(input = "test.hmp.txt", output = "filter_res", 
            inputformat = "hmp", missingRate = 0, 
            maf = 0.05,window = 100,step = 5,r2 = 0.1)

Imputation of the larger genotypic data

  • GSImputation
  • Description
    The function can only be used in G2P-sif that can execute imputation of genotypic data by calling Beagle.
  • Usage
    GSImputation(input, output, inputformat, workdir = "./", memory = 8,threads = 1,window = 40,overlap = 2)
  • Arguments
    • input(character) name of original file, prefix for plink, full name for Hapmap format and VCF format.
    • output(character) name of original file, prefix for all.
    • inputformat(character) file format of input file, "hmp" for hapmap,"vcf" VCF or "plink" for bed, bim, fam.
    • memory(numeric) set memory size, GB.
    • threads(numeric) specifies the number of computational threads that will be used.
    • window(numeric) specifies the cM length of each sliding window (default: 40).
    • overlap(numeric) specifies the cM length of overlap between adjacent sliding windows(default: 2).
    • workdir(character) set working directory.

Example

## load example data ##
data("cubic_hmp")
setwd("./")

## generating NA to data
hmp_NA <- as.matrix(hmp[,-c(1:11)])
hmp_NA[sample(1:length(hmp_NA),1000000)] <- "NN"
hmp_NA <- cbind(hmp[,1:11],hmp_NA)
write.table(hmp_NA,file = "test_NA.hmp.txt",row.names = F,col.names = T,sep = "\t",quote = F)

## imputation 
GSImputation(input = "test_NA.hmp.txt", output = "impute_res_NA",
             workdir = "./", inputformat = "hmp")

Fast read of the larger genotypic data

  • GSRead
  • Description
    The function can only be used in G2P-sif that can fast read the larger genotypic data in PLINK Hapmap and VCF format, convert it to numeric format (0 1 2) by calling Plink.
  • Usage
GSRead(input, output, inputformat, workdir = "./", memory = 8)
  • Arguments
    • input(character) name of original file, prefix for plink, full name for Hapmap format and VCF format.
    • output(character) name of original file, prefix for all.
    • inputformat(character) file format of input file, "hmp" for hapmap,"vcf" VCF or "plink" for bed,bim,fam.
    • memory(numeric) set memory size, GB.
    • workdir(character) set working directory.

Example

## load example data ##
library(G2P)
data("cubic_hmp")
setwd("./")
write.table(hmp,file = "test.hmp.txt",row.names = F,col.names = T,sep = "\t",quote = F)

## filtering 
G <- GSRead(input = "test.hmp.txt", output = "read_tmp", 
     inputformat = "hmp", workdir = "./", memory = 8)

## show genotypic in numeric
dim(G)
G[1:10,1:10]

fig 10

The example results of GSRead

Data quality control of genomic selection

  • GSDataQC
  • Description
    Function for summary and quality control of GS data.
  • Usage
    GSDataQC(markers, phenotype, impute = F, filter = F, NABound = 0.8, MAFBound = 0.05, imputeMethod = "median",hete = 1,silent =F)
  • Arguments
    • markers(numeric, matrix) genotypic data, row represents sample and column represents SNP (feature). Missing (NA) alleles are allowed
    • phenotype(data.frame) phenotypic data. The first column must be the sample name.
    • phenoPhenotypic data frame, the first column describes sample IDs.
    • impute(logical) if TRUE, imputation, default FALSE.
    • filter(logical) if TRUE, filtering genotypic data with MAF and missing rate.
    • NABound(numeric, [0,1]) max missing value percentage.
    • MAFBound(numeric, [0,1]) min MAF percentage in each marker.
    • imputeMethod(character) the method of imputation, "mean" or "median", default "median".
    • hete(numeric) if genotypic matrix coded with numeric or transformed to numeric from letter, this parameter is used for MAF summary which define the numeric code of heterozygote, default 1 among 0,1,2.
    • silent(logical) TRUE for hiding progress bar and FALSE for showing progress bar.

Example

data("cubic")
## generate missing value
misIndex <- sample(1:length(Markers),1000000)
Markers[misIndex] <- NA

## GSDataQC, not impute ##
QCRes <- GSDataQC(markers = Markers, phenotype = pheData, impute = T)

## GSDataQC, impute with methods mean ##
QCResImpute <- GSDataQC(markers = Markers, phenotype = pheData, impute = T, 
                        imputeMethod = "mean")

Genomic selection and model evaluation

Genotype-to-phenotype prediction

  • G2P
  • Description
    The function fits genomic selection models, performs prediction and exports the prediction value of testing sets.
  • Usage
    G2P(markers,data,fix = NULL,trait,trainIdx,predIdx,modelMethods ="BayesA", outputModel = FALSE,NAImpute = FALSE, nIter = 1500, burnIn = 500, thin = 5, saveAt = "", S0 = NULL, df0 =5, R2 = 0.5, weights = NULL,verbose = FALSE, rmExistingFiles = TRUE, groups=NULLimportance = FALSE,posPercentage = 0.4, BestIndividuals = c("top"),ntree = 500,nodesize = NULL,kernel = c("linear"), gamma = 1, cost = 2^(-9), K = 8,eta = 0.7,select = "pls2",fit = "simpls", scale.x = FALSE,scale.y = FALSE,eps = 1e-4,maxstep = 100, parameters,alpha = 1,X = NULL,family = gaussian(link = identity), lambda = NULL, tol.err = 1e-6, tol.conv = 1e-8,epochs = 30, neurons = 4, ...)
  • Arguments
    • markers(numeric, matrix) genootypic data, row represents sample and column represents SNP (feature).Genotypes generally be coded as {0,1,2;0 represents AA(homozygote),2 represents BB(homozygote) and 1 represents AB(heterozygote); missing (NA) alleles are not allowed.
    • data(data.frame) phenotypic data and other information for data set.
    • fix(character, array) names of data that used as fix effects, numeric, NAs are not permitted,
    • testPheno(numeric) the phenotype value of test population individual, default NULL.
    • trait(character) the trait name in data to fit model and perform genotype-to-phenotype prediction.
    • trainIdx(numeric, array) index of training set in marker and data.
    • predIdx(numeric, array) index of testing/prediction set in marker and data.
    • modelMethods(character) 16 alternative models to fit, including "BayesA", "BayesB", "BayesC", "BL", "BRR", "RKHS", "RRBLUP","LASSO", "SPLS", "SVC", "SVR", "RFR", "RFC", "RR", "RKHS", "BRNN"
    • outputModel(logical) if TRUE, return the list of trained model and prediction results, default FALSE.
    • NAImpute(logical) if true, the missing values in phenotype will be imputed by mean otherwise be removed.
    • nIter,burnIn,thin(integer) the number of iterations, burn-in and thinning,default nIter 7000,burnIn 500,thin 5.
    • saveAt(string) this may include a path and a pre-fix that will be added to the name of the files that are saved as the program runs,default "".
    • S0,df0(numeric) the scale parameter for the scaled inverse-chi squared prior assigned to the residual variance, only used with Gaussian outcomes.Default S0 NULL,df0 = 5.
    • R2(numeric, (0,1)) the proportion of variance that one expects, a priori, to be explained by the regression. Only used if the hyper-parameters are not specified.Defult 0.5
    • weights(numeric) a vector of weights, may be NULL. If weights is not NULL, the residual variance of each data-point is set to be proportional to the square of the weight. Only used with Gaussian outcomes.
    • verbose(logical) if TRUE the iteration history is printed, default FALSE.
    • rmExistingFiles(logical) if TRUE, removes existing output files from previous runs, default TRUE.
    • groups(factor) a vector of the same length of y that associates observations with groups, each group will have an associated variance component for the error term.
    • ntree(integer) ramdomforest parameter. Number of trees to grow. This should not be set to too small a number, to ensure that every input row gets predicted at least a few times,default 500.
    • nodesize(integer) ramdomforest parameter Minimum size of terminal nodes. Setting this number larger causes smaller trees to be grown (and thus take less time). Note that the default values are different for classification (1) and regression (5).
    • posPercentage(numeric,[0,1]) phenotype of extreme individuals which expected, default 0.4.
    • BestIndividuals(character) the position of expected phenotype in whole phenotypic data set."top","buttom" or "middle",default "top".
    • kernel(numeric) svm parameter the kernel used in training and predicting. You might consider changing some of the following parameters, depending on the kernel type.(linear,polynomial,sigmoid,radial). Default "linear".
    • gamma(numeric)svm parameter parameter needed for all kernels except linear, default 1.
    • cost(numeric)svm parameter cost of constraints violation, default: 2^(-9), it is the 'C'-constant of the regularization term in the Lagrange formulation.
    • K(integer) SPLS parameter: number of hidden components.
    • eta(numeric) SPLS parameter: thresholding parameter. eta should be between 0 and 1.
    • select(character) SPLS parameter: PLS algorithm for variable selection. Alternatives are "pls2" or "simpls". Default is "pls2".
    • fit(character) SPLS parameter: PLS algorithm for model fitting. Alternatives are "kernelpls", "widekernelpls", "simpls", or "oscorespls". Default is "simpls".
    • scale.x(character) SPLS parameter: scale predictors by dividing each predictor variable by its sample standard deviation?
    • scale.y(character) SPLS parameter: scale responses by dividing each response variable by its sample standard deviation?
    • eps(character) SPLS parameter: an effective zero. Default is 1e-4.
    • maxstep(integer) SPLS parameter: maximum number of iterations when fitting direction vectors. Default is 100.
    • alpha(numeric) LASSO model parameter: the elasticnet mixing parameter.Detail in glmnet.
    • familythe distribution family of y, see help('family') for more details.
    • lambdathe shrinkage parameter determines the amount of shrinkage. Default is NULL meaning that it is to be estimated along with other model parameters.
    • tol.errinternal tolerance level for extremely small values; default value is 1e-6.
    • tol.convtolerance level in convergence; default value is 1e-8.
    • neurons(integer) indicates the number of neurons,defult 4.
    • epochs(integer) maximum number of epochs(iterations) to train, default 30.

Example

## load example data ##
data("cubic")

## perform genotype to phenotype prediction with all integrated methods ## 
C2Pres <- G2P(markers = Markers,data = pheData,trait = "DTT",
               modelMethods = c("BayesA", "BayesB", "BayesC", "BL", "BRR", 
                                "RKHS", "RRBLUP","LASSO", "SPLS", "SVC", "SVR", "RFR", "RFC",
                                "RR", "RKHS", "BRNN"),trainIdx = 1:200,predIdx = 201:400)

## perform genotype to phenotype prediction with all integrated methods and output model ## 
C2Pres_outmodel <- G2P(markers = Markers,data = pheData,trait = "DTT",outputModel = T,
               modelMethods = c("BayesA", "BayesB", "BayesC", "BL", "BRR", 
                                "RKHS", "RRBLUP","LASSO", "SPLS", "SVC", "SVR", "RFR", "RFC",
                                "RR", "RKHS", "BRNN"),trainIdx = 1:200,predIdx = 201:400)

## add other information as fix effects ## 
## fix effects not permit NA, so impute NA first.
data <- pheData
data <- apply(data[,-1], 2,function(x){x[which(is.na(x) == T)] <- mean(x,na.rm = T);x})
data <- as.data.frame(data)
data$EW <- pheData$EW

## perform genotype to phenotype prediction ## 
C2Pres_fix <- G2P(markers = Markers,data = data,trait = "EW",outputModel = T,fix = c("DTT","PH"),
                  modelMethods = c("BayesA", "BayesB", "BayesC", "BL", "BRR", 
                                   "RKHS", "RRBLUP","LASSO", "SPLS", "SVC", "SVR", "RFR", "RFC",
                                   "RR", "RKHS", "BRNN"),trainIdx = 1:200,predIdx = 201:400)

G2P cross-validation

  • G2PCrossValidation
  • Description
    The function achieves cross-validation for integrated GS models.
  • Usage
    G2PCrossValidation(cvSampleList = NULL,cross = 10,times = 1, seed = 1,cpus = 1,markers,data,trait,modelMethods,...)
  • Arguments
    • cvSampleList(list) specified cross-validation combination, or a list from function \code{\link{cvSampleIndex.Default NULL and execute a random grouping.
    • cross(integer) fold of N-fold cross-validation. For example cross = 10 for 10-fold cross-validation.
    • times(integer) times of repetition. For example, cross = 10 and times = 5 for five times 10-fold cross-validation.
    • seed(integer) random seed for sampling of cross-validation.
    • cpus(integer) number of cpu cores to parallelizing cross-validation(only available in UNIX-like operating systems), default 1.
    • markers(numeric, matrix) row represents sample and column represents SNP (feature).Genotypes generally be coded as {0,1,2;0 represents AA(homozygote),2 represents BB(homozygote) and 1 represents AB(heterozygote); missing (NA) alleles are not allowed.
    • data(data.frame) phenotypic data and other information for data set.
    • trait(character) name of aim trait to perform cross-validation.
    • modelMethods(character) 16 alternative models to fit, including "BayesA, "BayesB, "BayesC", "BL", "BRR", "RKHS", "RRBLUP","LASSO", "SPLS", "SVC", "SVR", "RFR", "RFC", "RR", "RKHS", "BRNN"
    • Other parameters are same to G2P

Example

## load example data ##
data("cubic")

## perform a 2 times 5-fold cross-validation  ##
predlist <- G2PCrossValidation(cross = 5,seed = 1, cpus = 1, times = 2, markers = Markers[1:200,],
                data = pheData[1:200,], trait = "DTT", modelMethods = c("rrBLUP", "spls"),
                outputModel = FALSE)

Evaluation

  • G2PEvaluation
  • Description
    The function consolidates 13 evaluation metrics to evaluate the performance of genomic selection models.
  • Usage
    G2PEvaluation(realScores, predScores, Probability = FALSE, evalMethod = "RE", Beta = 1, BestIndividuals = "top", topAlpha = 1:90,probIndex = NULL)
  • Arguments
    • realScores(numeric,vector) observation values of validation data.
    • predScores(numeric,vector or matrix) prediction values from genomic selection models.
    • evalMethod(character) alternative evaluation metrics, including "pearson", "kendall", "spearman", "MSE", "R2", "RE", "Kappa", "AUC", "AUCpr", "accuracy", "F1", "meanNDCG", "NDCG".
    • Beta(numeric) the parameter of F-measure or F-score, when beta = 1, meaning F1, Default F1.
    • BestIndividuals(character) the position of expected phenotype in whole phenotypic data set."top","buttom" or "middle",default "top".
    • topAlpha(numeric, array) the proportion of considering excellent samples, default 1:90.
    • globalAlpha(logical) if global metrics (pearson, kendall, spearman, MSE and R2) evaluate by threshold (topAlpha),default FALSE.
    • probIndex(integer) indicate the column index which prediction result is probability. For example, classification model "RFC" locates the fifth column in prediction matrix and then probIndex = 5.

Example

data("cubic")
########## prediction ############
C2Pres <- G2P(markers = Markers,data = pheData,trait = "EW",
              modelMethods = c("BayesA", "BRR", "RRBLUP",
                               "SPLS", "SVC", "RFR"),trainIdx = 1:200,predIdx = 201:400)
########## evaluation ############
evalres <- G2PEvaluation(realScores = C2Pres[,1], predScores = C2Pres[,2:7], 
                           evalMethod = c("pearson", "kendall","spearman","RE","Kappa",
                                          "AUC","AUCpr","NDCG","meanNDCG",
                                          "MSE","R2","F1","accuracy"),topAlpha = 1:90, probIndex = 5)

Integration of prediction results

Two results integration and user-defined integration (more than two models results)

  • GSMerge
  • Description
    The function provides a strategy to integrate the results of two or more methods.
  • Usage
    GSMerge(predResMat, ratio, autoOptimize = F)
  • Arguments
    • predResMat(numeric, matrix) prediction results of algorithms which you want to merge, the first column is the real value of trait.
    • ratio(numeric,array) weights of every algorithms.
    • autoOptimize(logical) if true, two group results of methods from multiple-results will be selected and then get average (wight 1:1), default FALSE.

Example

data("cubic")

## Genotype-to-phenotype ##
G2Pres <- G2P(markers = Markers,data = pheData,trait = "EW",
              modelMethods = c("BayesA", "BRR", "RRBLUP",
                               "SPLS", "RFR"),trainIdx = 1:200,predIdx = 201:400)

## merge 2 model prediction results ## 
GSMergeRes <- GSMerge(predResMat = G2Pres[,c("realPhenScore","RRBLUP","SPLS")], autoOptimize = F,ratio = c(3,7))

## merge 2 model prediction results automatically ## 
GSMergeRes <- GSMerge(predResMat = G2Pres, autoOptimize = T)

The function provides a strategy to integrate the results of multiple GS methods.

Multiple results integration

  • GSEnsemble
  • Description
    Results integration of multiple GS methods
  • Usage
    GSEnsemble(predMat, nrandom = 10, evalMethods, by = 0.1, evaluation = T, topAlpha = 15)
  • Arguments
    • predMat(numeric, matrix) prediction results of algorithms which you want to merge, the first column is the real value of trait.
    • nrandom(integer) repeat number of stacking, default 10.
    • evalMethods(character) ensemble base which evaluation methods.
    • by(numeric,(0,1)) window of ensemble, the smaller "by", the higher accuracy of ensemble. Default 0.1.
    • evaluation(logical) if evaluate finalMat with evalMethods, default TRUE.
    • topAlpha(numeric) parameter of threshold evaluation methods, see also function evaluateGS. In this function, indicates the best ensemble base threshold when evalMethods is threshold methods.

Example

## load example data ##
data("cubic")

## Genotype-to-phenotype ##
G2Pres <- G2P(markers = Markers,data = pheData,trait = "EW",
              modelMethods = c("BayesA", "BRR", "RRBLUP",
                               "SPLS", "RFR"),trainIdx = 1:200,predIdx = 201:400)

## ensemble more than 2 prediction results of models automatically ## 
GSEnsembleRes <- GSEnsemble(predMat = G2Pres, nrandom = 10, evalMethods = "pearson",
                          by = 0.1, evaluation = T)
## new prediction of genotype-to-phenotype ##
G2PresNew <- G2P(markers = Markers,data = pheData,trait = "EW",
              modelMethods = c("BayesA", "BRR", "RRBLUP",
                               "SPLS", "RFR"),trainIdx = 1:200,predIdx = 401:600)

## apply GSEnsemble rotio to new prediction ##                       
GSEnsembleModelRes <- GSMerge(predResMat = G2PresNew,ratio = GSEnsembleRes$BestWeight, autoOptimize = F)

Refinement design of training set

  • TSRefine
  • Description
    The function can reduce the cost of phenotyping through refining the training set.
  • Usage
    TSRefine(markers,candidates,test = NULL,ntosel,method = "PEVmean", npop = 100, nelite =5 ,mutprob = .8,niterations = 500, cores = 1,lambda = NULL,sequent = FALSE,visulization = F)
  • Arguments
    • markers(numeric, matrix) row represents sample and column represents SNP (feature).Genotypes generally be coded as {0,1,2;0 represents AA(homozygote),2 represents BB(homozygote) and 1 represents AB(heterozygote); missing (NA) alleles are not allowed.
    • candidates(charcter array) names of condidates population.
    • test(character, array) names of testing or predicted set, defult NULL, if NULL, execute reference-free refinement, else execute reference-based refinement.
    • ntosel(integer) number of samples to select in refinement.
    • method(character) alternativeal algorithms for training set refinement or selection, including "PEVmean", "CDmean", "sim", or "RD". Default is "PEVmean".
    • npop(integer) genetic algorithm parameter, number of solutions at each iteration.
    • nelite(integer) genetic algorithm parameter, number of solutions selected as elite parents which will generate the next set of solutions.
    • mutprob(numeric) genetic algorithm parameter, probability of mutation for each generated soluation.
    • niterations(integer) genetic algorithm parameter, number of iterations.
    • cores(integer) number of CPU cores to use.
    • lambda(numeric) scalar shrinkage parameter (λ>0).
    • sequent(logical) phenotype selection parameter, if TRUE, perform sequential optimization, default FALSE.

Example

## load example data ##
data("cubic")

## reference-free refinement ##
G <- Markers[1:300,]
TSR_SP <- TSRefine(markers = G, candidates = rownames(G),
              test = NULL, ntosel = 100,cores = 1,visulization = T)

## reference-based refinement (PEV-mean) ##
TSR_PEVmean <- TSRefine(markers = G, method = "PEVmean", candidates = rownames(G)[1:200],
                       test = rownames(Markers)[201:300], ntosel = 50,cores = 1,visulization = T)