It’s been a while exploring lots of resources to understand how eQTL works and this is to summarize what I have found (or understood) and not to repeat the whole thing to save my time/memory.

Data prep: TCGA RNA-seq offers gene and isoform data which are processed using the rsem program.  Here my interest is in utilizing isoform data.  A couple of files are provided: 1)*.isoforms.results 2)*.isoforms.normalized_results

To perform eQTL, I began with the *.isoforms.results files that have two columns –raw_count and scaled_estimate: raw_count is equivalent to raw count except it includes uncertain mapping count (which would not change much the reads count); scaled_estimate is similar to FPKM or TPM values which are already processed.

I paired the raw_count with caret and DESeq for DESeq is reported as one of the best tools for RNA-seq data normalization [1]:

step 1: using a nearZeroVar function offered by caret, screen transcripts

nzv <- nearZeroVar(PRAD_T) ## remove near zeros; row: samples, column: genes
filteredPRAD_T <- PRAD_T[, -nzv]
filteredPRAD_T <-as.data.frame(t(filteredPRAD_T))

step 2: prepare normalized counts by:

cds = newCountDataSet(filteredPRAD_T, conditions=surv1$race); # input: row(gene), column(sample)
cds = estimateSizeFactors(cds)
cts = counts(cds, normalized=TRUE) ## expression data prepared
cts<-as.data.frame(cts)

Prepare SNP array by combining them and get gene location by running biomart in r and SNP location from a local SNP reference file.

Mapping using matrix eQTL

This requires a predefined format for each file (gene expression, SNP, Covariates (e.g., race, age, gender), gene location, SNP location).

To accelerate a large matrix computation, openBLAS is a must.

[1] Dillies MA et al., A comprehensive evaluation of normalization methods for Illumina high-throughput RNAsequencing data analysis, Brief Bioinform. 2013; 14(6):671-83 PMID:22988256

Advertisements