The whole-genome-sequencing pipeline is a modular pipeline for processing WGS data. This pipeline takes a fastq file as input and provides haplotype call results and annotations and visualizations based on GATK pipeline.
First, raw read data with well-calibrated base error estimates in fastq format are mapped to the reference genome. The BWA mapping application is used to map reads to the human genome reference, allowing for two mismatches in 30-base seeds, and generate a technology-independent SAM/BAM reference file format. Next, duplicate fragments are marked and eliminated with Picard(http://picard.sourceforge.net), mapping quality is assessed and low-quality mapped reads are filtered, and paired read information is evaluated to ensure that all mate-pair information is in sync between each read. We then refine the initial alignments by local realignment and identify suspicious regions. Using this information as a covariate along with other technical covariates and known sites of variation, the GATK base quality score recalibration (BQSR) is carried out. Call germline SNPs and indels via local re-assembly of haplotypes using the recalibrated and realigned BAM files. Finally, we provide somalier, a tool to quickly assess relevance from sequencing data in BAM, CRAM or VCF format.