본문영역 바로가기 하단 바로가기

The only cloud-based integrated data analysis service in Korea

High-speed analysis of genomic big data using the big data platform built with KOBIC's technology

How To Use

Introduction to Bio-Express Web Service

Web-based Large Data Analysis Service

  • It is a web-based service that can be used anywhere online
  • Increased access by utilizing open pipelines to increase the analysis experience of novice researchers
  • Real-time confirmation of user's system usage status
  • Expansion of information sharing and exchange opportunities through the provision of community services among users
  • Checking of basic analysis information through open programs and pipelines
[Figure 1.1-1]

[Figure 1.1-1]

[Figure 1.1-2]

[Figure 1.1-2]

Download Bio-Express Workbench

2.1.1 Windows

Go to “ https://www.kobic.re.kr/bioexpress/”.

[Figure 2.1.1-1]

[Figure 2.1.1-1]

[Figure 2.1.1-2]

[Figure 2.1.1-2]

Select Workbench for Windows via the quick menu on the right side of the main screen and download.

[Figure 2.1.1-3]

[Figure 2.1.1-3]

Run Bioexpress.exe after decompressing the downloaded .zip file to the desired directory.

2.1.2 Linux

[Figure 2.1.2-1]

[Figure 2.1.2-1]

Download to the desired directory via the wget command.

															
$ wget -O Bio-Express.workbench-linux.gtk-x86_64.tar.gz \

"https://www.kobic.re.kr/bioexpress/common/related_file_down?file_name=Bio-Express.workbench-linux.gtk-x86_64.tar.gz"
                                            
[Figure 2.1.2-2]

[Figure 2.1.2-2]

Run Biopress after decompressing the downloaded tar.gz file with the tar -zxvf command.

$ tar -zxvf Bio-Express.workbench-linux.gtk-x86_64.tar.gz
$ cd Bio-Express.workbench-linux.gtk-x86_64
$ ./Bioexpress
                                         

2.1.3 MacOS

Go to “ https://www.kobic.re.kr/bioexpress/”.

[Figure 2.1.3-1]

[Figure 2.1.3-1]

[Figure 2.1.3-2]

[Figure 2.1.3-2]

Select Workbench for MacOS via the quick menu on the right side of the main screen and download.

[Figure 2.1.3-3]

[Figure 2.1.3-3]

[Figure 2.1.3-4]

[Figure 2.1.3-4]

After decompressing the downloaded .zip file, move it to the application and run Bioexpress.

Run Bio-Express Workbench

2.2.1. Login to Bio-Express Workbench

[Figure 2.2.1-1]

[Figure 2.2.1-1]

Login with the ID registered in KOBIC integrated login (SSO).

If you do not have an ID, login after signing up as a member using the ‘Sign up’ button at the bottom left.

2.2.2. Bio-Express Workbench start screen

[Figure 2.2.2-1]

[Figure 2.2.2-1]

If the CLOSHA intro page is activated and you click the card area for each service, you will be connected to the Bio-Express web page, where you can view detailed introductions and information on how to use.

[Figure 2.2.2-2]

[Figure 2.2.2-2]

You can perform user-friendly functions and update Workbench via [File], [Window], [Help].

  • - [File] > [New]: Create a new workspace, category, pipeline, or program
  • - [File] > [Register]: Pipeline registration, program registration
  • - [Window] > [Open Perspective]: Switch between CLOSHA / G-Box perspective
  • - [Window] > [Reset Perspective]: Revert to the perspective set to Default.
  • - [Window] > [Preference]: Change Bio-Express Workbench settings
  • - [Help] > [Help Contents]: Bio-Express User Manual
  • - [Help] > [Community]: Go to the Bio-Express web community page
  • - [Help] > [Feedback]: Go to the Bio-Express web feedback page
  • - [Help] > [About Bio-Express]: Go to the Bio-Express web introduction page
  • - [Help] > [Update CLOSHA Workbench]: Update Bio-Express Workbench
[Figure 2.2.2-3]

[Figure 2.2.2-3]

Using the menu on the rightmost side of the toolbar, you can switch between big data high-speed Transfer system G-Box perspective and the cloud-based integrated analysis workbench CLOSHA perspective.
Control menus for workspace, pipeline, program, and category can be used through the toolbar.

Introduction to the GBox screen and functions

The GBox high-speed transfer system transfers the user's large volume of data at high speed using only software technology without additional hardware, such as an accelerator, or installing a separate plug-in, such as ActiveX. User data uploaded through GBox are available in CLOSHA.

[Figure 3.1-1]

[Figure 3.1-1]

GBox has been designed with an easy and intuitive user interface. Hence, users can run GBox without training.

Big data high-speed transfer

[Figure 3.2-1]

[Figure 3.2-1]

Upload method 1: Select the folder or file you want to transfer in the [File Explorer] view, and drag and drop it to the [GBox Explorer] view.

[Figure 3.2-2]

[Figure 3.2-2]

Upload method 2: Select the folder or file you want to transfer in the [File Explorer] view, and click the [Upload] button in the context menu to transfer it.

[Figure 3.2-3]

[Figure 3.2-3]

Download method 1: Select the folder or file you want to transfer in the [GBox Explorer] view, and drag and drop it to the [File Explorer] view.

[Figure 3.2-4]

[Figure 3.2-4]

Download method 2: Select the folder or file you want to transfer in the [GBox Explorer] view, and click the [Download] button in the context menu to transfer it.

[Figure 3.2-5]

[Figure 3.2-5]

Once the file transfer begins, the real-time transfer status can be monitored in the [Progress] view at the bottom of the application, as shown in [Figure 3.2-5].

High-speed synchronization of big data file system

[Figure 3.3-1]

[Figure 3.3-1]

GBox → HDFS: Select the folder or file you want to synchronize in the [GBox Explorer] view, and click the [Sync] button in the context menu to synchronize it.

[Figure 3.3-2]

[Figure 3.3-2]

HDFS → GBox: Select the folder or file you want to synchronize in the [HDFS Tree Browser] view, and click the [Sync] button in the context menu to synchronize it.

[Figure 3.3-3]

[Figure 3.3-3]

You can check the bidirectional file synchronization record in the [Synchronized History] view.

Data file management function

[Figure 3.4-1]

[Figure 3.4-1]

You can manage data using the context menu of the [File Explorer] view and the [GBox Explorer] view. ([New], [Rename], [Copy], [Paste], [Delete], [Refresh])

You can move directories and use shortcuts via the toolbar menu at the top of the [File Explorer] view and the [GBox Explorer] view.

Introduction to screens and functions

Bio-Express Workbench is a cloud-based program that provides services for analyzing a large volume of data. It has a user-friendly interface, so users can easily design a complex bio-information analysis process as a pipeline.

[Figure 4.1-1]

[Figure 4.1-1]

Register a user definition-based program

4.2.1. Write a user analysis code

To register a user definition-based program, you must first register a user analysis code.

[Figure 4.2.1-1]

[Figure 4.2.1-1]

Click the [New Folder] button in the context menu of the [Script] view to create a script folder for registering a user analysis code.

[Figure 4.2.1-2]

[Figure 4.2.1-2]

After entering the name of the new folder, click the [OK] button to create the folder.

[Figure 4.2.1-3]

[Figure 4.2.1-3]

Click the [New Script] button in the context menu of the created script folder, as shown in [Figure 4.2.1-3], to create a user analysis code. Python, Bash, and R file formats are supported for user analysis code.

[Figure 4.2.1-4]

[Figure 4.2.1-4]

After entering the name of the script to be created, click the [OK] button to create an empty script.

[Figure 4.2.1-5]

[Figure 4.2.1-5-1]

[Figure 4.2.1-6-2]

[Figure 4.2.1-5-2]

[Figure 4.2.1-6-3]

[그림 4.2.1-5-3]

#!/bin/bash

INPUT_DIR=$1
OUTPUT_DIR=$2

mkdir -p $OUTPUT_DIR

in_array() {
	local needle array value
	needle="${1}"; shift; array=("${@}")
	for value in ${array[@]};
		do
			[ "${value}" == "${needle}" ] && echo "true" && return;
		done
	echo "false"
}

array=(`find $INPUT_DIR/ -name "*.fastq.gz"`)
for i in "${array[@]}"
do
	FILE=${i##*/}
	IFS='' read -r -a array <<< "$FILE"
	len=(${#array[@]}-1)
	FASTP_IN_FILE_LIST+=("${array[$len]}")
done


for i in "${FASTP_IN_FILE_LIST[@]}"
do
	basename=$i
	FILE1=${basename%1.fastq.gz}

	IFS='' read -r -a file1_array <<< "$FILE1"
	file1_len=(${#file1_array[@]}-1)

	array_check=`in_array ${file1_array[$file1_len]}2.fastq.gz ${FASTP_IN_FILE_LIST[@]}`

	if [ "${array_check}" == "true" ]; then

		IFS='-' read -r -a SEQ_ID <<< "$FILE1"
		 
		FASTP='/BiO/program/fastp/current/fastp'

		time $FASTP -i ${INPUT_DIR}/${file1_array[$file1_len]}1.fastq.gz -I ${INPUT_DIR}/${file1_array[$file1_len]}2.fastq.gz \
		-o ${OUTPUT_DIR}/${file1_array[$file1_len]}1.fastq.gz -O ${OUTPUT_DIR}/${file1_array[$file1_len]}2.fastq.gz \
		--length_required 20 --average_qual 20 --detect_adapter_for_pe --correction \
		-h ${OUTPUT_DIR}/${SEQ_ID[0]}.html -j ${OUTPUT_DIR}/${SEQ_ID[0]}.json
	fi
done
                                            

You can write or edit the created user analysis code in the editor by double-clicking it in the [Script] view, as shown in [Figure 4.2.1-5].

4.2.2. Register a user program

Register the user program to run the user analysis code written.

  • [Figure 4.2.2-1]

    [Figure 4.2.2-1]

  • [Figure 4.2.2-2]

    [Figure 4.2.2-2]

Click the [New Category] button of the context menu in the [Program] view to create the root category and subcategory for registering the user-defined program.

The workspace, category, pipeline, and program names may only contain letters, numbers, and some special characters (-_.), and the first character must be a letter. Other characters or spaces are not allowed.

If a subcategory exists in the root category or if a program or pipeline exists in a subcategory, delete or edit may not be used.

[Figure 4.2.2-3]

[Figure 4.2.2-3]

Create a subcategory under the root category using the context menu.

[Figure 4.2.2-4]

[Figure 4.2.2-4]

Click the [New Program] button in the context menu of the created subcategory to register a user program.

[Figure 4.2.2-5]

[Figure 4.2.2-5]

Register a user program by setting its basic information, additional information, and parameters.

  • Basic information : name, category, version, keyword, related URL, description

    You can enter multiple keywords by delimiting them with commas (,).

  • Additional information : script type, single/multi core, related program
  • Parameters : Input, Output, Option
    • - Input : Set the file or path required for analysis
    • - Output : Set the path where the analysis results will be Output
    • - Option : Set the Option supported by the program

Parameters consist of Input, Output, and Option. The set values are default values, and the user can set the values again in the program execution stage.

[Figure 4.2.2-6]

[Figure 4.2.2-6]

Click the Add Parameter button to create the Input parameter of the fastp program. You can edit or delete the created parameter.

[Figure 4.2.2-7]

[Figure 4.2.2-7]

Enter the default value after specifying the type of parameter to be created. If the parameter type is Folder or File, you can import the files and folders of GBox using the Browser button.

  • [Figure 4.2.2-8]

    [Figure 4.2.2-8]

    You can check the created parameters, as shown in [Figure 3.3.2-8].

  • [Figure 4.2.2-9]

    [Figure 4.2.2-9]

    You can create the Output and Option parameters in the same manner.

Develop an analysis pipeline

4.3.1. Create a workspace

Create a workspace by clicking the [New Workspace] button in the context menu of the [Workspace] view, as shown in [Figure 4.3.1-1].

[Figure 4.3.1-1]

[Figure 4.3.1-1]

Enter the name and description of the workspace to be created. Register the related keywords and then click the [OK] button.

[Figure 4.3.1-2]

[Figure 4.3.1-2]

Create an analysis pipeline by clicking the [New Pipeline] button in the context menu of the created workspace.

[Figure 4.3.1-3]

[Figure 4.3.1-3]

Create a user-defined pipeline according to the pipeline type or create an instance pipeline by selecting an existing registered pipeline.

[Figure 4.3.1-4]

[Figure 4.3.1-4]

- [Develop a new pipeline] : Create a user-defined pipeline

- [Develop an instance pipeline] : Select an open pipeline

[Figure 4.3.1-5]

[Figure 4.3.1-5]

You can check the user-defined pipeline created through the [Workspace] view, as shown in [Figure 4.3.1-5].

4.3.2. Design an analysis pipeline

4.3.2-1. Design a user-defined pipeline

[Figure 4.3.2-1]

[Figure 4.3.2-1]

Find the analysis program required for the analysis pipeline design in the ‘Palette’ of the editor.

[Figure 4.3.2-2]

[Figure 4.3.2-2]

After selecting the program, drag and drop it onto the canvas using the mouse.

[Figure 4.3.2-3-1]

[Figure 4.3.2-3-1]

Save the pipeline designed with the required analysis program, and double-click each analysis program to set the required options for execution.

[Figure 4.3.2-3-2]

[Figure 4.3.2-3-2]

After setting the analysis pipeline execution option, create a single pipeline by connecting the analysis programs located on the canvas.

Changing the order of connecting the analysis programs and setting the options does not cause functional problems. However, when the analysis programs are connected, the output of the program in the previous stage is automatically set as the input. Therefore, it is recommended to connect the analysis programs after setting the options.

[Figure 4.3.2-4]

[Figure 4.3.2-4]

Design the user-defined pipeline by dragging and dropping the circle-shaped connection point located at the right end of the output parameter of the analysis program to the connection point at the left end of the input parameter of the next-stage analysis program.

[Figure 4.3.2-5]

[Figure 4.3.2-5]

The user can modify the shape of the connecting line freely by dragging the mouse, as shown in [Figure 4.3.2-5].

4.3.2-2. Design an instance pipeline

You can select an existing registered pipeline in the pipeline creation stage by selecting an instance pipeline type, as shown in [Figure 4.3.2-6].

[Figure 4.3.2-6]

[Figure 4.3.2-6]

You can also change the settings required to run each program in an instance pipeline, similar to a user-defined pipeline.

4.3.3. Share Analytics Pipeline

[Figure 4.3.3-1]

[Figure 4.3.3-1]

The designed analysis pipeline can be shared between members who have joined KOBIC SSO.

Click [Shared Pipeline] in the context menu of the pipeline you want to share.

[Figure 4.3.3-2]

[Figure 4.3.3-2]

You can check the information about the pipeline you want to share and the table in which you can select the members with whom you want to share the pipeline.

  • [Figure 4.3.3-3]

    [Figure 4.3.3-3]

  • [Figure 4.3.3-4]

    [Figure 4.3.3-4]

  • Go to the [Share Member] tab and click the [Add] button to see a list of members with whom sharing is possible.
  • You can quickly find a member if you search by ID in the search box of the Search Member window.
  • You can check the shared pipelines on the Bio-Express web.

Run an analysis pipeline

4.4.1. Run an analysis pipeline

You can run the designed analysis pipeline by clicking the [Run] button at the top of the editor.

[Figure 4.4.1-1]

[Figure 4.4.1-1]

When the analysis is run, you can check the progress of the analysis pipeline through the [Pipeline] view, the [Log] view, and the editor. You can also check the real-time progress of the analysis pipeline by clicking the [Refresh] button on each view and the editor. When the analysis pipeline is run, the analysis programs are executed sequentially.

[Figure 4.4.1-2-1]

[Figure 4.4.1-2-1]

[Figure 4.4.1-2-2]

[Figure 4.4.1-2-2]

The running programs are in the run state and shown in green on the editor, whereas the completed programs are in the complete state and shown in blue.

4.4.2. Run a single analysis program

[Figure 4.4.2-1]

[Figure 4.4.2-1]

You can also run each analysis program individually. Click the [Start Node] button in the shortcut menu that appears when you bring the mouse over the analysis program you want to run.

4.4.3. Check data analysis results

[Figure 4.4.3-1]

[Figure 4.4.3-1]

Check the log information for each analysis program through the [Log] view.

[Figure 4.4.3-2]

[Figure 4.4.3-2]

If you double-click the log, you can check the log of the analysis process and errors on the [Console] view.

[Figure 4.4.3-3]

[Figure 4.4.3-3]

Go to the Output path specified for each analysis program and check the results using the [GBox Browse] view.

[Figure 4.4.3-4]

[Figure 4.4.3-4]

[Figure 4.4.3-4] shows the analysis results of fastp, which is an open program in the Bioinformatics category..

Register an open service

4.5.1. Register a pipeline to make it open to the public

To share an analysis pipeline with specific users, you can register the analysis pipeline to make it open to the public.

[Figure 4.5.1-1]

[Figure 4.5.1-1]

Select the desired analysis pipeline in the [Workspace] view and click the [Register Pipeline] button in the context menu, as shown in [Figure 4.5.1-1].

[Figure 4.5.1-2]

[Figure 4.5.1-2]

The analysis pipeline for which the registration has been requested to make it open to the public goes through the administrator approval process. Once approved, it can be used as an open analysis pipeline.

4.5.2. Register a program to make it open to the public

The analysis program for which the registration has been requested to make it open to the public goes through the administrator approval process. Once approved, it can be registered as an open analysis program.

[Figure 4.5.2-1]

[Figure 4.5.2-1]

You can select a desired analysis program in the [Program] view and click the [Register Program] button in the context menu to register the program to make it open to the public, as shown in [Figure 4.5.2-1].

Users cannot delete or edit open categories, programs, or pipelines.

Introduction to the genomic analysis pipeline

5.1.1. WGS

Pipeline

[Figure 5.1.1-1]

[Figure 5.1.1-1]

Application area

Whole-Genome Sequencing

Function summary

It is a pipeline that receives as the input the DNA sequences of normal and cancerous tissues, and performs processes such as Read Map to Reference, Read Merge, and CleanUp, to create a bam file. It finds the somatic variants of SNP & InDel faster by using GATK4, and performs the variant annotate task.

Detailed description

The genomic analysis pipeline consists of 10 stages: Fastp, Cutadapt, BWA(mem), SortSam, MarkDuplicates, CountBase, BaseRecalibrator, ApplyBQSR, HaplotypeCaller, and somalier. The analysis process at each stage is as follows. The folder in which the fastq.gz file that has been pair-ended with the input data of fastp--the first stage--is located is set as the path. Then, Fastp performs preprocessing on fastq. In addition, the quality control and filtering results are visualized and made into JSON and HTML reports. Next, Cutadapt receives the fastq.gz file preprocessed by fastp as the input. Cutadapt then finds adapter sequences, primers, poly-A tails, and other types of unwanted sequences from the high-throughput sequencing reads and removes them. The fastq file results from this adaptive trimming and is sent to BWA (mem), which is the next analysis stage. BWA maps low-divergent sequences to large genome references such as human genomes. BWA (mem) seeds the 70bp~10mbp query sequences to the maximum exact match based on the genome reference and sorts them to create a SAM file. Then, the SAM file is sorted by the coordinates, queryname, and other attributes of the sam record through SortSam to return a BAM file. Next, MarkDuplicates tags duplicate reads in the BAM file and sends them to CountBase, BaseRecalibrator, and ApplyBQSR. CountBase indicates the output file that counts the total number of bases. BaseRecalibrator performs the first of the two stages of the Base Quality Score Recalibration (BQSR). It creates a recalibrated table based on various covariates such as read group, reported quality score, machine cycle, and nucleotide context. ApplyBQSR performs the second stage of BQSR. It re-calibrates the base quality of the input reads based on the table re-calibrated by BaseRecalibrator and creates a re-calibrated BAM file. Next, HaplotypeCaller calls the SNPs and InDels of haplotypes in the active region to reassemble the reads. It uses the result of the reassembled reads to create a VCF file, compresses it into the gvcf.gz format, and sends it to the next stage. Somalier, which is the final analysis stage, extracts the informative sites and evaluates the relatedness according to sequencing. It then saves the evaluation results in a Somalier file format.

NGS data analysis pipeline design process

5.2.1. NGS data preprocessing

Preprocessing workflow

[Figure 5.2.1-1]

[Figure 5.2.1-1]

FASTQ format

[Figure 5.2.1-2]

[Figure 5.2.1-2]

FASTQ format cont

Phred* quality score Q with base-calling error probability P

Q = -10 log10P

[Figure 5.2.1-3]

[Figure 5.2.1-3]

[Figure 5.2.1-4]

[Figure 5.2.1-4]

FASTQ format

  • One or more sequences per file
  • “ > ” denotes beginning of sequence or contig
  • Subsequent lines up to the next “ > ” define sequence
  • Lowercase base denotes repeat masked base
  • Contig ID may have comments delimited by “|”
[Figure 5.2.1-5]

[Figure 5.2.1-5]

SAM format

[Figure 5.2.1-6]

[Figure 5.2.1-6]

SAM format cont

[Figure 5.2.1-7]

[Figure 5.2.1-7]

[Figure 5.2.1-8]

[Figure 5.2.1-8]

[Figure 5.2.1-9]

[Figure 5.2.1-9]

Preprocessing stage

  • Quality Check
    • - FastQC is an excellent program for visualizing the overall quality of all reads in a fastq file
    • - Input : FASTQ
    • - Output : QC Report (html)
    • - Software : FastQC ...
[Figure 5.2.1-10]

[Figure 5.2.1-10]

[Figure 5.2.1-11]

[Figure 5.2.1-11]

[Figure 5.2.1-12]

[Figure 5.2.1-12]

Trimming

  • Adapter trimming
    • - May increase mapping rates
    • - Absolutely essential for small RNA
    • - Probably improves de novo assemblies
  • Quality trimming
    • - May increase mapping rates
    • - May also lead to loss of information
  • Input: FASTQ
  • Output: Trimmed FASTQ
  • Software: Sickle, Cutadapt, Fastx Toolkit ...
[Figure 5.2.1-13]

[Figure 5.2.1-13]

Alignment

  • The process of making a sequence by arranging sequences of DNA and RNA
  • Alignment is separated by mapping and assembly
  • Input: FASTQ, Fasta
  • Output: Sam / Bam
  • Software: bowtie, bwa, star, soap ...
[Figure 5.2.1-14]

[Figure 5.2.1-14]

Preprocessing stage cont

Alignment – Aligning to a Reference

[Figure 5.2.1-15]

[Figure 5.2.1-15]

[Figure 5.2.1-16]

[Figure 5.2.1-16]

5.2.2. Whole Genome Sequencing Pipeline

Pipeline summary

  • Pipeline Diagram
  • [Figure 5.2.2-1]

    [Figure 5.2.2-1]

  • Description

    It is a pipeline that receives as the input the DNA sequences of normal and cancerous tissues, and performs processes such as Read Map to Reference, Read Merge, and CleanUp, to create a bam file. It finds the somatic variants of SNP & InDel faster by using GATK4, and performs the variant annotate task.

  • Programs used

    Fastp, Cutadapt, BWA, GATK4, Somalier

  • Input files
    • - Fastq Data file
    • - Reference FASTA file
    • - Reference & Resource VCF filee
  • Final result file

    Variant annotate file of Somatic SNP & InDel

Registered pipeline

[Figure 5.2.2-2]

[Figure 5.2.2-2]

Step-by-step description

1) Quality Check - Quality Control & Adaptive Trimming

Fastp

  • - Version and License: 0. 20.1 / MIT License
  • - Description: A program that creates an HTML-based report by performing all-in-one preprocessing for Fastq files.
  • - Input parameter
    • [input]: The path where the fastq.gz file that has been pair-ended with raw data is located (path of *.fastq.gz)
    • [output]: The path where the preprocessing result report will be created (path of *.html, *.json, *.fastq.gz)
  • - Script details
    • fastp program written in Bash shell script.
    • Receives the input and output values as a folder, and selects and analyzes a set of fastq.gz files in the folder.
  •                                                             #!/bin/bash
    
    INPUT_DIR=$1
    OUTPUT_DIR=$2
    
    mkdir -p $OUTPUT_DIR
    
    in_array() {
        local needle array value
        needle="${1}"; shift; array=("${@}")
        for value in ${array[@]};
            do
                [ "${value}" == "${needle}" ] && echo "true" && return;
            done
        echo "false"
    }
    
    array=(`find $INPUT_DIR/ -name "*.fastq.gz"`)
    for i in "${array[@]}"
    do
        FILE=${i##*/}
        IFS='' read -r -a array <<< "$FILE"
        len=(${#array[@]}-1)
        FASTP_IN_FILE_LIST+=("${array[$len]}")
    done
    
    
    for i in "${FASTP_IN_FILE_LIST[@]}"
    do
        basename=$i
        FILE1=${basename%1.fastq.gz}
    
        IFS='' read -r -a file1_array <<< "$FILE1"
        file1_len=(${#file1_array[@]}-1)
    
        array_check=`in_array ${file1_array[$file1_len]}2.fastq.gz ${FASTP_IN_FILE_LIST[@]}`
    
        if [ "${array_check}" == "true" ]; then
    
            IFS='-' read -r -a SEQ_ID <<< "$FILE1"
             
            FASTP='/BiO/program/fastp/current/fastp'
    
            time $FASTP -i ${INPUT_DIR}/${file1_array[$file1_len]}1.fastq.gz -I ${INPUT_DIR}/${file1_array[$file1_len]}2.fastq.gz \
    		-o ${OUTPUT_DIR}/${file1_array[$file1_len]}1.fastq.gz -O ${OUTPUT_DIR}/${file1_array[$file1_len]}2.fastq.gz \
    		--length_required 20 --average_qual 20 --detect_adapter_for_pe --correction \
    		-h ${OUTPUT_DIR}/${SEQ_ID[0]}.html -j ${OUTPUT_DIR}/${SEQ_ID[0]}.json
        fi
    done
                                                            
  • - Pipeline connection
  • [Figure 5.2.2-3]

    [Figure 5.2.2-3]

2) Trimming - Quality Control & Adaptive Trimming

Cutadapt

  • - Version and License: 3.4 / MIT License
  • - Description A program that generates purified fastq files by finding and removing adapter sequences, primers, poly-A tails, and unwanted sequences from high-throughput sequencing reads
  • - Input parameter
    • [input]: The path where the fastq.gz file preprocessed by Fastp is located
    • [output]: The path where the fastq file that is the result of adaptive trimming will be created
  • - Script details
    • cutadapt program written in Bash shell script.
    • Receives the input and output values as a folder, and selects and analyzes a set of fastq.gz files in the folder.
  •                                                                 
    #!/bin/bash
    
    INPUT_DIR=$1
    OUTPUT_DIR=$2
    
    mkdir -p $OUTPUT_DIR
    
    in_array() {
        local needle array value
        needle="${1}"; shift; array=("${@}")
        for value in ${array[@]};
            do
                [ "${value}" == "${needle}" ] && echo "true" && return;
            done
        echo "false"
    }
    
    array=(`find $INPUT_DIR/ -name "*.fastq.gz"`) 
    for i in "${array[@]}"
    do
        FILE=${i##*/}
        IFS='' read -r -a array <<< "$FILE"
        len=(${#array[@]}-1)
        echo "${array[$len]}"
        SORT_IN_FILE_LIST+=("${array[$len]}")
    done
    
    index=1
    
    for i in "${CUTADAPT_IN_FILE_LIST[@]}"
    do
        basename=$i
        FILE1=${basename%1.fastq.gz}
    
        IFS='' read -r -a file1_array <<< "$FILE1"
        file1_len=(${#file1_array[@]}-1)
    
        array_check=`in_array ${file1_array[$file1_len]}2.fastq.gz ${CUTADAPT_IN_FILE_LIST[@]}`
    
        if [ "${array_check}" == "true" ]; then
    
            READ_1=$INPUT_DIR/${file1_array[$file1_len]}1.fastq.gz
            READ_2=$INPUT_DIR/${file1_array[$file1_len]}2.fastq.gz
    
            READ_1_FILE_NAME=$(basename $READ_1) 
            READ_2_FILE_NAME=$(basename $READ_2)
    
            IFS='.' read -r -a split_name_1 <<< "$READ_1_FILE_NAME" 
            IFS='.' read -r -a split_name_2 <<< "$READ_2_FILE_NAME" 
    
            READ_1_NAME=${split_name_1[0]} 
            READ_2_NAME=${split_name_2[0]}
    
           	CUTADAPT='/BiO/program/python/current/bin/cutadapt'
           
            time $CUTADAPT -q 30 -j 2 -a AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC -A AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT --minimum-length 70 --pair-filter=any -o $OUTPUT_DIR/${READ_1_NAME}.fastq -p $OUTPUT_DIR/${READ_2_NAME}.fastq  $READ_1 $READ_2
    
            index=$(($index+1))
        fi
    done
                                                                
  • - Pipeline connection
  • [Figure 5.2.2-4]

    [Figure 5.2.2-4]

3) Alignment - Read Map to Reference

BWA mem

  • - Version and License: 0.7.15 / GNU General Public License version 3.0
  • - Description: A program that seeds 70bp~10mbp query sequences to the maximum exact match and then sorts them to make a sam file
  • - 입력 인자
    • [input]: The path where the fastq file that has been adaptively trimmed by Cutadapt is located
    • [reference]: The path where the index of the reference fasta file was created by the BWA index
    • [output]: The path where the 'Read Map to Referenced' sam file will be created
  • - Script details
    • BWA mem program written in bash shell script.
    • Receives the reference values as a folder and only selects fasta files in the folder. Receives the input and output values as a folder and selects and analyzes the fastaq files in the folder.
  • #!/bin/bash
    
    INPUT_DIR=$1
    REF_DIR=$2
    OUTPUT_DIR=$3 
    
    mkdir -p $OUTPUT_DIR
    
    REF_FILE=(`find $REF_DIR/ -name "*.fasta"`)
    
    in_array() {
        local needle array value
        needle="${1}"; shift; array=("${@}")
        for value in ${array[@]};
            do
                [ "${value}" == "${needle}" ] && echo "true" && return;
            done
        echo "false"
    }
    
    array=(`find $INPUT_DIR/ -name "*.fastq"`) 
    
    for i in "${array[@]}"
    do
        FILE=${i##*/}
        IFS='' read -r -a array <<< "$FILE"
        len=(${#array[@]}-1)
        echo "${array[$len]}"
        SORT_IN_FILE_LIST+=("${array[$len]}")
    done
    
    for i in "${BWA_IN_FILE_LIST[@]}"
    do
        basename=$i
        FILE1=${basename%1.fastq}
    
        IFS='' read -r -a file1_array <<< "$FILE1"
        file1_len=(${#file1_array[@]}-1)
    
        array_check=`in_array ${file1_array[$file1_len]}2.fastq ${BWA_IN_FILE_LIST[@]}`
    
        if [ "${array_check}" == "true" ]; then
    
            READ_1=$INPUT_DIR/${file1_array[$file1_len]}1.fastq 
            READ_2=$INPUT_DIR/${file1_array[$file1_len]}2.fastq 
    
            INPUT_FILE_NAME=$(basename  $READ_1 .fastq)
    
            IFS='_' read -r -a NAME <<< "$INPUT_FILE_NAME"
            IFS='-' read -r -a FASTQ_2D_ID <<< "${NAME[0]}"
            
            OUTPUT_SAM=$OUTPUT_DIR/${NAME}.sam
    	
            BWA='/BiO/program/bwa/current/bwa'
            $BWA mem \
            -M -R "@RG\tID:${INPUT_DIR}\tPL:Nova6000S4\tLB:Truseq_PCRFREE\tSM:${FASTQ_2D_ID}" -v 1 \
            -t 1 $REF_FILE $READ_1 $READ_2 > $OUTPUT_SAM
            
        fi
    
    done
                                                                
  • - Pipeline connection
  • [Figure 5.2.2-5]

    [Figure 5.2.2-5]

4) Alignment - Read Merge

GATK SortSam

  • - Version and License: 0.7.15 / GATK 4.2.0.0 / MIT License
  • - Description: A program that sorts the sam or bam files by coordinate, query name, or other attributes of the sam record
  • - Input parameter
    • [input]: The path where the sam file created by BWA mem is located
    • [output]: The path where the bam files sorted by attributes will be created
  • - Script details
    • SortSam program written in bash shell script.
    • Receive input and output values into a folder and sift through a set of SAM files within the folder for analysis
  • #!/bin/bash
    
    INPUT_DIR=$1
    OUTPUT_DIR=$2 
    
    mkdir -p $OUTPUT_DIR
    
    array=(`find $INPUT_DIR/ -name "*.sam"`) 
    
    for i in "${array[@]}"
    do
    	FILE=${i##*/}
    	IFS='' read -r -a array <<< "$FILE"
    	len=(${#array[@]}-1)
    	echo "${array[$len]}"
    	SORT_IN_FILE_LIST+=("${array[$len]}")
    done
    
    for i in "${SORT_IN_FILE_LIST[@]}" 
    do
        basename=$(basename  $i)
        INPUT_FILE_NAME=$(basename  $i .sam)
        EXTENSION=${basename#*.}
    
        if [ $EXTENSION == 'sam' ]; then
    
            IFS='-' read -r -a FNAME <<< "$INPUT_FILE_NAME"
            IFS='_' read -r -a SEQ_ID <<< "$FNAME"
    
            INPUT_FILE=$INPUT_DIR/${INPUT_FILE_NAME}.sam
            
        SORTSAM='/BiO/program/gatk/current/gatk SortSam'
    
        time $SORTSAM -I $INPUT_FILE -O $OUTPUT_DIR/${INPUT_FILE_NAME}.bam -SO coordinate
    
        fi
    done
                                                                
  • - Pipeline connection
  • [Figure 5.2.2-6]

    [Figure 5.2.2-6]

5) Alignment - CleanUp

GATK MarkDuplicates

  • - Version and License: GATK 4.2.0.0 / MIT License
  • - Description: A program that compares the sequences in five key locations of the input file in order to find and tag duplicate reads
  • - Input parameter
    • [input]: The path where BAM files sorted by properties are created
    • [output]: The path where the bam file in which the duplicate reads have been found and tagged will be created
  • - Script details
    • MarkDuplicates program written in Bash shell script.
    • Receives the input and output values as a folder, and selects and analyzes a set of BAM files in the folder.
  • #!/bin/bash
    INPUT_DIR=$1 
    OUTPUT_DIR=$2
    
    mkdir -p $OUTPUT_DIR
    
    array=(`find $INPUT_DIR/ -name "*.bam"`)
    
    for i in "${array[@]}"
    do
        FILE=${i##*/}
        IFS='' read -r -a array <<< "$FILE"
        len=(${#array[@]}-1)
        echo "${array[$len]}"
        MARK_IN_FILE_LIST+=("${array[$len]}")
    done
    
    for i in "${MARK_IN_FILE_LIST[@]}"
    do
        basename=$(basename  $i)
        INPUT_FILE_NAME=$(basename  $i .bam)
        EXTENSION=${basename#*.}
    
        if [ $EXTENSION == 'bam' ]; then
    
            INPUT_FILE=$INPUT_DIR/${INPUT_FILE_NAME}.bam
            
        MARKDUPLICATES='/BiO/program/gatk/current/gatk MarkDuplicates'
    
        time $MARKDUPLICATES \
            -I $INPUT_FILE \
            -O $OUTPUT_DIR/${INPUT_FILE_NAME}.bam \
            -M $OUTPUT_DIR/${INPUT_FILE_NAME}.table \
            --REMOVE_SEQUENCING_DUPLICATES false
    
        echo "$MARKDUPLICATES -I $INPUT_FILE -O $OUTPUT_DIR/${INPUT_FILE_NAME}.bam -M $OUTPUT_DIR/${INPUT_FILE_NAME}.table --REMOVE_SEQUENCING_DUPLICATES true"
        
        fi
    done
                                                                
  • - Pipeline connection
  • [Figure 5.2.2-7]

    [Figure 5.2.2-7]

6) Alignment - Base Count

GATK CountBases

  • - Version and License: GATK 4.2.0.0 / MIT License
  • - Description: A program that counts the total number of bases in the SAM/BAM/CRAM files as standard outputs
  • - Input parameter
    • [input]: The path where the BAM file created by MarkDuplicates is located.
    • [intervals]: The path where the genomic interval file is located
    • [output]: The path where the count file to which the total number of bases in the bam file has been output will be created
  • - Script details
    • CountBases program written in bash shell script.
    • Receives the reference values as a file. Receives the input and output values as a folder, and selects and analyzes the bam files in the folder.
  • INPUT_DIR=$1
    REF_FILE=$2
    OUTPUT_DIR=$3
    
    mkdir -p $OUTPUT_DIR
    
    array=(`find $INPUT_DIR/ -name "*.bam"`)
    
    for i in "${array[@]}"
    do
        FILE=${i##*/}
        IFS='' read -r -a array <<< "$FILE"
        len=(${#array[@]}-1)
        echo "${array[$len]}"
        COUNT_IN_FILE_LIST+=("${array[$len]}")
    done
    
    for i in "${COUNT_IN_FILE_LIST[@]}"
    do
        basename=$(basename  $i)
        INPUT_FILE_NAME=$(basename  $i .bam)
        EXTENSION=${basename#*.}
    
        if [ $EXTENSION == 'bam' ]; then
    
            INPUT_FILE=$INPUT_DIR/${INPUT_FILE_NAME}.bam
    
            /samtools/1.6/samtools index $INPUT_FILE
        
        COUNTBASES='/BiO/program/gatk/current/gatk CountBases'
    
        time $COUNTBASES \
            -I $INPUT_FILE \
            -O $OUTPUT_DIR/${INPUT_FILE_NAME}.count \
            -RF NotSecondaryAlignmentReadFilter \
            -L $REF_FILE
        
        fi
    done
                                                                
  • - Pipeline connection
  • [Figure 5.2.2-8]

    [Figure 5.2.2-8]

7) Alignment - Base Quality Score Recalibration

GATK BaseRecalibrator

  • - Version and License: GATK 4.2.0.0 / MIT License
  • - Description: A program that creates a table by re-calibrating the base quality score based on various covariates, such as read group, reported quality score, machine cycle, and nucleotide context
  • - Input parameter
    • [input]: The path where the bam file created by MarkDuplicates is located
    • [reference]: The path where the reference sequence fasta file is located
    • [known sites1~3]: The path where the vcf file, which is ‘known polymorphic sites’, is located
    • [output]: The path where the table file created by re-calibrating the base quality score is located
  • - Script details
    • BaseRecalibrator program written in Bash shell script.
    • Receives the reference values as a file. Receives the input and output values as a folder, and selects and analyzes the bam files in the folder.
  • #!/bin/bash
    
    INPUT_DIR=$1
    REF_FILE1=$2
    REF_FILE2=$3
    REF_FILE3=$4
    REF_FILE4=$5
    OUTPUT_DIR=$6
    
    mkdir -p $OUTPUT_DIR
    
    array=(`find $INPUT_DIR/ -name "*.bam"`)
    
    for i in "${array[@]}"
    do
        FILE=${i##*/}
        IFS='' read -r -a array <<< "$FILE"
        len=(${#array[@]}-1)
        echo "${array[$len]}"
        BASE_IN_FILE_LIST+=("${array[$len]}")
    done
    
    for i in "${BASE_IN_FILE_LIST[@]}"
    do
        basename=$(basename  $i)
        INPUT_FILE_NAME=$(basename  $i .bam)
        EXTENSION=${basename#*.}
    
        if [ $EXTENSION == 'bam' ]; then
    
            INPUT_FILE=$INPUT_DIR/${INPUT_FILE_NAME}.bam
            
        BASERECALIBRATOR='/BiO/program/gatk/current/gatk BaseRecalibrator'
    
        time $BASERECALIBRATOR \
            -I $INPUT_DIR/${INPUT_FILE_NAME}.bam \
            -R ${REF_FILE1} \
            --known-sites ${REF_FILE2} \
            --known-sites ${REF_FILE3} \
            --known-sites ${REF_FILE4} \
            -O $OUTPUT_DIR/${INPUT_FILE_NAME}.table
    
        fi
    done
                                                                
  • - Pipeline connection
  • [Figure 5.2.2-9]

    [Figure 5.2.2-9]

8) Alignment - Apply BQSR

GATK ApplyBQSR

  • - Version and License: GATK 4.2.0.0 / MIT License
  • - Description: A program that creates bam or cram files by re-calibrating the base quality of the input reads based on the recalibration table created with the BaseRecalibrator tool
  • - Input parameter
    • [input]: The path where the bam file created by MarkDuplicates is located
    • [reference]: The path where the reference sequence fasta file is located
    • [bqsr recal file]: The path where the table file created by BaseRecalibrator is located
    • [output]: The path where the re-calibrated bam file will be created based on the table file created by BaseRecalibrator
  • - Script details
    • ApplyBQSR program written in Bash shell script.
    • Receives the reference values as a file and the table values as a folder with the table files in it. Receives the input and output values as a folder, and selects and analyzes the bam files in the folder.
  • #!/bin/bash
    
    INPUT_DIR=$1 
    REF_FILE=$2
    REF_DIR=$3
    OUTPUT_DIR=$4
    
    mkdir -p $OUTPUT_DIR
    
    array=(`find $INPUT_DIR/ -name "*.bam"`) 
    
    for i in "${array[@]}"
    do
        FILE=${i##*/}
        IFS='' read -r -a array <<< "$FILE"
        len=(${#array[@]}-1)
        echo "${array[$len]}"
        BQSR_IN_FILE_LIST+=("${array[$len]}")
    done
    
    for i in "${BQSR_IN_FILE_LIST[@]}"
    do
        basename=$(basename  $i)
        INPUT_FILE_NAME=$(basename  $i .bam)
        EXTENSION=${basename#*.}
    
        if [ $EXTENSION == 'bam' ]; then
     
            IFS='-' read -r -a FNAME <<< "$INPUT_FILE_NAME"
            IFS='_' read -r -a SEQ_ID <<< "$FNAME"
    
            echo $INPUT_FILE_NAME
            echo $FNAME
            
            INPUT_FILE=$INPUT_DIR/${INPUT_FILE_NAME}.bam
    
        APPLYBQSR='/BiO/program/gatk/current/gatk ApplyBQSR'
    
        time $APPLYBQSR \
            -R $REF_FILE \
            -I $INPUT_FILE \
            -bqsr $REF_DIR/${INPUT_FILE_NAME}.table \
            -O $OUTPUT_DIR/${SEQ_ID[0]}_v1.0.0.bam \
            --static-quantized-quals 10 \
            --static-quantized-quals 20 \
            --static-quantized-quals 30 
      
       fi
    done
                                                                
  • - Pipeline connection
  • [Figure 5.2.2-10]

    [Figure 5.2.2-10]

9) Alignment - Extract&Filter of SNPs, Indels

GATK HaplotypeCaller

  • - Version and License: GATK 4.2.0.0 / MIT License
  • - Description: A program that calls the SNP and indels of haplotypes in the active region to reassemble the reads. It uses the result of the reassembled reads to create a vcf file, and compresses it into gvcf.gz.
  • - Input parameter
    • [input]: The path where the bam file created by ApplyBQSR is located
    • [reference]: The path where the reference sequence fasta file is located
    • [intervals]: The path where the genomic intervals list file is located
    • [output]: The path where the gvcf.gz file created by HaplotypeCaller is located
  • - Script details
    • HaplotypeCaller program written in Bash shell script.
    • Receives the reference values as a file. Receives the input and output values as a folder, and selects and analyzes the bam files in the folder.
  • #!/bin/bash
    
    echo "Total Param= $#,  PROG: $0, param1 = $1, param2 = $2, param3 = $3, param4 = $4";
    
    if [ "$#" -lt 4 ]; then
        echo "$# is Illegal number of parameters."
        echo "Usage: $0 input dir, reference file, output dir"
    	exit 1
    fi
    args=("$@")
    
    for (( i=0; i<$#; i++ ))
    do
        echo "$i th parameter = ${args[$i]}";
    done
    
    INPUT_DIR=$1 # BAM 파일
    REF_FILE1=$2
    REF_FILE2=$3
    OUTPUT_DIR=$4 # VCF 파일
    
    mkdir -p $OUTPUT_DIR
    
    array=(`find $INPUT_DIR/ -name "*.bam"`)
    
    for i in "${array[@]}"
    do
        FILE=${i##*/}
        IFS='' read -r -a array <<< "$FILE"
        len=(${#array[@]}-1)
        echo "${array[$len]}"
        HAPLO_IN_FILE_LIST+=("${array[$len]}")
    done
    
    
    for i in "${HAPLO_IN_FILE_LIST[@]}"
    do
    
            basename=$(basename  $i)
            INPUT_FILE_NAME=${basename%.*}
            EXTENSION=${basename##*.}
    
        if [ $EXTENSION == 'bam' ]; then
    
            IFS='-' read -r -a FNAME <<< "$INPUT_FILE_NAME"
            IFS='_' read -r -a SEQ_ID <<< "$FNAME"
    
            INPUT_FILE=$INPUT_DIR/${INPUT_FILE_NAME}.bam
    
        HAPLOTYPECALLER='/BiO/program/gatk/current/gatk HaplotypeCaller'
        
        time $HAPLOTYPECALLER -I $INPUT_FILE -R $REF_FILE1 -O $OUTPUT_DIR/${SEQ_ID}_v1.0.0.vcf -OVI true --emit-ref-confidence GVCF -L $REF_FILE2
        echo "$HAPLOTYPECALLER -I $INPUT_FILE -R $REF_FILE1 -O $OUTPUT_DIR/${SEQ_ID}_v1.0.0.vcf -OVI true --emit-ref-confidence GVCF -L $REF_FILE2"
            
        
        mv $OUTPUT_DIR/${SEQ_ID}_v1.0.0.vcf $OUTPUT_DIR/${SEQ_ID}_v1.0.0.gvcf
        echo "mv $OUTPUT_DIR/${SEQ_ID}_v1.0.0.vcf $OUTPUT_DIR/${SEQ_ID}_v1.0.0.gvcf"
    
        BGZIP='/BiO/program/htslib/current/bgzip'
        $BGZIP -c $OUTPUT_DIR/${SEQ_ID}_v1.0.0.gvcf > $OUTPUT_DIR/${SEQ_ID}_v1.0.0.gvcf.gz
        echo "$BGZIP -c $OUTPUT_DIR/${SEQ_ID}_v1.0.0.gvcf > $OUTPUT_DIR/${SEQ_ID}_v1.0.0.gvcf.gz"
        
        TABIX='/BiO/program/htslib/current/tabix'
    	$TABIX -p vcf $OUTPUT_DIR/${SEQ_ID}_v1.0.0.gvcf.gz
        echo "$TABIX -p vcf $OUTPUT_DIR/${SEQ_ID}_v1.0.0.gvcf.gz"
    
    
        fi
    done
                                                                
  • - Pipeline connection
  • [Figure 5.2.2-11]

    [Figure 5.2.2-11]

10) Alignment - Variant Annotation

Somalier

  • - Version and License: 0.2.12/ MIT License
  • - Description: A program that extracts the informative sites from the BAM/CRAM/VCF files and evaluates the relatedness according to sequencing
  • - Input parameter
    • [input]: The path where the gvcf.gz file will be created through HaplotypeCaller
    • [reference]: The path where the reference sequence fasta file is located
    • [sites] : he path where the vcf file, which is the site of the variants to be extracted, is located
    • [output]: The path where the somalier file will be created through Somalier
  • - Script details
    • Somalier program written in Bash shell script.
    • Receives the reference values as a file. Receives the input and output values as a folder, and selects and analyzes the gvcf.gz files in the folder.
  • #!/bin/bash
    
    INPUT_DIR=$1
    REF_FILE1=$2
    REF_FILE2=$3
    OUTPUT_DIR=$4
    
    mkdir -p $OUTPUT_DIR
    
    array=(`find $INPUT_DIR/ -name "*.gvcf.gz"`)
    
    for i in "${array[@]}"
    do
        FILE=${i##*/}
        IFS='' read -r -a array <<< "$FILE"
        len=(${#array[@]}-1)
        echo "${array[$len]}"
        SOMAL_IN_FILE_LIST+=("${array[$len]}")
    done
    
    for i in "${SOMAL_IN_FILE_LIST[@]}"
    do
    
        INPUT_FILE_NAME=`basename "$i" | cut -d'.' -f-3`
        EXTENSION=`echo "$i" | cut -d'.' -f4-`
    
        if [ $EXTENSION == 'gvcf.gz' ]; then
    
            INPUT_FILE=$INPUT_DIR/${INPUT_FILE_NAME}.gvcf.gz
    		
            IFS='-' read -r -a FNAME <<< "$INPUT_FILE_NAME"
            IFS='_' read -r -a SEQ_ID <<< "$FNAME"
    
    	SOMALIER='/somalier/0.2.12/somalier'
        
        time $SOMALIER extract -d $OUTPUT_DIR -s ${REF_FILE2} -f ${REF_FILE1} $INPUT_FILE
    
        fi
    done
                                                                
  • - Pipeline connection
  • [Figure 5.2.2-12]

    [Figure 5.2.2-12]

The result screen

  • Fastp Results Screen
    [Figure 5.2.2-13]

    [Figure 5.2.2-13]

    [Figure 5.2.2-14]

    [Figure 5.2.2-14]

    [Figure 5.2.2-15]

    [Figure 5.2.2-15]

    [Figure 5.2.2-16]

    [Figure 5.2.2-16]

  • GATK_Cutadapt Results Screen
    @ERR792543.3 HWI-ST333_0255_FC:4:1101:1235:2392/1 CTGCCAATTCCCAATTATATTCTTGGAGAAGCTCATTAGGATAATGGAAATCAATCACTTTGGTTGATCTATCGAAACTT 
    + 
    CBCFFFFFHGHHGIJJIJIIJJJJJJGIIIJJGIGIJIJJJJGIIEIFEIHGIJJJJJJJFGGHIJIEHJGIJGIGHBEC 
    @ERR792543.7 HWI-ST333_0255_FC:4:1101:1228:2438/1 CTCCGGAGTGTAGGCTCCCCATATCTCAGCAAGTGGCCAAAGGTTCGTAATGCCATCTCTGCACNAATCTCCTCCCCCAT 
    + 
    @@@DFFFFCDHF8EHIIIIIIIIIIIIIIIIIGHGHIIIIIII?DDF;BFG<FHHCHEIFGCHH#-5;?DEEEEEBBDDC 
    @ERR792543.8 HWI-ST333_0255_FC:4:1101:1179:2464/1 GCCAGTCATGCCTGCCTGAGACGCCCCGCGGTTGGTGCCCATCTGTAACCCGATCACGTTCTTGNCCTCTTNCAGCTGGT 
    + 
    @CCFFFFFHHHHHJJJJJIJJJIIJJJIIJIDHIJFHIIJEHFHEHHHFFFFDCBDDBBDDDDC#,58??C#+8<<@CCC 
    @ERR792543.10 HWI-ST333_0255_FC:4:1101:1400:2280/1 CACAAACCTTGTTCTAATTTATGTGCACTTCTGAATAAAAATAACACCCTTTTGCTTTAATGCAATATTGATTTTATCAC 
    + 
    @CCDFFDFFHHHHIIGCHIIIIFIGGGGIIHIGIJIIJJIJIIJIJIGGIJIJJJJIJGGDEHGGIJIIHGIJJIHHIEH 
    @ERR792543.12 HWI-ST333_0255_FC:4:1101:1389:2303/1 ATGCGCTCCGCCGCTGTCCTGGCTCTTCTGCTCTGCGCCGGGCAAGTCACTGCGCTCCCTGTGAACAGCCCTATGAATAA 
    + 
    @@CFDFFFFHFD:FGGFHIIIIIIIIIIIGIIGGIIIIIIIFHFFFEACCEECCCC:@B?:@CEDCCCB?@CCCACDEDC
    @ERR792543.14 HWIST333_0255_FC:4:1101:1495:2309/1
    TCTCTCTCCTAATGAAAAAGTTTGTTTCTCTTTTACATTAGTAACGTATGTAACGGGTGGTATGTCTTTCTTTGCTTTGG
    +
    CCCFFFFFHHHDFIJJJJJJJJIJJIJJJJJJIJIGHIJGGIJJJJJJJIFGGIJFG<AH4=EEGGIJHGHHHGFCDFDD
    @ERR792543.15 HWI-ST333_0255_FC:4:1101:1274:2316/1
    GCCCAACTGTCCTGCTGATCTGAAGTAAGACAACTAGCCTTCTACATCAAATAATTCTGTGAGGCAATTTGAGTCAG
    +
    @@@DD?DDAC:3,<CGBFHHHIBCGHFEDEHF;EFEIFII@BFBGGI<DFIIGF@4BF>84<BDEFEEDGIIFFDF@
    @ERR792543.16 HWI-ST333_0255_FC:4:1101:1467:2316/1
    TGCAGTATAGATGGGATATGATGCATAGGCTTGGAGAACCACAGGCAAGGATGAGAGAAGAGAATATGGAAAGGATTG
                                                            
  • BWA_mem Results Screen
    @SQ	SN:chr3	LN:198295559
    @SQ	SN:chr4	LN:190214555
    @SQ	SN:chr5	LN:181538259
    @SQ	SN:chr6	LN:170805979
    @SQ	SN:chr7	LN:159345973
    @SQ	SN:chr8	LN:145138636
    @SQ	SN:chr9	LN:138394717
    @SQ	SN:chr10	LN:133797422
    @SQ	SN:chr11	LN:135086622
    @SQ	SN:chr12	LN:133275309
    @SQ	SN:chr13	LN:114364328
    @SQ	SN:chr14	LN:107043718
    @SQ	SN:chr15	LN:101991189
    @SQ	SN:chr16	LN:90338345
    @SQ	SN:chr17	LN:83257441
    @SQ	SN:chr18	LN:80373285
    @SQ	SN:chr19	LN:58617616
    @SQ	SN:chr20	LN:64444167
    @SQ	SN:chr21	LN:46709983
    @SQ	SN:chr22	LN:50818468
    @SQ	SN:chrX	LN:156040895
    @SQ	SN:chrY	LN:57227415
    @SQ	SN:chrM	LN:16569
    @SQ	SN:chr1_KI270706v1_random	LN:175055
    @SQ	SN:chr1_KI270707v1_random	LN:32032
    @SQ	SN:chr1_KI270708v1_random	LN:127682
    @SQ	SN:chr1_KI270709v1_random	LN:66860
    @SQ	SN:chr1_KI270710v1_random	LN:40176
    @SQ	SN:chr1_KI270711v1_random	LN:42210
    @SQ	SN:chr1_KI270712v1_random	LN:1760438
                                                            
  • GATK_BaseRecalibrator Results Screen
    #:GATKTable:Arguments:Recalibration argument collection values used in this run
    Argument                    Value                                                                   
    binary_tag_name             null                                                                    
    covariate                   ReadGroupCovariate,QualityScoreCovariate,ContextCovariate,CycleCovariate
    default_platform            null                                                                    
    deletions_default_quality   45                                                                      
    force_platform              null                                                                    
    indels_context_size         3                                                                       
    insertions_default_quality  45                                                                      
    low_quality_tail            2                                                                       
    maximum_cycle_value         500                                                                     
    mismatches_context_size     2                                                                       
    mismatches_default_quality  -1                                                                      
    no_standard_covs            false                                                                   
    quantizing_levels           16                                                                      
    recalibration_report        null                                                                    
    run_without_dbsnp           false                                                                   
    solid_nocall_strategy       THROW_EXCEPTION                                                         
    solid_recal_mode            SET_Q_ZERO
                                                                  
    #:GATKTable:3:94:%d:%d:%d:;
    #:GATKTable:Quantized:Quality quantization map
    QualityScore  Count     QuantizedScore
               0         0               6
               1         0               6
               2         0               6
               3         0               6
               4         0               6
               5         0               6
               6      3784               6
                                                            
  • GATK4_HaplotypeCaller Results Screen
    ##fileformat=VCFv4.2
    ##ALT=<ID=NON_REF,Description="Represents any possible alternative allele not already represented at this location by REF and ALT">
    ##FILTER=<ID=LowQual,Description="Low quality">
    ##FORMAT=<ID=AD,Number=R,Type=Integer,Description="Allelic depths for the ref and alt alleles in the order listed">
    ##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)">
    ##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
    ##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
    ##FORMAT=<ID=MIN_DP,Number=1,Type=Integer,Description="Minimum DP observed within the GVCF block">
    ##FORMAT=<ID=PGT,Number=1,Type=String,Description="Physical phasing haplotype information, describing how the alternate alleles are phased in relation to one another; will always be heterozygous and is not intended to describe called alleles">
    ##FORMAT=<ID=PID,Number=1,Type=String,Description="Physical phasing ID information, where each unique ID within a given sample (but not across samples) connects records within a phasing group">
    ##FORMAT=<ID=PL,Number=G,Type=Integer,Description="Normalized, Phred-scaled likelihoods for genotypes as defined in the VCF specification">
    ##FORMAT=<ID=PS,Number=1,Type=Integer,Description="Phasing set (typically the position of the first variant in the set)">
    ##FORMAT=<ID=SB,Number=4,Type=Integer,Description="Per-sample component statistics which comprise the Fisher's Exact Test to detect strand bias.">
    ##GATKCommandLine=<ID=HaplotypeCaller,CommandLine="HaplotypeCaller --emit-ref-confidence GVCF --output /bioex/output/haplotype_caller/ERR792543_v1.0.0.vcf --intervals /bioex/reference/references_hg38_v0_wgs_metrics_intervals.interval_list --input /bioex/output/apply_bqsr/ERR792543_v1.0.0.bam --reference /bioex/reference/resources_broad_hg38_v0_Homo_sapiens_assembly38.fasta --create-output-variant-index true --use-posteriors-to-calculate-qual false --dont-use-dragstr-priors false --use-new-qual-calculator true --annotate-with-num-discovered-alleles false --heterozygosity 0.001 --indel-heterozygosity 1.25E-4 --
                                                            

The Somalier result file

Somalier saves the results as [filename].somalier file, as shown in [Figure 5.2.2-17].

[Figure 5.2.2-17]

[Figure 5.2.2-17]