The only cloud-based integrated data analysis service in Korea
High-speed analysis of genomic big data using the big data platform built with KOBIC's technologyIntroduction to Bio-Express Web Service
Web-based Large Data Analysis Service
[Figure 1.1-1]
[Figure 1.1-2]
Download Bio-Express Workbench
2.1.1 Windows
Go to “ https://www.kobic.re.kr/bioexpress/”.
[Figure 2.1.1-1]
[Figure 2.1.1-2]
Select Workbench for Windows via the quick menu on the right side of the main screen and download.
[Figure 2.1.1-3]
Run Bioexpress.exe after decompressing the downloaded .zip file to the desired directory.
2.1.2 Linux
[Figure 2.1.2-1]
Download to the desired directory via the wget command.
$ wget -O Bio-Express.workbench-linux.gtk-x86_64.tar.gz \
"https://www.kobic.re.kr/bioexpress/common/related_file_down?file_name=Bio-Express.workbench-linux.gtk-x86_64.tar.gz"
[Figure 2.1.2-2]
Run Biopress after decompressing the downloaded tar.gz file with the tar -zxvf command.
$ tar -zxvf Bio-Express.workbench-linux.gtk-x86_64.tar.gz
$ cd Bio-Express.workbench-linux.gtk-x86_64
$ ./Bioexpress
2.1.3 MacOS
Go to “ https://www.kobic.re.kr/bioexpress/”.
[Figure 2.1.3-1]
[Figure 2.1.3-2]
Select Workbench for MacOS via the quick menu on the right side of the main screen and download.
[Figure 2.1.3-3]
[Figure 2.1.3-4]
After decompressing the downloaded .zip file, move it to the application and run Bioexpress.
Run Bio-Express Workbench
2.2.1. Login to Bio-Express Workbench
[Figure 2.2.1-1]
Login with the ID registered in KOBIC integrated login (SSO).
If you do not have an ID, login after signing up as a member using the ‘Sign up’ button at the bottom left.
2.2.2. Bio-Express Workbench start screen
[Figure 2.2.2-1]
If the CLOSHA intro page is activated and you click the card area for each service, you will be connected to the Bio-Express web page, where you can view detailed introductions and information on how to use.
[Figure 2.2.2-2]
You can perform user-friendly functions and update Workbench via [File], [Window], [Help].
[Figure 2.2.2-3]
Using the menu on the rightmost side of the toolbar, you can switch between big data high-speed Transfer system G-Box perspective and the cloud-based integrated analysis workbench CLOSHA perspective.
Control menus for workspace, pipeline, program, and category can be used through the toolbar.
Introduction to the GBox screen and functions
The GBox high-speed transfer system transfers the user's large volume of data at high speed using only software technology without additional hardware, such as an accelerator, or installing a separate plug-in, such as ActiveX. User data uploaded through GBox are available in CLOSHA.
[Figure 3.1-1]
GBox has been designed with an easy and intuitive user interface. Hence, users can run GBox without training.
Big data high-speed transfer
[Figure 3.2-1]
Upload method 1: Select the folder or file you want to transfer in the [File Explorer] view, and drag and drop it to the [GBox Explorer] view.
[Figure 3.2-2]
Upload method 2: Select the folder or file you want to transfer in the [File Explorer] view, and click the [Upload] button in the context menu to transfer it.
[Figure 3.2-3]
Download method 1: Select the folder or file you want to transfer in the [GBox Explorer] view, and drag and drop it to the [File Explorer] view.
[Figure 3.2-4]
Download method 2: Select the folder or file you want to transfer in the [GBox Explorer] view, and click the [Download] button in the context menu to transfer it.
[Figure 3.2-5]
Once the file transfer begins, the real-time transfer status can be monitored in the [Progress] view at the bottom of the application, as shown in [Figure 3.2-5].
High-speed synchronization of big data file system
[Figure 3.3-1]
GBox → HDFS: Select the folder or file you want to synchronize in the [GBox Explorer] view, and click the [Sync] button in the context menu to synchronize it.
[Figure 3.3-2]
HDFS → GBox: Select the folder or file you want to synchronize in the [HDFS Tree Browser] view, and click the [Sync] button in the context menu to synchronize it.
[Figure 3.3-3]
You can check the bidirectional file synchronization record in the [Synchronized History] view.
Data file management function
[Figure 3.4-1]
You can manage data using the context menu of the [File Explorer] view and the [GBox Explorer] view. ([New], [Rename], [Copy], [Paste], [Delete], [Refresh])
You can move directories and use shortcuts via the toolbar menu at the top of the [File Explorer] view and the [GBox Explorer] view.
Introduction to screens and functions
Bio-Express Workbench is a cloud-based program that provides services for analyzing a large volume of data. It has a user-friendly interface, so users can easily design a complex bio-information analysis process as a pipeline.
[Figure 4.1-1]
Register a user definition-based program
4.2.1. Write a user analysis code
To register a user definition-based program, you must first register a user analysis code.
[Figure 4.2.1-1]
Click the [New Folder] button in the context menu of the [Script] view to create a script folder for registering a user analysis code.
[Figure 4.2.1-2]
After entering the name of the new folder, click the [OK] button to create the folder.
[Figure 4.2.1-3]
Click the [New Script] button in the context menu of the created script folder, as shown in [Figure 4.2.1-3], to create a user analysis code. Python, Bash, and R file formats are supported for user analysis code.
[Figure 4.2.1-4]
After entering the name of the script to be created, click the [OK] button to create an empty script.
[Figure 4.2.1-5-1]
[Figure 4.2.1-5-2]
[그림 4.2.1-5-3]
#!/bin/bash
INPUT_DIR=$1
OUTPUT_DIR=$2
mkdir -p $OUTPUT_DIR
in_array() {
local needle array value
needle="${1}"; shift; array=("${@}")
for value in ${array[@]};
do
[ "${value}" == "${needle}" ] && echo "true" && return;
done
echo "false"
}
array=(`find $INPUT_DIR/ -name "*.fastq.gz"`)
for i in "${array[@]}"
do
FILE=${i##*/}
IFS='' read -r -a array <<< "$FILE"
len=(${#array[@]}-1)
FASTP_IN_FILE_LIST+=("${array[$len]}")
done
for i in "${FASTP_IN_FILE_LIST[@]}"
do
basename=$i
FILE1=${basename%1.fastq.gz}
IFS='' read -r -a file1_array <<< "$FILE1"
file1_len=(${#file1_array[@]}-1)
array_check=`in_array ${file1_array[$file1_len]}2.fastq.gz ${FASTP_IN_FILE_LIST[@]}`
if [ "${array_check}" == "true" ]; then
IFS='-' read -r -a SEQ_ID <<< "$FILE1"
FASTP='/BiO/program/fastp/current/fastp'
time $FASTP -i ${INPUT_DIR}/${file1_array[$file1_len]}1.fastq.gz -I ${INPUT_DIR}/${file1_array[$file1_len]}2.fastq.gz \
-o ${OUTPUT_DIR}/${file1_array[$file1_len]}1.fastq.gz -O ${OUTPUT_DIR}/${file1_array[$file1_len]}2.fastq.gz \
--length_required 20 --average_qual 20 --detect_adapter_for_pe --correction \
-h ${OUTPUT_DIR}/${SEQ_ID[0]}.html -j ${OUTPUT_DIR}/${SEQ_ID[0]}.json
fi
done
You can write or edit the created user analysis code in the editor by double-clicking it in the [Script] view, as shown in [Figure 4.2.1-5].
4.2.2. Register a user program
Register the user program to run the user analysis code written.
[Figure 4.2.2-1]
[Figure 4.2.2-2]
Click the [New Category] button of the context menu in the [Program] view to create the root category and subcategory for registering the user-defined program.
The workspace, category, pipeline, and program names may only contain letters, numbers, and some special characters (-_.), and the first character must be a letter. Other characters or spaces are not allowed.
If a subcategory exists in the root category or if a program or pipeline exists in a subcategory, delete or edit may not be used.
[Figure 4.2.2-3]
Create a subcategory under the root category using the context menu.
[Figure 4.2.2-4]
Click the [New Program] button in the context menu of the created subcategory to register a user program.
[Figure 4.2.2-5]
Register a user program by setting its basic information, additional information, and parameters.
You can enter multiple keywords by delimiting them with commas (,).
Parameters consist of Input, Output, and Option. The set values are default values, and the user can set the values again in the program execution stage.
[Figure 4.2.2-6]
Click the Add Parameter button to create the Input parameter of the fastp program. You can edit or delete the created parameter.
[Figure 4.2.2-7]
Enter the default value after specifying the type of parameter to be created. If the parameter type is Folder or File, you can import the files and folders of GBox using the Browser button.
[Figure 4.2.2-8]
You can check the created parameters, as shown in [Figure 3.3.2-8].
[Figure 4.2.2-9]
You can create the Output and Option parameters in the same manner.
Develop an analysis pipeline
4.3.1. Create a workspace
Create a workspace by clicking the [New Workspace] button in the context menu of the [Workspace] view, as shown in [Figure 4.3.1-1].
[Figure 4.3.1-1]
Enter the name and description of the workspace to be created. Register the related keywords and then click the [OK] button.
[Figure 4.3.1-2]
Create an analysis pipeline by clicking the [New Pipeline] button in the context menu of the created workspace.
[Figure 4.3.1-3]
Create a user-defined pipeline according to the pipeline type or create an instance pipeline by selecting an existing registered pipeline.
[Figure 4.3.1-4]
- [Develop a new pipeline] : Create a user-defined pipeline
- [Develop an instance pipeline] : Select an open pipeline
[Figure 4.3.1-5]
You can check the user-defined pipeline created through the [Workspace] view, as shown in [Figure 4.3.1-5].
4.3.2. Design an analysis pipeline
4.3.2-1. Design a user-defined pipeline
[Figure 4.3.2-1]
Find the analysis program required for the analysis pipeline design in the ‘Palette’ of the editor.
[Figure 4.3.2-2]
After selecting the program, drag and drop it onto the canvas using the mouse.
[Figure 4.3.2-3-1]
Save the pipeline designed with the required analysis program, and double-click each analysis program to set the required options for execution.
[Figure 4.3.2-3-2]
After setting the analysis pipeline execution option, create a single pipeline by connecting the analysis programs located on the canvas.
Changing the order of connecting the analysis programs and setting the options does not cause functional problems. However, when the analysis programs are connected, the output of the program in the previous stage is automatically set as the input. Therefore, it is recommended to connect the analysis programs after setting the options.
[Figure 4.3.2-4]
Design the user-defined pipeline by dragging and dropping the circle-shaped connection point located at the right end of the output parameter of the analysis program to the connection point at the left end of the input parameter of the next-stage analysis program.
[Figure 4.3.2-5]
The user can modify the shape of the connecting line freely by dragging the mouse, as shown in [Figure 4.3.2-5].
4.3.2-2. Design an instance pipeline
You can select an existing registered pipeline in the pipeline creation stage by selecting an instance pipeline type, as shown in [Figure 4.3.2-6].
[Figure 4.3.2-6]
You can also change the settings required to run each program in an instance pipeline, similar to a user-defined pipeline.
4.3.3. Share Analytics Pipeline
[Figure 4.3.3-1]
The designed analysis pipeline can be shared between members who have joined KOBIC SSO.
Click [Shared Pipeline] in the context menu of the pipeline you want to share.
[Figure 4.3.3-2]
You can check the information about the pipeline you want to share and the table in which you can select the members with whom you want to share the pipeline.
[Figure 4.3.3-3]
[Figure 4.3.3-4]
Run an analysis pipeline
4.4.1. Run an analysis pipeline
You can run the designed analysis pipeline by clicking the [Run] button at the top of the editor.
[Figure 4.4.1-1]
When the analysis is run, you can check the progress of the analysis pipeline through the [Pipeline] view, the [Log] view, and the editor. You can also check the real-time progress of the analysis pipeline by clicking the [Refresh] button on each view and the editor. When the analysis pipeline is run, the analysis programs are executed sequentially.
[Figure 4.4.1-2-1]
[Figure 4.4.1-2-2]
The running programs are in the run state and shown in green on the editor, whereas the completed programs are in the complete state and shown in blue.
4.4.2. Run a single analysis program
[Figure 4.4.2-1]
You can also run each analysis program individually. Click the [Start Node] button in the shortcut menu that appears when you bring the mouse over the analysis program you want to run.
4.4.3. Check data analysis results
[Figure 4.4.3-1]
Check the log information for each analysis program through the [Log] view.
[Figure 4.4.3-2]
If you double-click the log, you can check the log of the analysis process and errors on the [Console] view.
[Figure 4.4.3-3]
Go to the Output path specified for each analysis program and check the results using the [GBox Browse] view.
[Figure 4.4.3-4]
[Figure 4.4.3-4] shows the analysis results of fastp, which is an open program in the Bioinformatics category..
Register an open service
4.5.1. Register a pipeline to make it open to the public
To share an analysis pipeline with specific users, you can register the analysis pipeline to make it open to the public.
[Figure 4.5.1-1]
Select the desired analysis pipeline in the [Workspace] view and click the [Register Pipeline] button in the context menu, as shown in [Figure 4.5.1-1].
[Figure 4.5.1-2]
The analysis pipeline for which the registration has been requested to make it open to the public goes through the administrator approval process. Once approved, it can be used as an open analysis pipeline.
4.5.2. Register a program to make it open to the public
The analysis program for which the registration has been requested to make it open to the public goes through the administrator approval process. Once approved, it can be registered as an open analysis program.
[Figure 4.5.2-1]
You can select a desired analysis program in the [Program] view and click the [Register Program] button in the context menu to register the program to make it open to the public, as shown in [Figure 4.5.2-1].
Users cannot delete or edit open categories, programs, or pipelines.
Introduction to the genomic analysis pipeline
5.1.1. WGS
Pipeline
[Figure 5.1.1-1]
Application area
Whole-Genome Sequencing
Function summary
It is a pipeline that receives as the input the DNA sequences of normal and cancerous tissues, and performs processes such as Read Map to Reference, Read Merge, and CleanUp, to create a bam file. It finds the somatic variants of SNP & InDel faster by using GATK4, and performs the variant annotate task.
Detailed description
The genomic analysis pipeline consists of 10 stages: Fastp, Cutadapt, BWA(mem), SortSam, MarkDuplicates, CountBase, BaseRecalibrator, ApplyBQSR, HaplotypeCaller, and somalier. The analysis process at each stage is as follows. The folder in which the fastq.gz file that has been pair-ended with the input data of fastp--the first stage--is located is set as the path. Then, Fastp performs preprocessing on fastq. In addition, the quality control and filtering results are visualized and made into JSON and HTML reports. Next, Cutadapt receives the fastq.gz file preprocessed by fastp as the input. Cutadapt then finds adapter sequences, primers, poly-A tails, and other types of unwanted sequences from the high-throughput sequencing reads and removes them. The fastq file results from this adaptive trimming and is sent to BWA (mem), which is the next analysis stage. BWA maps low-divergent sequences to large genome references such as human genomes. BWA (mem) seeds the 70bp~10mbp query sequences to the maximum exact match based on the genome reference and sorts them to create a SAM file. Then, the SAM file is sorted by the coordinates, queryname, and other attributes of the sam record through SortSam to return a BAM file. Next, MarkDuplicates tags duplicate reads in the BAM file and sends them to CountBase, BaseRecalibrator, and ApplyBQSR. CountBase indicates the output file that counts the total number of bases. BaseRecalibrator performs the first of the two stages of the Base Quality Score Recalibration (BQSR). It creates a recalibrated table based on various covariates such as read group, reported quality score, machine cycle, and nucleotide context. ApplyBQSR performs the second stage of BQSR. It re-calibrates the base quality of the input reads based on the table re-calibrated by BaseRecalibrator and creates a re-calibrated BAM file. Next, HaplotypeCaller calls the SNPs and InDels of haplotypes in the active region to reassemble the reads. It uses the result of the reassembled reads to create a VCF file, compresses it into the gvcf.gz format, and sends it to the next stage. Somalier, which is the final analysis stage, extracts the informative sites and evaluates the relatedness according to sequencing. It then saves the evaluation results in a Somalier file format.
NGS data analysis pipeline design process
5.2.1. NGS data preprocessing
Preprocessing workflow
[Figure 5.2.1-1]
FASTQ format
[Figure 5.2.1-2]
FASTQ format cont
Phred* quality score Q with base-calling error probability P
Q = -10 log10P
[Figure 5.2.1-3]
[Figure 5.2.1-4]
FASTQ format
[Figure 5.2.1-5]
SAM format
[Figure 5.2.1-6]
SAM format cont
[Figure 5.2.1-7]
[Figure 5.2.1-8]
[Figure 5.2.1-9]
Preprocessing stage
[Figure 5.2.1-10]
[Figure 5.2.1-11]
[Figure 5.2.1-12]
Trimming
[Figure 5.2.1-13]
Alignment
[Figure 5.2.1-14]
Preprocessing stage cont
Alignment – Aligning to a Reference
[Figure 5.2.1-15]
[Figure 5.2.1-16]
5.2.2. Whole Genome Sequencing Pipeline
Pipeline summary
[Figure 5.2.2-1]
It is a pipeline that receives as the input the DNA sequences of normal and cancerous tissues, and performs processes such as Read Map to Reference, Read Merge, and CleanUp, to create a bam file. It finds the somatic variants of SNP & InDel faster by using GATK4, and performs the variant annotate task.
Fastp, Cutadapt, BWA, GATK4, Somalier
Variant annotate file of Somatic SNP & InDel
Registered pipeline
[Figure 5.2.2-2]
Step-by-step description
1) Quality Check - Quality Control & Adaptive Trimming
Fastp
#!/bin/bash
INPUT_DIR=$1
OUTPUT_DIR=$2
mkdir -p $OUTPUT_DIR
in_array() {
local needle array value
needle="${1}"; shift; array=("${@}")
for value in ${array[@]};
do
[ "${value}" == "${needle}" ] && echo "true" && return;
done
echo "false"
}
array=(`find $INPUT_DIR/ -name "*.fastq.gz"`)
for i in "${array[@]}"
do
FILE=${i##*/}
IFS='' read -r -a array <<< "$FILE"
len=(${#array[@]}-1)
FASTP_IN_FILE_LIST+=("${array[$len]}")
done
for i in "${FASTP_IN_FILE_LIST[@]}"
do
basename=$i
FILE1=${basename%1.fastq.gz}
IFS='' read -r -a file1_array <<< "$FILE1"
file1_len=(${#file1_array[@]}-1)
array_check=`in_array ${file1_array[$file1_len]}2.fastq.gz ${FASTP_IN_FILE_LIST[@]}`
if [ "${array_check}" == "true" ]; then
IFS='-' read -r -a SEQ_ID <<< "$FILE1"
FASTP='/BiO/program/fastp/current/fastp'
time $FASTP -i ${INPUT_DIR}/${file1_array[$file1_len]}1.fastq.gz -I ${INPUT_DIR}/${file1_array[$file1_len]}2.fastq.gz \
-o ${OUTPUT_DIR}/${file1_array[$file1_len]}1.fastq.gz -O ${OUTPUT_DIR}/${file1_array[$file1_len]}2.fastq.gz \
--length_required 20 --average_qual 20 --detect_adapter_for_pe --correction \
-h ${OUTPUT_DIR}/${SEQ_ID[0]}.html -j ${OUTPUT_DIR}/${SEQ_ID[0]}.json
fi
done
[Figure 5.2.2-3]
2) Trimming - Quality Control & Adaptive Trimming
Cutadapt
#!/bin/bash
INPUT_DIR=$1
OUTPUT_DIR=$2
mkdir -p $OUTPUT_DIR
in_array() {
local needle array value
needle="${1}"; shift; array=("${@}")
for value in ${array[@]};
do
[ "${value}" == "${needle}" ] && echo "true" && return;
done
echo "false"
}
array=(`find $INPUT_DIR/ -name "*.fastq.gz"`)
for i in "${array[@]}"
do
FILE=${i##*/}
IFS='' read -r -a array <<< "$FILE"
len=(${#array[@]}-1)
echo "${array[$len]}"
SORT_IN_FILE_LIST+=("${array[$len]}")
done
index=1
for i in "${CUTADAPT_IN_FILE_LIST[@]}"
do
basename=$i
FILE1=${basename%1.fastq.gz}
IFS='' read -r -a file1_array <<< "$FILE1"
file1_len=(${#file1_array[@]}-1)
array_check=`in_array ${file1_array[$file1_len]}2.fastq.gz ${CUTADAPT_IN_FILE_LIST[@]}`
if [ "${array_check}" == "true" ]; then
READ_1=$INPUT_DIR/${file1_array[$file1_len]}1.fastq.gz
READ_2=$INPUT_DIR/${file1_array[$file1_len]}2.fastq.gz
READ_1_FILE_NAME=$(basename $READ_1)
READ_2_FILE_NAME=$(basename $READ_2)
IFS='.' read -r -a split_name_1 <<< "$READ_1_FILE_NAME"
IFS='.' read -r -a split_name_2 <<< "$READ_2_FILE_NAME"
READ_1_NAME=${split_name_1[0]}
READ_2_NAME=${split_name_2[0]}
CUTADAPT='/BiO/program/python/current/bin/cutadapt'
time $CUTADAPT -q 30 -j 2 -a AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC -A AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT --minimum-length 70 --pair-filter=any -o $OUTPUT_DIR/${READ_1_NAME}.fastq -p $OUTPUT_DIR/${READ_2_NAME}.fastq $READ_1 $READ_2
index=$(($index+1))
fi
done
[Figure 5.2.2-4]
3) Alignment - Read Map to Reference
BWA mem
#!/bin/bash
INPUT_DIR=$1
REF_DIR=$2
OUTPUT_DIR=$3
mkdir -p $OUTPUT_DIR
REF_FILE=(`find $REF_DIR/ -name "*.fasta"`)
in_array() {
local needle array value
needle="${1}"; shift; array=("${@}")
for value in ${array[@]};
do
[ "${value}" == "${needle}" ] && echo "true" && return;
done
echo "false"
}
array=(`find $INPUT_DIR/ -name "*.fastq"`)
for i in "${array[@]}"
do
FILE=${i##*/}
IFS='' read -r -a array <<< "$FILE"
len=(${#array[@]}-1)
echo "${array[$len]}"
SORT_IN_FILE_LIST+=("${array[$len]}")
done
for i in "${BWA_IN_FILE_LIST[@]}"
do
basename=$i
FILE1=${basename%1.fastq}
IFS='' read -r -a file1_array <<< "$FILE1"
file1_len=(${#file1_array[@]}-1)
array_check=`in_array ${file1_array[$file1_len]}2.fastq ${BWA_IN_FILE_LIST[@]}`
if [ "${array_check}" == "true" ]; then
READ_1=$INPUT_DIR/${file1_array[$file1_len]}1.fastq
READ_2=$INPUT_DIR/${file1_array[$file1_len]}2.fastq
INPUT_FILE_NAME=$(basename $READ_1 .fastq)
IFS='_' read -r -a NAME <<< "$INPUT_FILE_NAME"
IFS='-' read -r -a FASTQ_2D_ID <<< "${NAME[0]}"
OUTPUT_SAM=$OUTPUT_DIR/${NAME}.sam
BWA='/BiO/program/bwa/current/bwa'
$BWA mem \
-M -R "@RG\tID:${INPUT_DIR}\tPL:Nova6000S4\tLB:Truseq_PCRFREE\tSM:${FASTQ_2D_ID}" -v 1 \
-t 1 $REF_FILE $READ_1 $READ_2 > $OUTPUT_SAM
fi
done
[Figure 5.2.2-5]
4) Alignment - Read Merge
GATK SortSam
#!/bin/bash
INPUT_DIR=$1
OUTPUT_DIR=$2
mkdir -p $OUTPUT_DIR
array=(`find $INPUT_DIR/ -name "*.sam"`)
for i in "${array[@]}"
do
FILE=${i##*/}
IFS='' read -r -a array <<< "$FILE"
len=(${#array[@]}-1)
echo "${array[$len]}"
SORT_IN_FILE_LIST+=("${array[$len]}")
done
for i in "${SORT_IN_FILE_LIST[@]}"
do
basename=$(basename $i)
INPUT_FILE_NAME=$(basename $i .sam)
EXTENSION=${basename#*.}
if [ $EXTENSION == 'sam' ]; then
IFS='-' read -r -a FNAME <<< "$INPUT_FILE_NAME"
IFS='_' read -r -a SEQ_ID <<< "$FNAME"
INPUT_FILE=$INPUT_DIR/${INPUT_FILE_NAME}.sam
SORTSAM='/BiO/program/gatk/current/gatk SortSam'
time $SORTSAM -I $INPUT_FILE -O $OUTPUT_DIR/${INPUT_FILE_NAME}.bam -SO coordinate
fi
done
[Figure 5.2.2-6]
5) Alignment - CleanUp
GATK MarkDuplicates
#!/bin/bash
INPUT_DIR=$1
OUTPUT_DIR=$2
mkdir -p $OUTPUT_DIR
array=(`find $INPUT_DIR/ -name "*.bam"`)
for i in "${array[@]}"
do
FILE=${i##*/}
IFS='' read -r -a array <<< "$FILE"
len=(${#array[@]}-1)
echo "${array[$len]}"
MARK_IN_FILE_LIST+=("${array[$len]}")
done
for i in "${MARK_IN_FILE_LIST[@]}"
do
basename=$(basename $i)
INPUT_FILE_NAME=$(basename $i .bam)
EXTENSION=${basename#*.}
if [ $EXTENSION == 'bam' ]; then
INPUT_FILE=$INPUT_DIR/${INPUT_FILE_NAME}.bam
MARKDUPLICATES='/BiO/program/gatk/current/gatk MarkDuplicates'
time $MARKDUPLICATES \
-I $INPUT_FILE \
-O $OUTPUT_DIR/${INPUT_FILE_NAME}.bam \
-M $OUTPUT_DIR/${INPUT_FILE_NAME}.table \
--REMOVE_SEQUENCING_DUPLICATES false
echo "$MARKDUPLICATES -I $INPUT_FILE -O $OUTPUT_DIR/${INPUT_FILE_NAME}.bam -M $OUTPUT_DIR/${INPUT_FILE_NAME}.table --REMOVE_SEQUENCING_DUPLICATES true"
fi
done
[Figure 5.2.2-7]
6) Alignment - Base Count
GATK CountBases
INPUT_DIR=$1
REF_FILE=$2
OUTPUT_DIR=$3
mkdir -p $OUTPUT_DIR
array=(`find $INPUT_DIR/ -name "*.bam"`)
for i in "${array[@]}"
do
FILE=${i##*/}
IFS='' read -r -a array <<< "$FILE"
len=(${#array[@]}-1)
echo "${array[$len]}"
COUNT_IN_FILE_LIST+=("${array[$len]}")
done
for i in "${COUNT_IN_FILE_LIST[@]}"
do
basename=$(basename $i)
INPUT_FILE_NAME=$(basename $i .bam)
EXTENSION=${basename#*.}
if [ $EXTENSION == 'bam' ]; then
INPUT_FILE=$INPUT_DIR/${INPUT_FILE_NAME}.bam
/samtools/1.6/samtools index $INPUT_FILE
COUNTBASES='/BiO/program/gatk/current/gatk CountBases'
time $COUNTBASES \
-I $INPUT_FILE \
-O $OUTPUT_DIR/${INPUT_FILE_NAME}.count \
-RF NotSecondaryAlignmentReadFilter \
-L $REF_FILE
fi
done
[Figure 5.2.2-8]
7) Alignment - Base Quality Score Recalibration
GATK BaseRecalibrator
#!/bin/bash
INPUT_DIR=$1
REF_FILE1=$2
REF_FILE2=$3
REF_FILE3=$4
REF_FILE4=$5
OUTPUT_DIR=$6
mkdir -p $OUTPUT_DIR
array=(`find $INPUT_DIR/ -name "*.bam"`)
for i in "${array[@]}"
do
FILE=${i##*/}
IFS='' read -r -a array <<< "$FILE"
len=(${#array[@]}-1)
echo "${array[$len]}"
BASE_IN_FILE_LIST+=("${array[$len]}")
done
for i in "${BASE_IN_FILE_LIST[@]}"
do
basename=$(basename $i)
INPUT_FILE_NAME=$(basename $i .bam)
EXTENSION=${basename#*.}
if [ $EXTENSION == 'bam' ]; then
INPUT_FILE=$INPUT_DIR/${INPUT_FILE_NAME}.bam
BASERECALIBRATOR='/BiO/program/gatk/current/gatk BaseRecalibrator'
time $BASERECALIBRATOR \
-I $INPUT_DIR/${INPUT_FILE_NAME}.bam \
-R ${REF_FILE1} \
--known-sites ${REF_FILE2} \
--known-sites ${REF_FILE3} \
--known-sites ${REF_FILE4} \
-O $OUTPUT_DIR/${INPUT_FILE_NAME}.table
fi
done
[Figure 5.2.2-9]
8) Alignment - Apply BQSR
GATK ApplyBQSR
#!/bin/bash
INPUT_DIR=$1
REF_FILE=$2
REF_DIR=$3
OUTPUT_DIR=$4
mkdir -p $OUTPUT_DIR
array=(`find $INPUT_DIR/ -name "*.bam"`)
for i in "${array[@]}"
do
FILE=${i##*/}
IFS='' read -r -a array <<< "$FILE"
len=(${#array[@]}-1)
echo "${array[$len]}"
BQSR_IN_FILE_LIST+=("${array[$len]}")
done
for i in "${BQSR_IN_FILE_LIST[@]}"
do
basename=$(basename $i)
INPUT_FILE_NAME=$(basename $i .bam)
EXTENSION=${basename#*.}
if [ $EXTENSION == 'bam' ]; then
IFS='-' read -r -a FNAME <<< "$INPUT_FILE_NAME"
IFS='_' read -r -a SEQ_ID <<< "$FNAME"
echo $INPUT_FILE_NAME
echo $FNAME
INPUT_FILE=$INPUT_DIR/${INPUT_FILE_NAME}.bam
APPLYBQSR='/BiO/program/gatk/current/gatk ApplyBQSR'
time $APPLYBQSR \
-R $REF_FILE \
-I $INPUT_FILE \
-bqsr $REF_DIR/${INPUT_FILE_NAME}.table \
-O $OUTPUT_DIR/${SEQ_ID[0]}_v1.0.0.bam \
--static-quantized-quals 10 \
--static-quantized-quals 20 \
--static-quantized-quals 30
fi
done
[Figure 5.2.2-10]
9) Alignment - Extract&Filter of SNPs, Indels
GATK HaplotypeCaller
#!/bin/bash
echo "Total Param= $#, PROG: $0, param1 = $1, param2 = $2, param3 = $3, param4 = $4";
if [ "$#" -lt 4 ]; then
echo "$# is Illegal number of parameters."
echo "Usage: $0 input dir, reference file, output dir"
exit 1
fi
args=("$@")
for (( i=0; i<$#; i++ ))
do
echo "$i th parameter = ${args[$i]}";
done
INPUT_DIR=$1 # BAM 파일
REF_FILE1=$2
REF_FILE2=$3
OUTPUT_DIR=$4 # VCF 파일
mkdir -p $OUTPUT_DIR
array=(`find $INPUT_DIR/ -name "*.bam"`)
for i in "${array[@]}"
do
FILE=${i##*/}
IFS='' read -r -a array <<< "$FILE"
len=(${#array[@]}-1)
echo "${array[$len]}"
HAPLO_IN_FILE_LIST+=("${array[$len]}")
done
for i in "${HAPLO_IN_FILE_LIST[@]}"
do
basename=$(basename $i)
INPUT_FILE_NAME=${basename%.*}
EXTENSION=${basename##*.}
if [ $EXTENSION == 'bam' ]; then
IFS='-' read -r -a FNAME <<< "$INPUT_FILE_NAME"
IFS='_' read -r -a SEQ_ID <<< "$FNAME"
INPUT_FILE=$INPUT_DIR/${INPUT_FILE_NAME}.bam
HAPLOTYPECALLER='/BiO/program/gatk/current/gatk HaplotypeCaller'
time $HAPLOTYPECALLER -I $INPUT_FILE -R $REF_FILE1 -O $OUTPUT_DIR/${SEQ_ID}_v1.0.0.vcf -OVI true --emit-ref-confidence GVCF -L $REF_FILE2
echo "$HAPLOTYPECALLER -I $INPUT_FILE -R $REF_FILE1 -O $OUTPUT_DIR/${SEQ_ID}_v1.0.0.vcf -OVI true --emit-ref-confidence GVCF -L $REF_FILE2"
mv $OUTPUT_DIR/${SEQ_ID}_v1.0.0.vcf $OUTPUT_DIR/${SEQ_ID}_v1.0.0.gvcf
echo "mv $OUTPUT_DIR/${SEQ_ID}_v1.0.0.vcf $OUTPUT_DIR/${SEQ_ID}_v1.0.0.gvcf"
BGZIP='/BiO/program/htslib/current/bgzip'
$BGZIP -c $OUTPUT_DIR/${SEQ_ID}_v1.0.0.gvcf > $OUTPUT_DIR/${SEQ_ID}_v1.0.0.gvcf.gz
echo "$BGZIP -c $OUTPUT_DIR/${SEQ_ID}_v1.0.0.gvcf > $OUTPUT_DIR/${SEQ_ID}_v1.0.0.gvcf.gz"
TABIX='/BiO/program/htslib/current/tabix'
$TABIX -p vcf $OUTPUT_DIR/${SEQ_ID}_v1.0.0.gvcf.gz
echo "$TABIX -p vcf $OUTPUT_DIR/${SEQ_ID}_v1.0.0.gvcf.gz"
fi
done
[Figure 5.2.2-11]
10) Alignment - Variant Annotation
Somalier
#!/bin/bash
INPUT_DIR=$1
REF_FILE1=$2
REF_FILE2=$3
OUTPUT_DIR=$4
mkdir -p $OUTPUT_DIR
array=(`find $INPUT_DIR/ -name "*.gvcf.gz"`)
for i in "${array[@]}"
do
FILE=${i##*/}
IFS='' read -r -a array <<< "$FILE"
len=(${#array[@]}-1)
echo "${array[$len]}"
SOMAL_IN_FILE_LIST+=("${array[$len]}")
done
for i in "${SOMAL_IN_FILE_LIST[@]}"
do
INPUT_FILE_NAME=`basename "$i" | cut -d'.' -f-3`
EXTENSION=`echo "$i" | cut -d'.' -f4-`
if [ $EXTENSION == 'gvcf.gz' ]; then
INPUT_FILE=$INPUT_DIR/${INPUT_FILE_NAME}.gvcf.gz
IFS='-' read -r -a FNAME <<< "$INPUT_FILE_NAME"
IFS='_' read -r -a SEQ_ID <<< "$FNAME"
SOMALIER='/somalier/0.2.12/somalier'
time $SOMALIER extract -d $OUTPUT_DIR -s ${REF_FILE2} -f ${REF_FILE1} $INPUT_FILE
fi
done
[Figure 5.2.2-12]
The result screen
[Figure 5.2.2-13]
[Figure 5.2.2-14]
[Figure 5.2.2-15]
[Figure 5.2.2-16]
@ERR792543.3 HWI-ST333_0255_FC:4:1101:1235:2392/1 CTGCCAATTCCCAATTATATTCTTGGAGAAGCTCATTAGGATAATGGAAATCAATCACTTTGGTTGATCTATCGAAACTT
+
CBCFFFFFHGHHGIJJIJIIJJJJJJGIIIJJGIGIJIJJJJGIIEIFEIHGIJJJJJJJFGGHIJIEHJGIJGIGHBEC
@ERR792543.7 HWI-ST333_0255_FC:4:1101:1228:2438/1 CTCCGGAGTGTAGGCTCCCCATATCTCAGCAAGTGGCCAAAGGTTCGTAATGCCATCTCTGCACNAATCTCCTCCCCCAT
+
@@@DFFFFCDHF8EHIIIIIIIIIIIIIIIIIGHGHIIIIIII?DDF;BFG<FHHCHEIFGCHH#-5;?DEEEEEBBDDC
@ERR792543.8 HWI-ST333_0255_FC:4:1101:1179:2464/1 GCCAGTCATGCCTGCCTGAGACGCCCCGCGGTTGGTGCCCATCTGTAACCCGATCACGTTCTTGNCCTCTTNCAGCTGGT
+
@CCFFFFFHHHHHJJJJJIJJJIIJJJIIJIDHIJFHIIJEHFHEHHHFFFFDCBDDBBDDDDC#,58??C#+8<<@CCC
@ERR792543.10 HWI-ST333_0255_FC:4:1101:1400:2280/1 CACAAACCTTGTTCTAATTTATGTGCACTTCTGAATAAAAATAACACCCTTTTGCTTTAATGCAATATTGATTTTATCAC
+
@CCDFFDFFHHHHIIGCHIIIIFIGGGGIIHIGIJIIJJIJIIJIJIGGIJIJJJJIJGGDEHGGIJIIHGIJJIHHIEH
@ERR792543.12 HWI-ST333_0255_FC:4:1101:1389:2303/1 ATGCGCTCCGCCGCTGTCCTGGCTCTTCTGCTCTGCGCCGGGCAAGTCACTGCGCTCCCTGTGAACAGCCCTATGAATAA
+
@@CFDFFFFHFD:FGGFHIIIIIIIIIIIGIIGGIIIIIIIFHFFFEACCEECCCC:@B?:@CEDCCCB?@CCCACDEDC
@ERR792543.14 HWIST333_0255_FC:4:1101:1495:2309/1
TCTCTCTCCTAATGAAAAAGTTTGTTTCTCTTTTACATTAGTAACGTATGTAACGGGTGGTATGTCTTTCTTTGCTTTGG
+
CCCFFFFFHHHDFIJJJJJJJJIJJIJJJJJJIJIGHIJGGIJJJJJJJIFGGIJFG<AH4=EEGGIJHGHHHGFCDFDD
@ERR792543.15 HWI-ST333_0255_FC:4:1101:1274:2316/1
GCCCAACTGTCCTGCTGATCTGAAGTAAGACAACTAGCCTTCTACATCAAATAATTCTGTGAGGCAATTTGAGTCAG
+
@@@DD?DDAC:3,<CGBFHHHIBCGHFEDEHF;EFEIFII@BFBGGI<DFIIGF@4BF>84<BDEFEEDGIIFFDF@
@ERR792543.16 HWI-ST333_0255_FC:4:1101:1467:2316/1
TGCAGTATAGATGGGATATGATGCATAGGCTTGGAGAACCACAGGCAAGGATGAGAGAAGAGAATATGGAAAGGATTG
@SQ SN:chr3 LN:198295559
@SQ SN:chr4 LN:190214555
@SQ SN:chr5 LN:181538259
@SQ SN:chr6 LN:170805979
@SQ SN:chr7 LN:159345973
@SQ SN:chr8 LN:145138636
@SQ SN:chr9 LN:138394717
@SQ SN:chr10 LN:133797422
@SQ SN:chr11 LN:135086622
@SQ SN:chr12 LN:133275309
@SQ SN:chr13 LN:114364328
@SQ SN:chr14 LN:107043718
@SQ SN:chr15 LN:101991189
@SQ SN:chr16 LN:90338345
@SQ SN:chr17 LN:83257441
@SQ SN:chr18 LN:80373285
@SQ SN:chr19 LN:58617616
@SQ SN:chr20 LN:64444167
@SQ SN:chr21 LN:46709983
@SQ SN:chr22 LN:50818468
@SQ SN:chrX LN:156040895
@SQ SN:chrY LN:57227415
@SQ SN:chrM LN:16569
@SQ SN:chr1_KI270706v1_random LN:175055
@SQ SN:chr1_KI270707v1_random LN:32032
@SQ SN:chr1_KI270708v1_random LN:127682
@SQ SN:chr1_KI270709v1_random LN:66860
@SQ SN:chr1_KI270710v1_random LN:40176
@SQ SN:chr1_KI270711v1_random LN:42210
@SQ SN:chr1_KI270712v1_random LN:1760438
#:GATKTable:Arguments:Recalibration argument collection values used in this run
Argument Value
binary_tag_name null
covariate ReadGroupCovariate,QualityScoreCovariate,ContextCovariate,CycleCovariate
default_platform null
deletions_default_quality 45
force_platform null
indels_context_size 3
insertions_default_quality 45
low_quality_tail 2
maximum_cycle_value 500
mismatches_context_size 2
mismatches_default_quality -1
no_standard_covs false
quantizing_levels 16
recalibration_report null
run_without_dbsnp false
solid_nocall_strategy THROW_EXCEPTION
solid_recal_mode SET_Q_ZERO
#:GATKTable:3:94:%d:%d:%d:;
#:GATKTable:Quantized:Quality quantization map
QualityScore Count QuantizedScore
0 0 6
1 0 6
2 0 6
3 0 6
4 0 6
5 0 6
6 3784 6
##fileformat=VCFv4.2
##ALT=<ID=NON_REF,Description="Represents any possible alternative allele not already represented at this location by REF and ALT">
##FILTER=<ID=LowQual,Description="Low quality">
##FORMAT=<ID=AD,Number=R,Type=Integer,Description="Allelic depths for the ref and alt alleles in the order listed">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=MIN_DP,Number=1,Type=Integer,Description="Minimum DP observed within the GVCF block">
##FORMAT=<ID=PGT,Number=1,Type=String,Description="Physical phasing haplotype information, describing how the alternate alleles are phased in relation to one another; will always be heterozygous and is not intended to describe called alleles">
##FORMAT=<ID=PID,Number=1,Type=String,Description="Physical phasing ID information, where each unique ID within a given sample (but not across samples) connects records within a phasing group">
##FORMAT=<ID=PL,Number=G,Type=Integer,Description="Normalized, Phred-scaled likelihoods for genotypes as defined in the VCF specification">
##FORMAT=<ID=PS,Number=1,Type=Integer,Description="Phasing set (typically the position of the first variant in the set)">
##FORMAT=<ID=SB,Number=4,Type=Integer,Description="Per-sample component statistics which comprise the Fisher's Exact Test to detect strand bias.">
##GATKCommandLine=<ID=HaplotypeCaller,CommandLine="HaplotypeCaller --emit-ref-confidence GVCF --output /bioex/output/haplotype_caller/ERR792543_v1.0.0.vcf --intervals /bioex/reference/references_hg38_v0_wgs_metrics_intervals.interval_list --input /bioex/output/apply_bqsr/ERR792543_v1.0.0.bam --reference /bioex/reference/resources_broad_hg38_v0_Homo_sapiens_assembly38.fasta --create-output-variant-index true --use-posteriors-to-calculate-qual false --dont-use-dragstr-priors false --use-new-qual-calculator true --annotate-with-num-discovered-alleles false --heterozygosity 0.001 --indel-heterozygosity 1.25E-4 --
The Somalier result file
Somalier saves the results as [filename].somalier file, as shown in [Figure 5.2.2-17].
[Figure 5.2.2-17]
INPUT | |||
---|---|---|---|
Name | Required | Value | Type |
data area 01 | data area 01 data area 01data area 01data area 01data area 01data area 01 | data area 01data area 01data area 01data area 01data area 01data area 01data area 01 | data area 01 |
OUTPUT | |||
---|---|---|---|
Name | Required | Value | Type |
content area 01 | content area 01content area 01content area 01 | content area 01 | content area 01 |
content area 02 | content area 02 | content area 02 | content area 02 |
OPTION | |||
---|---|---|---|
Name | Required | Value | Type |
content area 01 | content area 01 | content area 01 | content area 01 |
content area 03 | content area 03 | content area 03 | content area 03 |
content area 04 | content area 04 | content area 04 | content area 04 |