About Bio-Express
Introduction to Bio-Express Web Service
Install Bio-Express Workbench
Download Bioexpress Workbench
Go to “ https://www.kobic.re.kr/bioexpress/”.
Select Workbench for Windows via the quick menu on the right side of the main screen and download.
Run Bioexpress.exe after decompressing the downloaded .zip file to the desired directory.
Download to the desired directory with the wget command.
$ wget -O Bio-Express.workbench-linux.gtk-x86_64.tar.gz \
"https://www.kobic.re.kr/bioexpress/common/related_file_down?file_name=Bio-Express.workbench-linux.gtk-x86_64&file_ext=tar.gz"
Run Biopress after decompressing the downloaded tar.gz file with the tar -zxvf command.
$ tar -zxvf Bio-Express.workbench-linux.gtk-x86_64.tar.gz
$ cd Bio-Express.workbench-linux.gtk-x86_64
$ ./Bioexpress
Go to “ https://www.kobic.re.kr/bioexpress/”.
Select Workbench for MacOS via the quick menu on the right side of the main screen and download.
After decompressing the downloaded .zip file, move it to the application and run Bioexpress.
Run Bio-Express Workbench
Log in with the ID registered in KOBIC integrated login (SSO).
If you do not have an ID, log in after signing up as a member using the ‘Sign up’ button at the bottom left.
If the CLOSHA intro page is activated and you click the card area for each service, you will be connected to the Bio-Express web page, where you can view detailed introductions and information on how to use.
You can access user-friendly functions and update Workbench via [File], [Window], [Help].
Using the menu on the rightmost side of the toolbar, you can switch between big data high-speed Transfer system G-Box perspective and the cloud-based integrated analysis workbench CLOSHA perspective.
Introduction to big data high-speed transfer system functions
Introduction to Screens and Functions
The GBox high-speed Transfer system transmits the user's large-capacity data at high-speed using only software technology without the need for additional hardware such as accelerators or installation of separate plug-ins such as ActiveX. The user’s data uploaded through GBox is available in CLOSHA.
GBox is designed with an easy and intuitive user interface, so it can be used without any training.
High-speed transfer of Big Data
Upload method 1: Select the folder or file to be transferred from the [File Explorer] view and drag & drop it to the [GBox Explorer] view.
Upload method 2: In the [File Explorer] view, select the folder or file you want to transfer and transfer it using the [Upload] button in the Context menu.
Download method 1: Select the folder or file to be transferred from [GBox Explorer] and drag & drop it to the [File Explorer] view.
Download method 2: Select the folder or file you want to transfer in [GBox Explorer] and transfer it with the Context menu [Download] button.
When the file transfer starts, real-time transfer status monitoring is possible with the [Progress] view at the bottom of the application.
High-speed synchronization of Big Data file system
GBox -> HDFS: Select the folder or file to synchronize in the [GBox Explorer] view and synchronize it using the Context menu [Sync] button.
HDFS -> GBox: Select the folder or file to synchronize in the [HDFS Tree Browser] view and synchronize it by pressing the [Sync] button in the Context menu.
You can check the bidirectional file synchronization history in the [Synchronized History] view.
Data file management function
You can manage data in the Context menu of the [File Explorer] view and [GBox Explorer] the view. ([New], [Rename], [Copy], [Paste], [Delete], [Refresh])
You can use the ‘go to directory’ and ‘shortcut’ functions in the toolbar menu at the top of the [File Explorer] view and [GBox Explorer] view.
Introduction on the Detailed Features of Cloud-based Open Integrated Analytics Workbench
Introduction to Screens and Functions
Bio-Express Workbench is a cloud-based program that provides large-scale data analysis services. It consists of a user-friendly interface, and it is easy to design a complex bio information analysis process as a pipeline.
Registration for a user-defined program
To register a user-defined program, you must first register a user analysis code.
Click the [New Folder] button in the View context [Script] View Context menu to create a script folder to register the user analysis code.
After creating the folder name, click the [OK] button to create the folder.
As shown in [Figure 4.2.1-3], click the [New Script] button in the Context menu in the created script folder to generate a user analysis code. Python, Bash, and R are the supported file formats for user analysis codes.
After entering the name of the script to be created, click the [OK] button to create an empty script.
#!/bin/bash
INPUT_DIR=$1
OUTPUT_DIR=$2
mkdir -p $OUTPUT_DIR
in_array() {
local needle array value
needle="${1}"; shift; array=("${@}")
for value in ${array[@]};
do
[ "${value}" == "${needle}" ] && echo "true" && return;
done
echo "false"
}
array=(`find $INPUT_DIR/ -name "*.fastq.gz"`)
for i in "${array[@]}"
do
FILE=${i##*/}
IFS='' read -r -a array <<< "$FILE"
len=(${#array[@]}-1)
FASTP_IN_FILE_LIST+=("${array[$len]}")
done
for i in "${FASTP_IN_FILE_LIST[@]}"
do
basename=$i
FILE1=${basename%1.fastq.gz}
IFS='' read -r -a file1_array <<< "$FILE1"
file1_len=(${#file1_array[@]}-1)
array_check=`in_array ${file1_array[$file1_len]}2.fastq.gz ${FASTP_IN_FILE_LIST[@]}`
if [ "${array_check}" == "true" ]; then
IFS='-' read -r -a SEQ_ID <<< "$FILE1"
FASTP='/BiO/program/fastp/current/fastp'
time $FASTP -i ${INPUT_DIR}/${file1_array[$file1_len]}1.fastq.gz -I ${INPUT_DIR}/${file1_array[$file1_len]}2.fastq.gz \
-o ${OUTPUT_DIR}/${file1_array[$file1_len]}1.fastq.gz -O ${OUTPUT_DIR}/${file1_array[$file1_len]}2.fastq.gz \
--length_required 20 --average_qual 20 --detect_adapter_for_pe --correction \
-h ${OUTPUT_DIR}/${SEQ_ID[0]}.html -j ${OUTPUT_DIR}/${SEQ_ID[0]}.json
fi
done
The created user analysis code can be written and edited using the editor by double-clicking in the [Script] View as shown in [Figure 4.2.1-5]. More convenient code writing is provided with the code auto-completion function in the editor.
Register the user program to run the created user analysis code.
In the [Program] View, click the [New Category] button in the Context menu to create the root category and subcategory to register a user-defined program.
Create a sub-category under the root category in the Context menu.
Click the [New Program] button in the Context menu of the created sub-category to register a user program.
Register the program by setting basic the information, additional information, and parameter values of the user program to be registered.
These are the three parameter values values, and the set values are default values that can be newly set by the user in the program run stage.
Click the ‘Add parameter value’ button to create the input parameters of the fastp program. Created parameters can be modified and deleted.
After specifying the type of parameter to be created, enter the Default value. If the parameter type is folder or file, you can load the file and folder of the file GBox by pressing the Browser button.
As shown in [Figure 4.2.2-8], you can check the generated parameter values.
Output and Option parameter values can also be created in the same way.
Analysis pipeline development
As shown in [Figure 4.3.1-1], click the [New Workspace] button in the Context menu of the [Workspace] view to create a workspace.
Write the name and description of the workspace to be created, register the related keywords, and click the [OK] button.
Click the [New Pipeline] button in the context menu of the created workspace to create an analysis pipeline.
Depending on the pipeline type, a user-defined pipeline can be created or an instance pipeline can be created by selecting an existing registered pipeline.
- [Develop a new pipeline]: Create a custom pipeline.
- [Develop an instance pipeline]: Choose a public pipeline.
As shown in [Figure 4.3.1-5], you can check the created user-defined pipeline in the [Workspace] View.
Find the analysis program required for analysis pipeline design in 'Palette' of the editor.
After selecting the program, drag and drop it onto the canvas using your mouse.
Save the pipeline designed with the required analysis program, and double-click each analysis program to set the options required for running.
After setting the options for running the analysis pipeline, one pipeline is created by connecting the analysis program located on the canvas.
Even if the connection between analysis programs and the order of option setting are changed, there is no problem in function; however, when the analysis program is connected, the output of the previous step program is automatically set as the input, so it is recommended to connect after setting the program.
Design a user-defined pipeline by dragging and dropping the circle-shaped connection area located at the right end of the output parameter of the analysis program to the left connection area of the input parameter of the next stage analysis program.
As shown in [Figure 4.3.2-5], the shape of the connecting line can be freely modified by the user by dragging the mouse.
As shown in [Figure 4.3.2-6], you can select an existing registered pipeline by selecting the instance pipeline type in the pipeline creation stage.
Like user-defined pipelines, the settings required for running each program can also be changed for instance pipelines.
The designed analysis pipeline can be shared among members who have joined KOBIC SSO.
Click Shared Pipeline in the Context menu of the pipeline you want to share.
You can check the information of the pipeline you want to share and the table where you can select the members you want to share.
Go to the Share Member tab and click Add to see the list of members you can share with.
Search Member by ID in the search bar to quickly find the member.
The shared pipeline can be checked via the Bio-Express web.
Running Analysis Pipelines
You can run a designed analysis pipeline by clicking the [Run] button at the top of the editor.
When the analysis is run, the progress of the analysis pipeline can be checked in the [Pipeline] View, [Log] View and the editor. You can check the real-time progress of the analysis pipeline by pressing the [Refresh] button in each view and the editor. When the analysis pipeline is run, the analysis program proceeds sequentially.
Running programs are in the run state and displayed in green in the editor, and programs in the complete state after running are displayed in blue.
It is also possible to perform single run of each analysis program. Click the [Start Node] button in the shortcut menu that appears when you move your mouse over the analysis program to be run singly on the canvas.
Check the log information of each analysis program in the [Log] View.
If you double-click the log, you can check the log of the analysis process and errors in the [Console] View.
In the [GBox Browser] View, you can check the results by moving to the output path specified for each analysis program.
[Figure 4.4.3-4] shows the analysis results of fastp, an open program in the Bioinformatics category.
Register for public services
To share the analysis pipeline with a specific user, it is possible to openly register the analysis pipeline.
As shown in [Figure 4.5.1-1], select the desired analysis pipeline in the [Workspace] view and click the [Register Pipeline] button in the Context menu.
The analysis pipeline for which public registration is requested can be used as an open analysis pipeline through the administrator approval process.
The analysis program requested for public registration can be registered as an open analysis program through the administrator approval process.
As shown in [Figure 4.5.2-1], you can select a desired analysis program in the [Program] View and publicly register it by pressing the [Register Program] button in the Context menu.
Tutorial for Using Public Analysis Pipelines
Introduction to the Genomic Analysis Pipeline
Whole-Genome Sequencing
It is a pipeline that receives DNA sequences of normal and cancerous tissues and creates bam files through the processes of Read Map to Reference, Read Merge, and CleanUp, and it uses GATK4 to find faster somatic variants of SNP & InDel and perform variant annotation.
It consists of 10 steps: Fastp, Cutadapt, BWA (mem), SortSam, MarkDuplicates, CountBase, BaseRecalibrator, ApplyBQSR, HaplotypeCaller, and somalier. The process analyzed at each stage is as follows. Set the folder path where the pair-end fastq.gz file exists as input data of the first stage, which is fastp. The fastp stage preprocesses fastq. In addition, the results of quality control and filtering are visualized in JSON and HTML reports. After that, Cutadapt receives the fastq.gz file preprocessed by fastp, finds and removes adapter sequences, primers, poly-A tails and other types of unwanted sequences from high-throughput sequencing reads. As such, the fastq file, the result of adaptive trimming, is transferred to the next analysis step, BWA (mem). BWA maps low-divergent sequences to large genome references such as human genomes. BWA (mem) generates a SAM file by seeding 70bp-10mbp query sequences to the maximum exact match and then sorting them based on the genome reference. After this, the sam file is sorted by the other properties of coordinate, queryname, and sam record through SortSam and is returned as a BAM file. The following MarkDuplicates tags the duplicate reads in the bam file and passes them to CountBase, BaseRecalibrator and ApplyBQSR. CountBase indicates the output file that counts the total number of bases. BaseRecalibrator performs the first of the two-stage BQSR (Base Quality Score Recalibration) process, and creates a recalibrated table based on various covariates such as read group, reported quality score, machine cycle, and nucleotide context. ApplyBQSR performs the second step of BQSR, recalibrates the basic quality of input reads based on the recalibrated table in BaseRecalibrator, and creates a recalibrated BAM file. Then, HaplotypeCaller calls the SNPs and InDels of the haplotypes in the active region, creates a VCF file as a result of reassembled read, and compresses it in gvcf.gz format and delivers it to the next step. In the final analysis stage, somalier, informative sites are extracted and the results of evaluating relevance according to sequencing are stored in the form of a SOMALIER file.
NGS data analysis pipeline design process
Phred* quality score Q with base-calling error probability P
Q = -10 log10P
#!/bin/bash
INPUT_DIR=$1
OUTPUT_DIR=$2
mkdir -p $OUTPUT_DIR
in_array() {
local needle array value
needle="${1}"; shift; array=("${@}")
for value in ${array[@]};
do
[ "${value}" == "${needle}" ] && echo "true" && return;
done
echo "false"
}
array=(`find $INPUT_DIR/ -name "*.fastq.gz"`)
for i in "${array[@]}"
do
FILE=${i##*/}
IFS='' read -r -a array <<< "$FILE"
len=(${#array[@]}-1)
FASTP_IN_FILE_LIST+=("${array[$len]}")
done
for i in "${FASTP_IN_FILE_LIST[@]}"
do
basename=$i
FILE1=${basename%1.fastq.gz}
IFS='' read -r -a file1_array <<< "$FILE1"
file1_len=(${#file1_array[@]}-1)
array_check=`in_array ${file1_array[$file1_len]}2.fastq.gz ${FASTP_IN_FILE_LIST[@]}`
if [ "${array_check}" == "true" ]; then
IFS='-' read -r -a SEQ_ID <<< "$FILE1"
FASTP='/BiO/program/fastp/current/fastp'
time $FASTP -i ${INPUT_DIR}/${file1_array[$file1_len]}1.fastq.gz -I ${INPUT_DIR}/${file1_array[$file1_len]}2.fastq.gz \
-o ${OUTPUT_DIR}/${file1_array[$file1_len]}1.fastq.gz -O ${OUTPUT_DIR}/${file1_array[$file1_len]}2.fastq.gz \
--length_required 20 --average_qual 20 --detect_adapter_for_pe --correction \
-h ${OUTPUT_DIR}/${SEQ_ID[0]}.html -j ${OUTPUT_DIR}/${SEQ_ID[0]}.json
fi
done
#!/bin/bash
INPUT_DIR=$1
OUTPUT_DIR=$2
mkdir -p $OUTPUT_DIR
in_array() {
local needle array value
needle="${1}"; shift; array=("${@}")
for value in ${array[@]};
do
[ "${value}" == "${needle}" ] && echo "true" && return;
done
echo "false"
}
array=(`find $INPUT_DIR/ -name "*.fastq.gz"`)
for i in "${array[@]}"
do
FILE=${i##*/}
IFS='' read -r -a array <<< "$FILE"
len=(${#array[@]}-1)
echo "${array[$len]}"
SORT_IN_FILE_LIST+=("${array[$len]}")
done
index=1
for i in "${CUTADAPT_IN_FILE_LIST[@]}"
do
basename=$i
FILE1=${basename%1.fastq.gz}
IFS='' read -r -a file1_array <<< "$FILE1"
file1_len=(${#file1_array[@]}-1)
array_check=`in_array ${file1_array[$file1_len]}2.fastq.gz ${CUTADAPT_IN_FILE_LIST[@]}`
if [ "${array_check}" == "true" ]; then
READ_1=$INPUT_DIR/${file1_array[$file1_len]}1.fastq.gz
READ_2=$INPUT_DIR/${file1_array[$file1_len]}2.fastq.gz
READ_1_FILE_NAME=$(basename $READ_1)
READ_2_FILE_NAME=$(basename $READ_2)
IFS='.' read -r -a split_name_1 <<< "$READ_1_FILE_NAME"
IFS='.' read -r -a split_name_2 <<< "$READ_2_FILE_NAME"
READ_1_NAME=${split_name_1[0]}
READ_2_NAME=${split_name_2[0]}
CUTADAPT='/BiO/program/python/current/bin/cutadapt'
time $CUTADAPT -q 30 -j 2 -a AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC -A AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT --minimum-length 70 --pair-filter=any -o $OUTPUT_DIR/${READ_1_NAME}.fastq -p $OUTPUT_DIR/${READ_2_NAME}.fastq $READ_1 $READ_2
index=$(($index+1))
fi
done
#!/bin/bash
INPUT_DIR=$1
REF_DIR=$2
OUTPUT_DIR=$3
mkdir -p $OUTPUT_DIR
REF_FILE=(`find $REF_DIR/ -name "*.fasta"`)
in_array() {
local needle array value
needle="${1}"; shift; array=("${@}")
for value in ${array[@]};
do
[ "${value}" == "${needle}" ] && echo "true" && return;
done
echo "false"
}
array=(`find $INPUT_DIR/ -name "*.fastq"`)
for i in "${array[@]}"
do
FILE=${i##*/}
IFS='' read -r -a array <<< "$FILE"
len=(${#array[@]}-1)
echo "${array[$len]}"
SORT_IN_FILE_LIST+=("${array[$len]}")
done
for i in "${BWA_IN_FILE_LIST[@]}"
do
basename=$i
FILE1=${basename%1.fastq}
IFS='' read -r -a file1_array <<< "$FILE1"
file1_len=(${#file1_array[@]}-1)
array_check=`in_array ${file1_array[$file1_len]}2.fastq ${BWA_IN_FILE_LIST[@]}`
if [ "${array_check}" == "true" ]; then
READ_1=$INPUT_DIR/${file1_array[$file1_len]}1.fastq
READ_2=$INPUT_DIR/${file1_array[$file1_len]}2.fastq
INPUT_FILE_NAME=$(basename $READ_1 .fastq)
IFS='_' read -r -a NAME <<< "$INPUT_FILE_NAME"
IFS='-' read -r -a FASTQ_2D_ID <<< "${NAME[0]}"
OUTPUT_SAM=$OUTPUT_DIR/${NAME}.sam
BWA='/BiO/program/bwa/current/bwa'
$BWA mem \
-M -R "@RG\tID:${INPUT_DIR}\tPL:Nova6000S4\tLB:Truseq_PCRFREE\tSM:${FASTQ_2D_ID}" -v 1 \
-t 1 $REF_FILE $READ_1 $READ_2 > $OUTPUT_SAM
fi
done
#!/bin/bash
INPUT_DIR=$1
OUTPUT_DIR=$2
mkdir -p $OUTPUT_DIR
array=(`find $INPUT_DIR/ -name "*.sam"`)
for i in "${array[@]}"
do
FILE=${i##*/}
IFS='' read -r -a array <<< "$FILE"
len=(${#array[@]}-1)
echo "${array[$len]}"
SORT_IN_FILE_LIST+=("${array[$len]}")
done
for i in "${SORT_IN_FILE_LIST[@]}"
do
basename=$(basename $i)
INPUT_FILE_NAME=$(basename $i .sam)
EXTENSION=${basename#*.}
if [ $EXTENSION == 'sam' ]; then
IFS='-' read -r -a FNAME <<< "$INPUT_FILE_NAME"
IFS='_' read -r -a SEQ_ID <<< "$FNAME"
INPUT_FILE=$INPUT_DIR/${INPUT_FILE_NAME}.sam
SORTSAM='/BiO/program/gatk/current/gatk SortSam'
time $SORTSAM -I $INPUT_FILE -O $OUTPUT_DIR/${INPUT_FILE_NAME}.bam -SO coordinate
fi
done
#!/bin/bash
INPUT_DIR=$1
OUTPUT_DIR=$2
mkdir -p $OUTPUT_DIR
array=(`find $INPUT_DIR/ -name "*.bam"`)
for i in "${array[@]}"
do
FILE=${i##*/}
IFS='' read -r -a array <<< "$FILE"
len=(${#array[@]}-1)
echo "${array[$len]}"
MARK_IN_FILE_LIST+=("${array[$len]}")
done
for i in "${MARK_IN_FILE_LIST[@]}"
do
basename=$(basename $i)
INPUT_FILE_NAME=$(basename $i .bam)
EXTENSION=${basename#*.}
if [ $EXTENSION == 'bam' ]; then
INPUT_FILE=$INPUT_DIR/${INPUT_FILE_NAME}.bam
MARKDUPLICATES='/BiO/program/gatk/current/gatk MarkDuplicates'
time $MARKDUPLICATES \
-I $INPUT_FILE \
-O $OUTPUT_DIR/${INPUT_FILE_NAME}.bam \
-M $OUTPUT_DIR/${INPUT_FILE_NAME}.table \
--REMOVE_SEQUENCING_DUPLICATES false
echo "$MARKDUPLICATES -I $INPUT_FILE -O $OUTPUT_DIR/${INPUT_FILE_NAME}.bam -M $OUTPUT_DIR/${INPUT_FILE_NAME}.table --REMOVE_SEQUENCING_DUPLICATES true"
fi
done
INPUT_DIR=$1
REF_FILE=$2
OUTPUT_DIR=$3
mkdir -p $OUTPUT_DIR
array=(`find $INPUT_DIR/ -name "*.bam"`)
for i in "${array[@]}"
do
FILE=${i##*/}
IFS='' read -r -a array <<< "$FILE"
len=(${#array[@]}-1)
echo "${array[$len]}"
COUNT_IN_FILE_LIST+=("${array[$len]}")
done
for i in "${COUNT_IN_FILE_LIST[@]}"
do
basename=$(basename $i)
INPUT_FILE_NAME=$(basename $i .bam)
EXTENSION=${basename#*.}
if [ $EXTENSION == 'bam' ]; then
INPUT_FILE=$INPUT_DIR/${INPUT_FILE_NAME}.bam
/samtools/1.6/samtools index $INPUT_FILE
COUNTBASES='/BiO/program/gatk/current/gatk CountBases'
time $COUNTBASES \
-I $INPUT_FILE \
-O $OUTPUT_DIR/${INPUT_FILE_NAME}.count \
-RF NotSecondaryAlignmentReadFilter \
-L $REF_FILE
fi
done
#!/bin/bash
INPUT_DIR=$1
REF_FILE1=$2
REF_FILE2=$3
REF_FILE3=$4
REF_FILE4=$5
OUTPUT_DIR=$6
mkdir -p $OUTPUT_DIR
array=(`find $INPUT_DIR/ -name "*.bam"`)
for i in "${array[@]}"
do
FILE=${i##*/}
IFS='' read -r -a array <<< "$FILE"
len=(${#array[@]}-1)
echo "${array[$len]}"
BASE_IN_FILE_LIST+=("${array[$len]}")
done
for i in "${BASE_IN_FILE_LIST[@]}"
do
basename=$(basename $i)
INPUT_FILE_NAME=$(basename $i .bam)
EXTENSION=${basename#*.}
if [ $EXTENSION == 'bam' ]; then
INPUT_FILE=$INPUT_DIR/${INPUT_FILE_NAME}.bam
BASERECALIBRATOR='/BiO/program/gatk/current/gatk BaseRecalibrator'
time $BASERECALIBRATOR \
-I $INPUT_DIR/${INPUT_FILE_NAME}.bam \
-R ${REF_FILE1} \
--known-sites ${REF_FILE2} \
--known-sites ${REF_FILE3} \
--known-sites ${REF_FILE4} \
-O $OUTPUT_DIR/${INPUT_FILE_NAME}.table
fi
done
#!/bin/bash
INPUT_DIR=$1
REF_FILE=$2
REF_DIR=$3
OUTPUT_DIR=$4
mkdir -p $OUTPUT_DIR
array=(`find $INPUT_DIR/ -name "*.bam"`)
for i in "${array[@]}"
do
FILE=${i##*/}
IFS='' read -r -a array <<< "$FILE"
len=(${#array[@]}-1)
echo "${array[$len]}"
BQSR_IN_FILE_LIST+=("${array[$len]}")
done
for i in "${BQSR_IN_FILE_LIST[@]}"
do
basename=$(basename $i)
INPUT_FILE_NAME=$(basename $i .bam)
EXTENSION=${basename#*.}
if [ $EXTENSION == 'bam' ]; then
IFS='-' read -r -a FNAME <<< "$INPUT_FILE_NAME"
IFS='_' read -r -a SEQ_ID <<< "$FNAME"
echo $INPUT_FILE_NAME
echo $FNAME
INPUT_FILE=$INPUT_DIR/${INPUT_FILE_NAME}.bam
APPLYBQSR='/BiO/program/gatk/current/gatk ApplyBQSR'
time $APPLYBQSR \
-R $REF_FILE \
-I $INPUT_FILE \
-bqsr $REF_DIR/${INPUT_FILE_NAME}.table \
-O $OUTPUT_DIR/${SEQ_ID[0]}_v1.0.0.bam \
--static-quantized-quals 10 \
--static-quantized-quals 20 \
--static-quantized-quals 30
fi
done
#!/bin/bash
echo "Total Param= $#, PROG: $0, param1 = $1, param2 = $2, param3 = $3, param4 = $4";
if [ "$#" -lt 4 ]; then
echo "$# is Illegal number of parameters."
echo "Usage: $0 input dir, reference file, output dir"
exit 1
fi
args=("$@")
for (( i=0; i<$#; i++ ))
do
echo "$i th parameter = ${args[$i]}";
done
INPUT_DIR=$1 # BAM 파일
REF_FILE1=$2
REF_FILE2=$3
OUTPUT_DIR=$4 # VCF 파일
mkdir -p $OUTPUT_DIR
array=(`find $INPUT_DIR/ -name "*.bam"`)
for i in "${array[@]}"
do
FILE=${i##*/}
IFS='' read -r -a array <<< "$FILE"
len=(${#array[@]}-1)
echo "${array[$len]}"
HAPLO_IN_FILE_LIST+=("${array[$len]}")
done
for i in "${HAPLO_IN_FILE_LIST[@]}"
do
basename=$(basename $i)
INPUT_FILE_NAME=${basename%.*}
EXTENSION=${basename##*.}
if [ $EXTENSION == 'bam' ]; then
IFS='-' read -r -a FNAME <<< "$INPUT_FILE_NAME"
IFS='_' read -r -a SEQ_ID <<< "$FNAME"
INPUT_FILE=$INPUT_DIR/${INPUT_FILE_NAME}.bam
HAPLOTYPECALLER='/BiO/program/gatk/current/gatk HaplotypeCaller'
time $HAPLOTYPECALLER -I $INPUT_FILE -R $REF_FILE1 -O $OUTPUT_DIR/${SEQ_ID}_v1.0.0.vcf -OVI true --emit-ref-confidence GVCF -L $REF_FILE2
echo "$HAPLOTYPECALLER -I $INPUT_FILE -R $REF_FILE1 -O $OUTPUT_DIR/${SEQ_ID}_v1.0.0.vcf -OVI true --emit-ref-confidence GVCF -L $REF_FILE2"
mv $OUTPUT_DIR/${SEQ_ID}_v1.0.0.vcf $OUTPUT_DIR/${SEQ_ID}_v1.0.0.gvcf
echo "mv $OUTPUT_DIR/${SEQ_ID}_v1.0.0.vcf $OUTPUT_DIR/${SEQ_ID}_v1.0.0.gvcf"
BGZIP='/BiO/program/htslib/current/bgzip'
$BGZIP -c $OUTPUT_DIR/${SEQ_ID}_v1.0.0.gvcf > $OUTPUT_DIR/${SEQ_ID}_v1.0.0.gvcf.gz
echo "$BGZIP -c $OUTPUT_DIR/${SEQ_ID}_v1.0.0.gvcf > $OUTPUT_DIR/${SEQ_ID}_v1.0.0.gvcf.gz"
TABIX='/BiO/program/htslib/current/tabix'
$TABIX -p vcf $OUTPUT_DIR/${SEQ_ID}_v1.0.0.gvcf.gz
echo "$TABIX -p vcf $OUTPUT_DIR/${SEQ_ID}_v1.0.0.gvcf.gz"
fi
done
#!/bin/bash
INPUT_DIR=$1
REF_FILE1=$2
REF_FILE2=$3
OUTPUT_DIR=$4
mkdir -p $OUTPUT_DIR
array=(`find $INPUT_DIR/ -name "*.gvcf.gz"`)
for i in "${array[@]}"
do
FILE=${i##*/}
IFS='' read -r -a array <<< "$FILE"
len=(${#array[@]}-1)
echo "${array[$len]}"
SOMAL_IN_FILE_LIST+=("${array[$len]}")
done
for i in "${SOMAL_IN_FILE_LIST[@]}"
do
INPUT_FILE_NAME=`basename "$i" | cut -d'.' -f-3`
EXTENSION=`echo "$i" | cut -d'.' -f4-`
if [ $EXTENSION == 'gvcf.gz' ]; then
INPUT_FILE=$INPUT_DIR/${INPUT_FILE_NAME}.gvcf.gz
IFS='-' read -r -a FNAME <<< "$INPUT_FILE_NAME"
IFS='_' read -r -a SEQ_ID <<< "$FNAME"
SOMALIER='/somalier/0.2.12/somalier'
time $SOMALIER extract -d $OUTPUT_DIR -s ${REF_FILE2} -f ${REF_FILE1} $INPUT_FILE
fi
done
@ERR792543.3 HWI-ST333_0255_FC:4:1101:1235:2392/1 CTGCCAATTCCCAATTATATTCTTGGAGAAGCTCATTAGGATAATGGAAATCAATCACTTTGGTTGATCTATCGAAACTT
+
CBCFFFFFHGHHGIJJIJIIJJJJJJGIIIJJGIGIJIJJJJGIIEIFEIHGIJJJJJJJFGGHIJIEHJGIJGIGHBEC
@ERR792543.7 HWI-ST333_0255_FC:4:1101:1228:2438/1 CTCCGGAGTGTAGGCTCCCCATATCTCAGCAAGTGGCCAAAGGTTCGTAATGCCATCTCTGCACNAATCTCCTCCCCCAT
+
@@@DFFFFCDHF8EHIIIIIIIIIIIIIIIIIGHGHIIIIIII?DDF;BFG<FHHCHEIFGCHH#-5;?DEEEEEBBDDC
@ERR792543.8 HWI-ST333_0255_FC:4:1101:1179:2464/1 GCCAGTCATGCCTGCCTGAGACGCCCCGCGGTTGGTGCCCATCTGTAACCCGATCACGTTCTTGNCCTCTTNCAGCTGGT
+
@CCFFFFFHHHHHJJJJJIJJJIIJJJIIJIDHIJFHIIJEHFHEHHHFFFFDCBDDBBDDDDC#,58??C#+8<<@CCC
@ERR792543.10 HWI-ST333_0255_FC:4:1101:1400:2280/1 CACAAACCTTGTTCTAATTTATGTGCACTTCTGAATAAAAATAACACCCTTTTGCTTTAATGCAATATTGATTTTATCAC
+
@CCDFFDFFHHHHIIGCHIIIIFIGGGGIIHIGIJIIJJIJIIJIJIGGIJIJJJJIJGGDEHGGIJIIHGIJJIHHIEH
@ERR792543.12 HWI-ST333_0255_FC:4:1101:1389:2303/1 ATGCGCTCCGCCGCTGTCCTGGCTCTTCTGCTCTGCGCCGGGCAAGTCACTGCGCTCCCTGTGAACAGCCCTATGAATAA
+
@@CFDFFFFHFD:FGGFHIIIIIIIIIIIGIIGGIIIIIIIFHFFFEACCEECCCC:@B?:@CEDCCCB?@CCCACDEDC
@ERR792543.14 HWIST333_0255_FC:4:1101:1495:2309/1
TCTCTCTCCTAATGAAAAAGTTTGTTTCTCTTTTACATTAGTAACGTATGTAACGGGTGGTATGTCTTTCTTTGCTTTGG
+
CCCFFFFFHHHDFIJJJJJJJJIJJIJJJJJJIJIGHIJGGIJJJJJJJIFGGIJFG<AH4=EEGGIJHGHHHGFCDFDD
@ERR792543.15 HWI-ST333_0255_FC:4:1101:1274:2316/1
GCCCAACTGTCCTGCTGATCTGAAGTAAGACAACTAGCCTTCTACATCAAATAATTCTGTGAGGCAATTTGAGTCAG
+
@@@DD?DDAC:3,<CGBFHHHIBCGHFEDEHF;EFEIFII@BFBGGI<DFIIGF@4BF>84<BDEFEEDGIIFFDF@
@ERR792543.16 HWI-ST333_0255_FC:4:1101:1467:2316/1
TGCAGTATAGATGGGATATGATGCATAGGCTTGGAGAACCACAGGCAAGGATGAGAGAAGAGAATATGGAAAGGATTG
@SQ SN:chr3 LN:198295559
@SQ SN:chr4 LN:190214555
@SQ SN:chr5 LN:181538259
@SQ SN:chr6 LN:170805979
@SQ SN:chr7 LN:159345973
@SQ SN:chr8 LN:145138636
@SQ SN:chr9 LN:138394717
@SQ SN:chr10 LN:133797422
@SQ SN:chr11 LN:135086622
@SQ SN:chr12 LN:133275309
@SQ SN:chr13 LN:114364328
@SQ SN:chr14 LN:107043718
@SQ SN:chr15 LN:101991189
@SQ SN:chr16 LN:90338345
@SQ SN:chr17 LN:83257441
@SQ SN:chr18 LN:80373285
@SQ SN:chr19 LN:58617616
@SQ SN:chr20 LN:64444167
@SQ SN:chr21 LN:46709983
@SQ SN:chr22 LN:50818468
@SQ SN:chrX LN:156040895
@SQ SN:chrY LN:57227415
@SQ SN:chrM LN:16569
@SQ SN:chr1_KI270706v1_random LN:175055
@SQ SN:chr1_KI270707v1_random LN:32032
@SQ SN:chr1_KI270708v1_random LN:127682
@SQ SN:chr1_KI270709v1_random LN:66860
@SQ SN:chr1_KI270710v1_random LN:40176
@SQ SN:chr1_KI270711v1_random LN:42210
@SQ SN:chr1_KI270712v1_random LN:1760438
#:GATKTable:Arguments:Recalibration argument collection values used in this run
Argument Value
binary_tag_name null
covariate ReadGroupCovariate,QualityScoreCovariate,ContextCovariate,CycleCovariate
default_platform null
deletions_default_quality 45
force_platform null
indels_context_size 3
insertions_default_quality 45
low_quality_tail 2
maximum_cycle_value 500
mismatches_context_size 2
mismatches_default_quality -1
no_standard_covs false
quantizing_levels 16
recalibration_report null
run_without_dbsnp false
solid_nocall_strategy THROW_EXCEPTION
solid_recal_mode SET_Q_ZERO
#:GATKTable:3:94:%d:%d:%d:;
#:GATKTable:Quantized:Quality quantization map
QualityScore Count QuantizedScore
0 0 6
1 0 6
2 0 6
3 0 6
4 0 6
5 0 6
6 3784 6
##fileformat=VCFv4.2
##ALT=<ID=NON_REF,Description="Represents any possible alternative allele not already represented at this location by REF and ALT">
##FILTER=<ID=LowQual,Description="Low quality">
##FORMAT=<ID=AD,Number=R,Type=Integer,Description="Allelic depths for the ref and alt alleles in the order listed">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=MIN_DP,Number=1,Type=Integer,Description="Minimum DP observed within the GVCF block">
##FORMAT=<ID=PGT,Number=1,Type=String,Description="Physical phasing haplotype information, describing how the alternate alleles are phased in relation to one another; will always be heterozygous and is not intended to describe called alleles">
##FORMAT=<ID=PID,Number=1,Type=String,Description="Physical phasing ID information, where each unique ID within a given sample (but not across samples) connects records within a phasing group">
##FORMAT=<ID=PL,Number=G,Type=Integer,Description="Normalized, Phred-scaled likelihoods for genotypes as defined in the VCF specification">
##FORMAT=<ID=PS,Number=1,Type=Integer,Description="Phasing set (typically the position of the first variant in the set)">
##FORMAT=<ID=SB,Number=4,Type=Integer,Description="Per-sample component statistics which comprise the Fisher's Exact Test to detect strand bias.">
##GATKCommandLine=<ID=HaplotypeCaller,CommandLine="HaplotypeCaller --emit-ref-confidence GVCF --output /bioex/output/haplotype_caller/ERR792543_v1.0.0.vcf --intervals /bioex/reference/references_hg38_v0_wgs_metrics_intervals.interval_list --input /bioex/output/apply_bqsr/ERR792543_v1.0.0.bam --reference /bioex/reference/resources_broad_hg38_v0_Homo_sapiens_assembly38.fasta --create-output-variant-index true --use-posteriors-to-calculate-qual false --dont-use-dragstr-priors false --use-new-qual-calculator true --annotate-with-num-discovered-alleles false --heterozygosity 0.001 --indel-heterozygosity 1.25E-4 --
As shown [Figure 5.2.2-17], somalier results are saved as a [file name].somalier.
INPUT | |||
---|---|---|---|
Name | Required | Value | Type |
data area 01 | data area 01 data area 01data area 01data area 01data area 01data area 01 | data area 01data area 01data area 01data area 01data area 01data area 01data area 01 | data area 01 |
OUTPUT | |||
---|---|---|---|
Name | Required | Value | Type |
content area 01 | content area 01content area 01content area 01 | content area 01 | content area 01 |
content area 02 | content area 02 | content area 02 | content area 02 |
OPTION | |||
---|---|---|---|
Name | Required | Value | Type |
content area 01 | content area 01 | content area 01 | content area 01 |
content area 03 | content area 03 | content area 03 | content area 03 |
content area 04 | content area 04 | content area 04 | content area 04 |