본문영역 바로가기 메인메뉴 바로가기 하단링크 바로가기
  • About Bio-Express
  • Install Bio-Express Workbench
  • Introduction to Big Data High-speed Transfer System Functions
  • Introduction on the Detailed Features of Cloud-based Open Integrated Analytics Workbench
  • Tutorial for Using Public Analysis Pipelines

About Bio-Express

  • 1.1. Introduction to Bio-Express Web Service

Introduction to Bio-Express Web Service

  • Web-based Large Data Analysis Service
    • It is a web-based service that can be used anywhere a network is available.
    • Increased access by utilizing open pipelines to increase the analysis experience of novice researchers
    • Real-time confirmation of user's system usage status
    • Expansion of information sharing and exchange opportunities through the provision of community services among users
    • Checking of basic analysis information through open programs and pipelines
  • Figure 1.1-1
    [Figure 1.1-1]
  • Figure 1.1-2
    [Figure 1.1-2]

Install Bio-Express Workbench

  • 2.1. Download Bio-Express Workbench
    • 2.1.1. Windows
    • 2.1.2. Linux
    • 2.1.3. MacOS
  • 2.2. Run Bio-Express Workbench
    • 2.2.1. Login to Bio-Express Workbench
    • 2.2.2. Bio-Express Workbench Start Screen

Download Bioexpress Workbench

  • 2.1.1 Windows
  • Go to “ https://www.kobic.re.kr/bioexpress/”.

    Figure 2.1.1-1
    [Figure 2.1.1-1]

    Select Workbench for Windows via the quick menu on the right side of the main screen and download.

  • Figure 2.1.1-2
    [Figure 2.1.1-2]

    Run Bioexpress.exe after decompressing the downloaded .zip file to the desired directory.

  • 2.1.2 Linux
  • Figure 2.1.2-1
    [Figure 2.1.2-1]

    Download to the desired directory with the wget command.

															$ wget -O Bio-Express.workbench-linux.gtk-x86_64.tar.gz \
"https://www.kobic.re.kr/bioexpress/common/related_file_down?file_name=Bio-Express.workbench-linux.gtk-x86_64&file_ext=tar.gz"
														
  • Figure 2.1.2-2
    [Figure 2.1.2-2]

    Run Biopress after decompressing the downloaded tar.gz file with the tar -zxvf command.

															$ tar -zxvf Bio-Express.workbench-linux.gtk-x86_64.tar.gz
$ cd Bio-Express.workbench-linux.gtk-x86_64
$ ./Bioexpress
														
  • 2.1.3 MacOS
  • Go to “ https://www.kobic.re.kr/bioexpress/”.

    Figure 2.1.3-1
    [Figure 2.1.3-1]

    Select Workbench for MacOS via the quick menu on the right side of the main screen and download.

  • Figure 2.1.3-2
    [Figure 2.1.3-2]

    Figure 2.1.3-3
    [Figure 2.1.3-3]

    After decompressing the downloaded .zip file, move it to the application and run Bioexpress.

Run Bio-Express Workbench

  • 2.2.1. Login to Bio-Express Workbench
  • Figure 2.2.1-1
    [Figure 2.2.1-1]

    Log in with the ID registered in KOBIC integrated login (SSO).

    If you do not have an ID, log in after signing up as a member using the ‘Sign up’ button at the bottom left.

  • 2.2.2. Bio-Express Workbench start screen
  • Figure 2.2.2-1
    [Figure 2.2.2-1]

    If the CLOSHA intro page is activated and you click the card area for each service, you will be connected to the Bio-Express web page, where you can view detailed introductions and information on how to use.

  • Figure 2.2.2-2
    [Figure 2.2.2-2]

    You can access user-friendly functions and update Workbench via [File], [Window], [Help].

      • [File] > [New]: Create a new workspace, category, pipeline, or program
      • [File] > [Register]: Pipeline registration, program registration
      • [Window] > [Open Perspective]: Switch between CLOSHA / G-Box perspective
      • [Window] > [Reset Perspective]: Revert to the perspective set to Default.
      • [Window] > [Preference]: Change Bio-Express Workbench settings
      • [Help] > [Help Contents]: Bio-Express User Manual
      • [Help] > [Community]: Go to the Bio-Express web community page
      • [Help] > [Feedback]: Go to the Bio-Express web feedback page
      • [Help] > [About Bio-Express]: Go to the Bio-Express web introduction page
      • [Help] > [Update CLOSHA Workbench]: Update Bio-Express Workbench
  • Figure 2.2.2-3
    [Figure 2.2.2-3]

    Using the menu on the rightmost side of the toolbar, you can switch between big data high-speed Transfer system G-Box perspective and the cloud-based integrated analysis workbench CLOSHA perspective.

Introduction to big data high-speed transfer system functions

  • 3.1. Introduction to Screens and Functions
  • 3.2. High-speed transfer of Big Data
  • 3.3. High-speed synchronization of Big Data file system
  • 3.4. Data file management function

Introduction to Screens and Functions

The GBox high-speed Transfer system transmits the user's large-capacity data at high-speed using only software technology without the need for additional hardware such as accelerators or installation of separate plug-ins such as ActiveX. The user’s data uploaded through GBox is available in CLOSHA.

  • Figure 3.1 - 1
    [Figure 3.1-1]

    GBox is designed with an easy and intuitive user interface, so it can be used without any training.

High-speed transfer of Big Data

  • Figure 3.2 - 1
    [Figure 3.2-1]

    Upload method 1: Select the folder or file to be transferred from the [File Explorer] view and drag & drop it to the [GBox Explorer] view.

  • Figure 3.2 - 2
    [Figure 3.2-2]

    Upload method 2: In the [File Explorer] view, select the folder or file you want to transfer and transfer it using the [Upload] button in the Context menu.

  • Figure 3.2 - 3
    [Figure 3.2-3]

    Download method 1: Select the folder or file to be transferred from [GBox Explorer] and drag & drop it to the [File Explorer] view.

  • Figure 3.2 - 4
    [Figure 3.2-4]

    Download method 2: Select the folder or file you want to transfer in [GBox Explorer] and transfer it with the Context menu [Download] button.

  • Figure 3.2 - 5
    [Figure 3.2-5]

    When the file transfer starts, real-time transfer status monitoring is possible with the [Progress] view at the bottom of the application.

High-speed synchronization of Big Data file system

  • Figure 3.3 - 1
    [Figure 3.3-1]

    GBox -> HDFS: Select the folder or file to synchronize in the [GBox Explorer] view and synchronize it using the Context menu [Sync] button.

  • Figure 3.3 - 2
    [Figure 3.3-2]

    HDFS -> GBox: Select the folder or file to synchronize in the [HDFS Tree Browser] view and synchronize it by pressing the [Sync] button in the Context menu.

  • Figure 3.3 - 3
    [Figure 3.3-3]

    You can check the bidirectional file synchronization history in the [Synchronized History] view.

Data file management function

  • Figure 3.4 - 1
    [Figure 3.4-1]

    You can manage data in the Context menu of the [File Explorer] view and [GBox Explorer] the view. ([New], [Rename], [Copy], [Paste], [Delete], [Refresh])

    You can use the ‘go to directory’ and ‘shortcut’ functions in the toolbar menu at the top of the [File Explorer] view and [GBox Explorer] view.

Introduction on the Detailed Features of Cloud-based Open Integrated Analytics Workbench

  • 4.1. Introduction to Screens and Functions
  • 4.2. Registration for a User-Defined program
    • 4.2.1. Writing User Analysis Code
    • 4.2.2. User Program Registration
  • 4.3. Analysis Pipeline Development
    • 4.3.1. Workspace Creation
    • 4.3.2. Analysis Pipeline Design
    • 4.3.3. Share Analytics Pipeline
  • 4.4. Running Analysis Pipelines
    • 4.4.1. Running Analysis Pipelines
    • 4.4.2. Run a Single Analysis Program
    • 4.4.3. Check Data Analysis Results
  • 4.5. Register for Public Services
    • 4.5.1. Pipeline Public Registration
    • 4.5.2. Program public registration

Introduction to Screens and Functions

Bio-Express Workbench is a cloud-based program that provides large-scale data analysis services. It consists of a user-friendly interface, and it is easy to design a complex bio information analysis process as a pipeline.

  • Figure 4.1-1
    [Figure 4.1-1]

Registration for a user-defined program

  • 4.2.1. Writing user analysis code

To register a user-defined program, you must first register a user analysis code.

  • Figure 4.2.1-1
    [Figure 4.2.1-1]

    Click the [New Folder] button in the View context [Script] View Context menu to create a script folder to register the user analysis code.

  • Figure 4.2.1-2
    [Figure 4.2.1-2]

    After creating the folder name, click the [OK] button to create the folder.

  • Figure 4.2.1-3
    [Figure 4.2.1-3]

    As shown in [Figure 4.2.1-3], click the [New Script] button in the Context menu in the created script folder to generate a user analysis code. Python, Bash, and R are the supported file formats for user analysis codes.

  • Figure 4.2.1 - 4
    [Figure 4.2.1-4]

    After entering the name of the script to be created, click the [OK] button to create an empty script.

  • Figure 4.2.1 - 5
    [Figure 4.2.1-5]
    #!/bin/bash
    
    INPUT_DIR=$1
    OUTPUT_DIR=$2
    
    mkdir -p $OUTPUT_DIR
    
    in_array() {
        local needle array value
        needle="${1}"; shift; array=("${@}")
        for value in ${array[@]};
            do
                [ "${value}" == "${needle}" ] && echo "true" && return;
            done
        echo "false"
    }
    
    array=(`find $INPUT_DIR/ -name "*.fastq.gz"`)
    for i in "${array[@]}"
    do
        FILE=${i##*/}
        IFS='' read -r -a array <<< "$FILE"
        len=(${#array[@]}-1)
        FASTP_IN_FILE_LIST+=("${array[$len]}")
    done
    
    
    for i in "${FASTP_IN_FILE_LIST[@]}"
    do
        basename=$i
        FILE1=${basename%1.fastq.gz}
    
        IFS='' read -r -a file1_array <<< "$FILE1"
        file1_len=(${#file1_array[@]}-1)
    
        array_check=`in_array ${file1_array[$file1_len]}2.fastq.gz ${FASTP_IN_FILE_LIST[@]}`
    
        if [ "${array_check}" == "true" ]; then
    
            IFS='-' read -r -a SEQ_ID <<< "$FILE1"
             
            FASTP='/BiO/program/fastp/current/fastp'
    
            time $FASTP -i ${INPUT_DIR}/${file1_array[$file1_len]}1.fastq.gz -I ${INPUT_DIR}/${file1_array[$file1_len]}2.fastq.gz \
    		-o ${OUTPUT_DIR}/${file1_array[$file1_len]}1.fastq.gz -O ${OUTPUT_DIR}/${file1_array[$file1_len]}2.fastq.gz \
    		--length_required 20 --average_qual 20 --detect_adapter_for_pe --correction \
    		-h ${OUTPUT_DIR}/${SEQ_ID[0]}.html -j ${OUTPUT_DIR}/${SEQ_ID[0]}.json
        fi
    done

    The created user analysis code can be written and edited using the editor by double-clicking in the [Script] View as shown in [Figure 4.2.1-5]. More convenient code writing is provided with the code auto-completion function in the editor.

  • 4.2.2. User program registration

Register the user program to run the created user analysis code.

  • Figure 4.2.2 - 1
    [Figure 4.2.2-1]
  • Figure 4.2.2 - 2
    [Figure 4.2.2-2]

In the [Program] View, click the [New Category] button in the Context menu to create the root category and subcategory to register a user-defined program.

Workspace, category, pipeline, and program names can only contain alphanumerical characters and some special characters (-_.), and the first letter must be a roman letter. Entering any other characters or spaces is not allowed.
If a subcategory exists in the root category and when a program or pipeline exists in a subcategory, it cannot be deleted or modified.
  • Figure 4.2.2 - 3
    [Figure 4.2.2-3]

Create a sub-category under the root category in the Context menu.

  • Figure 4.2.2 - 4
    [Figure 4.2.2-4]

Click the [New Program] button in the Context menu of the created sub-category to register a user program.

  • Figure 4.2.2 - 5
  • Figure 4.2.2 - 5
  • Figure 4.2.2 - 5
  • [Figure 4.2.2-5]

Register the program by setting basic the information, additional information, and parameter values of the user program to be registered.

    • Basic information: Name, category, version, keyword, related URL, description
    • You can enter multiple keywords by separating them with a comma (,).
    • Additional information: Script type, single/multi-core, related programs
    • Parameter values: Input, Output, Option
      • Input: Set the file or path required for analysis.
      • Output: Set the path to which the results of analysis will be output.
      • Option: Set options supported by the program.

These are the three parameter values values, and the set values are default values that can be newly set by the user in the program run stage.

  • Figure 4.3.2 – 6
    [Figure 4.2.2-6]

Click the ‘Add parameter value’ button to create the input parameters of the fastp program. Created parameters can be modified and deleted.

  • Figure 4.3.2 – 7
  • Figure 4.3.2 – 7
  • Figure 4.3.2 – 7
  • [Figure 4.2.2-7]

After specifying the type of parameter to be created, enter the Default value. If the parameter type is folder or file, you can load the file and folder of the file GBox by pressing the Browser button.

  • Figure 4.3.2 – 8
    [Figure 4.2.2-8]

    As shown in [Figure 4.2.2-8], you can check the generated parameter values.

  • Figure 4.3.2 – 9
    [Figure 4.2.2-9]

    Output and Option parameter values can also be created in the same way.

Analysis pipeline development

  • 4.3.1. Workspace creation

As shown in [Figure 4.3.1-1], click the [New Workspace] button in the Context menu of the [Workspace] view to create a workspace.

  • Figure 4.3.1 - 1
    [Figure 4.3.1-1]

    Write the name and description of the workspace to be created, register the related keywords, and click the [OK] button.

  • Figure 4.3.1 - 2
    [Figure 4.3.1-2]

    Click the [New Pipeline] button in the context menu of the created workspace to create an analysis pipeline.

  • Figure 4.3.1 - 3
    [Figure 4.3.1-3]

    Depending on the pipeline type, a user-defined pipeline can be created or an instance pipeline can be created by selecting an existing registered pipeline.

  • Figure 4.3.1 - 4
    [Figure 4.3.1-4]

    - [Develop a new pipeline]: Create a custom pipeline.

    - [Develop an instance pipeline]: Choose a public pipeline.

  • Figure 4.3.1 - 5
    [Figure 4.3.1-5]

    As shown in [Figure 4.3.1-5], you can check the created user-defined pipeline in the [Workspace] View.

  • 4.3.2. Analysis pipeline design
  • 4.3.2-1. User-defined pipeline design
  • Figure 4.3.2 - 1
    [Figure 4.3.2-1]

    Find the analysis program required for analysis pipeline design in 'Palette' of the editor.

  • Figure 4.3.2 - 2
    [Figure 4.3.2-2]

    After selecting the program, drag and drop it onto the canvas using your mouse.

  • Figure 4.3.2 - 3
    [Figure 4.3.2-3-1]

    Save the pipeline designed with the required analysis program, and double-click each analysis program to set the options required for running.

  • Figure 4.3.2 - 3
    [Figure 4.3.2-3-2]

    After setting the options for running the analysis pipeline, one pipeline is created by connecting the analysis program located on the canvas.

    Even if the connection between analysis programs and the order of option setting are changed, there is no problem in function; however, when the analysis program is connected, the output of the previous step program is automatically set as the input, so it is recommended to connect after setting the program.

  • Figure 4.3.2 - 4
    [Figure 4.3.2-4]

    Design a user-defined pipeline by dragging and dropping the circle-shaped connection area located at the right end of the output parameter of the analysis program to the left connection area of the input parameter of the next stage analysis program.

  • Figure 4.3.2 - 5
    [Figure 4.3.2-5]

    As shown in [Figure 4.3.2-5], the shape of the connecting line can be freely modified by the user by dragging the mouse.

  • 4.3.2-2. Instance Pipeline Design

As shown in [Figure 4.3.2-6], you can select an existing registered pipeline by selecting the instance pipeline type in the pipeline creation stage.

  • Figure 4.3.2 - 6
    [Figure 4.3.2-6]

    Like user-defined pipelines, the settings required for running each program can also be changed for instance pipelines.

  • 4.3.3. Share Analytics Pipeline
  • Figure 4.3.3-1
    [Figure 4.3.3-1]

    The designed analysis pipeline can be shared among members who have joined KOBIC SSO.

    Click Shared Pipeline in the Context menu of the pipeline you want to share.

  • Figure 4.3.3-2
    [Figure 4.3.3-2]

    You can check the information of the pipeline you want to share and the table where you can select the members you want to share.

  • Figure 4.3.3-3
    [Figure 4.3.3-3]
  • Figure 4.3.3-4
    [Figure 4.3.3-4]

Go to the Share Member tab and click Add to see the list of members you can share with.

Search Member by ID in the search bar to quickly find the member.

The shared pipeline can be checked via the Bio-Express web.

Running Analysis Pipelines

  • 4.4.1. Running Analysis Pipelines

You can run a designed analysis pipeline by clicking the [Run] button at the top of the editor.

  • Figure 4.4.1-1
    [Figure 4.4.1-1]

    When the analysis is run, the progress of the analysis pipeline can be checked in the [Pipeline] View, [Log] View and the editor. You can check the real-time progress of the analysis pipeline by pressing the [Refresh] button in each view and the editor. When the analysis pipeline is run, the analysis program proceeds sequentially.

  • Figure 4.4.1-2-1
  • Figure 4.4.1-2-2
    [Figure 4.4.1-2]

    Running programs are in the run state and displayed in green in the editor, and programs in the complete state after running are displayed in blue.

  • 4.4.2. Run a single analysis program
  • Figure 4.4.2-1
    [Figure 4.4.2-1]

    It is also possible to perform single run of each analysis program. Click the [Start Node] button in the shortcut menu that appears when you move your mouse over the analysis program to be run singly on the canvas.

  • 4.4.3. Check data analysis results
  • Figure 4.4.3-1
    [Figure 4.4.3-1]

    Check the log information of each analysis program in the [Log] View.

  • Figure 4.4.3-2
    [Figure 4.4.3-2]

    If you double-click the log, you can check the log of the analysis process and errors in the [Console] View.

  • Figure 4.4.3-3
    [Figure 4.4.3-3]

    In the [GBox Browser] View, you can check the results by moving to the output path specified for each analysis program.

  • Figure 4.4.3-4
    [Figure 4.4.3-4]

    [Figure 4.4.3-4] shows the analysis results of fastp, an open program in the Bioinformatics category.

Register for public services

  • 4.5.1. Pipeline public registration

To share the analysis pipeline with a specific user, it is possible to openly register the analysis pipeline.

  • Figure 4.5.1-1
    [Figure 4.5.1-1]

    As shown in [Figure 4.5.1-1], select the desired analysis pipeline in the [Workspace] view and click the [Register Pipeline] button in the Context menu.

  • Figure 4.5.1-2
    [Figure 4.5.1-2]

    The analysis pipeline for which public registration is requested can be used as an open analysis pipeline through the administrator approval process.

  • 4.5.2. Program public registration

The analysis program requested for public registration can be registered as an open analysis program through the administrator approval process.

  • Figure 4.5.2-1
    [Figure 4.5.2-1]

    As shown in [Figure 4.5.2-1], you can select a desired analysis program in the [Program] View and publicly register it by pressing the [Register Program] button in the Context menu.

    Users cannot delete or modify published categories, programs, and pipelines.

Tutorial for Using Public Analysis Pipelines

  • 5.1 Introduction to the Genomic Analysis Pipeline
    • 5.1.1. WGS
  • 5.2 NGS Data Analysis Pipeline Design Process
    • 5.2.1 NGS Data Preprocessing
    • 5.2.2 Whole Genome Sequencing Pipeline

Introduction to the Genomic Analysis Pipeline

  • 5.1.1. WGS
    • Pipeline
  • Figure 5.1.1-1
    [Figure 5.1.1-1]
    • Field of Use

Whole-Genome Sequencing

    • Summary of Features

It is a pipeline that receives DNA sequences of normal and cancerous tissues and creates bam files through the processes of Read Map to Reference, Read Merge, and CleanUp, and it uses GATK4 to find faster somatic variants of SNP & InDel and perform variant annotation.

    • Detailed Description

It consists of 10 steps: Fastp, Cutadapt, BWA (mem), SortSam, MarkDuplicates, CountBase, BaseRecalibrator, ApplyBQSR, HaplotypeCaller, and somalier. The process analyzed at each stage is as follows. Set the folder path where the pair-end fastq.gz file exists as input data of the first stage, which is fastp. The fastp stage preprocesses fastq. In addition, the results of quality control and filtering are visualized in JSON and HTML reports. After that, Cutadapt receives the fastq.gz file preprocessed by fastp, finds and removes adapter sequences, primers, poly-A tails and other types of unwanted sequences from high-throughput sequencing reads. As such, the fastq file, the result of adaptive trimming, is transferred to the next analysis step, BWA (mem). BWA maps low-divergent sequences to large genome references such as human genomes. BWA (mem) generates a SAM file by seeding 70bp-10mbp query sequences to the maximum exact match and then sorting them based on the genome reference. After this, the sam file is sorted by the other properties of coordinate, queryname, and sam record through SortSam and is returned as a BAM file. The following MarkDuplicates tags the duplicate reads in the bam file and passes them to CountBase, BaseRecalibrator and ApplyBQSR. CountBase indicates the output file that counts the total number of bases. BaseRecalibrator performs the first of the two-stage BQSR (Base Quality Score Recalibration) process, and creates a recalibrated table based on various covariates such as read group, reported quality score, machine cycle, and nucleotide context. ApplyBQSR performs the second step of BQSR, recalibrates the basic quality of input reads based on the recalibrated table in BaseRecalibrator, and creates a recalibrated BAM file. Then, HaplotypeCaller calls the SNPs and InDels of the haplotypes in the active region, creates a VCF file as a result of reassembled read, and compresses it in gvcf.gz format and delivers it to the next step. In the final analysis stage, somalier, informative sites are extracted and the results of evaluating relevance according to sequencing are stored in the form of a SOMALIER file.

NGS data analysis pipeline design process

  • 5.2.1. NGS data preprocessing
    • Preprocessing Workflow
      • Figure 5.2.1-1
        [Figure 5.2.1-1]
    • FASTQ format
      • Figure 5.2.1-2
        [Figure 5.2.1-2]
    • FASTQ format cont
    • Phred* quality score Q with base-calling error probability P

      Q = -10 log10P

      • Figure 5.2.1-3
        [Figure 5.2.1-3]
      • Figure 5.2.1-4
        [Figure 5.2.1-4]
    • FASTA format
      • One or more sequences per file
      • “ › ” denotes beginning of sequence or contig
      • Subsequent lines up to the next “ › ” define sequence
      • Lowercase base denotes repeat masked base
      • Contig ID may have comments delimited by “|”
      • Figure 5.2.1-5
        [Figure 5.2.1-5]
    • SAM format
      • Figure 5.2.1-6
        [Figure 5.2.1-6]
    • SAM format cont
      • Figure 5.2.1-7
        [Figure 5.2.1-7]
      • Figure 5.2.1-8
        [Figure 5.2.1-8]
      • Figure 5.2.1-9
        [Figure 5.2.1-9]
    • Preprocessing stage
      • Quality Check
        • - FastQC is an excellent program for visualizing the overall quality of all reads in a fastq file
        • - Input : FASTQ
        • - Output : QC Report (html)
        • - Software : FastQC ...
      • Figure 5.2.1-10
        [Figure 5.2.1-10]
      • Figure 5.2.1-11
        [Figure 5.2.1-11]
      • Figure 5.2.1-12
        [Figure 5.2.1-12]
    • Trimming
      • Adapter trimming
        • - May increase mapping rates
        • - Absolutely essential for small RNA
        • - Probably improves de novo assemblies
      • Quality trimming
        • - May increase mapping rates
        • - May also lead to loss of information
      • Input: FASTQ
      • Output: Trimmed FASTQ
      • Software: Sickle, Cutadapt, Fastx Toolkit ...
      • Figure 5.2.1-13
        [Figure 5.2.1-13]
    • Alignment
      • The process of making a sequence by arranging sequences of DNA and RNA
      • Alignment is separated by mapping and assembly
      • Input: FASTQ, Fasta
      • Output: Sam / Bam
      • Software: bowtie, bwa, star, soap ...
      • Figure 5.2.1-14
        [Figure 5.2.1-14]
    • Preprocessing Stage cont
      • Alignment – Aligning to a Reference
      • Figure 5.2.1-15
        [Figure 5.2.1-15]
      • Figure 5.2.1-16
        [Figure 5.2.1-16]
  • 5.2.2. Whole Genome Sequencing Pipeline
    • Pipeline summary
      • Pipeline schematic diagram
        • Figure 5.2.2-1
          [Figure 5.2.2-1]
      • Description
        • It is a pipeline that receives DNA sequences of normal and cancerous tissues and creates BAM files through the processes of Read Map to Reference, Read Merge, and CleanUp, and it uses GATK4 to find faster somatic variants of SNV & InDel and perform variant annotation.
      • Program used
        • Fastp, Cutadapt, BWA, GATK4, Somalier
      • Input file
        • - Fastq data file
        • - Reference FASTA file
        • - Reference & Resource VCF file
      • Final results file
        • Variant annotate file of Somatic SNP & InDel
    • Registered pipeline
      • Figure 5.2.2-2
        [Figure 5.2.2-2]
    • Description of each stage
      • 1) Quality Check - Quality Control & Adaptive Trimming
        • Fastp
          • - Version and License: 0. 20.1 / MIT License
          • - Description: A program that creates HTML-based reports by performing all-in-one preprocessing of Fastq files.
          • - Input parameters
            • [input]: Path of pair-ended fastq.gz file with raw data (path of *.fastq.gz)
            • [output]: Path where the preprocessing results report will be generated (path of *.html, *.json, *.fastq.gz)
          • - Script Details
            • A fastp program written in a bash shell script.
            • Receive input and output values into a folder and sift through a set of fastq.gz files within the folder for analysis.
            																				#!/bin/bash
            
            INPUT_DIR=$1
            OUTPUT_DIR=$2
            
            mkdir -p $OUTPUT_DIR
            
            in_array() {
                local needle array value
                needle="${1}"; shift; array=("${@}")
                for value in ${array[@]};
                    do
                        [ "${value}" == "${needle}" ] && echo "true" && return;
                    done
                echo "false"
            }
            
            array=(`find $INPUT_DIR/ -name "*.fastq.gz"`)
            for i in "${array[@]}"
            do
                FILE=${i##*/}
                IFS='' read -r -a array <<< "$FILE"
                len=(${#array[@]}-1)
                FASTP_IN_FILE_LIST+=("${array[$len]}")
            done
            
            
            for i in "${FASTP_IN_FILE_LIST[@]}"
            do
                basename=$i
                FILE1=${basename%1.fastq.gz}
            
                IFS='' read -r -a file1_array <<< "$FILE1"
                file1_len=(${#file1_array[@]}-1)
            
                array_check=`in_array ${file1_array[$file1_len]}2.fastq.gz ${FASTP_IN_FILE_LIST[@]}`
            
                if [ "${array_check}" == "true" ]; then
            
                    IFS='-' read -r -a SEQ_ID <<< "$FILE1"
                     
                    FASTP='/BiO/program/fastp/current/fastp'
            
                    time $FASTP -i ${INPUT_DIR}/${file1_array[$file1_len]}1.fastq.gz -I ${INPUT_DIR}/${file1_array[$file1_len]}2.fastq.gz \
            		-o ${OUTPUT_DIR}/${file1_array[$file1_len]}1.fastq.gz -O ${OUTPUT_DIR}/${file1_array[$file1_len]}2.fastq.gz \
            		--length_required 20 --average_qual 20 --detect_adapter_for_pe --correction \
            		-h ${OUTPUT_DIR}/${SEQ_ID[0]}.html -j ${OUTPUT_DIR}/${SEQ_ID[0]}.json
                fi
            done
            																			
          • - Pipeline connection
            • Figure 5.2.2-3
              [Figure 5.2.2-3]
      • 2) Trimming - Quality Control & Adaptive Trimming
        • Cutadapt
          • - Version and License: 3.4 / MIT License
          • - Description: A program that generates purified fastq files by finding and removing adapter sequences, primers, poly-A tails, and unwanted sequences from high-throughput sequencing reads
          • - Input parameters
            • [input]: The path where the fastq.gz file preprocessed by Fastp is located
            • [output]: The path where the fastq file that is the result of adaptive trimming will be created
          • - Script Details
            • A cutadapt program written in a bash shell script.
            • Receive input and output values into a folder and sift through a set of fastq.gz files within the folder for analysis
            																				#!/bin/bash
            
            INPUT_DIR=$1
            OUTPUT_DIR=$2
            
            mkdir -p $OUTPUT_DIR
            
            in_array() {
                local needle array value
                needle="${1}"; shift; array=("${@}")
                for value in ${array[@]};
                    do
                        [ "${value}" == "${needle}" ] && echo "true" && return;
                    done
                echo "false"
            }
            
            array=(`find $INPUT_DIR/ -name "*.fastq.gz"`) 
            for i in "${array[@]}"
            do
                FILE=${i##*/}
                IFS='' read -r -a array <<< "$FILE"
                len=(${#array[@]}-1)
                echo "${array[$len]}"
                SORT_IN_FILE_LIST+=("${array[$len]}")
            done
            
            index=1
            
            for i in "${CUTADAPT_IN_FILE_LIST[@]}"
            do
                basename=$i
                FILE1=${basename%1.fastq.gz}
            
                IFS='' read -r -a file1_array <<< "$FILE1"
                file1_len=(${#file1_array[@]}-1)
            
                array_check=`in_array ${file1_array[$file1_len]}2.fastq.gz ${CUTADAPT_IN_FILE_LIST[@]}`
            
                if [ "${array_check}" == "true" ]; then
            
                    READ_1=$INPUT_DIR/${file1_array[$file1_len]}1.fastq.gz
                    READ_2=$INPUT_DIR/${file1_array[$file1_len]}2.fastq.gz
            
                    READ_1_FILE_NAME=$(basename $READ_1) 
                    READ_2_FILE_NAME=$(basename $READ_2)
            
                    IFS='.' read -r -a split_name_1 <<< "$READ_1_FILE_NAME" 
                    IFS='.' read -r -a split_name_2 <<< "$READ_2_FILE_NAME" 
            
                    READ_1_NAME=${split_name_1[0]} 
                    READ_2_NAME=${split_name_2[0]}
            
                   	CUTADAPT='/BiO/program/python/current/bin/cutadapt'
                   
                    time $CUTADAPT -q 30 -j 2 -a AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC -A AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT --minimum-length 70 --pair-filter=any -o $OUTPUT_DIR/${READ_1_NAME}.fastq -p $OUTPUT_DIR/${READ_2_NAME}.fastq  $READ_1 $READ_2
            
                    index=$(($index+1))
                fi
            done
            																			
          • - Pipeline connection
            • Figure 5.2.2-4
              [Figure 5.2.2-4]
      • 3) Alignment – Read Map to Reference
        • BWA mem
          • - Version and License: 0.7.15 / GNU General Public License version 3.0
          • - Description: A program that seeds 70bp~10mbp query sequences to a maximum exact match and then aligns them to make a SAM file.
          • - Input parameters
            • [input]: The path where the fastq file subject to adaptive trimming by Cutadapt is located
            • [reference]: The path where the index of the reference fasta file indexed by the BWA index is created
            • [output]: The Path where the Read Map to the referenced SAM file is created
          • - Script Details
            • BWA mem program written in bash shell script.
            • The reference value receives a folder and selects only the fasta files in the folder, receives the input and output values as a folder, selects the fastaq files in the folder, and analyzes them.
            																				#!/bin/bash
            
            INPUT_DIR=$1
            REF_DIR=$2
            OUTPUT_DIR=$3 
            
            mkdir -p $OUTPUT_DIR
            
            REF_FILE=(`find $REF_DIR/ -name "*.fasta"`)
            
            in_array() {
                local needle array value
                needle="${1}"; shift; array=("${@}")
                for value in ${array[@]};
                    do
                        [ "${value}" == "${needle}" ] && echo "true" && return;
                    done
                echo "false"
            }
            
            array=(`find $INPUT_DIR/ -name "*.fastq"`) 
            
            for i in "${array[@]}"
            do
                FILE=${i##*/}
                IFS='' read -r -a array <<< "$FILE"
                len=(${#array[@]}-1)
                echo "${array[$len]}"
                SORT_IN_FILE_LIST+=("${array[$len]}")
            done
            
            for i in "${BWA_IN_FILE_LIST[@]}"
            do
                basename=$i
                FILE1=${basename%1.fastq}
            
                IFS='' read -r -a file1_array <<< "$FILE1"
                file1_len=(${#file1_array[@]}-1)
            
                array_check=`in_array ${file1_array[$file1_len]}2.fastq ${BWA_IN_FILE_LIST[@]}`
            
                if [ "${array_check}" == "true" ]; then
            
                    READ_1=$INPUT_DIR/${file1_array[$file1_len]}1.fastq 
                    READ_2=$INPUT_DIR/${file1_array[$file1_len]}2.fastq 
            
                    INPUT_FILE_NAME=$(basename  $READ_1 .fastq)
            
                    IFS='_' read -r -a NAME <<< "$INPUT_FILE_NAME"
                    IFS='-' read -r -a FASTQ_2D_ID <<< "${NAME[0]}"
                    
                    OUTPUT_SAM=$OUTPUT_DIR/${NAME}.sam
            	
                    BWA='/BiO/program/bwa/current/bwa'
                    $BWA mem \
                    -M -R "@RG\tID:${INPUT_DIR}\tPL:Nova6000S4\tLB:Truseq_PCRFREE\tSM:${FASTQ_2D_ID}" -v 1 \
                    -t 1 $REF_FILE $READ_1 $READ_2 > $OUTPUT_SAM
                    
                fi
            
            done
            																			
          • - Pipeline connection
            • Figure 5.2.2-5
              [Figure 5.2.2-5]
      • 4) Alignment – Read Merge
        • GATK SortSam
          • - Version and License: 0.7.15 / GATK 4.2.0.0 / MIT License
          • - Description: Program to sort SAM or BAM files by coordinate, query name, or any other attributes of the SAM records.
          • - Input parameters
            • [input]: The path where the SAM file generated by BWA mem is located
            • [output]: The path where BAM files sorted by properties are created
          • - Script Details
            • SortSam program written in bash shell script.
            • Receive input and output values into a folder and sift through a set of SAM files within the folder for analysis
            																				#!/bin/bash
            
            INPUT_DIR=$1
            OUTPUT_DIR=$2 
            
            mkdir -p $OUTPUT_DIR
            
            array=(`find $INPUT_DIR/ -name "*.sam"`) 
            
            for i in "${array[@]}"
            do
                FILE=${i##*/}
                IFS='' read -r -a array <<< "$FILE"
                len=(${#array[@]}-1)
                echo "${array[$len]}"
                SORT_IN_FILE_LIST+=("${array[$len]}")
            done
            
            for i in "${SORT_IN_FILE_LIST[@]}" 
            do
                basename=$(basename  $i)
                INPUT_FILE_NAME=$(basename  $i .sam)
                EXTENSION=${basename#*.}
            
                if [ $EXTENSION == 'sam' ]; then
            
                    IFS='-' read -r -a FNAME <<< "$INPUT_FILE_NAME"
                    IFS='_' read -r -a SEQ_ID <<< "$FNAME"
            
                    INPUT_FILE=$INPUT_DIR/${INPUT_FILE_NAME}.sam
                    
                SORTSAM='/BiO/program/gatk/current/gatk SortSam'
            
                time $SORTSAM -I $INPUT_FILE -O $OUTPUT_DIR/${INPUT_FILE_NAME}.bam -SO coordinate
            
                fi
            done
            																			
          • - Pipeline connection
            • Figure 5.2.2-6
              [Figure 5.2.2-6]
      • 5) Alignment – CleanUp
        • GATK MarkDuplicates
          • - Version and License: GATK 4.2.0.0 / MIT License
          • - Description: Program to sort SAM or BAM files by coordinate, query name, or any other attributes of the SAM records.
          • - Input parameters
            • [input]: The path where the SAM file generated by BWA mem is located
            • [output]: The path where BAM files sorted by properties are created
          • - Script Details
            • SortSam program written in bash shell script.
            • Receive input and output values into a folder and sift through a set of SAM files within the folder for analysis
            																				#!/bin/bash
            INPUT_DIR=$1 
            OUTPUT_DIR=$2
            
            mkdir -p $OUTPUT_DIR
            
            array=(`find $INPUT_DIR/ -name "*.bam"`)
            
            for i in "${array[@]}"
            do
                FILE=${i##*/}
                IFS='' read -r -a array <<< "$FILE"
                len=(${#array[@]}-1)
                echo "${array[$len]}"
                MARK_IN_FILE_LIST+=("${array[$len]}")
            done
            
            for i in "${MARK_IN_FILE_LIST[@]}"
            do
                basename=$(basename  $i)
                INPUT_FILE_NAME=$(basename  $i .bam)
                EXTENSION=${basename#*.}
            
                if [ $EXTENSION == 'bam' ]; then
            
                    INPUT_FILE=$INPUT_DIR/${INPUT_FILE_NAME}.bam
                    
                MARKDUPLICATES='/BiO/program/gatk/current/gatk MarkDuplicates'
            
                time $MARKDUPLICATES \
                    -I $INPUT_FILE \
                    -O $OUTPUT_DIR/${INPUT_FILE_NAME}.bam \
                    -M $OUTPUT_DIR/${INPUT_FILE_NAME}.table \
                    --REMOVE_SEQUENCING_DUPLICATES false
            
                echo "$MARKDUPLICATES -I $INPUT_FILE -O $OUTPUT_DIR/${INPUT_FILE_NAME}.bam -M $OUTPUT_DIR/${INPUT_FILE_NAME}.table --REMOVE_SEQUENCING_DUPLICATES true"
                
                fi
            done
            																			
          • - Pipeline connection
            • Figure 5.2.2-7
              [Figure 5.2.2-7]
      • 6) Alignment – Base Count
        • GATK CountBases
          • - Version and License: GATK 4.2.0.0 / MIT License
          • - Description: Program to sort SAM or BAM files by coordinate, query name, or any other attributes of the SAM records.
          • - Input parameters
            • [input]: The path where the BAM file created by MarkDuplicates is located.
            • [intervals]: The path where the Genomic Interval file is located
            • [output]: The path where the count file with the total number of bases of the BAM file is created
          • - Script Details
            • CountBases program written in bash shell script.
            • Receive reference values as a file, receive input and output values into a folder and sift through a set of BAM files within the folder for analysis
            																				INPUT_DIR=$1
            REF_FILE=$2
            OUTPUT_DIR=$3
            
            mkdir -p $OUTPUT_DIR
            
            array=(`find $INPUT_DIR/ -name "*.bam"`)
            
            for i in "${array[@]}"
            do
                FILE=${i##*/}
                IFS='' read -r -a array <<< "$FILE"
                len=(${#array[@]}-1)
                echo "${array[$len]}"
                COUNT_IN_FILE_LIST+=("${array[$len]}")
            done
            
            for i in "${COUNT_IN_FILE_LIST[@]}"
            do
                basename=$(basename  $i)
                INPUT_FILE_NAME=$(basename  $i .bam)
                EXTENSION=${basename#*.}
            
                if [ $EXTENSION == 'bam' ]; then
            
                    INPUT_FILE=$INPUT_DIR/${INPUT_FILE_NAME}.bam
            
                    /samtools/1.6/samtools index $INPUT_FILE
                
                COUNTBASES='/BiO/program/gatk/current/gatk CountBases'
            
                time $COUNTBASES \
                    -I $INPUT_FILE \
                    -O $OUTPUT_DIR/${INPUT_FILE_NAME}.count \
                    -RF NotSecondaryAlignmentReadFilter \
                    -L $REF_FILE
                
                fi
            done
            																			
          • - Pipeline connection
            • Figure 5.2.2-8
              [Figure 5.2.2-8]
      • 7) Alignment – Base Quality Score Recalibration
        • GATK BaseRecalibrator
          • - Version and License: GATK 4.2.0.0 / MIT License
          • - Description: A program that creates tables by re-correcting the base quality score based on various covariates such as read group, reported quality score, machine cycle, and nucleotide context.
          • - Input parameters
            • [input]: The path where the BAM file created by MarkDuplicates is located.
            • [reference]: The path where the reference sequence fasta file is located
            • [known sites1~3]: The path where the vcf file, which is ‘known polymorphic sites’, is located.
            • [output]: The path where the table file created by re-calibrating the base quality score will be created
          • - Script Details
            • BaseRecalibrator program written in bash shell script.
            • Receive reference values as a file, receive input and output values into a folder and sift through a set of BAM files within the folder for analysis
            																				#!/bin/bash
            
            INPUT_DIR=$1
            REF_FILE1=$2
            REF_FILE2=$3
            REF_FILE3=$4
            REF_FILE4=$5
            OUTPUT_DIR=$6
            
            mkdir -p $OUTPUT_DIR
            
            array=(`find $INPUT_DIR/ -name "*.bam"`)
            
            for i in "${array[@]}"
            do
                FILE=${i##*/}
                IFS='' read -r -a array <<< "$FILE"
                len=(${#array[@]}-1)
                echo "${array[$len]}"
                BASE_IN_FILE_LIST+=("${array[$len]}")
            done
            
            for i in "${BASE_IN_FILE_LIST[@]}"
            do
                basename=$(basename  $i)
                INPUT_FILE_NAME=$(basename  $i .bam)
                EXTENSION=${basename#*.}
            
                if [ $EXTENSION == 'bam' ]; then
            
                    INPUT_FILE=$INPUT_DIR/${INPUT_FILE_NAME}.bam
                    
                BASERECALIBRATOR='/BiO/program/gatk/current/gatk BaseRecalibrator'
            
                time $BASERECALIBRATOR \
                    -I $INPUT_DIR/${INPUT_FILE_NAME}.bam \
                    -R ${REF_FILE1} \
                    --known-sites ${REF_FILE2} \
                    --known-sites ${REF_FILE3} \
                    --known-sites ${REF_FILE4} \
                    -O $OUTPUT_DIR/${INPUT_FILE_NAME}.table
            
                fi
            done
            																			
          • - Pipeline connection
            • Figure 5.2.2-9
              [Figure 5.2.2-9]
      • 8) Alignment – Apply BQSR
        • GATK ApplyBQSR
          • - Version and License:GATK 4.2.0.0 / MIT License
          • - Description:Program to generate BAM or CRAM files by recalibrating the default quality of input reads based on the recalibration table generated by the BaseRecalibrator tool
          • - Input parameters
            • [input]: The path where the BAM file created by MarkDuplicates is located.
            • [reference]: The path where the reference sequence fasta file is located
            • [bqsr recal file]: The path where the table file is created by BaseRecalibrator
            • [output]: The path where the recalibrated BAM file is created based on the table file created by BaseRecalibrator
          • - Script Details
            • ApplyBQSR program written in bash shell script.
            • The reference value is received as a file, the table value is received in the folder where the table file is located, and the input and output values are received in a folder, and the BAM files in the folder are selected and analyzed.
            																				#!/bin/bash
            
            INPUT_DIR=$1 
            REF_FILE=$2
            REF_DIR=$3
            OUTPUT_DIR=$4
            
            mkdir -p $OUTPUT_DIR
            
            array=(`find $INPUT_DIR/ -name "*.bam"`) 
            
            for i in "${array[@]}"
            do
                FILE=${i##*/}
                IFS='' read -r -a array <<< "$FILE"
                len=(${#array[@]}-1)
                echo "${array[$len]}"
                BQSR_IN_FILE_LIST+=("${array[$len]}")
            done
            
            for i in "${BQSR_IN_FILE_LIST[@]}"
            do
                basename=$(basename  $i)
                INPUT_FILE_NAME=$(basename  $i .bam)
                EXTENSION=${basename#*.}
            
                if [ $EXTENSION == 'bam' ]; then
             
                    IFS='-' read -r -a FNAME <<< "$INPUT_FILE_NAME"
                    IFS='_' read -r -a SEQ_ID <<< "$FNAME"
            
                    echo $INPUT_FILE_NAME
                    echo $FNAME
                    
                    INPUT_FILE=$INPUT_DIR/${INPUT_FILE_NAME}.bam
            
                APPLYBQSR='/BiO/program/gatk/current/gatk ApplyBQSR'
            
                time $APPLYBQSR \
                    -R $REF_FILE \
                    -I $INPUT_FILE \
                    -bqsr $REF_DIR/${INPUT_FILE_NAME}.table \
                    -O $OUTPUT_DIR/${SEQ_ID[0]}_v1.0.0.bam \
                    --static-quantized-quals 10 \
                    --static-quantized-quals 20 \
                    --static-quantized-quals 30 
              
               fi
            done
            																			
          • - Pipeline connection
            • Figure 5.2.2-10
              [Figure 5.2.2-10]
      • 9) Alignment – Extract&Filter of SNPs, Indels
        • GATK HaplotypeCaller
          • - Version and License:GATK 4.2.0.0 / MIT License
          • - Description:A program that creates a vcf file as a result of reassembled reads by calling the SNP and indels of haplotypes in the active region and compressing them into gvcf.gz. format.
          • - Input parameters
            • [input]: The path where the BAM file created by ApplyBQSR is located.
            • [reference]: The path where the reference sequence fasta file is located
            • [intervals]: The path where the Genomic Intervals list is located
            • [output]: The path where the gvcf.gz file will be created by HaplotypeCaller
          • - Script Details
            • HaplotypeCaller program written in bash shell script.
            • Receive reference values as a file, receive input and output values into a folder and sift through a set of BAM files within the folder for analysis
            #!/bin/bash
            
            echo "Total Param= $#,  PROG: $0, param1 = $1, param2 = $2, param3 = $3, param4 = $4";
            
            if [ "$#" -lt 4 ]; then
                echo "$# is Illegal number of parameters."
                echo "Usage: $0 input dir, reference file, output dir"
            	exit 1
            fi
            args=("$@")
            
            for (( i=0; i<$#; i++ ))
            do
                echo "$i th parameter = ${args[$i]}";
            done
            
            INPUT_DIR=$1 # BAM 파일
            REF_FILE1=$2
            REF_FILE2=$3
            OUTPUT_DIR=$4 # VCF 파일
            
            mkdir -p $OUTPUT_DIR
            
            array=(`find $INPUT_DIR/ -name "*.bam"`)
            
            for i in "${array[@]}"
            do
                FILE=${i##*/}
                IFS='' read -r -a array <<< "$FILE"
                len=(${#array[@]}-1)
                echo "${array[$len]}"
                HAPLO_IN_FILE_LIST+=("${array[$len]}")
            done
            
            
            for i in "${HAPLO_IN_FILE_LIST[@]}"
            do
            
                    basename=$(basename  $i)
                    INPUT_FILE_NAME=${basename%.*}
                    EXTENSION=${basename##*.}
            
                if [ $EXTENSION == 'bam' ]; then
            
                    IFS='-' read -r -a FNAME <<< "$INPUT_FILE_NAME"
                    IFS='_' read -r -a SEQ_ID <<< "$FNAME"
            
                    INPUT_FILE=$INPUT_DIR/${INPUT_FILE_NAME}.bam
            
                HAPLOTYPECALLER='/BiO/program/gatk/current/gatk HaplotypeCaller'
                
                time $HAPLOTYPECALLER -I $INPUT_FILE -R $REF_FILE1 -O $OUTPUT_DIR/${SEQ_ID}_v1.0.0.vcf -OVI true --emit-ref-confidence GVCF -L $REF_FILE2
                echo "$HAPLOTYPECALLER -I $INPUT_FILE -R $REF_FILE1 -O $OUTPUT_DIR/${SEQ_ID}_v1.0.0.vcf -OVI true --emit-ref-confidence GVCF -L $REF_FILE2"
                    
                
                mv $OUTPUT_DIR/${SEQ_ID}_v1.0.0.vcf $OUTPUT_DIR/${SEQ_ID}_v1.0.0.gvcf
                echo "mv $OUTPUT_DIR/${SEQ_ID}_v1.0.0.vcf $OUTPUT_DIR/${SEQ_ID}_v1.0.0.gvcf"
            
                BGZIP='/BiO/program/htslib/current/bgzip'
                $BGZIP -c $OUTPUT_DIR/${SEQ_ID}_v1.0.0.gvcf > $OUTPUT_DIR/${SEQ_ID}_v1.0.0.gvcf.gz
                echo "$BGZIP -c $OUTPUT_DIR/${SEQ_ID}_v1.0.0.gvcf > $OUTPUT_DIR/${SEQ_ID}_v1.0.0.gvcf.gz"
                
                TABIX='/BiO/program/htslib/current/tabix'
            	$TABIX -p vcf $OUTPUT_DIR/${SEQ_ID}_v1.0.0.gvcf.gz
                echo "$TABIX -p vcf $OUTPUT_DIR/${SEQ_ID}_v1.0.0.gvcf.gz"
            
            
                fi
            done
          • - Pipeline connection
            • Figure 5.2.2-11
              [Figure 5.2.2-11]
      • 10) Alignment – Variant Annotation
        • Somalier
          • - Version and License:0.2.12/ MIT License
          • - Description:A program that extracts informative sites from BAM/CRAM/VCF and evaluates relatedness according to sequencing
          • - Input parameters
            • [input]: The path where the gvcf.gz file created through HaplotypeCaller is located
            • [reference]: The path where the reference sequence fasta file is located
            • [sites] : The path where the vcf file, which contains the sites of the variant to be extracted, is located
            • [output]: The path where the somalier file is created by Somalier
          • - Script Details
            • Somalier program written in bash shell script.
            • Receive reference values as a file, receive input and output values into a folder and sift through a set of gvcf.gz files within the folder for analysis
            																				#!/bin/bash
            
            INPUT_DIR=$1
            REF_FILE1=$2
            REF_FILE2=$3
            OUTPUT_DIR=$4
            
            mkdir -p $OUTPUT_DIR
            
            array=(`find $INPUT_DIR/ -name "*.gvcf.gz"`)
            
            for i in "${array[@]}"
            do
                FILE=${i##*/}
                IFS='' read -r -a array <<< "$FILE"
                len=(${#array[@]}-1)
                echo "${array[$len]}"
                SOMAL_IN_FILE_LIST+=("${array[$len]}")
            done
            
            for i in "${SOMAL_IN_FILE_LIST[@]}"
            do
            
                INPUT_FILE_NAME=`basename "$i" | cut -d'.' -f-3`
                EXTENSION=`echo "$i" | cut -d'.' -f4-`
            
                if [ $EXTENSION == 'gvcf.gz' ]; then
            
                    INPUT_FILE=$INPUT_DIR/${INPUT_FILE_NAME}.gvcf.gz
            		
                    IFS='-' read -r -a FNAME <<< "$INPUT_FILE_NAME"
                    IFS='_' read -r -a SEQ_ID <<< "$FNAME"
            
            	SOMALIER='/somalier/0.2.12/somalier'
                
                time $SOMALIER extract -d $OUTPUT_DIR -s ${REF_FILE2} -f ${REF_FILE1} $INPUT_FILE
            
                fi
            done
            																			
          • - Pipeline connection
            • Figure 5.2.2-12
              [Figure 5.2.2-12]
    • Results Screen
      • Fastp Results Screen
        • Figure 5.2.2-13
          [Figure 5.2.2-13]
        • Figure 5.2.2-14
          [Figure 5.2.2-14]
        • Figure 5.2.2-15
          [Figure 5.2.2-15]
        • Figure 5.2.2-16
          [Figure 5.2.2-16]
      • GATK_Cutadapt Result Screen
      • 																		@ERR792543.3 HWI-ST333_0255_FC:4:1101:1235:2392/1 CTGCCAATTCCCAATTATATTCTTGGAGAAGCTCATTAGGATAATGGAAATCAATCACTTTGGTTGATCTATCGAAACTT 
        + 
        CBCFFFFFHGHHGIJJIJIIJJJJJJGIIIJJGIGIJIJJJJGIIEIFEIHGIJJJJJJJFGGHIJIEHJGIJGIGHBEC 
        @ERR792543.7 HWI-ST333_0255_FC:4:1101:1228:2438/1 CTCCGGAGTGTAGGCTCCCCATATCTCAGCAAGTGGCCAAAGGTTCGTAATGCCATCTCTGCACNAATCTCCTCCCCCAT 
        + 
        @@@DFFFFCDHF8EHIIIIIIIIIIIIIIIIIGHGHIIIIIII?DDF;BFG<FHHCHEIFGCHH#-5;?DEEEEEBBDDC 
        @ERR792543.8 HWI-ST333_0255_FC:4:1101:1179:2464/1 GCCAGTCATGCCTGCCTGAGACGCCCCGCGGTTGGTGCCCATCTGTAACCCGATCACGTTCTTGNCCTCTTNCAGCTGGT 
        + 
        @CCFFFFFHHHHHJJJJJIJJJIIJJJIIJIDHIJFHIIJEHFHEHHHFFFFDCBDDBBDDDDC#,58??C#+8<<@CCC 
        @ERR792543.10 HWI-ST333_0255_FC:4:1101:1400:2280/1 CACAAACCTTGTTCTAATTTATGTGCACTTCTGAATAAAAATAACACCCTTTTGCTTTAATGCAATATTGATTTTATCAC 
        + 
        @CCDFFDFFHHHHIIGCHIIIIFIGGGGIIHIGIJIIJJIJIIJIJIGGIJIJJJJIJGGDEHGGIJIIHGIJJIHHIEH 
        @ERR792543.12 HWI-ST333_0255_FC:4:1101:1389:2303/1 ATGCGCTCCGCCGCTGTCCTGGCTCTTCTGCTCTGCGCCGGGCAAGTCACTGCGCTCCCTGTGAACAGCCCTATGAATAA 
        + 
        @@CFDFFFFHFD:FGGFHIIIIIIIIIIIGIIGGIIIIIIIFHFFFEACCEECCCC:@B?:@CEDCCCB?@CCCACDEDC
        @ERR792543.14 HWIST333_0255_FC:4:1101:1495:2309/1
        TCTCTCTCCTAATGAAAAAGTTTGTTTCTCTTTTACATTAGTAACGTATGTAACGGGTGGTATGTCTTTCTTTGCTTTGG
        +
        CCCFFFFFHHHDFIJJJJJJJJIJJIJJJJJJIJIGHIJGGIJJJJJJJIFGGIJFG<AH4=EEGGIJHGHHHGFCDFDD
        @ERR792543.15 HWI-ST333_0255_FC:4:1101:1274:2316/1
        GCCCAACTGTCCTGCTGATCTGAAGTAAGACAACTAGCCTTCTACATCAAATAATTCTGTGAGGCAATTTGAGTCAG
        +
        @@@DD?DDAC:3,<CGBFHHHIBCGHFEDEHF;EFEIFII@BFBGGI<DFIIGF@4BF>84<BDEFEEDGIIFFDF@
        @ERR792543.16 HWI-ST333_0255_FC:4:1101:1467:2316/1
        TGCAGTATAGATGGGATATGATGCATAGGCTTGGAGAACCACAGGCAAGGATGAGAGAAGAGAATATGGAAAGGATTG
        																	
      • BWA_mem Results Screen
      • 																		@SQ	SN:chr3	LN:198295559
        @SQ	SN:chr4	LN:190214555
        @SQ	SN:chr5	LN:181538259
        @SQ	SN:chr6	LN:170805979
        @SQ	SN:chr7	LN:159345973
        @SQ	SN:chr8	LN:145138636
        @SQ	SN:chr9	LN:138394717
        @SQ	SN:chr10	LN:133797422
        @SQ	SN:chr11	LN:135086622
        @SQ	SN:chr12	LN:133275309
        @SQ	SN:chr13	LN:114364328
        @SQ	SN:chr14	LN:107043718
        @SQ	SN:chr15	LN:101991189
        @SQ	SN:chr16	LN:90338345
        @SQ	SN:chr17	LN:83257441
        @SQ	SN:chr18	LN:80373285
        @SQ	SN:chr19	LN:58617616
        @SQ	SN:chr20	LN:64444167
        @SQ	SN:chr21	LN:46709983
        @SQ	SN:chr22	LN:50818468
        @SQ	SN:chrX	LN:156040895
        @SQ	SN:chrY	LN:57227415
        @SQ	SN:chrM	LN:16569
        @SQ	SN:chr1_KI270706v1_random	LN:175055
        @SQ	SN:chr1_KI270707v1_random	LN:32032
        @SQ	SN:chr1_KI270708v1_random	LN:127682
        @SQ	SN:chr1_KI270709v1_random	LN:66860
        @SQ	SN:chr1_KI270710v1_random	LN:40176
        @SQ	SN:chr1_KI270711v1_random	LN:42210
        @SQ	SN:chr1_KI270712v1_random	LN:1760438
        																	
      • GATK_BaseRecalibrator Results Screen
      • 																		#:GATKTable:Arguments:Recalibration argument collection values used in this run
        Argument                    Value                                                                   
        binary_tag_name             null                                                                    
        covariate                   ReadGroupCovariate,QualityScoreCovariate,ContextCovariate,CycleCovariate
        default_platform            null                                                                    
        deletions_default_quality   45                                                                      
        force_platform              null                                                                    
        indels_context_size         3                                                                       
        insertions_default_quality  45                                                                      
        low_quality_tail            2                                                                       
        maximum_cycle_value         500                                                                     
        mismatches_context_size     2                                                                       
        mismatches_default_quality  -1                                                                      
        no_standard_covs            false                                                                   
        quantizing_levels           16                                                                      
        recalibration_report        null                                                                    
        run_without_dbsnp           false                                                                   
        solid_nocall_strategy       THROW_EXCEPTION                                                         
        solid_recal_mode            SET_Q_ZERO
                                                                      
        #:GATKTable:3:94:%d:%d:%d:;
        #:GATKTable:Quantized:Quality quantization map
        QualityScore  Count     QuantizedScore
                   0         0               6
                   1         0               6
                   2         0               6
                   3         0               6
                   4         0               6
                   5         0               6
                   6      3784               6
        																	
      • GATK4_HaplotypeCaller Results Screen
      • 																		##fileformat=VCFv4.2
        ##ALT=<ID=NON_REF,Description="Represents any possible alternative allele not already represented at this location by REF and ALT">
        ##FILTER=<ID=LowQual,Description="Low quality">
        ##FORMAT=<ID=AD,Number=R,Type=Integer,Description="Allelic depths for the ref and alt alleles in the order listed">
        ##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)">
        ##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
        ##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
        ##FORMAT=<ID=MIN_DP,Number=1,Type=Integer,Description="Minimum DP observed within the GVCF block">
        ##FORMAT=<ID=PGT,Number=1,Type=String,Description="Physical phasing haplotype information, describing how the alternate alleles are phased in relation to one another; will always be heterozygous and is not intended to describe called alleles">
        ##FORMAT=<ID=PID,Number=1,Type=String,Description="Physical phasing ID information, where each unique ID within a given sample (but not across samples) connects records within a phasing group">
        ##FORMAT=<ID=PL,Number=G,Type=Integer,Description="Normalized, Phred-scaled likelihoods for genotypes as defined in the VCF specification">
        ##FORMAT=<ID=PS,Number=1,Type=Integer,Description="Phasing set (typically the position of the first variant in the set)">
        ##FORMAT=<ID=SB,Number=4,Type=Integer,Description="Per-sample component statistics which comprise the Fisher's Exact Test to detect strand bias.">
        ##GATKCommandLine=<ID=HaplotypeCaller,CommandLine="HaplotypeCaller --emit-ref-confidence GVCF --output /bioex/output/haplotype_caller/ERR792543_v1.0.0.vcf --intervals /bioex/reference/references_hg38_v0_wgs_metrics_intervals.interval_list --input /bioex/output/apply_bqsr/ERR792543_v1.0.0.bam --reference /bioex/reference/resources_broad_hg38_v0_Homo_sapiens_assembly38.fasta --create-output-variant-index true --use-posteriors-to-calculate-qual false --dont-use-dragstr-priors false --use-new-qual-calculator true --annotate-with-num-discovered-alleles false --heterozygosity 0.001 --indel-heterozygosity 1.25E-4 --
        																	
    • somalier 결과 파일.
      • As shown [Figure 5.2.2-17], somalier results are saved as a [file name].somalier.

        • Figure 5.2.2-17
          [Figure 5.2.2-17]