메뉴 바로가기 본문 바로가기 하단 바로가기

Korea Bioinformation Center

National center for biological research resources and information

Public Analysis Pipelines

Pipeline Image

Whole Genome Sequencing Somatic Variant Analysis Pipeline

The Bio-Express Somatic WGS Pipeline is a modular analysis pipeline designed to detect somatic mutations from whole-genome sequencing data. It takes raw FASTQ files as input and provides comprehensive somatic mutation calling results, quality assessment, and visualization based on tumor-normal pair analysis. After assessing sequencing quality with FastQC, adapter removal and quality trimming are performed using Cutadapt, followed by mapping to a reference genome sequence using the BWA-MEM2 alignment tool to generate BAM format alignment files. Subsequently, the GATK pipeline is employed for duplicate removal, mapping quality assessment, and filtering of low-quality reads, ensuring consistency of all pair information. Coordinate-based sorting is performed using SAMtools, and PCR duplicates are removed via GATK MarkDuplicates. Base quality score recalibration is then conducted using GATK BaseRecalibrator and ApplyBQSR, leveraging known variant site information as covariates. For the recalibrated BAM files, a comprehensive quality control and sample validation step is performed first. Sample relatedness is verified using Somalier, sample identity is confirmed through variant-SNP marker integration analysis with SNPmatch, sample contamination is assessed with VerifyBamID2, and coverage analysis is conducted using Mosdepth to comprehensively evaluate the quality and reliability of the sequencing data. Next, the pipeline proceeds to the tumor-normal pair analysis stage, where Normal-Tumor pair compatibility is verified and cross-sample contamination levels are estimated using Conpair. Single nucleotide variants and insertion/deletion mutations are detected in parallel using Strelka2 and Mutect2 to maximize the sensitivity and specificity of somatic mutation detection. Finally, tumor purity analysis is performed using TINC, structural variants are called with Manta, and copy number variation analysis is conducted with Canvas, quantifying comprehensive somatic genomic alterations to provide essential information for cancer genomics research and precision medicine. > Default genome: hg38 [IMPORTANT] Sample Type Identification Rule: - Tumor tissue samples: FASTQ filenames must contain "_T" - Normal tissue samples: FASTQ filenames must contain "_N" (e.g.) patient001_T_R1.fastq.gz # Tumor sample, Read 1 patient001_T_R2.fastq.gz # Tumor sample, Read 2 patient001_N_R1.fastq.gz # Normal sample, Read 1 patient001_N_R2.fastq.gz # Normal sample, Read 2
#wgs
#whole-genome sequencing
#somatic mutation
#tumor-normal pair analysis
#cancer genomics
#precision medicine

Pipeline Image

Whole Genome Sequencing Germline Variant Analysis Pipeline

The Bio-Express Somatic WGS Pipeline is a modular analysis pipeline designed to detect somatic mutations from whole-genome sequencing data. It takes raw FASTQ files as input and provides comprehensive somatic mutation calling results, quality assessment, and visualization based on tumor-normal pair analysis. After assessing sequencing quality with FastQC, adapter removal and quality trimming are performed using Cutadapt, followed by mapping to a reference genome sequence using the BWA-MEM2 alignment tool to generate BAM format alignment files. Subsequently, the GATK pipeline is employed for duplicate removal, mapping quality assessment, and filtering of low-quality reads, ensuring consistency of all pair information. Coordinate-based sorting is performed using SAMtools, and PCR duplicates are removed via GATK MarkDuplicates. Base quality score recalibration is then conducted using GATK BaseRecalibrator and ApplyBQSR, leveraging known variant site information as covariates. For the recalibrated BAM files, a comprehensive quality control and sample validation step is performed first. Sample relatedness is verified using Somalier, sample contamination is assessed with VerifyBamID2, and coverage analysis is conducted using Mosdepth to comprehensively evaluate the quality and reliability of the sequencing data. Subsequently, GVCF file generation using GATK HaplotypeCaller and germline SNV/Indel variant detection in standard VCF format using GenotypeGVCFs are executed. Following this, comprehensive variant statistical analysis using BCFtools is performed, and structural variants are detected using the Manta tool. > Default genome: hg38
#wgs
#whole-genome sequencing
#germline mutation
#individual genomic analysis

Pipeline Image

ChIP-seq Analysis Pipeline

The Bio-Express ChIP-seq Analysis Pipeline is a modular analysis pipeline designed to detect protein-DNA binding sites from Chromatin Immunoprecipitation Sequencing (ChIP-Seq) data. It uses raw FASTQ files as input and provides comprehensive epigenomic binding site calling results, quality assessment, and visualization based on transcription factor binding sites, histone modification regions, and chromatin structure analysis. After evaluating sequencing quality with FastQC, low-quality base filtering is performed using FASTX-Toolkit, followed by mapping to a reference genome sequence using the Bowtie2 alignment tool to generate SAM format alignment files. Subsequently, the preprocessed alignment files are used to enter the epigenomic signal analysis stage. Statistically significant peak calling is performed using MACS2 (Model-based Analysis of ChIP-Seq) to accurately identify protein-DNA binding sites, providing high-resolution binding regions in narrowPeak format. Finally, comprehensive downstream analysis is conducted using Homer. The annotatePeaks function provides genomic location annotations and nearby gene information for the detected peaks, and the makeUCSCfile function generates bedGraph and bigWig format visualization files compatible with the UCSC Genome Browser, enabling intuitive visualization of the genome-wide distribution patterns of chromatin immunoprecipitation signals. > Default genome: hg38 [IMPORTANT] Sample Type Identification Rule: - Control Files: Must start with "CONTROL_" (Required prefix for automatic identification) - Treatment/ChIP Files: No specific filename requirements (e.g.) CONTROL_input_R1.fastq.gz # Valid control, Read 1 CONTROL_input_R2.fastq.gz # Valid control, Read 2 ChIP_H3K4me3_R1.fastq.gz # Valid treatment, Read 1 ChIP_H3K4me3_R2.fastq.gz # Valid treatment, Read 2
#chip-seq
#protein-dna binding
#epigenomics
#tfbs
#transcription factor binding sites
#histone modification
#chromatin structure
Bio-Express View User Guide

The BioExpress service is the only cloud-based integrated data analysis service in Korea that enables big data analysis in scientific fields through a dynamic container-based automated workflow analysis platform and high-speed data transfer service.

Download

Please download
the workbench and high-speed transfer service for the OS that matches your environment.

6,470

User

1,129cases

Workspace

91,339cases

Execution Task
Korea BioData Station More

Bio research data refers to all types of data produced through national R&D projects in the life sciences field. As innovative research methods utilizing this data gain attention, bio data is emerging as a key factor driving R&D innovation. To support this, the National Bio Data Station has been established to integrate and provide data scattered across ministries, projects, and researchers, aiming to create a data-driven bio research environment.

Registration Status by Data Type

  • 2,219cases

    Bio Project
  • 113,245cases

    Bio Sample
  • 2,375,075cases

    Registered Data

Bio Project Registration Status

Cumulative Number of Registrations(cases)
National Genome Project More

Bio big data, which forms the foundation of precision medicine, is becoming increasingly important as the focus shifts from post-treatment care to personalized treatment and preventive healthcare. Particularly in the bioindustry, which benefits from first-mover advantages, proactive investment is necessary, and major countries are building large-scale bio big data systems. Therefore, this project was launched to build national bio big data for leading future healthcare. The goal is to establish a foundation at the national level to collect, store, and utilize 'bio big data', the center of the precision medicine era, and to contribute to the promotion of new industries and the improvement of healthy lives.

Collection of Clinical Information

Designating and operating 16 rare disease collaboration institutions to recruit rare disease patients and collect clinical information.

Data Analysis

Transporting the collected rare disease patients' samples to resource production institutions for the production and analysis of genomic data.

Data Sharing

The collected clinical information and genomic data are shared through a consortium formed by three institutions.

Data Utilization

The analyzed data is used for rare disease patient counseling, diagnosis, and research activities.

Genomic Data 25,000
Variant Analysis Data 25,000
Clinical Information 25,000
Cohort 7
Infectious Disease Data Portal More

The Infectious Disease Data Portal is a portal service that integrates and provides research data on infectious disease viruses from around the world.In a rapidly changing environment, to understand infectious diseases and develop treatments and vaccines, KOBIC integrates and provides global infectious disease research data to share data and results harmoniously.

Sequence Dashboard

88,386 Domestic Genomic Sequences
1,354 Domestic Protein Sequences
19,685,177 Overseas Genomic Sequences
35,837,682 Overseas Protein Sequences
19,764,289 COVID-19 Genomic Sequences
35,333,179 COVID-19 Protein Sequences
Virus

Provides integrated information on viruses including disease overview, particle and genomic structure, life cycle, epidemiology, mutations, etc.

Data

Provides quality-analyzed genomic and protein sequences, and protein structures collected from around the world.

Statistics

Offers various statistical services on virus data, such as outbreak timing, region, mutations, etc.

Analysis Tools

Simple web-based BLAST service for infectious disease standard genomic sequences.