국가생명연구자원정보센터(KOBIC)

The Bio-Express Somatic WGS Pipeline is a modular analysis pipeline designed to detect somatic mutations from whole-genome sequencing data. It takes raw FASTQ files as input and provides comprehensive somatic mutation calling results, quality assessment, and visualization based on tumor-normal pair analysis. After assessing sequencing quality with FastQC, adapter removal and quality trimming are performed using Cutadapt, followed by mapping to a reference genome sequence using the BWA-MEM2 alignment tool to generate BAM format alignment files. Subsequently, the GATK pipeline is employed for duplicate removal, mapping quality assessment, and filtering of low-quality reads, ensuring consistency of all pair information. Coordinate-based sorting is performed using SAMtools, and PCR duplicates are removed via GATK MarkDuplicates. Base quality score recalibration is then conducted using GATK BaseRecalibrator and ApplyBQSR, leveraging known variant site information as covariates. For the recalibrated BAM files, a comprehensive quality control and sample validation step is performed first. Sample relatedness is verified using Somalier, sample identity is confirmed through variant-SNP marker integration analysis with SNPmatch, sample contamination is assessed with VerifyBamID2, and coverage analysis is conducted using Mosdepth to comprehensively evaluate the quality and reliability of the sequencing data. Next, the pipeline proceeds to the tumor-normal pair analysis stage, where Normal-Tumor pair compatibility is verified and cross-sample contamination levels are estimated using Conpair. Single nucleotide variants and insertion/deletion mutations are detected in parallel using Strelka2 and Mutect2 to maximize the sensitivity and specificity of somatic mutation detection. Finally, tumor purity analysis is performed using TINC, structural variants are called with Manta, and copy number variation analysis is conducted with Canvas, quantifying comprehensive somatic genomic alterations to provide essential information for cancer genomics research and precision medicine. > Default genome: hg38 [IMPORTANT] Sample Type Identification Rule: - Tumor tissue samples: FASTQ filenames must contain "_T" - Normal tissue samples: FASTQ filenames must contain "_N" (e.g.) patient001_T_R1.fastq.gz # Tumor sample, Read 1 patient001_T_R2.fastq.gz # Tumor sample, Read 2 patient001_N_R1.fastq.gz # Normal sample, Read 1 patient001_N_R2.fastq.gz # Normal sample, Read 2

Version1.0

Execution Count 0

Genomics

Variant-analysis Whole Genome Sequencing Germline Variant Analysis Pipeline

b bio-workflow

Genomics

Whole Genome Sequencing Germline Variant Analysis Pipeline

Version1.0

Execution Count 0

Epigenomics

DNA-binding-protein-based-analysis ChIP-seq Analysis Pipeline

b bio-workflow

Epigenomics

ChIP-seq Analysis Pipeline

The Bio-Express ChIP-seq Analysis Pipeline is a modular analysis pipeline designed to detect protein-DNA binding sites from Chromatin Immunoprecipitation Sequencing (ChIP-Seq) data. It uses raw FASTQ files as input and provides comprehensive epigenomic binding site calling results, quality assessment, and visualization based on transcription factor binding sites, histone modification regions, and chromatin structure analysis. After evaluating sequencing quality with FastQC, low-quality base filtering is performed using FASTX-Toolkit, followed by mapping to a reference genome sequence using the Bowtie2 alignment tool to generate SAM format alignment files. Subsequently, the preprocessed alignment files are used to enter the epigenomic signal analysis stage. Statistically significant peak calling is performed using MACS2 (Model-based Analysis of ChIP-Seq) to accurately identify protein-DNA binding sites, providing high-resolution binding regions in narrowPeak format. Finally, comprehensive downstream analysis is conducted using Homer. The annotatePeaks function provides genomic location annotations and nearby gene information for the detected peaks, and the makeUCSCfile function generates bedGraph and bigWig format visualization files compatible with the UCSC Genome Browser, enabling intuitive visualization of the genome-wide distribution patterns of chromatin immunoprecipitation signals. > Default genome: hg38 [IMPORTANT] Sample Type Identification Rule: - Control Files: Must start with "CONTROL_" (Required prefix for automatic identification) - Treatment/ChIP Files: No specific filename requirements (e.g.) CONTROL_input_R1.fastq.gz # Valid control, Read 1 CONTROL_input_R2.fastq.gz # Valid control, Read 2 ChIP_H3K4me3_R1.fastq.gz # Valid treatment, Read 1 ChIP_H3K4me3_R2.fastq.gz # Valid treatment, Read 2

Version1.0

Execution Count 0

Transcriptomics

Single-cell-transcriptomics Single-cell RNA Sequencing Pipeline

b bio-workflow

Transcriptomics

Single-cell RNA Sequencing Pipeline

The single-cell RNA sequencing pipeline is designed to analyze gene expression data at the single-cell level using 10X Genomics' Cell Ranger along with the Python-based Scanpy package. It takes raw sequencing data in FASTQ format as input, along with metadata describing the samples from which each library was derived. The pipeline performs a comprehensive set of analytical steps including count matrix generation, data preprocessing, normalize and scaling, dimension reduction and clustering, differential gene expression analysis, and functional analysis. As output, it produces a cell-by-gene expression matrix, analytical result tables derived from this matrix, and a variety of visualization plots to support biological interpretation.

Version2.0

Execution Count 0

Analysis Tools

Information Sharing and Support

SUPPORT

Public Pipelines

Public Analysis Pipelines