IBDC-INDA Curation and Data Validation Workflow

The IBDC-INDA curation and data validation workflow ensures submission of high-quality, standardized biological sequence data through rigorous, multi-step validation.

Overview

  • The IBDC-INDA pipeline validates Raw sequence data, Annotated sequences (features), and Genome assemblies, in adherence to INSDC and EMBL-EBI guidelines for nucleotide data archival [1] [2] .

  • Taxonomy entry valdiation is assisted by the NCBI Taxonomy API.

  • The Metadata points are as per the controlled vocabulary and ontologies, ensuring global interoperability and scientific rigor.


Type of Data Submission to IBDC-INDA


1. Raw Data Validation

Raw sequencing files: FASTQ, BAM, CRAM (single, paired, long-read). Data originates directly from sequencing platforms.

Step

Description

Taxonomy Validation

Cross-verify taxon IDs/Scientific names utilising the NCBI Taxon REST API

Missing Value Reporting

Use of INSDC standardized missing value protocols for missing metadata

Metadata Standards

ENA sample checklists for metadata compliance as per minimum standards defined by INSDC

Format & Integrity

Automated file structure, md5sum check, madndatory fields, file size

Read Name Length

Ensure read names ≤256 characters as per International Standards

Base Counting

Frequencies for A, T, G, C (DNA/RNA), ambiguous bases (N, R, Y, etc.), Total base count

Quality Assessment

Check read quality, read statistics, adapter sequences, length distribution


2. Annotated Sequence Validation

All submissions containing annotated — genes, CDS, regulatory elements with sample, feature, and taxonomy metadata.

Step

Description

Feature QC

Features validated against INSDC Feature Table V11.3 (keys, location, qualifiers)

Annotation QC

Verify gene models, translation, overlaps, start/stop codons

Taxonomy Context

Confirm annotation linkage to correct taxon

3. Assembly Data Validation

Assembly Types

Assembly Type

Detailed Types

Genome Assembly

Includes plasmids, organelles,complete viral genomes, viral segments/replicons, bacteriophages, prokaryotic and eukaryotic genomes.

Transcriptome Assembly

Includes isolate and metatranscriptome assemblies

Assembly Levels

  • Contig: Basic, contiguous sequence blocks

  • Scaffold: Ordered contigs, potential gaps

  • Chromosome: Chromosome-level, complete or partial, organalle chromosomes.

Step

Description

Format and Metdata QC

Validate FASTA, GFF3, flat files, unique naming, study-sample linkage

Taxonomy Validation

Confirm sample taxid with NCBI API

Assembly Metrics

N50/L50, total length, depth, duplication rates

Raw Read Mapping

Linking of assembly with runi-ds submitted (If available)

Annotated Assembly QC

  • GFF/EMBL Conversion: Consistent format conversion validated for scaffolds and chromosomes.

  • Annotation Integrity: Validation of gene and feature annotations across all assembly levels.

  • Contextual Metadata: Standardized sample and feature metadata throughout the submission.

  • Common errors encountered during annotated assembly submission:
    1. ERROR: “intron” Features locations are duplicated - consider merging qualifiers. [ line: 37201 of abc.embl.gz, line: 37162 of abc.embl.gz]

    2. Abutting features cannot be adjacent between neighbouring exons. [ line: 50620 of abc.embl.gz]

    3. “intron” usually expected to be at least “10” nt long. Please check the accuracy. [ line: 17559548 of abc.embl.gz]

Taxonomy Validation Using the NCBI API

  • Automated/Manual Lookup: Validates taxon id and Scirntific name.

  • Integration: Ensures globally recognized taxonomy and organism linkage for every record submitted to IBDC-INDA.

Controlled Vocabulary

  • INSDC controlled vocabulary and ontologies applied throughout all submission entities for data harmonization and interoperability.

Update Cycle

  • Standards, sample checklists, and metadata requirements are updated as per latest INSDC announcements.

Submission Feedback

  • Automated Reporting: Failed QC checks trigger feedback reports for submitter correction and resubmission.

  • Iterative Corrections: Enables robust, reproducible data submission practices.

This documentation ensures clarity, comprehensive validation, and transparent feedback for all stages of IBDC-INDA biocuration and submission. Each step aligns with recognized standards for interoperability and long-term data preservation.

References