IBDC-INDA Curation and Data Validation Workflow
The IBDC-INDA curation and data validation workflow ensures submission of high-quality, standardized biological sequence data through rigorous, multi-step validation.
Overview
The IBDC-INDA pipeline validates Raw sequence data, Annotated sequences (features), and Genome assemblies, in adherence to INSDC and EMBL-EBI guidelines for nucleotide data archival [1] [2] .
Taxonomy entry valdiation is assisted by the NCBI Taxonomy API.
The Metadata points are as per the controlled vocabulary and ontologies, ensuring global interoperability and scientific rigor.
Type of Data Submission to IBDC-INDA
1. Raw Data Validation
Raw sequencing files: FASTQ, BAM, CRAM (single, paired, long-read). Data originates directly from sequencing platforms.
Step |
Description |
|---|---|
Taxonomy Validation |
Cross-verify taxon IDs/Scientific names utilising the NCBI Taxon REST API |
Missing Value Reporting |
Use of INSDC standardized missing value protocols for missing metadata |
Metadata Standards |
ENA sample checklists for metadata compliance as per minimum standards defined by INSDC |
Format & Integrity |
Automated file structure, md5sum check, madndatory fields, file size |
Read Name Length |
Ensure read names ≤256 characters as per International Standards |
Base Counting |
Frequencies for A, T, G, C (DNA/RNA), ambiguous bases (N, R, Y, etc.), Total base count |
Quality Assessment |
Check read quality, read statistics, adapter sequences, length distribution |
2. Annotated Sequence Validation
All submissions containing annotated — genes, CDS, regulatory elements with sample, feature, and taxonomy metadata.
Step |
Description |
|---|---|
Feature QC |
Features validated against INSDC Feature Table V11.3 (keys, location, qualifiers) |
Annotation QC |
Verify gene models, translation, overlaps, start/stop codons |
Taxonomy Context |
Confirm annotation linkage to correct taxon |
3. Assembly Data Validation
Assembly Types
Assembly Type |
Detailed Types |
|---|---|
Genome Assembly |
Includes plasmids, organelles,complete viral genomes, viral segments/replicons, bacteriophages, prokaryotic and eukaryotic genomes. |
Transcriptome Assembly |
Includes isolate and metatranscriptome assemblies |
Assembly Levels
Contig: Basic, contiguous sequence blocks
Scaffold: Ordered contigs, potential gaps
Chromosome: Chromosome-level, complete or partial, organalle chromosomes.
Step |
Description |
|---|---|
Format and Metdata QC |
Validate FASTA, GFF3, flat files, unique naming, study-sample linkage |
Taxonomy Validation |
Confirm sample taxid with NCBI API |
Assembly Metrics |
N50/L50, total length, depth, duplication rates |
Raw Read Mapping |
Linking of assembly with runi-ds submitted (If available) |
Annotated Assembly QC
GFF/EMBL Conversion: Consistent format conversion validated for scaffolds and chromosomes.
Annotation Integrity: Validation of gene and feature annotations across all assembly levels.
Contextual Metadata: Standardized sample and feature metadata throughout the submission.
- Common errors encountered during annotated assembly submission:
ERROR: “intron” Features locations are duplicated - consider merging qualifiers. [ line: 37201 of abc.embl.gz, line: 37162 of abc.embl.gz]
Abutting features cannot be adjacent between neighbouring exons. [ line: 50620 of abc.embl.gz]
“intron” usually expected to be at least “10” nt long. Please check the accuracy. [ line: 17559548 of abc.embl.gz]
Taxonomy Validation Using the NCBI API
Automated/Manual Lookup: Validates taxon id and Scirntific name.
Integration: Ensures globally recognized taxonomy and organism linkage for every record submitted to IBDC-INDA.
Controlled Vocabulary
INSDC controlled vocabulary and ontologies applied throughout all submission entities for data harmonization and interoperability.
Update Cycle
Standards, sample checklists, and metadata requirements are updated as per latest INSDC announcements.
Submission Feedback
Automated Reporting: Failed QC checks trigger feedback reports for submitter correction and resubmission.
Iterative Corrections: Enables robust, reproducible data submission practices.
This documentation ensures clarity, comprehensive validation, and transparent feedback for all stages of IBDC-INDA biocuration and submission. Each step aligns with recognized standards for interoperability and long-term data preservation.