IBDC-INDA Curation and Data Validation Workflow =============================================== The IBDC-INDA curation and data validation workflow ensures submission of high-quality, standardized biological sequence data through rigorous, multi-step validation. Overview -------- - The IBDC-INDA pipeline validates Raw sequence data, Annotated sequences (features), and Genome assemblies, in adherence to INSDC and EMBL-EBI guidelines for nucleotide data archival [1]_ [2]_ . - Taxonomy entry valdiation is assisted by the NCBI Taxonomy API. - The Metadata points are as per the controlled vocabulary and ontologies, ensuring global interoperability and scientific rigor. ------------------------------------------------ Type of Data Submission to IBDC-INDA ------------------------------------ .. image:: _images/datatypes.png :width: 500px :align: center ----------------------------------------------- 1. Raw Data Validation ~~~~~~~~~~~~~~~~~~~~~~ Raw sequencing files: FASTQ, BAM, CRAM (single, paired, long-read). Data originates directly from sequencing platforms. .. image:: _images/raw-data.svg :width: 600px :align: center .. list-table:: :header-rows: 1 :widths: 20 10 * - Step - Description * - Taxonomy Validation - Cross-verify taxon IDs/Scientific names utilising the NCBI Taxon REST API * - Missing Value Reporting - Use of INSDC standardized missing value protocols for missing metadata * - Metadata Standards - ENA sample checklists for metadata compliance as per minimum standards defined by INSDC * - Format & Integrity - Automated file structure, md5sum check, madndatory fields, file size * - Read Name Length - Ensure read names ≤256 characters as per International Standards * - Base Counting - Frequencies for A, T, G, C (DNA/RNA), ambiguous bases (N, R, Y, etc.), Total base count * - Quality Assessment - Check read quality, read statistics, adapter sequences, length distribution --------------------------------------------------------------------------------------------------------- 2. Annotated Sequence Validation ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ All submissions containing annotated — genes, CDS, regulatory elements with sample, feature, and taxonomy metadata. .. image:: _images/ann-seq.svg :width: 600px :align: right .. list-table:: :header-rows: 1 :widths: 20 30 * - Step - Description * - Feature QC - Features validated against INSDC Feature Table V11.3 (keys, location, qualifiers) * - Annotation QC - Verify gene models, translation, overlaps, start/stop codons * - Taxonomy Context - Confirm annotation linkage to correct taxon 3. Assembly Data Validation ~~~~~~~~~~~~~~~~~~~~~~~~~~~ Assembly Types ^^^^^^^^^^^^^^ .. list-table:: :header-rows: 1 :widths: 30 40 * - Assembly Type - Detailed Types * - Genome Assembly - Includes plasmids, organelles,complete viral genomes, viral segments/replicons, bacteriophages, prokaryotic and eukaryotic genomes. * - Transcriptome Assembly - Includes isolate and metatranscriptome assemblies Assembly Levels ^^^^^^^^^^^^^^^ - **Contig:** Basic, contiguous sequence blocks - **Scaffold:** Ordered contigs, potential gaps - **Chromosome:** Chromosome-level, complete or partial, organalle chromosomes. .. image:: _images/assembly.svg :width: 600px :align: center .. list-table:: :header-rows: 1 :widths: 30 30 * - Step - Description * - Format and Metdata QC - Validate FASTA, GFF3, flat files, unique naming, study-sample linkage * - Taxonomy Validation - Confirm sample taxid with NCBI API * - Assembly Metrics - N50/L50, total length, depth, duplication rates * - Raw Read Mapping - Linking of assembly with runi-ds submitted (If available) Annotated Assembly QC ^^^^^^^^^^^^^^^^^^^^^ - **GFF/EMBL Conversion:** Consistent format conversion validated for scaffolds and chromosomes. - **Annotation Integrity:** Validation of gene and feature annotations across all assembly levels. - **Contextual Metadata:** Standardized sample and feature metadata throughout the submission. - **Common errors encountered during annotated assembly submission:** 1. ERROR: "intron" Features locations are duplicated - consider merging qualifiers. [ line: 37201 of abc.embl.gz, line: 37162 of abc.embl.gz] 2. Abutting features cannot be adjacent between neighbouring exons. [ line: 50620 of abc.embl.gz] 3. "intron" usually expected to be at least "10" nt long. Please check the accuracy. [ line: 17559548 of abc.embl.gz] Taxonomy Validation Using the NCBI API -------------------------------------- - **Automated/Manual Lookup:** Validates taxon id and Scirntific name. - **Integration:** Ensures globally recognized taxonomy and organism linkage for every record submitted to IBDC-INDA. Controlled Vocabulary --------------------- - INSDC controlled vocabulary and ontologies applied throughout all submission entities for data harmonization and interoperability. Update Cycle ------------ - Standards, sample checklists, and metadata requirements are updated as per latest INSDC announcements. Submission Feedback ------------------- - **Automated Reporting:** Failed QC checks trigger feedback reports for submitter correction and resubmission. - **Iterative Corrections:** Enables robust, reproducible data submission practices. This documentation ensures clarity, comprehensive validation, and transparent feedback for all stages of IBDC-INDA biocuration and submission. Each step aligns with recognized standards for interoperability and long-term data preservation. References ========== .. [1] Minimum Metdata Standards: https://www.insdc.org/news/insdc-spatiotemporal-metadata-minimum-standards-update-03-03-2023/ .. [2] Genome Assembly Standards : https://www.insdc.org/submitting-standards/insdc-standards-genome-assembly-submission/ .. [3] Sample Checklists: https://www.ebi.ac.uk/ena/browser/checklists .. [4] Feature Table: https://www.insdc.org/submitting-standards/feature-table/