Glossary ======== * **Adapter Sequences** * Short synthetic DNA/RNA sequences ligated to the ends of DNA/RNA fragments during library preparation for sequencing. They are used for primer binding, sample indexing, and attachment to sequencing platforms. Their presence in sequenced reads can interfere with downstream analysis, so they are typically identified and trimmed during quality control. * **Ambiguous Bases (N, R, Y, etc.)** * In DNA/RNA sequencing, 'N' represents any nucleotide (A, T/U, G, C), indicating an unknown base. 'R' represents a purine (A or G), and 'Y' represents a pyrimidine (C or T/U). These are used when the sequencing machine cannot definitively call a specific base at a given position. * **Annotated Sequences** * Sequences (like DNA or RNA) that have had specific biological features (such as genes, coding regions, regulatory elements) identified and described. * **Assembly Levels (Contig, Scaffold, Chromosome)** * These are defined as categories for assembly data. * **Contig**: Basic, contiguous sequence blocks. * **Scaffold**: Ordered contigs, with potential gaps. * **Chromosome**: Chromosome-level, complete or partial, or organelle chromosomes. * These terms describe increasing levels of completeness and organization in a genome assembly. A **contig** is a continuous stretch of DNA sequence with no gaps. **Scaffolds** are created by ordering and orienting multiple contigs using various evidence (like paired-end reads), with gaps often represented by Ns. **Chromosome-level** assemblies represent the highest level, where scaffolds are further arranged into complete or nearly complete chromosomal structures. * **BAM (Binary Alignment Map)** * A binary format for storing aligned sequencing reads against a reference genome. It is a compressed and indexed version of a SAM (Sequence Alignment/Map) file, making it efficient for storage and retrieval. * **Biocuration** * The process of extracting, interpreting, and integrating biological information from scientific literature and experimental data into structured, standardized databases. It involves expert review and annotation to ensure data quality and usability. * **CDS (Coding Sequence)** * The region of a gene that codes for a protein. It starts with a start codon and ends with a stop codon. * **Controlled Vocabulary and Ontologies** * **Controlled Vocabulary**: A standardized list of terms used to describe data, ensuring consistency and preventing ambiguity. * **Ontologies**: More structured than controlled vocabularies, they define concepts and their relationships within a specific domain (e.g., biological processes, cell types), providing a formal framework for knowledge representation and semantic interoperability. * **CRAM** * A compressed sequence alignment format designed to be more efficient than BAM, often achieving higher compression ratios while retaining data integrity. * **Depth (Assembly Metric)** * Refers to sequencing depth or coverage, which is the average number of times a particular nucleotide in the genome has been sequenced. Higher depth generally correlates with higher confidence in the assembled sequence. * **Duplication Rates (Assembly Metric)** * In the context of genome assembly, duplication rates refer to the frequency of redundant or over-represented sequences within the assembly, which can indicate issues like assembly artifacts or repetitive regions being incorrectly collapsed or expanded. * **EMBL-EBI (European Molecular Biology Laboratory - European Bioinformatics Institute)** * A leading center for bioinformatics research and services, hosting a wide range of biological databases and tools, including the European Nucleotide Archive (ENA). * **FASTA** * A text-based format for representing nucleotide or amino acid sequences, in which each sequence is preceded by a single-line description (header). * **FASTQ** * A text-based format for storing both nucleotide sequences and their corresponding quality scores, typically used for raw sequencing reads. * **Feature Table V11.3 (keys, location, qualifiers)** * A standardized specification by INSDC that defines how biological features (like genes, promoters, coding sequences) should be annotated and described within sequence data. It includes defined "keys" (types of features, e.g., "gene", "CDS"), their "location" on the sequence, and "qualifiers" (additional descriptive information, e.g., "product", "note"). * **Flat Files** * Simple text files that store data in a plain, unformatted way, often with one record per line and fields separated by delimiters (like tabs or commas), or in a fixed-width format. In bioinformatics, they can be used for various data types, including sequence annotations or metadata. * **GFF3 (General Feature Format version 3)** * A file format used to describe genomic features (like genes, exons, CDS) and their locations on a sequence. It is designed to be a standardized, hierarchical, and extensible format for genome annotation. * **Genome Assemblies** * The process of taking a large number of short DNA sequencing reads and putting them back together to create a representation of the original long DNA sequences (e.g., a chromosome or entire genome). The "assembly" itself is the reconstructed sequence. * **IBDC-INDA (Indian Biological Data Centre - Indian Nucleotide Data Archive)** * The specific biological data center and nucleotide data archive for which the described workflow is implemented. * **INSDC (International Nucleotide Sequence Database Collaboration)** * A collaboration between the National Center for Biotechnology Information (NCBI) in the USA, the European Bioinformatics Institute (EMBL-EBI) in Europe, and the DNA Data Bank of Japan (DDBJ) in Japan. They maintain a shared set of international standards for nucleotide sequence data, ensuring global consistency and accessibility. * **Interoperability** * The ability of different information systems, devices, or applications to connect, communicate, and exchange data in a coherent and transparent manner, without special effort from the end user. In data curation, it means data can be easily shared and understood across different platforms and analyses. * **md5sum check** * A checksum algorithm that produces a 32-character hexadecimal string (the "hash") for a given file. This hash acts as a unique digital fingerprint. An md5sum check verifies file integrity by comparing the computed hash of a file with a previously recorded hash; if they match, the file has not been altered or corrupted. * **Metadata** * "Data about data." It provides descriptive information about a dataset, such as who created it, when, what standards were used, the biological context (e.g., organism, sample collection details), and other relevant details, which are crucial for understanding and reusing the primary data. * **N50/L50 (Assembly Metrics)** * Common metrics to assess the quality and contiguity of a genome assembly. * **N50**: The length such that 50% of the total assembly length is contained in contigs/scaffolds of this length or greater. A higher N50 indicates a more contiguous assembly. * **L50**: The minimum number of contigs/scaffolds whose lengths sum up to at least 50% of the total assembly length. A lower L50 indicates a more contiguous assembly. * **NCBI (National Center for Biotechnology Information)** * A part of the United States National Library of Medicine, NCBI is a major resource for bioinformatics tools and biological information, hosting many databases including GenBank (for DNA sequences) and the NCBI Taxonomy database. * **Nucleotide Data Archival** * The process of collecting, organizing, storing, and preserving DNA and RNA sequence data (nucleotide data) in stable, publicly accessible databases for long-term use and future research. * **QC (Quality Control)** * A process by which the quality of all factors involved in production is inspected. In this context, it refers to systematic checks performed at various stages of data processing to ensure that the data meets predefined standards for accuracy, integrity, and usability. * **Read Name Length** * The identifier or label given to individual sequencing reads. Its length can be critical for compatibility with various bioinformatics tools and databases. * **runi-ds (Run Identifiers)** * Unique identifiers assigned to individual sequencing runs or experiments. These IDs are crucial for tracing the raw data (reads) that were used to generate a particular assembly. * **Start/Stop Codons** * Specific three-nucleotide sequences (codons) in an mRNA molecule that signal the beginning (start codon, typically AUG) and end (stop codons, UAA, UAG, UGA) of protein translation. They are fundamental for gene annotation. * **Taxon IDs / Taxon REST API / Taxonomy API / Taxonomy Validation** * **Taxonomy Validation**: The process of confirming the correct scientific classification (taxonomy) of an organism associated with submitted biological data. * **Taxon IDs**: Unique numerical identifiers assigned to each taxonomic unit (e.g., species, genus) in a taxonomic database like NCBI Taxonomy. * **Taxonomy API / Taxon REST API**: Application Programming Interfaces provided by databases like NCBI that allow automated access to their taxonomic data for validation and lookup purposes. * **Translation (Annotation QC)** * In the context of gene annotation, "translation" refers to the process by which the genetic information encoded in messenger RNA (mRNA) is converted into a sequence of amino acids to form a protein. Annotation QC ensures that the predicted protein sequences from coding regions are biologically plausible.