Go back

Study Complete Details




Project Accession: IBIAP_1000000009
Title: High-resolution AI image dataset for diagnosing oral submucous fibrosis and squamous cell carcinoma
Representative Image:
Description: Oral cancer is a global health challenge with a difficult histopathological diagnosis. The accurate histopathological interpretation of oral cancer tissue samples remains difficult. However, early diagnosis is very challenging due to a lack of experienced pathologists and inter- observer variability in diagnosis. The application of artificial intelligence (deep learning algorithms) for oral cancer histology images is very promising for rapid diagnosis. However, it requires a quality annotated dataset to build AI models. We present ORCHID (ORal Cancer Histology Image Database), a specialized database generated to advance research in AI-based histology image analytics of oral cancer and precancer. The ORCHID database is an extensive multicenter collection of high-resolution images captured at 1000X effective magnification (100X objective lens), encapsulating various oral cancer and precancer categories, such as oral submucous fibrosis (OSMF) and oral squamous cell carcinoma (OSCC). Additionally, it also contains grade-level sub-classifications for OSCC, such as well- differentiated (WD), moderately-differentiated (MD), and poorly-differentiated (PD). The database seeks to aid in developing innovative artificial intelligence-based rapid diagnostics for OSMF and OSCC, along with subtypes.
Publications: https://doi.org/10.1038/s41597-024-03836-6
Funding agency: N/A
Grant Number: N/A
Ethics Statement: Download
Any Other Information : The original version of this dataset is available at Zenodo (https://zenodo.org/records/12636426 ; https://zenodo.org/records/12646943). The Zenodo citation for training set and test/validation set is: Chaudhary, N. et al. High-resolution AI image dataset for diagnosing oral submucous fibrosis and squamous cell carcinoma Zenodo. https://doi.org/10.5281/zenodo.12636426 (2024), and Chaudhary, N., & Ahmad, T. Validation and Test Datasets for “High-resolution AI image dataset for diagnosing oral submucous fibrosis and squamous cell carcinoma” Zenodo. https://doi.org/10.5281/zenodo.12646943 (2024), respectively.
Additional File: N/A
Acknowledgments: N.C. is the recipient of a senior research fellowship from the Indian Council of Medical Research(3/1/2(1)/Oral/2021-NCD-II), New Delhi, India. This work was also supported by the Science and Engineering Research Board (CRG/2020/002294), and the Indian Council of Medical Research (ICMR) (GIA/2019/000274/PRCGIA (Ver-1)), New Delhi, India. We also acknowledge the computing support from the Mphasis F1 Foundation and the Center for Bioinformatics and Computational Biology (B.I.C.) (BT/PR40220/BTIS/137/22/2021) facility at Ashoka University. We are thanking Farhat Zeba and Sumra Khan for helping out with the imaging.

Sr.No First name Last name Email Organization Designation
1 Nisha Chaudhary N/A Multidisciplinary Centre for Advanced Research and Studies, Jamia Millia Islamia, New Delhi, India Research Scholar
2 Arpita Rai N/A Rajendra Institute of Medical Sciences, Ranchi, Jharkhand, India Research Scholar
3 Aakash Rao N/A Department of Computer Science, Ashoka University, Sonipat, Haryana, India Research Scholar
4 Md Faizan N/A Multidisciplinary Centre for Advanced Research and Studies, Jamia Millia Islamia, New Delhi, India Research Scholar
5 Jeyaseelan Augustine N/A Maulana Azad Institute of Dental Sciences, New Delhi, India Research Scholar
6 Akhilanand Chaurasia N/A King George Medical University, Lucknow, Uttar Pradesh, India Research Scholar
7 Deepika Mishra N/A All India Institute of Medical Sciences, New Delhi, India Research Scholar
8 Akhilesh Chandra N/A Banaras Hindu University, Banaras, Uttar Pradesh, India Research Scholar
9 Varnit Chauhan N/A Multidisciplinary Centre for Advanced Research and Studies, Jamia Millia Islamia, New Delhi, India Research Scholar
10 Tanveer Ahmad tahmad7@jmi.ac.in Multidisciplinary Centre for Advanced Research and Studies, Jamia Millia Islamia, New Delhi, India Principal Investigator

Study Accession: HISTOS_1000000013
Title: High-resolution AI image dataset for diagnosing oral submucous fibrosis and squamous cell carcinoma
Imaging Type: Histopathology (HISTO)
Imaging Sub-type: Diagnostic Pathology
Summary: The number of images available in each of the five classes(folders), which are as follows, Normal, OSMF, WDOSCC, MDOSCC, and PDOSCC. Each class folder consists of subfolders representing different tissue slides collected from different patients. We have made an initial attempt to provide a comprehensive image database for two of the most prominent oral conditions, OSCC and OSMF. We believe that more such databases will be made publicly available in the near future. These comprehensive image databases will facilitate the development of accurate AI-based diagnostic tools for oral diseases, ultimately improving patient care and outcomes in the field of oral healthcare. In future, integration of databases comprising molecular markers, transcriptome, metabolome, and other biomarkers, combined with oral histological image through advanced AI-driven imaging techniques, holds great promise in improving diagnostic accuracy and precision. This potential has already been observed in the diagnosis of lung and breast cancers. This expansion will aid in developing a more comprehensive AI-driven diagnostic tool.
Keywords: Oral cancer; Oral submucous fibrosis (OSMF); Oral squamous cell carcinoma (OSCC); Artificial intelligence
Additional / Any Other Information: N/A
Release Date: Jan. 13, 2025
Access Licence Type: Open Access

Table 1. The sample types registered under this study are as follows:
Sample Type IDOrganismTaxon IDBiological EntityLateralitySource TissueSource Cell/Cell-lineCell Organelle
HISTOSMT_10000000037Homo sapiens 9606 Oral cavityNot ApplicableBuccal mucosaN/AN/A

Table 2. The samples registered under this study are as follows:
Sample Type ID Sample ID Method used for Sample Collection Cell Phenotype Studied De-identified Patient ID ICD-11 Code (patient health condition) Sample Source Tissue Phenotype Studied
HISTOSMT_10000000037 HISTOSM_10000047448 Tissue biopsy N/A 13 DA02.2 Not specified osmf
HISTOSMT_10000000037 HISTOSM_10000047449 Tissue biopsy N/A 13 DA02.2 Not specified osmf
HISTOSMT_10000000037 HISTOSM_10000047450 Tissue biopsy N/A 13 DA02.2 Not specified osmf
HISTOSMT_10000000037 HISTOSM_10000047451 Tissue biopsy N/A 13 DA02.2 Not specified osmf
HISTOSMT_10000000037 HISTOSM_10000047452 Tissue biopsy N/A 13 DA02.2 Not specified osmf
HISTOSMT_10000000037 HISTOSM_10000047453 Tissue biopsy N/A 13 DA02.2 Not specified osmf
HISTOSMT_10000000037 HISTOSM_10000047454 Tissue biopsy N/A 13 DA02.2 Not specified osmf
HISTOSMT_10000000037 HISTOSM_10000047455 Tissue biopsy N/A 13 DA02.2 Not specified osmf
HISTOSMT_10000000037 HISTOSM_10000047456 Tissue biopsy N/A 13 DA02.2 Not specified osmf
HISTOSMT_10000000037 HISTOSM_10000047457 Tissue biopsy N/A 13 DA02.2 Not specified osmf
HISTOSMT_10000000037 HISTOSM_10000047458 Tissue biopsy N/A 13 DA02.2 Not specified osmf
HISTOSMT_10000000037 HISTOSM_10000047460 Tissue biopsy N/A 13 DA02.2 Not specified osmf
HISTOSMT_10000000037 HISTOSM_10000047461 Tissue biopsy N/A 14 DA02.2 Not specified osmf
HISTOSMT_10000000037 HISTOSM_10000047462 Tissue biopsy N/A 14 DA02.2 Not specified osmf
HISTOSMT_10000000037 HISTOSM_10000047463 Tissue biopsy N/A 14 DA02.2 Not specified osmf
HISTOSMT_10000000037 HISTOSM_10000047464 Tissue biopsy N/A 14 DA02.2 Not specified osmf
HISTOSMT_10000000037 HISTOSM_10000047465 Tissue biopsy N/A 14 DA02.2 Not specified osmf
HISTOSMT_10000000037 HISTOSM_10000047466 Tissue biopsy N/A 14 DA02.2 Not specified osmf
HISTOSMT_10000000037 HISTOSM_10000047467 Tissue biopsy N/A 14 DA02.2 Not specified osmf
HISTOSMT_10000000037 HISTOSM_10000047468 Tissue biopsy N/A 15 DA02.2 Not specified osmf

Table 3. The experiment types registered under this study are as follows:
Experiment Type IDInstrument NameInstrument TypeManufacturerModel
HISTOET_10000000008MicroscopeBright-fieldLeedz Microimaging Ltd (LMI)NA


Experimental Design Summary (HISTOET_10000000008)
The ORCHID database is an extensive multicenter collection of high-resolution images captured at 1000X effective magnification (100X objective lens). Tissue slides were collected with the approval of an ethical committee from the participating hospitals and research institutions. The buccal mucosa tissue samples were collected for three classes, normal, OSMF, and OSCC, with grade-wise annotation from the pathologists at each hospital. Biopsy samples of normal, OSMF and OSCC tissues underwent H&E staining. The staining procedure was conducted either in-house or outsourced to different laboratories. To eliminate staining variations across different laboratories, the preparation of H&E slides involved five histopathology labs, each utilizing their own independently developed and optimized protocols for the staining process. Following staining, the samples were examined under a microscope by a skilled histopathologist to assess cellular morphology, and tissue architecture, and identify any distinctive features or abnormalities specific to each sample type. This evaluation by the histopathologist involved grading the tissue slides for OSCC and OSMF, as well as differentiating between normal and diseased tissue sections. Images were acquired using a 1000X magnification (100X objective) lens from Leedz microimaging (LMI) bright field microscopy. To capture the images consistently, we utilized ToupView imaging software, which was configured for automatic adjustments. This setting applies to both white balance and camera settings, thereby standardizing the image acquisition process across different slides. The images of the H&E stained slides were captured at 1000X magnification(100X objective lens). By setting the ToupView software to automatically adjust white balance and camera settings, we aimed to minimize human intervention and the variability it introduces. This approach ensures that the images are not only consistent but also replicable in different laboratory settings, provided similar equipment and software settings are used. We collected approximately 100–150 images per tissue slide, which were stored in PNG file format.

Acquired Images Annotation Description (HISTOET_10000000008)
The data included in the ORCHID database underwent rigorous expert annotation and validation to ensure a high level of quality and accuracy. In our expert validation process, ‘sufficient detail’ for an image to be qualified was determined based on several key criteria. Firstly, the clarity of histological features which depict the necessary histological structures, such as cellular details and tissue architecture. Images should be free from artifacts that could interfere with accurate interpretation (e.g., folds, tears, excessive staining). The image must be in focus, with appropriate contrast and resolution to discern pathological features. Our team of pathologists and histopathology experts independently assessed each image against these criteria to ensure only high-quality images were included in our study. Images that were blurry or lacked sufficient detail were dismissed as they would not provide accurate or reliable information. Next, the experts evaluated the annotations that accompany the images. These annotations were scrutinized for consistency and accuracy, to ensure that they accurately represented the disease conditions depicted in the images. The process of labeling the slides was conducted manually by trained pathology experts. This involved a careful review of each slide to identify and label the specific disease conditions present. Furthermore, the slides that showed staining artifacts were also rejected. Staining artifacts can occur during the preparation of the slides and can alter the appearance of the tissue, potentially leading to misinterpretation or incorrect diagnosis. As such, only slides that were free from such errors and provided a clear and accurate representation of the oral pathology were included in the database. These standardization processes ensure that AI models are trained and validated on data that consistently represent the true pathological features. Standardized and validated data enhance the model’s ability to generalize findings across different datasets and real-world scenarios.

Table 4. The experiments registered under this study are as follows:
Sample ID Experiment Type ID Experiment ID Image type (Original / Derived / Unknown) Any Other Information Staining Type Images Magnification Tissue / Tumor Fixative Used Data Repository Name (If already deposited in another repository) Dataset Split Type (Training / Validation / Test) Licence Type (original source) Stain Normalization Method
HISTOSM_10000050051 HISTOET_10000000008 HISTOE_10000050051 Derived N/A Haematoxylin and eosin staining (H&E) 1000X N/A Zenodo train CC-BY 4.0 license Reinhard stain normalization technique
HISTOSM_10000050054 HISTOET_10000000008 HISTOE_10000050054 Derived N/A Haematoxylin and eosin staining (H&E) 1000X N/A Zenodo train CC-BY 4.0 license Reinhard stain normalization technique
HISTOSM_10000050068 HISTOET_10000000008 HISTOE_10000050068 Derived N/A Haematoxylin and eosin staining (H&E) 1000X N/A Zenodo train CC-BY 4.0 license Reinhard stain normalization technique
HISTOSM_10000050071 HISTOET_10000000008 HISTOE_10000050071 Derived N/A Haematoxylin and eosin staining (H&E) 1000X N/A Zenodo train CC-BY 4.0 license Reinhard stain normalization technique
HISTOSM_10000050085 HISTOET_10000000008 HISTOE_10000050085 Derived N/A Haematoxylin and eosin staining (H&E) 1000X N/A Zenodo train CC-BY 4.0 license Reinhard stain normalization technique
HISTOSM_10000050088 HISTOET_10000000008 HISTOE_10000050088 Derived N/A Haematoxylin and eosin staining (H&E) 1000X N/A Zenodo train CC-BY 4.0 license Reinhard stain normalization technique
HISTOSM_10000050102 HISTOET_10000000008 HISTOE_10000050102 Derived N/A Haematoxylin and eosin staining (H&E) 1000X N/A Zenodo train CC-BY 4.0 license Reinhard stain normalization technique
HISTOSM_10000050105 HISTOET_10000000008 HISTOE_10000050105 Derived N/A Haematoxylin and eosin staining (H&E) 1000X N/A Zenodo train CC-BY 4.0 license Reinhard stain normalization technique
HISTOSM_10000050119 HISTOET_10000000008 HISTOE_10000050119 Derived N/A Haematoxylin and eosin staining (H&E) 1000X N/A Zenodo train CC-BY 4.0 license Reinhard stain normalization technique
HISTOSM_10000050122 HISTOET_10000000008 HISTOE_10000050122 Derived N/A Haematoxylin and eosin staining (H&E) 1000X N/A Zenodo train CC-BY 4.0 license Reinhard stain normalization technique
HISTOSM_10000050136 HISTOET_10000000008 HISTOE_10000050136 Derived N/A Haematoxylin and eosin staining (H&E) 1000X N/A Zenodo train CC-BY 4.0 license Reinhard stain normalization technique
HISTOSM_10000050139 HISTOET_10000000008 HISTOE_10000050139 Derived N/A Haematoxylin and eosin staining (H&E) 1000X N/A Zenodo train CC-BY 4.0 license Reinhard stain normalization technique
HISTOSM_10000050153 HISTOET_10000000008 HISTOE_10000050153 Derived N/A Haematoxylin and eosin staining (H&E) 1000X N/A Zenodo train CC-BY 4.0 license Reinhard stain normalization technique
HISTOSM_10000050156 HISTOET_10000000008 HISTOE_10000050156 Derived N/A Haematoxylin and eosin staining (H&E) 1000X N/A Zenodo train CC-BY 4.0 license Reinhard stain normalization technique
HISTOSM_10000050170 HISTOET_10000000008 HISTOE_10000050170 Derived N/A Haematoxylin and eosin staining (H&E) 1000X N/A Zenodo train CC-BY 4.0 license Reinhard stain normalization technique
HISTOSM_10000050173 HISTOET_10000000008 HISTOE_10000050173 Derived N/A Haematoxylin and eosin staining (H&E) 1000X N/A Zenodo train CC-BY 4.0 license Reinhard stain normalization technique
HISTOSM_10000050187 HISTOET_10000000008 HISTOE_10000050187 Derived N/A Haematoxylin and eosin staining (H&E) 1000X N/A Zenodo train CC-BY 4.0 license Reinhard stain normalization technique
HISTOSM_10000050190 HISTOET_10000000008 HISTOE_10000050190 Derived N/A Haematoxylin and eosin staining (H&E) 1000X N/A Zenodo train CC-BY 4.0 license Reinhard stain normalization technique
HISTOSM_10000050204 HISTOET_10000000008 HISTOE_10000050204 Derived N/A Haematoxylin and eosin staining (H&E) 1000X N/A Zenodo train CC-BY 4.0 license Reinhard stain normalization technique
HISTOSM_10000050207 HISTOET_10000000008 HISTOE_10000050207 Derived N/A Haematoxylin and eosin staining (H&E) 1000X N/A Zenodo train CC-BY 4.0 license Reinhard stain normalization technique

Experiment ID Image File Name (with path) Image Preview Image Size
HISTOE_10000053590ORCHID_DB/train/osmf/o-1-00-24/o-1-00-24-085.png

Download Image
5.0M
HISTOE_10000053591ORCHID_DB/train/osmf/o-1-00-24/o-1-00-24-086.png

Download Image
4.3M
HISTOE_10000053592ORCHID_DB/train/osmf/o-1-00-24/o-1-00-24-087.png

Download Image
4.7M
HISTOE_10000053593ORCHID_DB/train/osmf/o-1-00-24/o-1-00-24-088.png

Download Image
3.0M
HISTOE_10000053594ORCHID_DB/train/osmf/o-1-00-24/o-1-00-24-089.png

Download Image
4.6M
HISTOE_10000053595ORCHID_DB/train/osmf/o-1-00-24/o-1-00-24-090.png

Download Image
4.3M
HISTOE_10000053596ORCHID_DB/train/osmf/o-1-00-24/o-1-00-24-092.png

Download Image
4.7M
HISTOE_10000053597ORCHID_DB/train/osmf/o-1-00-24/o-1-00-24-094.png

Download Image
4.5M
HISTOE_10000053598ORCHID_DB/train/osmf/o-1-00-24/o-1-00-24-096.png

Download Image
3.7M
HISTOE_10000053599ORCHID_DB/train/osmf/o-1-00-24/o-1-00-24-098.png

Download Image
4.9M