Go back

Study Complete Details




Project Accession: IBIAP_1000000009
Title: High-resolution AI image dataset for diagnosing oral submucous fibrosis and squamous cell carcinoma
Representative Image:
Description: Oral cancer is a global health challenge with a difficult histopathological diagnosis. The accurate histopathological interpretation of oral cancer tissue samples remains difficult. However, early diagnosis is very challenging due to a lack of experienced pathologists and inter- observer variability in diagnosis. The application of artificial intelligence (deep learning algorithms) for oral cancer histology images is very promising for rapid diagnosis. However, it requires a quality annotated dataset to build AI models. We present ORCHID (ORal Cancer Histology Image Database), a specialized database generated to advance research in AI-based histology image analytics of oral cancer and precancer. The ORCHID database is an extensive multicenter collection of high-resolution images captured at 1000X effective magnification (100X objective lens), encapsulating various oral cancer and precancer categories, such as oral submucous fibrosis (OSMF) and oral squamous cell carcinoma (OSCC). Additionally, it also contains grade-level sub-classifications for OSCC, such as well- differentiated (WD), moderately-differentiated (MD), and poorly-differentiated (PD). The database seeks to aid in developing innovative artificial intelligence-based rapid diagnostics for OSMF and OSCC, along with subtypes.
Publications: https://doi.org/10.1038/s41597-024-03836-6
Funding agency: N/A
Grant Number: N/A
Ethics Statement: Download
Any Other Information : The original version of this dataset is available at Zenodo (https://zenodo.org/records/12636426 ; https://zenodo.org/records/12646943). The Zenodo citation for training set and test/validation set is: Chaudhary, N. et al. High-resolution AI image dataset for diagnosing oral submucous fibrosis and squamous cell carcinoma Zenodo. https://doi.org/10.5281/zenodo.12636426 (2024), and Chaudhary, N., & Ahmad, T. Validation and Test Datasets for “High-resolution AI image dataset for diagnosing oral submucous fibrosis and squamous cell carcinoma” Zenodo. https://doi.org/10.5281/zenodo.12646943 (2024), respectively.
Additional File: N/A
Acknowledgments: N.C. is the recipient of a senior research fellowship from the Indian Council of Medical Research(3/1/2(1)/Oral/2021-NCD-II), New Delhi, India. This work was also supported by the Science and Engineering Research Board (CRG/2020/002294), and the Indian Council of Medical Research (ICMR) (GIA/2019/000274/PRCGIA (Ver-1)), New Delhi, India. We also acknowledge the computing support from the Mphasis F1 Foundation and the Center for Bioinformatics and Computational Biology (B.I.C.) (BT/PR40220/BTIS/137/22/2021) facility at Ashoka University. We are thanking Farhat Zeba and Sumra Khan for helping out with the imaging.

Sr.No First name Last name Email Organization Designation
1 Nisha Chaudhary N/A Multidisciplinary Centre for Advanced Research and Studies, Jamia Millia Islamia, New Delhi, India Research Scholar
2 Arpita Rai N/A Rajendra Institute of Medical Sciences, Ranchi, Jharkhand, India Research Scholar
3 Aakash Rao N/A Department of Computer Science, Ashoka University, Sonipat, Haryana, India Research Scholar
4 Md Faizan N/A Multidisciplinary Centre for Advanced Research and Studies, Jamia Millia Islamia, New Delhi, India Research Scholar
5 Jeyaseelan Augustine N/A Maulana Azad Institute of Dental Sciences, New Delhi, India Research Scholar
6 Akhilanand Chaurasia N/A King George Medical University, Lucknow, Uttar Pradesh, India Research Scholar
7 Deepika Mishra N/A All India Institute of Medical Sciences, New Delhi, India Research Scholar
8 Akhilesh Chandra N/A Banaras Hindu University, Banaras, Uttar Pradesh, India Research Scholar
9 Varnit Chauhan N/A Multidisciplinary Centre for Advanced Research and Studies, Jamia Millia Islamia, New Delhi, India Research Scholar
10 Tanveer Ahmad tahmad7@jmi.ac.in Multidisciplinary Centre for Advanced Research and Studies, Jamia Millia Islamia, New Delhi, India Principal Investigator

Study Accession: HISTOS_1000000013
Title: High-resolution AI image dataset for diagnosing oral submucous fibrosis and squamous cell carcinoma
Imaging Type: Histopathology (HISTO)
Imaging Sub-type: Diagnostic Pathology
Summary: The number of images available in each of the five classes(folders), which are as follows, Normal, OSMF, WDOSCC, MDOSCC, and PDOSCC. Each class folder consists of subfolders representing different tissue slides collected from different patients. We have made an initial attempt to provide a comprehensive image database for two of the most prominent oral conditions, OSCC and OSMF. We believe that more such databases will be made publicly available in the near future. These comprehensive image databases will facilitate the development of accurate AI-based diagnostic tools for oral diseases, ultimately improving patient care and outcomes in the field of oral healthcare. In future, integration of databases comprising molecular markers, transcriptome, metabolome, and other biomarkers, combined with oral histological image through advanced AI-driven imaging techniques, holds great promise in improving diagnostic accuracy and precision. This potential has already been observed in the diagnosis of lung and breast cancers. This expansion will aid in developing a more comprehensive AI-driven diagnostic tool.
Keywords: Oral cancer; Oral submucous fibrosis (OSMF); Oral squamous cell carcinoma (OSCC); Artificial intelligence
Additional / Any Other Information: N/A
Release Date: Jan. 13, 2025
Access Licence Type: Open Access

Table 1. The sample types registered under this study are as follows:
Sample Type IDOrganismTaxon IDBiological EntityLateralitySource TissueSource Cell/Cell-lineCell Organelle
HISTOSMT_10000000037Homo sapiens 9606 Oral cavityNot ApplicableBuccal mucosaN/AN/A

Table 2. The samples registered under this study are as follows:
Sample Type ID Sample ID Method used for Sample Collection Cell Phenotype Studied De-identified Patient ID ICD-11 Code (patient health condition) Sample Source Tissue Phenotype Studied
HISTOSMT_10000000037 HISTOSM_10000048761 Tissue biopsy N/A 07 2B6E.0 Not specified mdoscc
HISTOSMT_10000000037 HISTOSM_10000048763 Tissue biopsy N/A 07 2B6E.0 Not specified mdoscc
HISTOSMT_10000000037 HISTOSM_10000048765 Tissue biopsy N/A 07 2B6E.0 Not specified mdoscc
HISTOSMT_10000000037 HISTOSM_10000048767 Tissue biopsy N/A 07 2B6E.0 Not specified mdoscc
HISTOSMT_10000000037 HISTOSM_10000048769 Tissue biopsy N/A 07 2B6E.0 Not specified mdoscc
HISTOSMT_10000000037 HISTOSM_10000048771 Tissue biopsy N/A 07 2B6E.0 Not specified mdoscc
HISTOSMT_10000000037 HISTOSM_10000048773 Tissue biopsy N/A 07 2B6E.0 Not specified mdoscc
HISTOSMT_10000000037 HISTOSM_10000048775 Tissue biopsy N/A 07 2B6E.0 Not specified mdoscc
HISTOSMT_10000000037 HISTOSM_10000048777 Tissue biopsy N/A 07 2B6E.0 Not specified mdoscc
HISTOSMT_10000000037 HISTOSM_10000048779 Tissue biopsy N/A 07 2B6E.0 Not specified mdoscc
HISTOSMT_10000000037 HISTOSM_10000048781 Tissue biopsy N/A 07 2B6E.0 Not specified mdoscc
HISTOSMT_10000000037 HISTOSM_10000048783 Tissue biopsy N/A 07 2B6E.0 Not specified mdoscc
HISTOSMT_10000000037 HISTOSM_10000048785 Tissue biopsy N/A 07 2B6E.0 Not specified mdoscc
HISTOSMT_10000000037 HISTOSM_10000048787 Tissue biopsy N/A 07 2B6E.0 Not specified mdoscc
HISTOSMT_10000000037 HISTOSM_10000048789 Tissue biopsy N/A 07 2B6E.0 Not specified mdoscc
HISTOSMT_10000000037 HISTOSM_10000048791 Tissue biopsy N/A 07 2B6E.0 Not specified mdoscc
HISTOSMT_10000000037 HISTOSM_10000048793 Tissue biopsy N/A 07 2B6E.0 Not specified mdoscc
HISTOSMT_10000000037 HISTOSM_10000048795 Tissue biopsy N/A 07 2B6E.0 Not specified mdoscc
HISTOSMT_10000000037 HISTOSM_10000048797 Tissue biopsy N/A 07 2B6E.0 Not specified mdoscc
HISTOSMT_10000000037 HISTOSM_10000048799 Tissue biopsy N/A 07 2B6E.0 Not specified mdoscc

Table 3. The experiment types registered under this study are as follows:
Experiment Type IDInstrument NameInstrument TypeManufacturerModel
HISTOET_10000000008MicroscopeBright-fieldLeedz Microimaging Ltd (LMI)NA


Experimental Design Summary (HISTOET_10000000008)
The ORCHID database is an extensive multicenter collection of high-resolution images captured at 1000X effective magnification (100X objective lens). Tissue slides were collected with the approval of an ethical committee from the participating hospitals and research institutions. The buccal mucosa tissue samples were collected for three classes, normal, OSMF, and OSCC, with grade-wise annotation from the pathologists at each hospital. Biopsy samples of normal, OSMF and OSCC tissues underwent H&E staining. The staining procedure was conducted either in-house or outsourced to different laboratories. To eliminate staining variations across different laboratories, the preparation of H&E slides involved five histopathology labs, each utilizing their own independently developed and optimized protocols for the staining process. Following staining, the samples were examined under a microscope by a skilled histopathologist to assess cellular morphology, and tissue architecture, and identify any distinctive features or abnormalities specific to each sample type. This evaluation by the histopathologist involved grading the tissue slides for OSCC and OSMF, as well as differentiating between normal and diseased tissue sections. Images were acquired using a 1000X magnification (100X objective) lens from Leedz microimaging (LMI) bright field microscopy. To capture the images consistently, we utilized ToupView imaging software, which was configured for automatic adjustments. This setting applies to both white balance and camera settings, thereby standardizing the image acquisition process across different slides. The images of the H&E stained slides were captured at 1000X magnification(100X objective lens). By setting the ToupView software to automatically adjust white balance and camera settings, we aimed to minimize human intervention and the variability it introduces. This approach ensures that the images are not only consistent but also replicable in different laboratory settings, provided similar equipment and software settings are used. We collected approximately 100–150 images per tissue slide, which were stored in PNG file format.

Acquired Images Annotation Description (HISTOET_10000000008)
The data included in the ORCHID database underwent rigorous expert annotation and validation to ensure a high level of quality and accuracy. In our expert validation process, ‘sufficient detail’ for an image to be qualified was determined based on several key criteria. Firstly, the clarity of histological features which depict the necessary histological structures, such as cellular details and tissue architecture. Images should be free from artifacts that could interfere with accurate interpretation (e.g., folds, tears, excessive staining). The image must be in focus, with appropriate contrast and resolution to discern pathological features. Our team of pathologists and histopathology experts independently assessed each image against these criteria to ensure only high-quality images were included in our study. Images that were blurry or lacked sufficient detail were dismissed as they would not provide accurate or reliable information. Next, the experts evaluated the annotations that accompany the images. These annotations were scrutinized for consistency and accuracy, to ensure that they accurately represented the disease conditions depicted in the images. The process of labeling the slides was conducted manually by trained pathology experts. This involved a careful review of each slide to identify and label the specific disease conditions present. Furthermore, the slides that showed staining artifacts were also rejected. Staining artifacts can occur during the preparation of the slides and can alter the appearance of the tissue, potentially leading to misinterpretation or incorrect diagnosis. As such, only slides that were free from such errors and provided a clear and accurate representation of the oral pathology were included in the database. These standardization processes ensure that AI models are trained and validated on data that consistently represent the true pathological features. Standardized and validated data enhance the model’s ability to generalize findings across different datasets and real-world scenarios.

Table 4. The experiments registered under this study are as follows:
Sample ID Experiment Type ID Experiment ID Image type (Original / Derived / Unknown) Any Other Information Staining Type Images Magnification Tissue / Tumor Fixative Used Data Repository Name (If already deposited in another repository) Dataset Split Type (Training / Validation / Test) Licence Type (original source) Stain Normalization Method
HISTOSM_10000047617 HISTOET_10000000008 HISTOE_10000047693 Derived N/A Haematoxylin and eosin staining (H&E) 1000X N/A Zenodo test CC-BY 4.0 license Reinhard stain normalization technique
HISTOSM_10000047618 HISTOET_10000000008 HISTOE_10000047694 Derived N/A Haematoxylin and eosin staining (H&E) 1000X N/A Zenodo test CC-BY 4.0 license Reinhard stain normalization technique
HISTOSM_10000047619 HISTOET_10000000008 HISTOE_10000047695 Derived N/A Haematoxylin and eosin staining (H&E) 1000X N/A Zenodo test CC-BY 4.0 license Reinhard stain normalization technique
HISTOSM_10000047620 HISTOET_10000000008 HISTOE_10000047696 Derived N/A Haematoxylin and eosin staining (H&E) 1000X N/A Zenodo test CC-BY 4.0 license Reinhard stain normalization technique
HISTOSM_10000047621 HISTOET_10000000008 HISTOE_10000047697 Derived N/A Haematoxylin and eosin staining (H&E) 1000X N/A Zenodo test CC-BY 4.0 license Reinhard stain normalization technique
HISTOSM_10000047622 HISTOET_10000000008 HISTOE_10000047698 Derived N/A Haematoxylin and eosin staining (H&E) 1000X N/A Zenodo test CC-BY 4.0 license Reinhard stain normalization technique
HISTOSM_10000047623 HISTOET_10000000008 HISTOE_10000047699 Derived N/A Haematoxylin and eosin staining (H&E) 1000X N/A Zenodo test CC-BY 4.0 license Reinhard stain normalization technique
HISTOSM_10000047624 HISTOET_10000000008 HISTOE_10000047700 Derived N/A Haematoxylin and eosin staining (H&E) 1000X N/A Zenodo test CC-BY 4.0 license Reinhard stain normalization technique
HISTOSM_10000047625 HISTOET_10000000008 HISTOE_10000047701 Derived N/A Haematoxylin and eosin staining (H&E) 1000X N/A Zenodo test CC-BY 4.0 license Reinhard stain normalization technique
HISTOSM_10000047626 HISTOET_10000000008 HISTOE_10000047702 Derived N/A Haematoxylin and eosin staining (H&E) 1000X N/A Zenodo test CC-BY 4.0 license Reinhard stain normalization technique
HISTOSM_10000047627 HISTOET_10000000008 HISTOE_10000047703 Derived N/A Haematoxylin and eosin staining (H&E) 1000X N/A Zenodo test CC-BY 4.0 license Reinhard stain normalization technique
HISTOSM_10000047628 HISTOET_10000000008 HISTOE_10000047704 Derived N/A Haematoxylin and eosin staining (H&E) 1000X N/A Zenodo test CC-BY 4.0 license Reinhard stain normalization technique
HISTOSM_10000047630 HISTOET_10000000008 HISTOE_10000047706 Derived N/A Haematoxylin and eosin staining (H&E) 1000X N/A Zenodo test CC-BY 4.0 license Reinhard stain normalization technique
HISTOSM_10000047631 HISTOET_10000000008 HISTOE_10000047707 Derived N/A Haematoxylin and eosin staining (H&E) 1000X N/A Zenodo test CC-BY 4.0 license Reinhard stain normalization technique
HISTOSM_10000047633 HISTOET_10000000008 HISTOE_10000047709 Derived N/A Haematoxylin and eosin staining (H&E) 1000X N/A Zenodo test CC-BY 4.0 license Reinhard stain normalization technique
HISTOSM_10000047634 HISTOET_10000000008 HISTOE_10000047710 Derived N/A Haematoxylin and eosin staining (H&E) 1000X N/A Zenodo test CC-BY 4.0 license Reinhard stain normalization technique
HISTOSM_10000047635 HISTOET_10000000008 HISTOE_10000047711 Derived N/A Haematoxylin and eosin staining (H&E) 1000X N/A Zenodo test CC-BY 4.0 license Reinhard stain normalization technique
HISTOSM_10000047636 HISTOET_10000000008 HISTOE_10000047712 Derived N/A Haematoxylin and eosin staining (H&E) 1000X N/A Zenodo test CC-BY 4.0 license Reinhard stain normalization technique
HISTOSM_10000047637 HISTOET_10000000008 HISTOE_10000047713 Derived N/A Haematoxylin and eosin staining (H&E) 1000X N/A Zenodo test CC-BY 4.0 license Reinhard stain normalization technique
HISTOSM_10000047638 HISTOET_10000000008 HISTOE_10000047714 Derived N/A Haematoxylin and eosin staining (H&E) 1000X N/A Zenodo test CC-BY 4.0 license Reinhard stain normalization technique

Experiment ID Image File Name (with path) Image Preview Image Size
HISTOE_10000047170ORCHID_DB/test/mdoscc/o-3-00-32/o-3-00-32-009.png

Download Image
6.9M
HISTOE_10000047171ORCHID_DB/test/mdoscc/o-3-00-32/o-3-00-32-015.png

Download Image
6.5M
HISTOE_10000047172ORCHID_DB/test/mdoscc/o-3-00-32/o-3-00-32-019.png

Download Image
5.6M
HISTOE_10000047173ORCHID_DB/test/mdoscc/o-3-00-32/o-3-00-32-025.png

Download Image
6.3M
HISTOE_10000047174ORCHID_DB/test/mdoscc/o-3-00-32/o-3-00-32-029.png

Download Image
6.8M
HISTOE_10000047175ORCHID_DB/test/mdoscc/o-3-00-32/o-3-00-32-055.png

Download Image
5.7M
HISTOE_10000047176ORCHID_DB/test/mdoscc/o-3-00-32/o-3-00-32-066.png

Download Image
6.0M
HISTOE_10000047177ORCHID_DB/test/mdoscc/o-3-00-32/o-3-00-32-084.png

Download Image
6.6M
HISTOE_10000047178ORCHID_DB/test/mdoscc/o-3-00-33/o-3-00-33-006.png

Download Image
2.5M
HISTOE_10000047179ORCHID_DB/test/mdoscc/o-3-00-33/o-3-00-33-012.png

Download Image
2.5M