Go back

Study Complete Details




Project Accession: IBIAP_1000000009
Title: High-resolution AI image dataset for diagnosing oral submucous fibrosis and squamous cell carcinoma
Representative Image:
Description: Oral cancer is a global health challenge with a difficult histopathological diagnosis. The accurate histopathological interpretation of oral cancer tissue samples remains difficult. However, early diagnosis is very challenging due to a lack of experienced pathologists and inter- observer variability in diagnosis. The application of artificial intelligence (deep learning algorithms) for oral cancer histology images is very promising for rapid diagnosis. However, it requires a quality annotated dataset to build AI models. We present ORCHID (ORal Cancer Histology Image Database), a specialized database generated to advance research in AI-based histology image analytics of oral cancer and precancer. The ORCHID database is an extensive multicenter collection of high-resolution images captured at 1000X effective magnification (100X objective lens), encapsulating various oral cancer and precancer categories, such as oral submucous fibrosis (OSMF) and oral squamous cell carcinoma (OSCC). Additionally, it also contains grade-level sub-classifications for OSCC, such as well- differentiated (WD), moderately-differentiated (MD), and poorly-differentiated (PD). The database seeks to aid in developing innovative artificial intelligence-based rapid diagnostics for OSMF and OSCC, along with subtypes.
Publications: https://doi.org/10.1038/s41597-024-03836-6
Funding agency: N/A
Grant Number: N/A
Ethics Statement: Download
Any Other Information : The original version of this dataset is available at Zenodo (https://zenodo.org/records/12636426 ; https://zenodo.org/records/12646943). The Zenodo citation for training set and test/validation set is: Chaudhary, N. et al. High-resolution AI image dataset for diagnosing oral submucous fibrosis and squamous cell carcinoma Zenodo. https://doi.org/10.5281/zenodo.12636426 (2024), and Chaudhary, N., & Ahmad, T. Validation and Test Datasets for “High-resolution AI image dataset for diagnosing oral submucous fibrosis and squamous cell carcinoma” Zenodo. https://doi.org/10.5281/zenodo.12646943 (2024), respectively.
Additional File: N/A
Acknowledgments: N.C. is the recipient of a senior research fellowship from the Indian Council of Medical Research(3/1/2(1)/Oral/2021-NCD-II), New Delhi, India. This work was also supported by the Science and Engineering Research Board (CRG/2020/002294), and the Indian Council of Medical Research (ICMR) (GIA/2019/000274/PRCGIA (Ver-1)), New Delhi, India. We also acknowledge the computing support from the Mphasis F1 Foundation and the Center for Bioinformatics and Computational Biology (B.I.C.) (BT/PR40220/BTIS/137/22/2021) facility at Ashoka University. We are thanking Farhat Zeba and Sumra Khan for helping out with the imaging.

Sr.No First name Last name Email Organization Designation
1 Nisha Chaudhary N/A Multidisciplinary Centre for Advanced Research and Studies, Jamia Millia Islamia, New Delhi, India Research Scholar
2 Arpita Rai N/A Rajendra Institute of Medical Sciences, Ranchi, Jharkhand, India Research Scholar
3 Aakash Rao N/A Department of Computer Science, Ashoka University, Sonipat, Haryana, India Research Scholar
4 Md Faizan N/A Multidisciplinary Centre for Advanced Research and Studies, Jamia Millia Islamia, New Delhi, India Research Scholar
5 Jeyaseelan Augustine N/A Maulana Azad Institute of Dental Sciences, New Delhi, India Research Scholar
6 Akhilanand Chaurasia N/A King George Medical University, Lucknow, Uttar Pradesh, India Research Scholar
7 Deepika Mishra N/A All India Institute of Medical Sciences, New Delhi, India Research Scholar
8 Akhilesh Chandra N/A Banaras Hindu University, Banaras, Uttar Pradesh, India Research Scholar
9 Varnit Chauhan N/A Multidisciplinary Centre for Advanced Research and Studies, Jamia Millia Islamia, New Delhi, India Research Scholar
10 Tanveer Ahmad tahmad7@jmi.ac.in Multidisciplinary Centre for Advanced Research and Studies, Jamia Millia Islamia, New Delhi, India Principal Investigator

Study Accession: HISTOS_1000000013
Title: High-resolution AI image dataset for diagnosing oral submucous fibrosis and squamous cell carcinoma
Imaging Type: Histopathology (HISTO)
Imaging Sub-type: Diagnostic Pathology
Summary: The number of images available in each of the five classes(folders), which are as follows, Normal, OSMF, WDOSCC, MDOSCC, and PDOSCC. Each class folder consists of subfolders representing different tissue slides collected from different patients. We have made an initial attempt to provide a comprehensive image database for two of the most prominent oral conditions, OSCC and OSMF. We believe that more such databases will be made publicly available in the near future. These comprehensive image databases will facilitate the development of accurate AI-based diagnostic tools for oral diseases, ultimately improving patient care and outcomes in the field of oral healthcare. In future, integration of databases comprising molecular markers, transcriptome, metabolome, and other biomarkers, combined with oral histological image through advanced AI-driven imaging techniques, holds great promise in improving diagnostic accuracy and precision. This potential has already been observed in the diagnosis of lung and breast cancers. This expansion will aid in developing a more comprehensive AI-driven diagnostic tool.
Keywords: Oral cancer; Oral submucous fibrosis (OSMF); Oral squamous cell carcinoma (OSCC); Artificial intelligence
Additional / Any Other Information: N/A
Release Date: Jan. 13, 2025
Access Licence Type: Open Access

Table 1. The sample types registered under this study are as follows:
Sample Type IDOrganismTaxon IDBiological EntityLateralitySource TissueSource Cell/Cell-lineCell Organelle
HISTOSMT_10000000037Homo sapiens 9606 Oral cavityNot ApplicableBuccal mucosaN/AN/A

Table 2. The samples registered under this study are as follows:
Sample Type ID Sample ID Method used for Sample Collection Cell Phenotype Studied De-identified Patient ID ICD-11 Code (patient health condition) Sample Source Tissue Phenotype Studied
HISTOSMT_10000000037 HISTOSM_10000047778 Tissue biopsy N/A 08 2B6E.0 Not specified pdoscc
HISTOSMT_10000000037 HISTOSM_10000047780 Tissue biopsy N/A 08 2B6E.0 Not specified pdoscc
HISTOSMT_10000000037 HISTOSM_10000047782 Tissue biopsy N/A 08 2B6E.0 Not specified pdoscc
HISTOSMT_10000000037 HISTOSM_10000047784 Tissue biopsy N/A 08 2B6E.0 Not specified pdoscc
HISTOSMT_10000000037 HISTOSM_10000047786 Tissue biopsy N/A 08 2B6E.0 Not specified pdoscc
HISTOSMT_10000000037 HISTOSM_10000047788 Tissue biopsy N/A 08 2B6E.0 Not specified pdoscc
HISTOSMT_10000000037 HISTOSM_10000047790 Tissue biopsy N/A 08 2B6E.0 Not specified pdoscc
HISTOSMT_10000000037 HISTOSM_10000047792 Tissue biopsy N/A 09 2B6E.0 Not specified pdoscc
HISTOSMT_10000000037 HISTOSM_10000047794 Tissue biopsy N/A 09 2B6E.0 Not specified pdoscc
HISTOSMT_10000000037 HISTOSM_10000047796 Tissue biopsy N/A 09 2B6E.0 Not specified pdoscc
HISTOSMT_10000000037 HISTOSM_10000047798 Tissue biopsy N/A 09 2B6E.0 Not specified pdoscc
HISTOSMT_10000000037 HISTOSM_10000047800 Tissue biopsy N/A 09 2B6E.0 Not specified pdoscc
HISTOSMT_10000000037 HISTOSM_10000047802 Tissue biopsy N/A 09 2B6E.0 Not specified pdoscc
HISTOSMT_10000000037 HISTOSM_10000047804 Tissue biopsy N/A 10 2B6E.0 Not specified pdoscc
HISTOSMT_10000000037 HISTOSM_10000047805 Tissue biopsy N/A 10 2B6E.0 Not specified pdoscc
HISTOSMT_10000000037 HISTOSM_10000047807 Tissue biopsy N/A 10 2B6E.0 Not specified pdoscc
HISTOSMT_10000000037 HISTOSM_10000047808 Tissue biopsy N/A 10 2B6E.0 Not specified pdoscc
HISTOSMT_10000000037 HISTOSM_10000047810 Tissue biopsy N/A 10 2B6E.0 Not specified pdoscc
HISTOSMT_10000000037 HISTOSM_10000047811 Tissue biopsy N/A 10 2B6E.0 Not specified pdoscc
HISTOSMT_10000000037 HISTOSM_10000047812 Tissue biopsy N/A 10 2B6E.0 Not specified pdoscc

Table 3. The experiment types registered under this study are as follows:
Experiment Type IDInstrument NameInstrument TypeManufacturerModel
HISTOET_10000000008MicroscopeBright-fieldLeedz Microimaging Ltd (LMI)NA


Experimental Design Summary (HISTOET_10000000008)
The ORCHID database is an extensive multicenter collection of high-resolution images captured at 1000X effective magnification (100X objective lens). Tissue slides were collected with the approval of an ethical committee from the participating hospitals and research institutions. The buccal mucosa tissue samples were collected for three classes, normal, OSMF, and OSCC, with grade-wise annotation from the pathologists at each hospital. Biopsy samples of normal, OSMF and OSCC tissues underwent H&E staining. The staining procedure was conducted either in-house or outsourced to different laboratories. To eliminate staining variations across different laboratories, the preparation of H&E slides involved five histopathology labs, each utilizing their own independently developed and optimized protocols for the staining process. Following staining, the samples were examined under a microscope by a skilled histopathologist to assess cellular morphology, and tissue architecture, and identify any distinctive features or abnormalities specific to each sample type. This evaluation by the histopathologist involved grading the tissue slides for OSCC and OSMF, as well as differentiating between normal and diseased tissue sections. Images were acquired using a 1000X magnification (100X objective) lens from Leedz microimaging (LMI) bright field microscopy. To capture the images consistently, we utilized ToupView imaging software, which was configured for automatic adjustments. This setting applies to both white balance and camera settings, thereby standardizing the image acquisition process across different slides. The images of the H&E stained slides were captured at 1000X magnification(100X objective lens). By setting the ToupView software to automatically adjust white balance and camera settings, we aimed to minimize human intervention and the variability it introduces. This approach ensures that the images are not only consistent but also replicable in different laboratory settings, provided similar equipment and software settings are used. We collected approximately 100–150 images per tissue slide, which were stored in PNG file format.

Acquired Images Annotation Description (HISTOET_10000000008)
The data included in the ORCHID database underwent rigorous expert annotation and validation to ensure a high level of quality and accuracy. In our expert validation process, ‘sufficient detail’ for an image to be qualified was determined based on several key criteria. Firstly, the clarity of histological features which depict the necessary histological structures, such as cellular details and tissue architecture. Images should be free from artifacts that could interfere with accurate interpretation (e.g., folds, tears, excessive staining). The image must be in focus, with appropriate contrast and resolution to discern pathological features. Our team of pathologists and histopathology experts independently assessed each image against these criteria to ensure only high-quality images were included in our study. Images that were blurry or lacked sufficient detail were dismissed as they would not provide accurate or reliable information. Next, the experts evaluated the annotations that accompany the images. These annotations were scrutinized for consistency and accuracy, to ensure that they accurately represented the disease conditions depicted in the images. The process of labeling the slides was conducted manually by trained pathology experts. This involved a careful review of each slide to identify and label the specific disease conditions present. Furthermore, the slides that showed staining artifacts were also rejected. Staining artifacts can occur during the preparation of the slides and can alter the appearance of the tissue, potentially leading to misinterpretation or incorrect diagnosis. As such, only slides that were free from such errors and provided a clear and accurate representation of the oral pathology were included in the database. These standardization processes ensure that AI models are trained and validated on data that consistently represent the true pathological features. Standardized and validated data enhance the model’s ability to generalize findings across different datasets and real-world scenarios.

Table 4. The experiments registered under this study are as follows:
Sample ID Experiment Type ID Experiment ID Image type (Original / Derived / Unknown) Any Other Information Staining Type Images Magnification Tissue / Tumor Fixative Used Data Repository Name (If already deposited in another repository) Dataset Split Type (Training / Validation / Test) Licence Type (original source) Stain Normalization Method
HISTOSM_10000046869 HISTOET_10000000008 HISTOE_10000046945 Derived N/A Haematoxylin and eosin staining (H&E) 1000X N/A Zenodo test CC-BY 4.0 license Reinhard stain normalization technique
HISTOSM_10000046870 HISTOET_10000000008 HISTOE_10000046946 Derived N/A Haematoxylin and eosin staining (H&E) 1000X N/A Zenodo test CC-BY 4.0 license Reinhard stain normalization technique
HISTOSM_10000046871 HISTOET_10000000008 HISTOE_10000046947 Derived N/A Haematoxylin and eosin staining (H&E) 1000X N/A Zenodo test CC-BY 4.0 license Reinhard stain normalization technique
HISTOSM_10000046872 HISTOET_10000000008 HISTOE_10000046948 Derived N/A Haematoxylin and eosin staining (H&E) 1000X N/A Zenodo test CC-BY 4.0 license Reinhard stain normalization technique
HISTOSM_10000046873 HISTOET_10000000008 HISTOE_10000046949 Derived N/A Haematoxylin and eosin staining (H&E) 1000X N/A Zenodo test CC-BY 4.0 license Reinhard stain normalization technique
HISTOSM_10000046874 HISTOET_10000000008 HISTOE_10000046950 Derived N/A Haematoxylin and eosin staining (H&E) 1000X N/A Zenodo test CC-BY 4.0 license Reinhard stain normalization technique
HISTOSM_10000046875 HISTOET_10000000008 HISTOE_10000046951 Derived N/A Haematoxylin and eosin staining (H&E) 1000X N/A Zenodo test CC-BY 4.0 license Reinhard stain normalization technique
HISTOSM_10000046876 HISTOET_10000000008 HISTOE_10000046952 Derived N/A Haematoxylin and eosin staining (H&E) 1000X N/A Zenodo test CC-BY 4.0 license Reinhard stain normalization technique
HISTOSM_10000046877 HISTOET_10000000008 HISTOE_10000046953 Derived N/A Haematoxylin and eosin staining (H&E) 1000X N/A Zenodo test CC-BY 4.0 license Reinhard stain normalization technique
HISTOSM_10000046878 HISTOET_10000000008 HISTOE_10000046954 Derived N/A Haematoxylin and eosin staining (H&E) 1000X N/A Zenodo test CC-BY 4.0 license Reinhard stain normalization technique
HISTOSM_10000046879 HISTOET_10000000008 HISTOE_10000046955 Derived N/A Haematoxylin and eosin staining (H&E) 1000X N/A Zenodo test CC-BY 4.0 license Reinhard stain normalization technique
HISTOSM_10000046880 HISTOET_10000000008 HISTOE_10000046956 Derived N/A Haematoxylin and eosin staining (H&E) 1000X N/A Zenodo test CC-BY 4.0 license Reinhard stain normalization technique
HISTOSM_10000046882 HISTOET_10000000008 HISTOE_10000046958 Derived N/A Haematoxylin and eosin staining (H&E) 1000X N/A Zenodo test CC-BY 4.0 license Reinhard stain normalization technique
HISTOSM_10000046883 HISTOET_10000000008 HISTOE_10000046959 Derived N/A Haematoxylin and eosin staining (H&E) 1000X N/A Zenodo test CC-BY 4.0 license Reinhard stain normalization technique
HISTOSM_10000046885 HISTOET_10000000008 HISTOE_10000046961 Derived N/A Haematoxylin and eosin staining (H&E) 1000X N/A Zenodo test CC-BY 4.0 license Reinhard stain normalization technique
HISTOSM_10000046886 HISTOET_10000000008 HISTOE_10000046962 Derived N/A Haematoxylin and eosin staining (H&E) 1000X N/A Zenodo test CC-BY 4.0 license Reinhard stain normalization technique
HISTOSM_10000046887 HISTOET_10000000008 HISTOE_10000046963 Derived N/A Haematoxylin and eosin staining (H&E) 1000X N/A Zenodo test CC-BY 4.0 license Reinhard stain normalization technique
HISTOSM_10000046888 HISTOET_10000000008 HISTOE_10000046964 Derived N/A Haematoxylin and eosin staining (H&E) 1000X N/A Zenodo test CC-BY 4.0 license Reinhard stain normalization technique
HISTOSM_10000046889 HISTOET_10000000008 HISTOE_10000046965 Derived N/A Haematoxylin and eosin staining (H&E) 1000X N/A Zenodo test CC-BY 4.0 license Reinhard stain normalization technique
HISTOSM_10000046890 HISTOET_10000000008 HISTOE_10000046966 Derived N/A Haematoxylin and eosin staining (H&E) 1000X N/A Zenodo test CC-BY 4.0 license Reinhard stain normalization technique

Experiment ID Image File Name (with path) Image Preview Image Size
HISTOE_10000046840ORCHID_DB/train/mdoscc/o-3-00-09/o-3-00-09-018.png

Download Image
8.5M
HISTOE_10000046841ORCHID_DB/train/mdoscc/o-3-00-09/o-3-00-09-019.png

Download Image
7.8M
HISTOE_10000046842ORCHID_DB/train/mdoscc/o-3-00-09/o-3-00-09-020.png

Download Image
7.3M
HISTOE_10000046843ORCHID_DB/train/mdoscc/o-3-00-09/o-3-00-09-021.png

Download Image
7.3M
HISTOE_10000046844ORCHID_DB/train/mdoscc/o-3-00-09/o-3-00-09-022.png

Download Image
7.3M
HISTOE_10000046845ORCHID_DB/train/mdoscc/o-3-00-09/o-3-00-09-023.png

Download Image
7.6M
HISTOE_10000046846ORCHID_DB/test/mdoscc/o-3-00-01/o-3-00-01-009.png

Download Image
7.4M
HISTOE_10000046847ORCHID_DB/test/mdoscc/o-3-00-01/o-3-00-01-013.png

Download Image
7.5M
HISTOE_10000046848ORCHID_DB/test/mdoscc/o-3-00-01/o-3-00-01-023.png

Download Image
8.3M
HISTOE_10000046849ORCHID_DB/test/mdoscc/o-3-00-01/o-3-00-01-025.png

Download Image
7.7M