| Project Accession: | IBIAP_1000000009 |
| Title: | High-resolution AI image dataset for diagnosing oral submucous fibrosis and squamous cell carcinoma |
| Representative Image: | |
| Description: | Oral cancer is a global health challenge with a difficult histopathological diagnosis. The accurate histopathological interpretation of oral cancer tissue samples remains difficult. However, early diagnosis is very challenging due to a lack of experienced pathologists and inter- observer variability in diagnosis. The application of artificial intelligence (deep learning algorithms) for oral cancer histology images is very promising for rapid diagnosis. However, it requires a quality annotated dataset to build AI models. We present ORCHID (ORal Cancer Histology Image Database), a specialized database generated to advance research in AI-based histology image analytics of oral cancer and precancer. The ORCHID database is an extensive multicenter collection of high-resolution images captured at 1000X effective magnification (100X objective lens), encapsulating various oral cancer and precancer categories, such as oral submucous fibrosis (OSMF) and oral squamous cell carcinoma (OSCC). Additionally, it also contains grade-level sub-classifications for OSCC, such as well- differentiated (WD), moderately-differentiated (MD), and poorly-differentiated (PD). The database seeks to aid in developing innovative artificial intelligence-based rapid diagnostics for OSMF and OSCC, along with subtypes. |
| Publications: | https://doi.org/10.1038/s41597-024-03836-6 |
| Funding agency: | N/A |
| Grant Number: | N/A |
| Ethics Statement: | Download |
| Any Other Information : | The original version of this dataset is available at Zenodo (https://zenodo.org/records/12636426 ; https://zenodo.org/records/12646943). The Zenodo citation for training set and test/validation set is: Chaudhary, N. et al. High-resolution AI image dataset for diagnosing oral submucous fibrosis and squamous cell carcinoma Zenodo. https://doi.org/10.5281/zenodo.12636426 (2024), and Chaudhary, N., & Ahmad, T. Validation and Test Datasets for “High-resolution AI image dataset for diagnosing oral submucous fibrosis and squamous cell carcinoma” Zenodo. https://doi.org/10.5281/zenodo.12646943 (2024), respectively. |
| Additional File: | N/A |
| Acknowledgments: | N.C. is the recipient of a senior research fellowship from the Indian Council of Medical Research(3/1/2(1)/Oral/2021-NCD-II), New Delhi, India. This work was also supported by the Science and Engineering Research Board (CRG/2020/002294), and the Indian Council of Medical Research (ICMR) (GIA/2019/000274/PRCGIA (Ver-1)), New Delhi, India. We also acknowledge the computing support from the Mphasis F1 Foundation and the Center for Bioinformatics and Computational Biology (B.I.C.) (BT/PR40220/BTIS/137/22/2021) facility at Ashoka University. We are thanking Farhat Zeba and Sumra Khan for helping out with the imaging. |
| Sr.No | First name | Last name | Organization | Designation | |
|---|---|---|---|---|---|
| 1 | Nisha | Chaudhary | N/A | Multidisciplinary Centre for Advanced Research and Studies, Jamia Millia Islamia, New Delhi, India | Research Scholar |
| 2 | Arpita | Rai | N/A | Rajendra Institute of Medical Sciences, Ranchi, Jharkhand, India | Research Scholar |
| 3 | Aakash | Rao | N/A | Department of Computer Science, Ashoka University, Sonipat, Haryana, India | Research Scholar |
| 4 | Md | Faizan | N/A | Multidisciplinary Centre for Advanced Research and Studies, Jamia Millia Islamia, New Delhi, India | Research Scholar |
| 5 | Jeyaseelan | Augustine | N/A | Maulana Azad Institute of Dental Sciences, New Delhi, India | Research Scholar |
| 6 | Akhilanand | Chaurasia | N/A | King George Medical University, Lucknow, Uttar Pradesh, India | Research Scholar |
| 7 | Deepika | Mishra | N/A | All India Institute of Medical Sciences, New Delhi, India | Research Scholar |
| 8 | Akhilesh | Chandra | N/A | Banaras Hindu University, Banaras, Uttar Pradesh, India | Research Scholar |
| 9 | Varnit | Chauhan | N/A | Multidisciplinary Centre for Advanced Research and Studies, Jamia Millia Islamia, New Delhi, India | Research Scholar |
| 10 | Tanveer | Ahmad | tahmad7@jmi.ac.in | Multidisciplinary Centre for Advanced Research and Studies, Jamia Millia Islamia, New Delhi, India | Principal Investigator |
| Study Accession: | HISTOS_1000000013 |
| Title: | High-resolution AI image dataset for diagnosing oral submucous fibrosis and squamous cell carcinoma |
| Imaging Type: | Histopathology (HISTO) |
| Imaging Sub-type: | Diagnostic Pathology |
| Summary: | The number of images available in each of the five classes(folders), which are as follows, Normal, OSMF, WDOSCC, MDOSCC, and PDOSCC. Each class folder consists of subfolders representing different tissue slides collected from different patients. We have made an initial attempt to provide a comprehensive image database for two of the most prominent oral conditions, OSCC and OSMF. We believe that more such databases will be made publicly available in the near future. These comprehensive image databases will facilitate the development of accurate AI-based diagnostic tools for oral diseases, ultimately improving patient care and outcomes in the field of oral healthcare. In future, integration of databases comprising molecular markers, transcriptome, metabolome, and other biomarkers, combined with oral histological image through advanced AI-driven imaging techniques, holds great promise in improving diagnostic accuracy and precision. This potential has already been observed in the diagnosis of lung and breast cancers. This expansion will aid in developing a more comprehensive AI-driven diagnostic tool. |
| Keywords: | Oral cancer; Oral submucous fibrosis (OSMF); Oral squamous cell carcinoma (OSCC); Artificial intelligence |
| Additional / Any Other Information: | N/A |
| Release Date: | Jan. 13, 2025 |
| Access Licence Type: | Open Access |
| Sample Type ID | Organism | Taxon ID | Biological Entity | Laterality | Source Tissue | Source Cell/Cell-line | Cell Organelle |
|---|---|---|---|---|---|---|---|
| HISTOSMT_10000000037 | Homo sapiens | 9606 | Oral cavity | Not Applicable | Buccal mucosa | N/A | N/A |
| Sample Type ID | Sample ID | Method used for Sample Collection | Cell Phenotype Studied | De-identified Patient ID | ICD-11 Code (patient health condition) | Sample Source | Tissue Phenotype Studied |
|---|---|---|---|---|---|---|---|
| HISTOSMT_10000000037 | HISTOSM_10000058913 | Tissue biopsy | N/A | 17 | 2B6E.0 | Not specified | mdoscc |
| HISTOSMT_10000000037 | HISTOSM_10000058944 | Tissue biopsy | N/A | 18 | 2B6E.0 | Not specified | mdoscc |
| HISTOSMT_10000000037 | HISTOSM_10000058975 | Tissue biopsy | N/A | 20 | 2B6E.0 | Not specified | mdoscc |
| HISTOSMT_10000000037 | HISTOSM_10000059006 | Tissue biopsy | N/A | 21 | 2B6E.0 | Not specified | mdoscc |
| HISTOSMT_10000000037 | HISTOSM_10000059037 | Tissue biopsy | N/A | 24 | 2B6E.0 | Not specified | mdoscc |
| HISTOSMT_10000000037 | HISTOSM_10000059068 | Tissue biopsy | N/A | 26 | 2B6E.0 | Not specified | mdoscc |
| HISTOSMT_10000000037 | HISTOSM_10000059099 | Tissue biopsy | N/A | 27 | 2B6E.0 | Not specified | mdoscc |
| HISTOSMT_10000000037 | HISTOSM_10000059130 | Tissue biopsy | N/A | 29 | 2B6E.0 | Not specified | mdoscc |
| HISTOSMT_10000000037 | HISTOSM_10000059161 | Tissue biopsy | N/A | 31 | 2B6E.0 | Not specified | mdoscc |
| HISTOSMT_10000000037 | HISTOSM_10000059192 | Tissue biopsy | N/A | 32 | 2B6E.0 | Not specified | mdoscc |
| HISTOSMT_10000000037 | HISTOSM_10000059223 | Tissue biopsy | N/A | 35 | 2B6E.0 | Not specified | mdoscc |
| HISTOSMT_10000000037 | HISTOSM_10000059254 | Tissue biopsy | N/A | 36 | 2B6E.0 | Not specified | mdoscc |
| HISTOSMT_10000000037 | HISTOSM_10000059285 | Tissue biopsy | N/A | 39 | 2B6E.0 | Not specified | mdoscc |
| HISTOSMT_10000000037 | HISTOSM_10000059316 | Tissue biopsy | N/A | 41 | 2B6E.0 | Not specified | mdoscc |
| HISTOSMT_10000000037 | HISTOSM_10000059347 | Tissue biopsy | N/A | 42 | 2B6E.0 | Not specified | mdoscc |
| HISTOSMT_10000000037 | HISTOSM_10000059378 | Tissue biopsy | N/A | 02 | N/A | Not specified | normal |
| HISTOSMT_10000000037 | HISTOSM_10000059409 | Tissue biopsy | N/A | 04 | N/A | Not specified | normal |
| HISTOSMT_10000000037 | HISTOSM_10000059440 | Tissue biopsy | N/A | 05 | N/A | Not specified | normal |
| HISTOSMT_10000000037 | HISTOSM_10000059471 | Tissue biopsy | N/A | 06 | N/A | Not specified | normal |
| HISTOSMT_10000000037 | HISTOSM_10000059502 | Tissue biopsy | N/A | 08 | N/A | Not specified | normal |
| Experiment Type ID | Instrument Name | Instrument Type | Manufacturer | Model |
|---|---|---|---|---|
| HISTOET_10000000008 | Microscope | Bright-field | Leedz Microimaging Ltd (LMI) | NA |
| Experimental Design Summary (HISTOET_10000000008) |
|---|
| The ORCHID database is an extensive multicenter collection of high-resolution images captured at 1000X effective magnification (100X objective lens). Tissue slides were collected with the approval of an ethical committee from the participating hospitals and research institutions. The buccal mucosa tissue samples were collected for three classes, normal, OSMF, and OSCC, with grade-wise annotation from the pathologists at each hospital. Biopsy samples of normal, OSMF and OSCC tissues underwent H&E staining. The staining procedure was conducted either in-house or outsourced to different laboratories. To eliminate staining variations across different laboratories, the preparation of H&E slides involved five histopathology labs, each utilizing their own independently developed and optimized protocols for the staining process. Following staining, the samples were examined under a microscope by a skilled histopathologist to assess cellular morphology, and tissue architecture, and identify any distinctive features or abnormalities specific to each sample type. This evaluation by the histopathologist involved grading the tissue slides for OSCC and OSMF, as well as differentiating between normal and diseased tissue sections. Images were acquired using a 1000X magnification (100X objective) lens from Leedz microimaging (LMI) bright field microscopy. To capture the images consistently, we utilized ToupView imaging software, which was configured for automatic adjustments. This setting applies to both white balance and camera settings, thereby standardizing the image acquisition process across different slides. The images of the H&E stained slides were captured at 1000X magnification(100X objective lens). By setting the ToupView software to automatically adjust white balance and camera settings, we aimed to minimize human intervention and the variability it introduces. This approach ensures that the images are not only consistent but also replicable in different laboratory settings, provided similar equipment and software settings are used. We collected approximately 100–150 images per tissue slide, which were stored in PNG file format. |
| Acquired Images Annotation Description (HISTOET_10000000008) |
|---|
| The data included in the ORCHID database underwent rigorous expert annotation and validation to ensure a high level of quality and accuracy. In our expert validation process, ‘sufficient detail’ for an image to be qualified was determined based on several key criteria. Firstly, the clarity of histological features which depict the necessary histological structures, such as cellular details and tissue architecture. Images should be free from artifacts that could interfere with accurate interpretation (e.g., folds, tears, excessive staining). The image must be in focus, with appropriate contrast and resolution to discern pathological features. Our team of pathologists and histopathology experts independently assessed each image against these criteria to ensure only high-quality images were included in our study. Images that were blurry or lacked sufficient detail were dismissed as they would not provide accurate or reliable information. Next, the experts evaluated the annotations that accompany the images. These annotations were scrutinized for consistency and accuracy, to ensure that they accurately represented the disease conditions depicted in the images. The process of labeling the slides was conducted manually by trained pathology experts. This involved a careful review of each slide to identify and label the specific disease conditions present. Furthermore, the slides that showed staining artifacts were also rejected. Staining artifacts can occur during the preparation of the slides and can alter the appearance of the tissue, potentially leading to misinterpretation or incorrect diagnosis. As such, only slides that were free from such errors and provided a clear and accurate representation of the oral pathology were included in the database. These standardization processes ensure that AI models are trained and validated on data that consistently represent the true pathological features. Standardized and validated data enhance the model’s ability to generalize findings across different datasets and real-world scenarios. |
| Sample ID | Experiment Type ID | Experiment ID | Image type (Original / Derived / Unknown) | Any Other Information | Staining Type | Images Magnification | Tissue / Tumor Fixative Used | Data Repository Name (If already deposited in another repository) | Dataset Split Type (Training / Validation / Test) | Licence Type (original source) | Stain Normalization Method |
|---|---|---|---|---|---|---|---|---|---|---|---|
| HISTOSM_10000061099 | HISTOET_10000000008 | HISTOE_10000061099 | Derived | N/A | Haematoxylin and eosin staining (H&E) | 1000X | N/A | Zenodo | val | CC-BY 4.0 license | Reinhard stain normalization technique |
| HISTOSM_10000061107 | HISTOET_10000000008 | HISTOE_10000061107 | Derived | N/A | Haematoxylin and eosin staining (H&E) | 1000X | N/A | Zenodo | val | CC-BY 4.0 license | Reinhard stain normalization technique |
| HISTOSM_10000061116 | HISTOET_10000000008 | HISTOE_10000061116 | Derived | N/A | Haematoxylin and eosin staining (H&E) | 1000X | N/A | Zenodo | val | CC-BY 4.0 license | Reinhard stain normalization technique |
| HISTOSM_10000061124 | HISTOET_10000000008 | HISTOE_10000061124 | Derived | N/A | Haematoxylin and eosin staining (H&E) | 1000X | N/A | Zenodo | val | CC-BY 4.0 license | Reinhard stain normalization technique |
| HISTOSM_10000061133 | HISTOET_10000000008 | HISTOE_10000061133 | Derived | N/A | Haematoxylin and eosin staining (H&E) | 1000X | N/A | Zenodo | val | CC-BY 4.0 license | Reinhard stain normalization technique |
| HISTOSM_10000061141 | HISTOET_10000000008 | HISTOE_10000061141 | Derived | N/A | Haematoxylin and eosin staining (H&E) | 1000X | N/A | Zenodo | val | CC-BY 4.0 license | Reinhard stain normalization technique |
| HISTOSM_10000061150 | HISTOET_10000000008 | HISTOE_10000061150 | Derived | N/A | Haematoxylin and eosin staining (H&E) | 1000X | N/A | Zenodo | val | CC-BY 4.0 license | Reinhard stain normalization technique |
| HISTOSM_10000061158 | HISTOET_10000000008 | HISTOE_10000061158 | Derived | N/A | Haematoxylin and eosin staining (H&E) | 1000X | N/A | Zenodo | val | CC-BY 4.0 license | Reinhard stain normalization technique |
| HISTOSM_10000061167 | HISTOET_10000000008 | HISTOE_10000061167 | Derived | N/A | Haematoxylin and eosin staining (H&E) | 1000X | N/A | Zenodo | val | CC-BY 4.0 license | Reinhard stain normalization technique |
| HISTOSM_10000061175 | HISTOET_10000000008 | HISTOE_10000061175 | Derived | N/A | Haematoxylin and eosin staining (H&E) | 1000X | N/A | Zenodo | val | CC-BY 4.0 license | Reinhard stain normalization technique |
| HISTOSM_10000061184 | HISTOET_10000000008 | HISTOE_10000061184 | Derived | N/A | Haematoxylin and eosin staining (H&E) | 1000X | N/A | Zenodo | val | CC-BY 4.0 license | Reinhard stain normalization technique |
| HISTOSM_10000061192 | HISTOET_10000000008 | HISTOE_10000061192 | Derived | N/A | Haematoxylin and eosin staining (H&E) | 1000X | N/A | Zenodo | val | CC-BY 4.0 license | Reinhard stain normalization technique |
| HISTOSM_10000061201 | HISTOET_10000000008 | HISTOE_10000061201 | Derived | N/A | Haematoxylin and eosin staining (H&E) | 1000X | N/A | Zenodo | val | CC-BY 4.0 license | Reinhard stain normalization technique |
| HISTOSM_10000061209 | HISTOET_10000000008 | HISTOE_10000061209 | Derived | N/A | Haematoxylin and eosin staining (H&E) | 1000X | N/A | Zenodo | val | CC-BY 4.0 license | Reinhard stain normalization technique |
| HISTOSM_10000061218 | HISTOET_10000000008 | HISTOE_10000061218 | Derived | N/A | Haematoxylin and eosin staining (H&E) | 1000X | N/A | Zenodo | val | CC-BY 4.0 license | Reinhard stain normalization technique |
| HISTOSM_10000061226 | HISTOET_10000000008 | HISTOE_10000061226 | Derived | N/A | Haematoxylin and eosin staining (H&E) | 1000X | N/A | Zenodo | val | CC-BY 4.0 license | Reinhard stain normalization technique |
| HISTOSM_10000061235 | HISTOET_10000000008 | HISTOE_10000061235 | Derived | N/A | Haematoxylin and eosin staining (H&E) | 1000X | N/A | Zenodo | val | CC-BY 4.0 license | Reinhard stain normalization technique |
| HISTOSM_10000061243 | HISTOET_10000000008 | HISTOE_10000061243 | Derived | N/A | Haematoxylin and eosin staining (H&E) | 1000X | N/A | Zenodo | val | CC-BY 4.0 license | Reinhard stain normalization technique |
| HISTOSM_10000061252 | HISTOET_10000000008 | HISTOE_10000061252 | Derived | N/A | Haematoxylin and eosin staining (H&E) | 1000X | N/A | Zenodo | val | CC-BY 4.0 license | Reinhard stain normalization technique |
| HISTOSM_10000061260 | HISTOET_10000000008 | HISTOE_10000061260 | Derived | N/A | Haematoxylin and eosin staining (H&E) | 1000X | N/A | Zenodo | val | CC-BY 4.0 license | Reinhard stain normalization technique |
| Experiment ID | Image File Name (with path) | Image Preview | Image Size |
|---|---|---|---|
| HISTOE_10000054100 | ORCHID_DB/train/osmf/o-1-00-30/o-1-00-30-098.png | ![]() ![]() Download Image |
3.5M |
| HISTOE_10000054101 | ORCHID_DB/train/osmf/o-1-00-30/o-1-00-30-100.png | ![]() ![]() Download Image |
4.3M |
| HISTOE_10000054102 | ORCHID_DB/train/osmf/o-1-00-30/o-1-00-30-102.png | ![]() ![]() Download Image |
4.0M |
| HISTOE_10000054103 | ORCHID_DB/train/osmf/o-1-00-30/o-1-00-30-104.png | ![]() ![]() Download Image |
3.6M |
| HISTOE_10000054104 | ORCHID_DB/train/osmf/o-1-00-30/o-1-00-30-105.png | ![]() ![]() Download Image |
3.6M |
| HISTOE_10000054105 | ORCHID_DB/train/osmf/o-1-00-30/o-1-00-30-106.png | ![]() ![]() Download Image |
3.9M |
| HISTOE_10000054106 | ORCHID_DB/train/osmf/o-1-00-30/o-1-00-30-107.png | ![]() ![]() Download Image |
3.9M |
| HISTOE_10000054107 | ORCHID_DB/train/osmf/o-1-00-30/o-1-00-30-109.png | ![]() ![]() Download Image |
3.5M |
| HISTOE_10000054108 | ORCHID_DB/train/osmf/o-1-00-30/o-1-00-30-114.png | ![]() ![]() Download Image |
3.8M |
| HISTOE_10000054109 | ORCHID_DB/train/osmf/o-1-00-30/o-1-00-30-115.png | ![]() ![]() Download Image |
3.9M |