Large Dataset of Labeled Optical Coherence Tomography (OCT) and Chest X-Ray Images

Version 1.0.0
Description

This dataset was designed to support research in AI-based medical image analysis, particularly focusing on retinal and pulmonary conditions. It includes thousands of expertly labeled OCT and Chest X-Ray images sourced from independent patients and categorized into four classes: CNV, DME, DRUSEN, and NORMAL. The dataset mirrors the imaging data described in the publication 'Identifying Medical Diagnoses and Treatable Diseases by Image-Based Deep Learning', and is structured to facilitate reproducibility and benchmarking in deep learning workflows.

Keywords
OCT ImagingChest X-RayAI in Medical ImagingRetinal Disease ClassificationPneumonia Detection
Conditions
Choroidal NeovascularizationDiabetic Macular EdemaDrusenPneumonia
License

Creative Commons Attribution 4.0 International

64 citations
1.1k views

Overview of the study

The Artificial Intelligence Ready and Exploratory Atlas for Diabetes Insights (AI-READI) project seeks to create a flagship ethically-sourced dataset to enable future generations of artificial intelligence/machine learning (AI/ML) research to provide critical insights into type 2 diabetes mellitus (T2DM), including salutogenic pathways to return to health. The ability to understand and affect the course of complex, multi-organ diseases such as T2DM has been limited by a lack of well-designed, high quality, large, and inclusive multimodal datasets. The AI-READI team of investigators will aim to collect a cross-sectional dataset of 4,000 people and longitudinal data from 10% of the study cohort across the US. The study cohort will be balanced for diabetes disease stage. Data collection will be specifically designed to permit downstream pseudo-time manifold analysis, an approach used to predict disease trajectories by collecting and learning from complex, multimodal data from participants with differing disease severity (normal to insulin-dependent T2DM). The long-term objective for this project is to develop a foundational dataset in T2DM, agnostic to existing classification criteria or biases, which can be used to reconstruct a temporal atlas of T2DM development and reversal towards health (i.e., salutogenesis). Data will be optimized for downstream AI/ML research and made publicly available.

Description of the dataset

This dataset contains data from 1067 participants that was collected between July 19, 2023 and July 31, 2024. Data from multiple modalities are included. A full list is provided in the Data Standards section below. The data in this dataset contain no protected health information (PHI). Information related to the sex and race/ethnicity of the participants as well as medication used has also been removed.

The dataset contains 165,051 files and is around 2.01 TB in size.

A detailed description of the dataset is available in the AI-READI documentation for v2.0.0 of the dataset at docs.aireadi.org.

Protocol

The protocol followed for collecting the data can be found in the AI-READI documentation for v2.0.0 of the dataset at docs.aireadi.org.

Dataset access/restrictions

Accessing the dataset requires several steps, including:

  • Login in through a verified ID system
  • Agreeing to use the data only for type 2 diabetes related research.
  • Agreeing to the license terms which set certain restrictions and obligations for data usage (see License section below).

Data standards followed

This dataset is organized following the Clinical Dataset Structure (CDS) v0.1.1. We refer to the CDS documentation for more details. Briefly, data is organized at the root level into one directory per datatype (c.f. Table below). Within each datatype folder, there is one folder per modality. Within each modality folder, there is one folder per device used to collect that modality. Within each device folder, there is one folder per participant. Each datatype, modality, and device folder is named using a name that best defines it. Each participant folder is named after the participant's ID number used in the study. For each datatype, the data files follow the standards listed in the Table below. More details are available in the dataset_structure_description.json metadata file included in this dataset.