In order to maximize the utility of the data HTAN generates, we have defined a structured schema of all data, associated metadata and their relationships. HTAN has been fortunate enough to start at a time with ample of great data modeling examples from other projects including The Cancer Genome Atlas (TCGA), the International Cancer Genome Consortium (ICGC) and the NCI Genomic Data Commons (GDC). HTAN tries to follow in those footsteps and strives to provide compatibility with other relevant platforms, such as the Human Cell Atlas (HCA) Data Coordination Platform and the Human Biomolecular Atlas Program (HuBMAP).
The HTAN data is generated by 12 different atlases. An atlas is a group of people from one our more institutes that study a specific cancer type. All data is associated to an atlas.
The clinical metadata on Research Participants and BioSpecimens follows a tiered approach, where the first tier contain the most common metadata and the higher tiers are more specific dependent on particular use cases. See e.g. the organization of clinical data:
The other types of data follow a leveled approach, similar to TCGA, where the raw data is level 1 and higher levels are further processed data. E.g. for single cell RNASeq:
Level 1 | Raw primary data, e.g. FASTQs and unaligned BAMs |
---|---|
Level 2 | Aligned primary data, e.g. aligned BAMs |
Level 3 | Derived biomolecular data, i.e. gene expression matrix file |
Level 4 | Sample level summary, i.e. t-SNE plot coordinates |
Level 5 | Cohort level summary, i.e. significantly mutated genes |
HTAN uses bioschemas to define the data model. Bioschema extends schema.org, a community effort used by many search engines that provides a way to define information with properties. Bioschemas define profiles over types that state which properties must be used (minimum), should be used (recommended), and could be used (optional). HTAN and other consortiums, including the Human Cell Atlas and HuBMAP are working together to provide common shared schemas. One can find more info on bioschemas.org.