HTAN Sequencing Data

HTAN supports multiple sequencing modalities including Single Cell and Single Nucleus RNA Seq (sc/snRNASeq), Single Cell ATAC Seq, Bulk RNA Seq and Bulk DNA Seq.

The HTAN standard for gene annotations is GENCODE Version 34. GENCODE is used for gene definitions by many consortia, including ENCODE, NCI Genomic Data Commons, Human Cell Atlas, and PCAWG (Pan-Cancer Analysis of Whole Genomes). Ensembl gene content is essentially identical to that of GENCODE (FAQ) and interconversion is possible.

HTAN has adopted the GENCODE 34 Gene Transfer Format (GTF) comprehensive gene annotation file (GENCODE 34 GTF) and filtered files (GENCODE 34 GTF with genes only; GENCODE 34 GTF with genes only and retaining only chromosome X copy of pseudoautosomal region) for HTAN gene annotation. Note that HTAN also includes data generated with other gene models, as the process of implementing the standard is ongoing. Within HTAN metadata files, the reference genome used can be found in the attribute “Genomic Reference” and “Genomic Reference URL”.

In alignment with The Cancer Genome Atlas and the NCI Genomic Data Commons, sequencing data are divided into four levels:

LevelDefinitionExample Data
1Raw dataFASTQs, unaligned BAMs
2Aligned primary dataAligned BAMs
3Derived biomolecular dataGene expression matrix files, VCFs, etc.
4Sample level summary data.t-SNE plot coordinates, etc.
Data Schema:
Attribute
Description
scRNA-seq Level 1
Single-cell RNA-seq [EFO_0008913]
scRNA-seq Level 2
Alignment workflows downstream of scRNA-seq Level 1
scRNA-seq Level 3
Gene and Isoform expression files
scRNA-seq Level 4
Data represents the relationships between cells derived from Level 3 expression data and shown as tSNE or UMAP coordinates per cell, plus all other cell-specific meta information (e.g., cell type)
scATAC-seq Level 1
scATAC-seq files containing sequence read information, with or without alignment, as FASTQ or BAM files
Bulk DNA Level 1
Bulk Whole Exome Sequencing raw files
Bulk DNA Level 2
Bulk Whole Exome Sequencing aligned files and QC
Bulk DNA Level 3
Bulk Whole Exome Sequencing called variants
Bulk RNA-seq Level 1
Bulk RNA-seq [EFO_0003738]
Bulk RNA-seq Level 2
Bulk RNA-seq alignment protocol description
Bulk RNA-seq Level 3
Bulk RNA-seq gene expression matrices