Genomic Data Science Core Data Storage Policy

Data storage agreement

Last updated: Nov 2020

Typical analyses of high dimensional genomics datasets produce a significant number of files, many of which are considerably large in file size. Some of these represent intermediate file types (e.g. read alignments in BAM format) that represent information from one sample and may not be required for downstream data analysis. Other files summarize the final output from a pipeline or analysis workflow and may represent aggregated data from all samples in a given experiment (e.g. matrix of raw read counts for an RNA-seq experiment). We provide a summary of file type examples from each category in the table below for several data analysis workflows.

Due to the large size of many files generated during standard analysis pipelines, the GDSC is unable to store intermediate and final output files indefinitely. From the date we return the results of an analysis to a core user, we will retain intermediate files for 1-month, and final output files will be kept for 1 year, after which time these files may be deleted. It is the responsibility of individual users to develop a long-term storage plan for their data.

Generally, our analyses are performed using Dartmouth’s high-performance computing cluster and stored using the DartFS file management system. We commonly use DartFS to share the results of an analysis with users, and encourage Dartmouth users to sign up for a personal DartFS account through Research Computing here. In addition to 50GB of personal storage space, independent Dartmouth research labs have access to an additional free terabyte of storage, which includes periodic snapshots and offsite back-ups. Additional space is also available for purchase from Research Computing. Visit the Research Computing website for their current policies and pricing.

For DartFS users, the GDSC is happy to facilitate the transfer of analysis files and results to their local DartFS directories. Furthermore, given the integrated nature of the DartFS system, GDSC personnel can be granted access to lab directories, enabling the core to easily continue work on a project after the initial results have been returned.

All summary reports (e.g. for QC or analysis summaries) and analysis code used to generate the results will be kept indefinitely and can be used to regenerate files from a previous analysis that have since been deleted, however there may be a cost associated with doing so.

Note: Raw next-generation sequencing or array data generated by the Genomics Core, in FASTQ or IDAT format respectively, is stored using by the Norris Cotton Cancer Center (NCCC) Genomics Shared Resource (GSR) on DartFS using high speed network storage for 1-year, after which time it is moved to the slower standard performance network for an additional 4 years before being deleted.

Analysis type

Intermediate Files

Final Files

RNA-seq(preprocessing) ·Processed reads (.FASTQ)· Alignments (.BAM)

·Quantification results (HTseq-Count or RSEM output)

·Quality control report··Matrix of raw read counts
RNA-seq(differential expression) ·SummarizedExperiment format objects used to perform DE analyses (.Rdata/.RDS) ·Normalized read counts·Spreadsheets of differential expression results

·Summary level results figures (e.g. volcano plots)

ATAC-seq ·Processed sequence reads (.FASTQ)·Standard alignments (.BAM) and tn5 offset alignments ·Quality control report·Matrix of raw read counts

·Called peaks (.BED)

·Normalized signal tracks (.BIGWIG)

Variant calling(SNV & INDELs) ·Processed reads (.FASTQ)·Read alignments (.BAM)

·Unfiltered variant calls (.VCF)

·Quality control report·Filtered variant calls (.VCF)
DNA methylation arrays (EPIC) ·Intermediate probe intensity normalization and background correction values (.Rdata) ·Quality control report·Final beta-value matrix

·Spreadsheets of differential methylation results

·Summary level results figures