Large File Formats

Ensembl supports a number of large (multi-gigabyte) file formats for datasets that are too large to upload directly to the Ensembl servers. The file remains on the remote server and is accessed via its URL, with only small chunks of data being requested by the Ensembl webcode at any one time.

All of these formats can be viewed in Ensembl by creating a custom data track using the 'Custom tracks' function.

When you select one of these formats, the form button will change to read 'Attach'.

YouTube channel

Track Hubs

In addition to individual files, we also support track hubs (also known as data hubs) - collections of remote data files that uses a simple text-based configuration.

For how to set up a hub, see the documentation provided by UCSC, the originators of this data service:

Please note that we do not currently support all the options documented by UCSC, though we intend to implement as many as are appropriate to our displays.

Note also that if you change the configuration of your hub, you will need to shift-reload the Ensembl page to load the new file.

BAM and CRAM formats

BAM is a compressed version of the SAM (Sequence Alignment/Map) binary format. BAM uses an index file to give fast access to small sections of the file. CRAM is an improvement on BAM, offering significantly better lossless compression whilst maintaining backwards compatibility with BAM.

Both formats use the new HTSLIB codebase - more information is available from the SAMtools project on github.

bigBED format

BigBED is an indexed format created from standard BED files using the bedToBigBed utility program. It allows large datasets to be processed much faster than a conventional BED file.

BigWig format

The BigWig format is designed for dense, continuous data that is intended to be displayed as a graph. Files can be created from WIG or BedGraph files using the appropriate utility program.

Ensembl currently allows the following configuration options to be set when you upload the file:

  • Track colour (10 choices)

VCF format

The VCF format is a tab delimited format for storing variant calls and and individual genotypes. It is able to store all variant calls from single nucleotide variants to large scale insertions and deletions.

More information on this format, which is still under development, can be obtained from the 1000 Genomes Wiki.

Please note that if you are attaching a large VCF file from a URL, you will need to have an index file with the extension .vcf.gz.tbi in the same directory as your data file (and with the same name)

In order to produce the indexed vcf file with the .gz.tbi extension you must follow the following steps:

  • Compress your vcf file using bgzip
  • Index the vcf.gz file using tabix. Use will need to pass the option -p vcf to tabix, for example "/usr/bin/tabix -p vcf my_file.vcf.gz"