SARS-CoV-2 SRA sequences search

What is it?

This service allowed to search for the presence and abundance of any sequence within publicly available raw next-generation sequencing datasets (and genomes) of SARS-CoV-2.
The service has been discontinued (as of 2021), and no alternative exists. If you are interested in performing sequence search, please feel free to contact us!

Search

Input a DNA sequence to search:

Examples

Query sequence	Origin	Remarks
TCAAATTGGATGACAAAGATCCAAATTTCAA	NC_045512v2:29283-29313	Just a chunk of the SARS-CoV-2 genome, illustrating that most datasets indeed have it at high abundances
AAAAAAAAAAAAAAAAAAAAAAAA	a poly-A tail	Likely to be found in RNA-Seq datasets
CTTTATCAGGATGTTAACTGC	NC_045512v2:23403	famous variant site More info
CTTTATCAGGATGTTAACTGC	NC_045512v2:23405	NOT a famous variant site More info
CTTTATCAGGATGTTAACTGC	NC_045512v2:23406	also NOT a famous variant site
GAAGGTCTTAATGACAACCTT	NC_045512v2:1605-1607	famous deletion site More info

Results, sequencing data

X-Axis	Date	SRA IDs	Seq Technology	Sample Type	Number of reads	Country	Continent
Labels
Sorted by

Results, assembled genomes

Technical details

Sequences searches are performed as follows: the query sequence is broken down into all its overlapping 21-mers, and if any of those 21-kmers is absent from a dataset, the whole sequence is reported as absent in that dataset.
Otherwise, it is considered to be present and for raw sequencing datasets, we report the median abundance across all 21-mers of the query in the dataset. For genomes, we only report the presence/absence of the query.

Thus, sequences searches are exact in the sense that they allow for no mutations between a query sequence and matching sequences in datasets.
However, this is not the same as doing a 'grep': a query is essentially seen as an unordered, de-duplicated set of 21-mers. E.g. for the long polyA sample query, since all the constituent 21-mers are equal, the query is performed as if it was only a single 21-mer (disregarding the original query length).

Misc

Get the k-mer centered at a given position in the Covid19 genome (k=21, NC_045512v2)

Contact

The technology behind this service is REINDEER (pre-print).
Contact for this website: rayan.chikhi@pasteur.fr
Department of Computational Biology, Institut Pasteur.
Website hosted by Information Systems at Institut Pasteur.
Project funded by ANR Transipedia (University Paris-Orsay, INSERM Montpellier, CNRS, Institut Pasteur) and INCEPTION (PIA/ANR-16-CONV-0005).
Part of the PANGAIA H2020-MSCA-RISE-2019 network.
Contributors