SARS-CoV-2 SRA sequences search

What is it?

This service allowed to search for the presence and abundance of any sequence within publicly available raw next-generation sequencing datasets (and genomes) of SARS-CoV-2.
The service has been discontinued (as of 2021), and no alternative exists. If you are interested in performing sequence search, please feel free to contact us!


Input a DNA sequence to search:


Query sequence Origin Remarks
TCAAATTGGATGACAAAGATCCAAATTTCAA NC_045512v2:29283-29313 Just a chunk of the SARS-CoV-2 genome, illustrating that most datasets indeed have it at high abundances
AAAAAAAAAAAAAAAAAAAAAAAA a poly-A tail Likely to be found in RNA-Seq datasets
CTTTATCAGGATGTTAACTGC NC_045512v2:23403 famous variant site More info
CTTTATCAGGATGTTAACTGC NC_045512v2:23405 NOT a famous variant site More info
CTTTATCAGGATGTTAACTGC NC_045512v2:23406 also NOT a famous variant site
GAAGGTCTTAATGACAACCTT NC_045512v2:1605-1607 famous deletion site More info

Results, sequencing data

X-Axis Date SRA IDs Seq Technology Sample Type Number of reads Country Continent
Sorted by

Results, assembled genomes

Technical details

Sequences searches are performed as follows: the query sequence is broken down into all its overlapping 21-mers, and if any of those 21-kmers is absent from a dataset, the whole sequence is reported as absent in that dataset.
Otherwise, it is considered to be present and for raw sequencing datasets, we report the median abundance across all 21-mers of the query in the dataset. For genomes, we only report the presence/absence of the query.

Thus, sequences searches are exact in the sense that they allow for no mutations between a query sequence and matching sequences in datasets.
However, this is not the same as doing a 'grep': a query is essentially seen as an unordered, de-duplicated set of 21-mers. E.g. for the long polyA sample query, since all the constituent 21-mers are equal, the query is performed as if it was only a single 21-mer (disregarding the original query length).


Get the k-mer centered at a given position in the Covid19 genome (k=21, NC_045512v2)


The technology behind this service is REINDEER (pre-print).
Contact for this website:
Department of Computational Biology, Institut Pasteur.
Website hosted by Information Systems at Institut Pasteur.
Project funded by ANR Transipedia (University Paris-Orsay, INSERM Montpellier, CNRS, Institut Pasteur) and INCEPTION (PIA/ANR-16-CONV-0005).
Part of the PANGAIA H2020-MSCA-RISE-2019 network.