Research People Publication Tool News Link Position  

SVFilter

Introduction

Genomic structural variations (SVs), including large deletions, insertions, inversions, duplications and translocations, constitute an important source of genetic diversity. Recent advances in next-generation sequencing (NGS) technologies and computational algorithms have enabled the genome-wide mapping of SVs at a fine resolution. However, false discovery rate in the current SV discovery programs remains high. We have developed the following five filters that can be used to efficiently identify false SVs.
  • Ratio filter - filter based on the ratio of normal and abnormal reads. Within an anchoring window where abnormal reads cluster and form a SV, if substantial normal reads, which share the same orientation as the abnormal reads, are also present, then this SV is discarded.

  • SNV filter - filter based on the SNVs between normal and abnormal reads. Within an anchoring window, if SNV(s) can be detected between normal and abnormal reads, then this SV is removed.

  • Gap filter - filter based on gaps in the identified SVs. If an SV spans a genomic region that contains gap(s), then this SV is discarded. Such SV spans across at least two contigs or even two scaffolds.

  • Read coverage filter - filter based on the read coverage of the potential SV by normal reads. This filter can only be applied to deletion events. If a substantial fraction of the deleted region is covered by normal reads, then this candidate deletion is removed.

  • Sequencing depth filter - filter based on the sequencing depth of the potential SV region. This filter can be applied to tandem duplication events. If the average sequencing depth over the duplicated segment is not significantly higher than the genome-wide average, then the predicted tandem duplication is rejected.
Please check this figure for further explanation of these filters.

System requirement and dependencies

  • Linux (required) - Mac OS X is not supported
  • samtools

Installation

Download the SVFilter and unzip the downloaded file.

$ tar -xzvf SVFilter-1.0.tar.gz

This will generate a directory named "SVFilter-1.0". The directory contains three subdirectories:
  • bin directory: includes all executables.
  • test_files directory: includes all necessary input files to test the SV filter programs
  • src directory: includes C++ source codes.
The executables under the "bin" directory were pre-compiled on a 64-bit Linux machine. For a 32-bit Linux machine, user needs to compile the C++ source codes and then moves the executables to the "bin" directory. This can be done by running "install.sh" shell script (sh install.sh) provided in the package.

Next, add the "bin" directory to the environmental variable PATH.

SV file format

Each of the SV filters requires an input file in tab-delimited text format which contains a list of SVs (deletion, insertion, inversion, duplication, etc.). Here is an example of the file. Each SV contains 11 fields. Here is the explanation of each field:

field No.field nameexample valuenotes
1chromosome/scaffold IDchr1These four fields define the left anchoring window of the SV: where it locates (chromosome/scaffold and start and end coordinates), and orientation of the abnormal reads mapped within the window: R for reverse, F for forward
2start position18099267
3end position18099607
4read strandR
5chromosome/scaffold IDchr1These four fields define the right anchoring window of the SV
6start position18100733
7end position18101053
8read strandF
9number of abnormal pairs4
10abnormal read IDs(FC42CA5AAXX:5:51:1514:1044#0, FC42CA5AAXX:5:95:188:1715#0, FC42CA5AAXX:5:23:1419:1684#0, FC42CA5AAXX:5:78:1132:1599#0)This field lists the IDs of abnormal read pairs, one for each pair. The IDs must be enclosed within a parenthesis and separated by commas
11SV typeDELETIONThe type of the SV including DELETION, INSERTION, INVERSION and LARGE_DUPLI

Run SVFilter

The five filters implemented in SVFilter are run separately. Each filter generates two output files, one containing the list of SVs that pass the filter (kept) and one containing the list of SVs that are discarded by the filter. The file containing the kept SVs is in the format described above and can be used as the input for other filters. Click here for detailed description of the files containing discarded SVs.

Ratio filter - run the program:

$ ratiofilter test_SV normPair.sam 0.2 75

Explanation of the parameters

parameterexample valuedescription
1test_SVSV input file containing a list of SVs in a format described above.
2normPair.samFile containing the alignments of normal paired-end reads in SAM format.
30.2Cutoff ratio between the numbers of normal reads and abnormal reads within an anchoring window. The normal reads must have the same orientations as abnormal reads. The SV input file specifies number of abnormal read pairs supporting each SV. Within one (or both) of the two anchoring windows, if the ratio between the number of normal reads and abnormal reads with same orientations exceeds the cutoff value, then the SV is considered as a false positive and discarded.
475Read length (bp)


SNV filter - run the program:

$ SNVfilter test_sv genome.fa abnorm-pair.sam normPair.sam 2

Explanation of the parameters

parameterexample valuedescription
1test_SVSV input file containing a list of SVs in a format described above.
2genome.faSequences of the reference genome in fasta format.
3abnormPair.samFile containing the alignments of abnormal paired-end reads in SAM format.
4normPair.samFile containing the alignments of normal paired-end reads in SAM format.
52Minimum read depth to detect SNVs between abnormal and normal reads. If there is at least one SNV detected between abnormal and normal reads in one (or both) anchoring window, the SV is considered as a false positive and discarded.


gap filter - run the program:

$ gapfilter test_sv genome.fa 1 0.1

Explanation of the parameters

parameterexample valuedescription
1test_SVSV input file containing a list of SVs in a format described above.
2genome.faSequences of the reference genome in fasta format.
31Cutoff of gap size (number of 'N') within the SV.
40.1Minimum fraction of the gap in the SV region (ratio between gap size and SV size).
Assuming the size of an SV is 2500 bp and within the SV region 300 bp are 'N's (gaps), then this SV will be discarded since 300 > 1 and 300 / 2500 > 0.1. If parameter 3 is set to 1 and parameter 4 is set to a negative value, then the SV will be discarded as long as there is a gap within the SV region.


read coverage filter - run the program (this filter is only applicable to deletion events):

$ coveragefilter test_sv genome.fa test.pileup 6 0.05 2

Explanation of the parameters

parameterexample valuedescription
1test_SVSV input file containing a list of SVs in a format described above.
2genome.faSequences of the reference genome in fasta format.
3test.pileupPileup file generated only from normal reads. The file is used to check whether the putative deleted region contains normal mapped reads. The file is generated from BAM file using the "samtools mpileup" utility.
46Minimum number of base pairs in the deleted region that are mapped by normal reads.
50.05Minimum fraction of the deletion region that are mapped by normal reads
62Minimum depth of normal reads in the deleted region.
Assuming the length of a deleted region is 2500 bp and within the deleted region 300 bp are covered by normal reads with at least 2X, then this SV will be discarded since 300 > 6 and 300 / 2500 > 0.05.


sequencing depth filter - run the program:

$ depthfilter test_sv chr-length test.pileup 1.5 1.5

Explanation of the parameters

parameterexample valuedescription
1test_SVSV input file containing a list of SVs in a format described above.
2chr-lengthThe file lists the length for each chromosome/scaffold. Each line consists of chromosome/scaffold ID and its length, delimited by tab.
3test.pileupPileup file generated only from normal reads. The file is used to calculate sequencing depths. The file is generated from BAM file using the "samtools mpileup" utility.
41.5Minimum ratio between the average of sequencing depth in the duplicated region and that over entire genome.
51.5Minimum ratio between the median of sequencing depth in the duplicated region and that over the entire genome.
The filter calculates the average and median of sequencing depth of a duplication region, as well as the entire genome. If the ratio of the average depths between the duplicated region and the entire genome < cutoff (1.5), or the ratio of the median depth < cutoff (1.5), then this duplication event is discarded.

Frequent Asked Questions (FAQs)

  1. I obtained a set of SVs using breakpoint-based approaches such as Pindel. Can I use SVFilter to identify potential false SVs?
    Answer: Yes, SVFilter can be used. In this case, start and end positions of left and right anchoring windows will be collapsed to left and right breakpoints, respectively. Since both anchoring windows are collapsed into single positions, ratio and SNV filters can not be used, while gap, coverage and depth filters can be applied to the SVs identified with breakpoint-based approaches. In the SV input file, it's not necessary to specify read IDs. You can just put "(NA)" in the corresponding field. Here is an example:

    Chr1884482884482FChr1886071886071R6(NA)DELETION

  2. Do I have to provide read IDs in the SV input file?
    Answer: Only SNV filter requires read IDs because it has to derive genotype information from the abnormal reads. Read IDs are not required for the other four filters and you can just put "(NA)" in the read ID field for these four filters.

Download

Download SVFilter from the ftp server

Contact

For questions and suggestions, please contact us at bioinfo@cornell.edu