NYU Bioinformatics Lab Courant Institute of Mathematical Sciences

Feature-Response Curve

Assembly Metric and Analysis Tool

Description

Inspired by the standard receiver operating characteristic (ROC) curve, the Feature-Response curve characterizes the sensitivity (coverage) of the sequence assembler output (contigs) as a function of its discrimination threshold (number of features/errors). The AMOS package provides an automated assembly validation pipeline called amosvalidate that analyzes the output of an assembler using a variety of assembly quality metrics (or features). Examples of features include: (M) mate-pair orientations and separations, (K) repeat content by k-mer analysis, (C) depth-of-coverage, (P) correlated polymorphism in the read alignments, and (B) read alignment breakpoints to identify structurally suspicious regions of the assembly. After running amosvalidate on the output of the assembler, each contig is assigned a number of features that correspond to doubtful regions of the sequence. Given any such set of features, the response (quality) of the assembler output is then analyzed as a function of the maximum number of possible errors (features) allowed in the contigs. More specifically, for a fixed feature threshold φ, the contigs are sorted by size and, starting from the longest, only those contigs are tallied, if their sum of features is . For this set of contigs, the corresponding approximate genome coverage is computed, leading to a single point of the Feature-Response curve.

Some of its properties are:

  • The FRC can be used as a metric to compare the assembly quality of multiple assemblers.
  • The FRC does not require any reference sequence (except an estimate of the genome size) to be used for validation, thus making it a very useful tool in de novo sequencing projects.
  • Separate FRCs can be generated for each feature type enabling to scrutinize the relative strengths and weaknesses of different assemblers.

People

Examples

The figure below shows the Feature-Response Curve generated for the minimus assembly pipeline on the Brucella suis genome using the benchmark dataset available here.

FRCurve for the contigs assembled by Minimus on the Brucella suis dataset.

Documentation

Following the AMOS philosophy, the FRCurve is implemented as a pipeline that consists of two steps:

  1. invocation to the amosvalidate tool to compute the features for the set of contigs;
  2. invocation to the FRC module

The name of the pipeline in the AMOS distribution is "FRCurve_pipeline". Documentation on how to run FRCurve is obtained by typing:

  FRCurve_pipeline -h

The usage message is:

Feature-Response Curve pipeline
Usage:
     FRCurve_pipeline [params] \
               -D GENOME_SIZE=<n>              - Genome size (number of bps)
               -D BANK=<n>                     - AMOS bank name
Description:
               The Feature-Response curve characterizes the sensitivity (coverage)
               of the sequence assembler as a function of its discrimination threshold (number of features).
               Given any set of features compute by the amosvalidate pipeline, the response (quality)
               of the assembler output is analyzed as a function of the maximum number of possible
               errors (features) allowed in the contigs.
Output:
               The Feature-Response curve (FRC) is saved in file "FRC.txt", while
               FRCs for each feature type are saved respectively in:
               "FRC_coverage.txt", "FRC_polymorphism.txt", "FRC_breakpoint.txt",
               "FRC_kmer.txt", "FRC_matepair.txt" and "FRC_misassembly.txt"
Output file format:
               Each file contains the FRCs in 3-columns format
               - column 1 = feature threshold T;
               - column 2 = contigs' N50 associated to the threshold T in column 1;
               - column 3 = cumulative size of the contigs whose number of features is <= T;

Availability

The FRCurve is available as part of the AMOS assembly software package. This documentation page is also available at the AMOS wiki page.

Notice: The FRCurve will be part of the next release of AMOS which is still under preparation. In the meantime, the FRC curve can be downloaded as part of the beta release of AMOS, wich is available by anonymous CVS access here. Please contact the AMOS group if you have problems accessing the code from the CVS repository.

References

  • Narzisi G. and Mishra B.:
    Comparing De Novo Genome Assembly: The Long and Short of It.
    PLoS ONE 6(4):e19175. April 2011 (DOI: 10.1371/journal.pone.0019175).

Acknowledgement

Research reported here was supported by grants from NSF CDI program and Abraxis BioScience, LLC.