De PyroCleaner Wiki.
The pyrocleaner is intended to clean the reads included in the sff file in order to ease the assembly process. It enables filtering sequences on different criteria such as length, complexity, number of undetermined bases which has been proven to correlate with pour quality and multiple copy reads. It also enables to clean paired-ends sff files and generates on one side a sff with the validated paired-ends and on the other the sequences which can be used as shotgun reads. To install the Pyrocleaner, please refere to the Installation guide.
- pyrocleaner_galaxy_tool for pyrocleaner version 1.3 - provides all files to run the pyrocleaner within the galaxy workflow manager. Also available from the Galaxy Tool Shed.
- pyrocleaner_ergatis_component for pyrocleaner version 1.3 - provides all files to run the pyrocleaner within the ergatis workflow manager.
- pyrocleaner v1.3 - Use Seqio instead of Bio.SeqIO as input/outpout sequences library to spead up the process and delete the Biopython dependance.
- pyrocleaner v1.2 - Fix a bug when using --clean-pairends.
- pyrocleaner v1.1 - Add --aggressive option to keep only 1 read per cluster, add --clean-quality cleaning option to clean reads based on basepairs quality.
- pyrocleaner v1.0 - Here is the first packaged version of the tool.
Usage: pyrocleaner -i file -o output -f format --clean-pairends --clean-length-std --clean-ns --clean-duplicated-reads --clean-complexity-win --clean-quality
--version show program's version number and exit -h, --help show this help message and exit
Input files options: -i FILE, --in=FILE The file to clean, can be [sff|fastq|fasta] -q FILE, --qual=FILE The quality file to use if input file is fasta -f FORMAT, --format=FORMAT The format of the input file [sff|fastq|fasta] default is sff
Output files options: -o OUTPUT, --out=OUTPUT The output folder where to store results -g FILE, --log=FILE The log file name (default:pyrocleaner.log) -z, --split-pairends Write splited pairends sequences into a fasta file
Cleaning options: -p, --clean-pairends Clean pairends -l, --clean-length-std Filter short reads shorter than mean less x*standard deviation and long reads longer than mean plus x*standard deviation -w, --clean-length-win Filter reads with a legnth in between [x:y] -n, --clean-ns Filter reads with too much N -d, --clean-duplicated-reads Filter duplicated reads -c, --clean-complexity-win Filter low complexity reads computed on a sliding window -u, --clean-complexity-full Filter low complexity reads computed on the whole sequence -k, --clean-quality Filter low quality reads
Processing options: -a NB_CPUS, --acpus=NB_CPUS Number of processors to use -r RECURSION, --recursion=RECURSION Recursion limit when computing duplicated reads
Cleaning parameters: -b BORDER_LIMIT, --border-limit=BORDER_LIMIT Minimal length between the spacer and the read extremity (used with --clean-pairends option, default=70) -m, --aggressive Filter all duplication reads gathered in a cluster to keep one (used with --clean-duplicated-reads, default=False) -e MISSMATCH, --missmatch=MISSMATCH Limit of missmatch nucleotide (used with --clean- pairends option, default=10) -j STD, --std=STD Number standard deviation to use (used with --clean- length-std option, default=2) -1 MIN, --min=MIN Minimal length (used with --clean-length-win option, default=200) -2 MAX, --max=MAX Maximal length (used with --clean-length-win option, default=600) -3 QUALITY_THRESHOLD, --quality-threshold=QUALITY_THRESHOLD At least one base pair has to be equal or higher than this value (used with --clean-quality, default=35) -s NS_PERCENT, --ns_percent=NS_PERCENT Percentage of N to use to filter reads (used with --clean-ns option, default=4) -t DUPLICATION_LIMIT, --duplication_limit=DUPLICATION_LIMIT Limit size difference (used with --clean-duplicated- reads, default=70) -v WINDOW, --window=WINDOW The window size (used with --clean-complexity-win, default=100) -x STEP, --step=STEP The window step (used with --clean-complexity-win, default=5) -y COMPLEXITY, --complexity=COMPLEXITY Minimal complexity/length ratio (used with --clean- complexity-win and --clean-complexity-full, default=40)
How to cite
Mariette J, Noirot C, Klopp C. Assessment of replicate bias in 454 pyrosequencing and a multi-purpose read-filtering tool. BMC Research Notes 2011, 4:149.