Difference between revisions of "Fastq mcf"
(9 intermediate revisions by one user not shown) | |||
Line 1: | Line 1: | ||
− | =Introduction= | + | ==Introduction== |
fastq-mcf attempts to: | fastq-mcf attempts to: | ||
− | + | * Detect & remove sequencing adapters and primers | |
− | + | * Detect limited skewing at the ends of reads and clip | |
− | + | * Detect poor quality at the ends of reads and clip | |
− | + | * Detect Ns, and remove from ends | |
− | + | * Remove reads with CASAVA 'Y' flag (purity filtering) | |
− | + | * Discard sequences that are too short after all of the above | |
− | + | * Keep multiple mate-reads in sync while doing all of the above | |
+ | |||
+ | ==Usage== | ||
+ | <code> | ||
+ | Usage: fastq-mcf [options] <adapters.fa> <reads.fq> [mates1.fq ...]<br> | ||
+ | </code> | ||
+ | Detects levels of adapter presence, computes likelihoods and | ||
+ | locations (start, end) of the adapters. Removes the adapter | ||
+ | sequences from the fastq file(s). | ||
+ | |||
+ | Stats go to stderr, unless <code>-o</code> is specified. | ||
+ | |||
+ | Specify <code>-0</code> to turn off all default settings | ||
+ | |||
+ | If you specify multiple 'paired-end' inputs, then a -o option is | ||
+ | required for each. IE: -o read1.clip.q -o read2.clip.fq | ||
+ | |||
+ | ====Options==== | ||
+ | -h This help | ||
+ | -o FIL Output file (stats to stdout) | ||
+ | -s N.N Log scale for adapter minimum-length-match (2.2) | ||
+ | -t N % occurance threshold before adapter clipping (0.25) | ||
+ | -m N Minimum clip length, overrides scaled auto (1) | ||
+ | -p N Maximum adapter difference percentage (10) | ||
+ | -l N Minimum remaining sequence length (19) | ||
+ | -L N Maximum remaining sequence length (none) | ||
+ | -D N Remove duplicate reads : Read_1 has an identical N bases (0) | ||
+ | -k N sKew percentage-less-than causing cycle removal (2) | ||
+ | -x N 'N' (Bad read) percentage causing cycle removal (20) | ||
+ | -q N quality threshold causing base removal (10) | ||
+ | -w N window-size for quality trimming (1) | ||
+ | -H remove >95% homopolymer reads (no) | ||
+ | -0 Set all default parameters to zero/do nothing | ||
+ | -U|u Force disable/enable Illumina PF filtering (auto) | ||
+ | -P N Phred-scale (auto) | ||
+ | -R Dont remove Ns from the fronts/ends of reads | ||
+ | -n Dont clip, just output what would be done | ||
+ | -C N Number of reads to use for subsampling (300k) | ||
+ | -S Save all discarded reads to '.skip' files | ||
+ | -d Output lots of random debugging stuff | ||
+ | |||
+ | ====Quality adjustment options==== | ||
+ | --cycle-adjust CYC,AMT Adjust cycle CYC (negative = offset from end) by amount AMT | ||
+ | --phred-adjust SCORE,AMT Adjust score SCORE by amount AMT | ||
+ | |||
+ | ====Filtering options==== | ||
+ | --[mate-]qual-mean NUM Minimum mean quality score | ||
+ | --[mate-]qual-gt NUM,THR At least NUM quals > THR | ||
+ | --[mate-]max-ns NUM Maxmium N-calls in a read (can be a %) | ||
+ | --[mate-]min-len NUM Minimum remaining length (same as -l) | ||
+ | --hompolymer-pct PCT Homopolymer filter percent (95) | ||
+ | |||
+ | If mate- prefix is used, then applies to second non-barcode read only | ||
+ | |||
+ | Adapter files are 'fasta' formatted: | ||
+ | |||
+ | Specify n/a to turn off adapter clipping, and just use filters | ||
+ | |||
+ | Increasing the scale makes recognition-lengths longer, a scale | ||
+ | of 100 will force full-length recognition of adapters. | ||
+ | |||
+ | Adapter sequences with _5p in their label will match 'end's, | ||
+ | and sequences with _3p in their label will match 'start's, | ||
+ | otherwise the 'end' is auto-determined. | ||
+ | |||
+ | Skew is when one cycle is poor, 'skewed' toward a particular base. | ||
+ | If any nucleotide is less than the skew percentage, then the | ||
+ | whole cycle is removed. Disable for methyl-seq, etc. | ||
+ | |||
+ | Set the skew (-k) or N-pct (-x) to 0 to turn it off (should be done | ||
+ | for miRNA, amplicon and other low-complexity situations!) | ||
+ | |||
+ | Duplicate read filtering is appropriate for assembly tasks, and | ||
+ | never when read length < expected coverage. -D 50 will use | ||
+ | 4.5GB RAM on 100m DNA reads - be careful. Great for RNA assembly. | ||
+ | |||
+ | Quality filters are evaluated after clipping/trimming | ||
+ | |||
+ | ==Links== | ||
+ | |||
+ | [https://github.com/ExpressionAnalysis/ea-utils/blob/wiki/FastqMcf.md fastq-mcf on GitHub] |
Latest revision as of 17:35, 25 March 2019
Contents |
[edit] Introduction
fastq-mcf attempts to:
- Detect & remove sequencing adapters and primers
- Detect limited skewing at the ends of reads and clip
- Detect poor quality at the ends of reads and clip
- Detect Ns, and remove from ends
- Remove reads with CASAVA 'Y' flag (purity filtering)
- Discard sequences that are too short after all of the above
- Keep multiple mate-reads in sync while doing all of the above
[edit] Usage
Usage: fastq-mcf [options] <adapters.fa> <reads.fq> [mates1.fq ...]
Detects levels of adapter presence, computes likelihoods and
locations (start, end) of the adapters. Removes the adapter
sequences from the fastq file(s).
Stats go to stderr, unless -o
is specified.
Specify -0
to turn off all default settings
If you specify multiple 'paired-end' inputs, then a -o option is required for each. IE: -o read1.clip.q -o read2.clip.fq
[edit] Options
-h This help -o FIL Output file (stats to stdout) -s N.N Log scale for adapter minimum-length-match (2.2) -t N % occurance threshold before adapter clipping (0.25) -m N Minimum clip length, overrides scaled auto (1) -p N Maximum adapter difference percentage (10) -l N Minimum remaining sequence length (19) -L N Maximum remaining sequence length (none) -D N Remove duplicate reads : Read_1 has an identical N bases (0) -k N sKew percentage-less-than causing cycle removal (2) -x N 'N' (Bad read) percentage causing cycle removal (20) -q N quality threshold causing base removal (10) -w N window-size for quality trimming (1) -H remove >95% homopolymer reads (no) -0 Set all default parameters to zero/do nothing -U|u Force disable/enable Illumina PF filtering (auto) -P N Phred-scale (auto) -R Dont remove Ns from the fronts/ends of reads -n Dont clip, just output what would be done -C N Number of reads to use for subsampling (300k) -S Save all discarded reads to '.skip' files -d Output lots of random debugging stuff
[edit] Quality adjustment options
--cycle-adjust CYC,AMT Adjust cycle CYC (negative = offset from end) by amount AMT --phred-adjust SCORE,AMT Adjust score SCORE by amount AMT
[edit] Filtering options
--[mate-]qual-mean NUM Minimum mean quality score --[mate-]qual-gt NUM,THR At least NUM quals > THR --[mate-]max-ns NUM Maxmium N-calls in a read (can be a %) --[mate-]min-len NUM Minimum remaining length (same as -l) --hompolymer-pct PCT Homopolymer filter percent (95)
If mate- prefix is used, then applies to second non-barcode read only
Adapter files are 'fasta' formatted:
Specify n/a to turn off adapter clipping, and just use filters
Increasing the scale makes recognition-lengths longer, a scale of 100 will force full-length recognition of adapters.
Adapter sequences with _5p in their label will match 'end's, and sequences with _3p in their label will match 'start's, otherwise the 'end' is auto-determined.
Skew is when one cycle is poor, 'skewed' toward a particular base. If any nucleotide is less than the skew percentage, then the whole cycle is removed. Disable for methyl-seq, etc.
Set the skew (-k) or N-pct (-x) to 0 to turn it off (should be done for miRNA, amplicon and other low-complexity situations!)
Duplicate read filtering is appropriate for assembly tasks, and never when read length < expected coverage. -D 50 will use 4.5GB RAM on 100m DNA reads - be careful. Great for RNA assembly.
Quality filters are evaluated after clipping/trimming