Command Reference
This page provides comprehensive documentation for all modtector commands and their options.
Overview
modtector provides ten main commands:
count- Generate pileup data from BAM filesnorm- Normalize and filter signalsreactivity- Calculate reactivity scorescompare- Compare samples and identify differencesevaluate- Evaluate accuracy using secondary structureplot- Generate visualization plotsduet- Decompose normalized reactivity into dynamic ensembles using read-level co-variationextract- Extract gene regions from count results using GTF annotationconvert- Convert various input formats to modtector pileup CSV formatcorrect- Apply PCR bias correction to pileup CSV file
count - Data Processing
Generate pileup data from BAM files by counting stop and mutation signals.
Usage
modtector count [OPTIONS] --bam <BAM> --fasta <FASTA> --output <OUTPUT>
Options
Option |
Short |
Type |
Default |
Description |
|---|---|---|---|---|
|
|
String |
Required |
BAM file path (or glob pattern for batch/single-cell mode) |
|
|
String |
Required |
Reference FASTA file path |
|
|
String |
Required |
Output CSV path (or directory for batch/single-cell mode) |
|
|
String |
“+/-” |
Strand filter ( |
|
|
Integer |
1 |
Number of parallel threads |
|
|
String |
None |
Log file path (optional) |
|
|
Integer |
None |
Window size for genome segmentation (bases) |
|
Flag |
false |
Enable batch mode: process multiple BAM files sequentially |
|
|
Flag |
false |
Enable single-cell unified mode: unified processing with cell labels |
|
|
Integer |
0 |
Minimum base quality score (Phred score) to count a mutation. Only mutations with quality >= threshold are counted. Recommended: 20 (same as RNAFramework rf-count). Set to 0 to disable. |
|
|
Flag |
false |
Enable PCR bias correction for depth calculation. When enabled, effective_depth will be calculated and used instead of raw depth. |
|
|
Float |
1.0 |
Weight for increasing depth correction (used when depth is too low) |
|
|
Float |
0.5 |
Weight for decreasing depth correction (used when depth is too high) |
Examples
# Basic usage (standard mode)
modtector count -b sample.bam -f reference.fa -o output.csv
# With multiple threads
modtector count -b sample.bam -f reference.fa -o output.csv -t 8
# With windowing
modtector count -b sample.bam -f reference.fa -o output.csv -w 1000
# With logging
modtector count -b sample.bam -f reference.fa -o output.csv -l count.log
# Batch mode: process multiple BAM files sequentially
modtector count --batch \
-b "/path/to/bam/*sort.bam" \
-f reference.fa \
-o output_dir/ \
-t 8 \
-w 10000
# Single-cell unified mode: unified processing with cell labels
modtector count --single-cell \
-b "/path/to/single_cell/*sort.bam" \
-f reference.fa \
-o output_dir/ \
-t 8 \
-w 10000 \
-l batch.log
Batch and Single-cell Modes
Batch Mode (--batch):
Processes multiple BAM files sequentially
Each file is processed independently
Suitable for scenarios requiring separate processing per file
Output: Each file generates a separate CSV file in the output directory
Example glob patterns:
*sort.bam,*_RHX*.bam,sample_*.bam
Single-cell Unified Mode (--single-cell):
True unified processing strategy for single-cell data
Processing Flow:
For each window, collects all reads from all BAM files at once
Performs unified pileup processing (all reads together)
Tracks cell labels during processing, splits results by cell
Outputs separate CSV file for each cell
Performance Benefits (2-3x speedup):
Skips data distribution scanning (direct processing of all reference sequences)
Unified read collection and pileup (reduces I/O overhead significantly)
Better parallelization efficiency (cross-file parallel processing)
Window-based memory management (prevents memory accumulation)
Progress Reporting:
Real-time display: chunks processed, percentage, speed (chunks/s), ETA
CPU usage recommendations based on BAM file count
Output: Each cell generates a separate CSV file (cell label extracted from filename)
Cell Label Extraction: Supports RHX pattern (e.g.,
*RHX672.sort.bam→RHX672) or last underscore-separated part
Output Format
The output CSV contains the following columns:
ChrID: Chromosome/contig identifierpipe_truncation_Strand: Strand (+ or -)pipe_truncation_ChrPos: Position (1-based)rf_mutation_Base: Reference baserf_mutation_Count: Mutation countpipe_truncation_count: Stop countdepth: Total read depthrf_mutation_ins: Insertion countrf_mutation_del: Deletion countbase_A,base_C,base_G,base_T: Base counts
norm - Signal Normalization
Normalize and filter signals to remove noise and outliers.
Usage
modtector norm [OPTIONS] --input <INPUT> --output <OUTPUT> --method <METHOD>
Options
Option |
Short |
Type |
Default |
Description |
|---|---|---|---|---|
|
|
String |
Required |
Input CSV file |
|
|
String |
Required |
Output CSV file |
|
|
String |
Required |
Normalization method |
|
Float |
0.25 |
SNP threshold for filtering mutation signals |
|
|
String |
“ACGT” |
Reactive base types (e.g., “AC” for A and C) |
|
|
Integer |
0 |
Fixed window size (fixed sliding window mode) |
|
|
Integer |
0 |
Window offset (fixed sliding window mode) |
|
|
Flag |
False |
Enable dynamic sliding window mode |
|
|
Integer |
50 |
Number of reactive bases required per window |
|
|
Flag |
False |
Apply piecewise linear mapping |
|
|
|
String |
None |
Log file path (optional) |
Normalization Methods
percentile28: 28th percentile normalization
winsor90: 90th percentile winsorization
boxplot: Boxplot-based outlier removal
Examples
# Basic normalization
modtector norm -i input.csv -o output.csv -m winsor90
# Target specific bases
modtector norm -i input.csv -o output.csv -m winsor90 --bases AC
# With dynamic windowing
modtector norm -i input.csv -o output.csv -m winsor90 --dynamic
# With linear mapping
modtector norm -i input.csv -o output.csv -m winsor90 --linear
reactivity - Reactivity Calculation
Calculate reactivity scores by comparing modified and unmodified samples.
Usage
modtector reactivity [OPTIONS] --mod <MOD_CSV> --unmod <UNMOD_CSV> --output <OUTPUT>
Options
Option |
Short |
Type |
Default |
Description |
|---|---|---|---|---|
|
|
String |
Required |
Modified sample CSV |
|
|
String |
Optional |
Unmodified sample CSV (optional for mod-only mode) |
|
|
String |
Required |
Reactivity output file |
|
|
String |
“kfactor” |
Stop signal reactivity method: |
|
|
String |
“kfactor” |
Mutation signal reactivity method: |
|
Float |
1.0 |
Pseudocount parameter for logarithmic calculations (Ding method) |
|
|
Float |
10.0 |
Maximum score limit parameter (Ding method) |
|
|
Float |
0.25 |
SNP threshold for filtering positions (filter positions where unmod sample mutation rate >= cutoff) |
|
|
String |
“background” |
K-factor prediction method: |
|
|
String |
None |
Reference secondary structure file (required for recursive k-factor method) |
|
|
String |
None |
Gene ID to use for k-factor calculation (only positions with this gene_id will be used as background regions) |
|
|
|
Integer |
1 |
Number of parallel threads |
|
|
String |
None |
Log file path (optional) |
Reactivity Methods
Stop Signal Methods
kfactor (default): K-factor method using difference between modified and unmodified samples
ding: Ding et al. method with pseudocount and maxscore parameters
rouskin: Rouskin et al. method (uses only modified sample stop signal)
Mutation Signal Methods
kfactor (default): K-factor method using difference between modified and unmodified samples
siegfried: Siegfried et al. method (allows negative reactivity values)
zubradt: Zubradt et al. method
K-Factor Prediction Methods
background (default): Predict k-factor from background regions (low mutation rate positions)
distribution: Predict k-factor using statistical distribution analysis (improved accuracy for many sample types)
recursive: Predict k-factor recursively using reference secondary structure (requires
--structure-file)
Examples
# Basic reactivity calculation (with unmodified sample)
modtector reactivity -M mod.csv -U unmod.csv -O reactivity.csv
# Mod-only mode (no unmodified sample, for smartSHAPE datasets)
modtector reactivity -M mod.csv -O reactivity.csv
# With specific methods
modtector reactivity -M mod.csv -U unmod.csv -O reactivity.csv -s ding -m siegfried
# With custom parameters
modtector reactivity -M mod.csv -U unmod.csv -O reactivity.csv --pseudocount 0.5 --maxscore 5
# Using distribution-based k-factor prediction
modtector reactivity -M mod.csv -U unmod.csv -O reactivity.csv --k-prediction-method distribution
# Using recursive k-factor prediction (requires structure file)
modtector reactivity -M mod.csv -U unmod.csv -O reactivity.csv \
--k-prediction-method recursive \
--structure-file structure.dp \
--k-background-gene-id gene_id
# With SNP filtering
modtector reactivity -M mod.csv -U unmod.csv -O reactivity.csv --snp-cutoff 0.2
compare - Sample Comparison
Compare modified and unmodified samples to identify significant differences.
Usage
modtector compare [OPTIONS] --output <OUTPUT>
Options
Option |
Short |
Type |
Default |
Description |
|---|---|---|---|---|
|
|
String |
“mod-vs-unmod” |
Comparison mode |
|
|
String |
None |
Modified sample CSV (mod-vs-unmod mode) |
|
|
String |
None |
Unmodified sample CSV (mod-vs-unmod mode) |
|
String |
None |
Group 1 file list (biological-replicates mode) |
|
|
String |
None |
Group 2 file list (biological-replicates mode) |
|
|
String |
None |
Group 1 reactivity files (reactivity-groups mode) |
|
|
String |
None |
Group 2 reactivity files (reactivity-groups mode) |
|
|
String |
None |
First reactivity result file (reactivity-results mode) |
|
|
String |
None |
Second reactivity result file (reactivity-results mode) |
|
|
|
String |
Required |
Output CSV path |
|
|
Integer |
10 |
Minimum depth |
|
|
Float |
2 |
Minimum fold change |
|
|
String |
“t-test” |
Statistical test type |
|
|
Float |
0.05 |
Significance level |
|
|
String |
None |
Log file path (optional) |
Comparison Modes
mod-vs-unmod: Compare modified vs unmodified samples
reactivity-groups: Compare two groups of reactivity results
biological-replicates: Compare biological replicates
reactivity-results: Compare two reactivity result files
Statistical Tests
t-test: Student’s t-test
mann-whitney: Mann-Whitney U test
wilcoxon: Wilcoxon signed-rank test
chi-square: Chi-square test
continuity: Continuity correction
diffscan: Differential scanning
deltashape: DeltaSHAPE method
Examples
# Basic comparison
modtector compare -M mod.csv -U unmod.csv -o comparison.csv
# With custom thresholds
modtector compare -M mod.csv -U unmod.csv -o comparison.csv -d 20 -f 1.5
# Using different statistical test
modtector compare -M mod.csv -U unmod.csv -o comparison.csv -t mann-whitney
# Compare reactivity groups
modtector compare --mode reactivity-groups \
--reactivity-group1 group1_reactivity.csv \
--reactivity-group2 group2_reactivity.csv \
-o comparison.csv
evaluate - Accuracy Evaluation
Evaluate reactivity accuracy using known secondary structure.
Usage
modtector evaluate [OPTIONS] --reactivity <REACTIVITY_FILE> --structure <STRUCTURE_FILE> --output <OUTPUT_DIR> --gene-id <GENE_ID>
Options
Option |
Short |
Type |
Default |
Description |
|---|---|---|---|---|
|
|
String |
Required |
Reactivity signal file |
|
|
String |
Required |
Secondary structure file |
|
|
String |
Required |
Output directory |
|
|
String |
None |
Log file path (default to result directory) |
|
|
String |
“stop” |
Signal type ( |
|
|
String |
Required |
Gene ID (for base matching) |
|
|
String |
“+” |
Strand information ( |
|
Flag |
True |
Use base matching (default: true) |
|
|
Flag |
True |
Use auto-shift correction (default: true) |
|
|
String |
“ACGT” |
Reactive bases for AUC/ROC calculation (e.g., |
|
|
|
Flag |
False |
Use optimized version |
Examples
# Basic evaluation
modtector evaluate -r reactivity.csv -s structure.dp -o results/ -g gene_id
# Evaluate both signal types
modtector evaluate -r reactivity.csv -s structure.dp -o results/ -g gene_id -t both
# Without auto-shift correction
modtector evaluate -r reactivity.csv -s structure.dp -o results/ -g gene_id --no-auto-shift
# Using optimized version
modtector evaluate -r reactivity.csv -s structure.dp -o results/ -g gene_id -O
plot - Visualization
Generate signal distribution plots and reactivity visualizations, including RNA structure SVG plots.
Usage
Regular Mode
modtector plot [OPTIONS] --mod <MOD_CSV> --unmod <UNMOD_CSV> --output <OUTPUT>
SVG-Only Mode
modtector plot [OPTIONS] --output <OUTPUT> --svg-template <SVG_TEMPLATE> --reactivity <REACTIVITY_CSV>
Options
Option |
Short |
Type |
Default |
Description |
|---|---|---|---|---|
|
|
String |
Optional* |
Modified sample CSV (required for regular mode) |
|
|
String |
Optional* |
Unmodified sample CSV (required for regular mode) |
|
|
String |
Required |
Output directory |
|
|
Float |
0.2 |
Coverage threshold |
|
|
Integer |
50 |
Depth threshold (reads) |
|
|
String |
None |
Reactivity CSV file (optional) |
|
|
String |
None |
Genome annotation GFF/GTF file (optional) |
|
|
Integer |
1 |
Number of parallel threads |
|
|
String |
None |
Log file path (optional) |
SVG Plotting Options
Option |
Type |
Default |
Description |
|---|---|---|---|
|
String |
None |
SVG template file for RNA structure visualization |
|
String |
“ACGT” |
Bases to include in SVG plot (e.g., “AC” for DMS-seq) |
|
String |
“all” |
Signal type to plot ( |
|
String |
“+” |
Strand to include ( |
|
String |
None |
Reference sequence file for alignment (optional) |
|
Integer |
5 |
Maximum shift for alignment search |
|
Flag |
False |
Circle fill type: |
|
String |
None |
Font color for text elements (e.g., “black”, “red”, “#000000”) |
|
Float |
30.0 |
Legend item width for horizontal layout |
|
Float |
15.0 |
Legend item height for vertical layout |
|
Flag |
False |
Generate interactive HTML visualization instead of static SVG |
Note: -M and -U parameters are optional when using SVG-only mode (when both --svg-template and --reactivity are provided).
Examples
Regular Plotting
# Basic plotting
modtector plot -M mod.csv -U unmod.csv -o plots/
# With reactivity data
modtector plot -M mod.csv -U unmod.csv -o plots/ -r reactivity.csv
# With custom thresholds
modtector plot -M mod.csv -U unmod.csv -o plots/ -c 0.3 -d 100
# With genome annotation
modtector plot -M mod.csv -U unmod.csv -o plots/ -g annotation.gff
SVG-Only Mode
# Basic SVG plotting (simplified command)
modtector plot -o output/ --svg-template template.svg --reactivity data.csv
# Multi-signal SVG plotting
modtector plot -o output/ \
--svg-template template.svg \
--reactivity data.csv \
--svg-bases ATGC \
--svg-signal all \
--svg-strand +
# Single signal SVG plotting
modtector plot -o output/ \
--svg-template template.svg \
--reactivity data.csv \
--svg-bases AC \
--svg-signal stop \
--svg-strand +
# SVG plotting with alignment
modtector plot -o output/ \
--svg-template template.svg \
--reactivity data.csv \
--svg-bases ATGC \
--svg-signal all \
--svg-strand + \
--svg-ref reference.fa \
--svg-max-shift 10
# SVG plotting with custom styling
modtector plot -o output/ \
--svg-template template.svg \
--reactivity data.csv \
--svg-circle-filled \
--svg-font-color "black" \
--svg-legend-width 40.0 \
--svg-legend-height 20.0
# Interactive HTML visualization
modtector plot -o output/ \
--svg-template template.svg \
--reactivity data.csv \
--interactive
Combined Mode (Regular + SVG)
# Generate both regular plots and SVG plots
modtector plot -M mod.csv -U unmod.csv -o output/ \
--svg-template template.svg \
--reactivity data.csv \
--svg-bases ATGC \
--svg-signal all
SVG Plotting Features
Signal Type Support
Auto-detection: Automatically detects signal types from CSV headers
Multiple signals: Supports plotting multiple signal types (stop, mutation, etc.)
Single signal: Works with single signal files
Custom naming: Output files named by signal type (e.g.,
rna_structure_colored_stop.svg)
Strand Selection
Positive strand:
--svg-strand +(default)Negative strand:
--svg-strand -Both strands:
--svg-strand both
Base Filtering
All bases:
--svg-bases ATGC(default)DMS-seq:
--svg-bases AC(A and C only)Custom: Any combination of A, T, C, G
Alignment Support
Reference sequence: Use
--svg-reffor sequence alignmentShift calculation: Automatic calculation of optimal shift values
Max shift: Control maximum shift search range with
--svg-max-shift
Output Files
Regular Mode Output
overall_scatter.png: Signal distribution plothigh/: High coverage gene plotslow/: Low coverage gene plots
SVG Mode Output
rna_structure_colored_[signal].svg: SVG files for each signal typeMultiple SVG files for multi-signal data
Single SVG file for single-signal data
Combined Mode Output
All regular mode files
All SVG mode files
duet - Dynamic Ensemble Decomposition
Duet performs a sliding-window analysis that merges normalized stop/mutation reactivity with read-level co-variation to detect alternative ensembles without predefining cluster counts. Within each window, read-level stop/mutation co-occurrence vectors are clustered with DBSCAN to highlight dynamic structural mixtures.
Usage
modtector duet [OPTIONS] --input <INPUT> --bam <BAM> --fasta <FASTA> --output <OUTPUT>
Options
Option |
Short |
Type |
Default |
Description |
|---|---|---|---|---|
|
|
String |
Required |
Normalized reactivity CSV file |
|
|
String |
Required |
Sorted BAM file for co-variation analysis |
|
|
String |
Required |
Reference FASTA used for alignment |
|
|
String |
Required |
Window-level Duet results CSV |
|
Float |
0.75 |
DBSCAN ε (radius) in standardized feature space |
|
|
Integer |
5 |
Minimum neighbours required to form a DBSCAN core read |
|
|
Integer |
100 |
Sliding-window size in nucleotides |
|
|
Integer |
50 |
Sliding-window step in nucleotides |
|
|
|
Integer |
System CPU count |
Number of threads for parallel processing |
|
String |
|
Optional per-window/per-ensemble summary CSV |
|
|
|
String |
None |
Log file path (optional) |
Outputs
Window CSV: Each window annotated with status (
OK,InsufficientReads, orLowDensity), read counts, detected ensemble count, primary ensemble occupancy, noise fraction, dual-signal fraction, window-averaged reactivities, a concise cluster summary string (including per-cluster confidence), and aGlobalEnsemblesmapping column.Summary CSV: Per window/per ensemble (including DBSCAN noise) details covering occupancy, cluster confidence, average stop/mutation counts per read, dual-signal fractions, and the associated global ensemble ID. Generated automatically unless
--summary-outputis provided.Global Ensemble CSV: (
<output>_global.csv) summarises each global ensemble with aggregated read counts, stop/mutation totals, dual-signal read counts, contributing window/cluster counts, unique position counts, and a global confidence score.Global Per-base CSV: (
<output>_global_per_base.csv) lists every position associated with each global ensemble alongside ensemble-specific read support, normalized reactivity values, and per-base confidence scores.
Progress Reporting
During execution, duet provides detailed statistics:
Dual-signal reads: Number and percentage of reads carrying both stop and mutation signals
High-confidence positions: Number and percentage of positions assigned high confidence (reactivity >= 0.7)
Average confidence: Mean position confidence score across all analyzed positions
These statistics help assess data quality and the reliability of ensemble detection.
Examples
# Sliding-window Duet analysis (100-nt windows, 50-nt step, 16 threads)
modtector duet \
-i signal/03_norm/iso-1_rep-1_norm.csv \
-b signal/00_bam/iso-1_rep-1.sort.bam \
-f reference/PDL1.fa \
-o signal/04_duet/iso-1_rep-1_windows.csv \
-t 16
# Custom density and window configuration with explicit summary output
modtector duet \
-i signal/03_norm/sample_norm.csv \
-b signal/00_bam/sample.sort.bam \
-f reference/transcriptome.fa \
-o signal/04_duet/sample_windows.csv \
--epsilon 0.9 \
--min-samples 8 \
--window-size 120 \
--window-step 30 \
--summary-output signal/04_duet/sample_windows_summary.csv
extract - Extract Gene Regions from Count Results
Extract count results for specific gene regions using GTF/GFF annotation files. This command allows you to filter count CSV files to include only positions within specified gene regions, such as 18S rRNA or other target genes.
Usage
modtector extract [OPTIONS] --input <INPUT> --gtf <GTF> --target-gene <TARGET_GENE> --output <OUTPUT>
Options
Option |
Short |
Type |
Default |
Description |
|---|---|---|---|---|
|
|
String |
Required |
Count result CSV file from |
|
|
String |
Required |
GTF/GFF annotation file |
|
|
String |
Required |
Target gene name or ID (e.g., “18S”, “RN18S1”, “rRNA”) |
|
|
String |
Required |
Output file prefix (each gene region saved as: |
|
String |
“csv” |
Input format: “csv” (default) or “bam-readcount” |
|
|
Integer |
1 |
Number of parallel threads for processing multiple gene regions |
|
|
Flag |
false |
Use relative position (1-based, relative to gene start) instead of absolute genomic position |
|
|
|
String |
None |
Log file path (optional) |
How It Works
Parse GTF file: Reads the GTF/GFF annotation file and identifies gene regions matching the target gene name or ID
Match gene regions: Searches for genes where
gene_idorgene_namecontains the target stringExtract positions: For each matched gene region, filters the count CSV to include only positions within that region
Calculate statistics: Computes average depth for each gene region
Output separate files: Creates a separate CSV file for each gene region with filename format:
{prefix}_{gene_name}_depth{avg_depth}.csv
Gene Matching
The command matches genes by:
gene_id: Exact match or substring match
gene_name: Exact match or substring match
For example, searching for “18S” will match:
RN18S1(gene_id)18S_rRNA(gene_name)Any gene containing “18S” in its ID or name
Examples
# Extract 18S rRNA gene regions from count results (CSV format)
# If multiple regions match, each will be saved as a separate file
modtector extract \
-i benchmark/sc/count/single_cell_transcripts_in_hESC_NAIN3_RHE1581_b1.sort.csv \
-g annotation.gtf \
-t "18S" \
-o 18S_rRNA \
-l extract.log
# This will create files like:
# - 18S_rRNA_RN18S1_depth1234.5.csv
# - 18S_rRNA_RN18S2_depth987.2.csv
# (if multiple 18S genes are found)
# Extract from bam-readcount output format with multithreading
modtector extract \
-i /data1/bioinfo/07_people/lxt/Modector/PRJNA946273_fixed_bamreadcount/in_vivo_single_cell_Mut_transfected_RNAs_in_HEK293T_DMSO_RHX672.sort.readcount.txt \
-g annotation.gtf \
-t "18S" \
--input-format bam-readcount \
--threads 8 \
-o 18S_rRNA \
-l extract.log
# Extract with multithreading (CSV format)
modtector extract \
-i count_results.csv \
-g annotation.gtf \
-t "18S" \
--threads 8 \
-o 18S_output
# Extract with relative position (1-based, relative to gene start)
# Position will be converted from absolute genomic position to relative position
# Example: If gene starts at position 125931, absolute position 125931 becomes relative position 1
modtector extract \
-i count_results.csv \
-g annotation.gtf \
-t "18S" \
--relative-position \
-o 18S_rRNA_relpos
# Extract specific gene by gene_id
modtector extract \
-i count_results.csv \
-g annotation.gtf \
-t "RN18S1" \
-o RN18S1 \
-l extract.log
# Extract gene by partial name match (may match multiple genes)
modtector extract \
-i count_results.csv \
-g annotation.gtf \
-t "rRNA" \
-o rRNA_genes
Output Format
Each output CSV file maintains the same format as the input count CSV:
ChrID,pipe_truncation_Strand,pipe_truncation_ChrPos,rf_mutation_Base,rf_mutation_Count,pipe_truncation_count,depth,rf_mutation_ins,rf_mutation_del,base_A,base_C,base_G,base_T
Only rows matching the specific gene region are included in each output file.
Position Format:
Default (absolute position):
pipe_truncation_ChrPoscontains the absolute genomic position (e.g., 125931)With
--relative-position:pipe_truncation_ChrPoscontains the relative position (1-based, relative to gene start). For example, if a gene starts at position 125931, the first position in the gene will be 1, the second will be 2, etc.
Example:
Gene region: NT_187388.1:125931-127799
Absolute position 125931 → Relative position 1 (with
--relative-position)Absolute position 125932 → Relative position 2 (with
--relative-position)
File Naming
Output files are named using the pattern: {prefix}_{gene_name}_depth{avg_depth}.csv
{prefix}: The output prefix specified with-o{gene_name}: The gene name from GTF (or gene_id if gene_name is not available){avg_depth}: The average depth calculated for that gene region (rounded to 1 decimal place)
Example filenames:
18S_rRNA_RN18S1_depth1234.5.csvrRNA_genes_28S_rRNA_depth987.2.csv
Input Formats
CSV Format (default)
Standard CSV format from modtector count command with the following columns:
ChrID,pipe_truncation_Strand,pipe_truncation_ChrPos,rf_mutation_Base,rf_mutation_Count,pipe_truncation_count,depth,rf_mutation_ins,rf_mutation_del,base_A,base_C,base_G,base_T
bam-readcount Format
Tab-separated format from bam-readcount software:
Format:
chr position reference_base depth base:count:avg_mapping_quality:...Each base (A, C, G, T, N, =, +, -) has colon-separated statistics
Note: bam-readcount format does not include stop signals (
pipe_truncation_count), so this value will be set to 0 in the outputStrand information is taken from the GTF file (not available in bam-readcount output)
Performance Optimization
Multithreading: Use
--threadsto enable parallel processing of multiple gene regionsRecommended: 4-8 threads for optimal performance
Automatically falls back to sequential processing for single region or single thread
Provides 6-40× speedup depending on format and thread count
bam-readcount Format: File is read once into memory, then processed in parallel
Significantly faster than sequential processing (40× speedup with 8 threads)
Memory usage: Entire file loaded into memory (consider file size)
Notes
The GTF file should contain
gene_idand/orgene_nameattributesWhen using bam-readcount format, stop signals are not available and will be set to 0
Both GTF and GFF3 formats are supported
Gene coordinates are 1-based (GTF standard)
Each matching gene region is saved to a separate file
If multiple genes match, each will have its own output file with its gene name and average depth in the filename
Gene names in filenames are sanitized (special characters like
/,:, spaces are replaced with_)Files with zero extracted rows are not created
Multithreading: Parallel processing is most effective when extracting multiple gene regions
convert - Format Conversion
Convert various input formats to modtector pileup CSV format. Supports multiple input formats including bamreadcount, rf-rctools, rf-norm, shapemapper2, samtools-mpileup, icSHAPE-rt, and bedgraph formats.
Usage
modtector convert [OPTIONS] --input <INPUT> --output <OUTPUT>
Options
Option |
Short |
Type |
Default |
Description |
|---|---|---|---|---|
|
|
String |
Required* |
Input file (or use |
|
|
String |
Required |
Output pileup CSV file |
|
|
String |
Auto-detect |
Input format: |
|
|
String |
“+” |
Strand orientation (‘+’ for forward, ‘-’ for reverse, or ‘None’ for all strands) |
|
Flag |
false |
For rf-rctools format: if true, input was generated with |
|
|
String |
None |
Reference FASTA file path (required for shapemapper2, icSHAPE-rt, and bedgraph formats) |
|
|
String |
None |
Chromosome/contig name (for formats with missing metadata) |
|
|
String |
None |
Optional mutation input file (rf-rctools format with |
|
|
String |
None |
Optional stop input file (rf-rctools format without |
|
|
String |
“+” |
Filter by strand orientation (‘+’ or ‘-’ or ‘None’ to process all strands) |
|
|
|
String |
None |
Log file path (optional) |
*Note: In dual input mode (--input-mutation/--input-stop), --input can be empty.
Supported Input Formats
bamreadcount: Tab-separated format from bam-readcount software
rf-rctools: RNA Framework / rctools counting output (with optional
--rf-count-mutationsflag)rf-norm: rf-norm CSV output (reactivity-level data)
rf-norm-xml: rf-norm XML output
shapemapper2: ShapeMapper2 profile files (requires
--ref-fasta)shapemapper-profile: ShapeMapper2 profile format with dynamic column detection
samtools-mpileup: Samtools mpileup output (10-column tab-separated format)
icSHAPE-rt: icSHAPE-pipe RT file format (requires
--ref-fasta)bedgraph: Bedgraph format (5-column: chr, start, end, coverage, strand; requires
--ref-fasta)
Examples
# Basic usage (bamreadcount format, auto-detected)
modtector convert -i bamreadcount.txt -o pileup.csv
# With strand specification
modtector convert -i bamreadcount.txt -o pileup.csv -s +
# Convert rf-rctools format (mutation mode)
modtector convert -i mutation.tsv -o pileup.csv -f rf-rctools --rf-count-mutations
# Convert rf-rctools format (stop mode)
modtector convert -i stop.tsv -o pileup.csv -f rf-rctools
# Dual input mode: merge mutation and stop files
modtector convert --input-mutation mutation.tsv --input-stop stop.tsv -o pileup.csv -f rf-rctools
# Convert shapemapper2 format (requires reference)
modtector convert -i shapemapper.profile -o pileup.csv -f shapemapper-profile --ref-fasta reference.fa
# Convert icSHAPE RT format
modtector convert -i input.rt -o pileup.csv -f icSHAPE-rt --ref-fasta reference.fa --filter-strand "+"
# Convert bedgraph format
modtector convert -i input.bedgraph -o pileup.csv -f bedgraph --ref-fasta reference.fa --filter-strand "+"
# Convert samtools mpileup format
modtector convert -i mpileup.txt -o pileup.csv -f samtools-mpileup -s +
# With logging
modtector convert -i input.txt -o pileup.csv -s + -l convert.log
Input Format (bamreadcount)
Tab-separated format from bam-readcount software:
EC_16S 1 A 77 A:77:29.25:33.82:... C:0:0.00:... G:0:0.00:... T:0:0.00:...
Fields:
Column 1: Chromosome/contig name
Column 2: Position (1-based)
Column 3: Reference base
Column 4: Total depth
Column 5+: Base counts in format
base:count:quality:...
Output Format (modtector pileup)
CSV format compatible with modtector reactivity command:
ChrID,pipe_truncation_Strand,pipe_truncation_ChrPos,rf_mutation_Base,rf_mutation_Count,pipe_truncation_count,depth,rf_mutation_ins,rf_mutation_del,base_A,base_C,base_G,base_T
EC_16S,+,1,A,0,0,77,0,0,77,0,0,0
Conversion Details
Mutation Count: Calculated as sum of non-reference bases (A, C, G, T)
Stop Signals:
pipe_truncation_countis set to 0 (not available in bamreadcount format)Strand Information: Uses the specified strand parameter (default: ‘+’)
Insertions/Deletions: Extracted from bamreadcount format (
+for insertions,-for deletions)
Performance
Streaming Processing: Processes files line-by-line without loading entire file into memory
Large File Support: Tested with files up to 35GB
Processing Speed: ~1-2 million lines/second (depends on I/O)
Memory Usage: O(1) constant regardless of file size
NFS Optimization: Uses 8MB buffer for better network filesystem performance
Progress Reporting
The command reports progress every 1 million lines:
Total lines processed
Valid lines converted
Error count (invalid lines skipped)
Error Handling
Invalid lines are skipped and error count is tracked
First 10 errors are logged for debugging
Processing continues despite errors
Final statistics show total, valid, and error counts
Use Cases
Workflow Integration: Convert bamreadcount output for use with modtector reactivity command
Format Standardization: Standardize bamreadcount data to modtector pileup format
Pipeline Compatibility: Enable bamreadcount-based workflows to use modtector reactivity features
Notes
Output file must have
.csvextensionStrand parameter must be either ‘+’ or ‘-’ (or ‘None’ for all strands)
Input file must exist and be readable
The conversion preserves all position and depth information
Base counts (A, C, G, T) are extracted from input format
Insertions and deletions are included if present in input format
Format auto-detection works by checking file extension, headers, and data patterns
NaN values are preserved in output for formats that support them (rf-norm)
correct - PCR Bias Correction
Apply PCR bias correction to pileup CSV files using Chi-Square distribution fitting. This command corrects for PCR amplification bias where high depths cause mutation rate dilution, similar to shapemapper2’s effective_depth calculation.
Usage
modtector correct [OPTIONS] --input <INPUT> --output <OUTPUT>
Options
Option |
Short |
Type |
Default |
Description |
|---|---|---|---|---|
|
|
String |
Required |
Input pileup CSV file |
|
|
String |
Required |
Output pileup CSV file (with corrected effective_depth) |
|
Float |
1.0 |
Weight for increasing depth correction (used when depth is too low) |
|
|
Float |
0.5 |
Weight for decreasing depth correction (used when depth is too high) |
|
|
String |
None |
Species name for filtering (optional) |
How It Works
Data Binning: Groups depth and mutation rate data into bins
Chi-Square Fitting: Fits a Chi-Square distribution (df=2) to the depth-mutation rate relationship
Correction Factor Calculation: Computes correction factors based on the fitted distribution
Effective Depth Calculation: Applies correction factors to adjust
effective_depthvaluesOutput: Writes corrected pileup CSV with updated
effective_depthcolumn
Method Details
The correction uses a Chi-Square distribution model:
Distribution: Chi-Square with 2 degrees of freedom
Fitting: Brent’s method for 1D minimization to find optimal scale parameter
Correction: Applies weights for increasing/decreasing depth based on fitted model
Examples
# Basic usage
modtector correct -i pileup.csv -o pileup_corrected.csv
# With custom weights
modtector correct -i pileup.csv -o pileup_corrected.csv \
--weight-increase 1.2 \
--weight-decrease 0.4
# With species filtering
modtector correct -i pileup.csv -o pileup_corrected.csv \
--species "human"
Output Format
The output CSV maintains the same format as input, with the effective_depth column updated:
Original
depthcolumn: Unchanged (total read depth)effective_depthcolumn: Corrected depth based on PCR bias model
Use Cases
Quality Filtering Enhancement: Combine with
--min-base-qualincountcommand for comprehensive depth correctionHigh-Depth Data: Correct for mutation rate dilution in high-coverage regions
Workflow Integration: Optional step in analysis pipeline (can be disabled in default workflow)
Notes
This is an optional advanced feature
Can be combined with quality filtering (
--min-base-qual) for best resultsCorrection factors are calculated per-species if
--speciesis specifiedThe method is based on Chi-Square distribution fitting, similar to shapemapper2’s approach
Global Options
All commands support these global options:
Option |
Short |
Description |
|---|---|---|
|
|
Print help information |
|
|
Print version information |
Output Files
Common Output Formats
CSV Files: Comma-separated values with headers
SVG Files: Scalable vector graphics for plots
Text Files: Plain text for evaluation results
Log Files: Processing logs and error messages
File Naming Conventions
Input files:
*.bam,*.fa,*.csvOutput files:
*_count.csv,*_norm.csv,*_reactivity.csvPlots:
*_distribution.svg,*_reactivity.svg,*_roc.svgEvaluation:
*_evaluation.txt,*_comprehensive.txt
Best Practices
Use appropriate thread counts: Match
-tto your CPU coresMonitor memory usage: Large datasets may require significant RAM
Check log files: Always review logs for warnings and errors
Validate inputs: Ensure BAM files are properly aligned
Use consistent parameters: Keep parameters consistent across runs
Backup results: Save intermediate results for reproducibility
Troubleshooting
Common Issues
Memory errors: Reduce thread count or use smaller datasets
File not found: Check file paths and permissions
Low coverage: Increase sequencing depth or adjust thresholds
Poor results: Check data quality and parameter settings
Getting Help
Use
--helpfor command-specific helpCheck log files for detailed error messages
Refer to the Troubleshooting Guide
Open an issue on the GitHub repository