Command Reference

This page provides comprehensive documentation for all modtector commands and their options.

Overview

modtector provides ten main commands:

count - Generate pileup data from BAM files
norm - Normalize and filter signals
reactivity - Calculate reactivity scores
compare - Compare samples and identify differences
evaluate - Evaluate accuracy using secondary structure
plot - Generate visualization plots
duet - Decompose normalized reactivity into dynamic ensembles using read-level co-variation
extract - Extract gene regions from count results using GTF annotation
convert - Convert various input formats to modtector pileup CSV format
correct - Apply PCR bias correction to pileup CSV file

count - Data Processing

Generate pileup data from BAM files by counting stop and mutation signals.

Usage

modtector count [OPTIONS] --bam <BAM> --fasta <FASTA> --output <OUTPUT>

Options

Option	Short	Type	Default	Description
`--bam`	`-b`	String	Required	BAM file path (or glob pattern for batch/single-cell mode)
`--fasta`	`-f`	String	Required	Reference FASTA file path
`--output`	`-o`	String	Required	Output CSV path (or directory for batch/single-cell mode)
`--strand`	`-s`	String	“+/-”	Strand filter (`+`, `-`, or `+/-`)
`--threads`	`-t`	Integer	1	Number of parallel threads
`--log`	`-l`	String	None	Log file path (optional)
`--window`	`-w`	Integer	None	Window size for genome segmentation (bases)
`--batch`		Flag	false	Enable batch mode: process multiple BAM files sequentially
`--single-cell`		Flag	false	Enable single-cell unified mode: unified processing with cell labels
`--min-base-qual`		Integer	0	Minimum base quality score (Phred score) to count a mutation. Only mutations with quality >= threshold are counted. Recommended: 20 (same as RNAFramework rf-count). Set to 0 to disable.
`--pcr-bias-correction`		Flag	false	Enable PCR bias correction for depth calculation. When enabled, effective_depth will be calculated and used instead of raw depth.
`--weight-increase`		Float	1.0	Weight for increasing depth correction (used when depth is too low)
`--weight-decrease`		Float	0.5	Weight for decreasing depth correction (used when depth is too high)

Examples

# Basic usage (standard mode)
modtector count -b sample.bam -f reference.fa -o output.csv

# With multiple threads
modtector count -b sample.bam -f reference.fa -o output.csv -t 8

# With windowing
modtector count -b sample.bam -f reference.fa -o output.csv -w 1000

# With logging
modtector count -b sample.bam -f reference.fa -o output.csv -l count.log

# Batch mode: process multiple BAM files sequentially
modtector count --batch \
  -b "/path/to/bam/*sort.bam" \
  -f reference.fa \
  -o output_dir/ \
  -t 8 \
  -w 10000

# Single-cell unified mode: unified processing with cell labels
modtector count --single-cell \
  -b "/path/to/single_cell/*sort.bam" \
  -f reference.fa \
  -o output_dir/ \
  -t 8 \
  -w 10000 \
  -l batch.log

Batch and Single-cell Modes

Batch Mode (--batch):

Processes multiple BAM files sequentially
Each file is processed independently
Suitable for scenarios requiring separate processing per file
Output: Each file generates a separate CSV file in the output directory
Example glob patterns: *sort.bam, *_RHX*.bam, sample_*.bam

Single-cell Unified Mode (--single-cell):

True unified processing strategy for single-cell data
Processing Flow:
1. For each window, collects all reads from all BAM files at once
2. Performs unified pileup processing (all reads together)
3. Tracks cell labels during processing, splits results by cell
4. Outputs separate CSV file for each cell
Performance Benefits (2-3x speedup):
- Skips data distribution scanning (direct processing of all reference sequences)
- Unified read collection and pileup (reduces I/O overhead significantly)
- Better parallelization efficiency (cross-file parallel processing)
- Window-based memory management (prevents memory accumulation)
Progress Reporting:
- Real-time display: chunks processed, percentage, speed (chunks/s), ETA
- CPU usage recommendations based on BAM file count
Output: Each cell generates a separate CSV file (cell label extracted from filename)
Cell Label Extraction: Supports RHX pattern (e.g., *RHX672.sort.bam → RHX672) or last underscore-separated part

Output Format

The output CSV contains the following columns:

ChrID: Chromosome/contig identifier
pipe_truncation_Strand: Strand (+ or -)
pipe_truncation_ChrPos: Position (1-based)
rf_mutation_Base: Reference base
rf_mutation_Count: Mutation count
pipe_truncation_count: Stop count
depth: Total read depth
rf_mutation_ins: Insertion count
rf_mutation_del: Deletion count
base_A, base_C, base_G, base_T: Base counts

norm - Signal Normalization

Normalize and filter signals to remove noise and outliers.

Usage

modtector norm [OPTIONS] --input <INPUT> --output <OUTPUT> --method <METHOD>

Options

Option	Short	Type	Default	Description
`--input`	`-i`	String	Required	Input CSV file
`--output`	`-o`	String	Required	Output CSV file
`--method`	`-m`	String	Required	Normalization method
`--cutoff`		Float	0.25	SNP threshold for filtering mutation signals
`--bases`		String	“ACGT”	Reactive base types (e.g., “AC” for A and C)
`--window`		Integer	0	Fixed window size (fixed sliding window mode)
`--window-offset`		Integer	0	Window offset (fixed sliding window mode)
`--dynamic`		Flag	False	Enable dynamic sliding window mode
`--window-size-dynamic`		Integer	50	Number of reactive bases required per window
`--linear`		Flag	False	Apply piecewise linear mapping
`--log`	`-l`	String	None	Log file path (optional)

Normalization Methods

percentile28: 28th percentile normalization
winsor90: 90th percentile winsorization
boxplot: Boxplot-based outlier removal

Examples

# Basic normalization
modtector norm -i input.csv -o output.csv -m winsor90

# Target specific bases
modtector norm -i input.csv -o output.csv -m winsor90 --bases AC

# With dynamic windowing
modtector norm -i input.csv -o output.csv -m winsor90 --dynamic

# With linear mapping
modtector norm -i input.csv -o output.csv -m winsor90 --linear

reactivity - Reactivity Calculation

Calculate reactivity scores by comparing modified and unmodified samples.

Usage

modtector reactivity [OPTIONS] --mod <MOD_CSV> --unmod <UNMOD_CSV> --output <OUTPUT>

Options

Option	Short	Type	Default	Description
`--mod`	`-M`	String	Required	Modified sample CSV
`--unmod`	`-U`	String	Optional	Unmodified sample CSV (optional for mod-only mode)
`--output`	`-O`	String	Required	Reactivity output file
`--stop-method`	`-s`	String	“kfactor”	Stop signal reactivity method: `kfactor`, `ding`, `rouskin`
`--mutation-method`	`-m`	String	“kfactor”	Mutation signal reactivity method: `kfactor`, `siegfried`, `zubradt`
`--pseudocount`		Float	1.0	Pseudocount parameter for logarithmic calculations (Ding method)
`--maxscore`		Float	10.0	Maximum score limit parameter (Ding method)
`--snp-cutoff`		Float	0.25	SNP threshold for filtering positions (filter positions where unmod sample mutation rate >= cutoff)
`--k-prediction-method`		String	“background”	K-factor prediction method: `background` (default), `distribution`, `recursive`
`--structure-file`		String	None	Reference secondary structure file (required for recursive k-factor method)
`--k-background-gene-id`		String	None	Gene ID to use for k-factor calculation (only positions with this gene_id will be used as background regions)
`--threads`	`-t`	Integer	1	Number of parallel threads
`--log`	`-l`	String	None	Log file path (optional)

Reactivity Methods

Stop Signal Methods

kfactor (default): K-factor method using difference between modified and unmodified samples
ding: Ding et al. method with pseudocount and maxscore parameters
rouskin: Rouskin et al. method (uses only modified sample stop signal)

Mutation Signal Methods

kfactor (default): K-factor method using difference between modified and unmodified samples
siegfried: Siegfried et al. method (allows negative reactivity values)
zubradt: Zubradt et al. method

K-Factor Prediction Methods

background (default): Predict k-factor from background regions (low mutation rate positions)
distribution: Predict k-factor using statistical distribution analysis (improved accuracy for many sample types)
recursive: Predict k-factor recursively using reference secondary structure (requires --structure-file)

Examples

# Basic reactivity calculation (with unmodified sample)
modtector reactivity -M mod.csv -U unmod.csv -O reactivity.csv

# Mod-only mode (no unmodified sample, for smartSHAPE datasets)
modtector reactivity -M mod.csv -O reactivity.csv

# With specific methods
modtector reactivity -M mod.csv -U unmod.csv -O reactivity.csv -s ding -m siegfried

# With custom parameters
modtector reactivity -M mod.csv -U unmod.csv -O reactivity.csv --pseudocount 0.5 --maxscore 5

# Using distribution-based k-factor prediction
modtector reactivity -M mod.csv -U unmod.csv -O reactivity.csv --k-prediction-method distribution

# Using recursive k-factor prediction (requires structure file)
modtector reactivity -M mod.csv -U unmod.csv -O reactivity.csv \
    --k-prediction-method recursive \
    --structure-file structure.dp \
    --k-background-gene-id gene_id

# With SNP filtering
modtector reactivity -M mod.csv -U unmod.csv -O reactivity.csv --snp-cutoff 0.2

compare - Sample Comparison

Compare modified and unmodified samples to identify significant differences.

Usage

modtector compare [OPTIONS] --output <OUTPUT>

Options

Option	Short	Type	Default	Description
`--mode`	`-m`	String	“mod-vs-unmod”	Comparison mode
`--mod`	`-M`	String	None	Modified sample CSV (mod-vs-unmod mode)
`--unmod`	`-U`	String	None	Unmodified sample CSV (mod-vs-unmod mode)
`--group1`		String	None	Group 1 file list (biological-replicates mode)
`--group2`		String	None	Group 2 file list (biological-replicates mode)
`--reactivity-group1`		String	None	Group 1 reactivity files (reactivity-groups mode)
`--reactivity-group2`		String	None	Group 2 reactivity files (reactivity-groups mode)
`--reactivity1`		String	None	First reactivity result file (reactivity-results mode)
`--reactivity2`		String	None	Second reactivity result file (reactivity-results mode)
`--output`	`-o`	String	Required	Output CSV path
`--min-depth`	`-d`	Integer	10	Minimum depth
`--min-fold`	`-f`	Float	2	Minimum fold change
`--test`	`-t`	String	“t-test”	Statistical test type
`--alpha`	`-a`	Float	0.05	Significance level
`--log`	`-l`	String	None	Log file path (optional)

Comparison Modes

mod-vs-unmod: Compare modified vs unmodified samples
reactivity-groups: Compare two groups of reactivity results
biological-replicates: Compare biological replicates
reactivity-results: Compare two reactivity result files

Statistical Tests

t-test: Student’s t-test
mann-whitney: Mann-Whitney U test
wilcoxon: Wilcoxon signed-rank test
chi-square: Chi-square test
continuity: Continuity correction
diffscan: Differential scanning
deltashape: DeltaSHAPE method

Examples

# Basic comparison
modtector compare -M mod.csv -U unmod.csv -o comparison.csv

# With custom thresholds
modtector compare -M mod.csv -U unmod.csv -o comparison.csv -d 20 -f 1.5

# Using different statistical test
modtector compare -M mod.csv -U unmod.csv -o comparison.csv -t mann-whitney

# Compare reactivity groups
modtector compare --mode reactivity-groups \
    --reactivity-group1 group1_reactivity.csv \
    --reactivity-group2 group2_reactivity.csv \
    -o comparison.csv

evaluate - Accuracy Evaluation

Evaluate reactivity accuracy using known secondary structure.

Usage

modtector evaluate [OPTIONS] --reactivity <REACTIVITY_FILE> --structure <STRUCTURE_FILE> --output <OUTPUT_DIR> --gene-id <GENE_ID>

Options

Option	Short	Type	Default	Description
`--reactivity`	`-r`	String	Required	Reactivity signal file
`--structure`	`-s`	String	Required	Secondary structure file
`--output`	`-o`	String	Required	Output directory
`--log`	`-l`	String	None	Log file path (default to result directory)
`--signal-type`	`-t`	String	“stop”	Signal type (`stop`, `mutation`, or `both`)
`--gene-id`	`-g`	String	Required	Gene ID (for base matching)
`--strand`	`-S`	String	“+”	Strand information (`+` or `-`)
`--base-matching`		Flag	True	Use base matching (default: true)
`--auto-shift`		Flag	True	Use auto-shift correction (default: true)
`--reactive-bases`		String	“ACGT”	Reactive bases for AUC/ROC calculation (e.g., `AC` for DMS; default `ACGT` for all bases)
`--optimized`	`-O`	Flag	False	Use optimized version

Examples

# Basic evaluation
modtector evaluate -r reactivity.csv -s structure.dp -o results/ -g gene_id

# Evaluate both signal types
modtector evaluate -r reactivity.csv -s structure.dp -o results/ -g gene_id -t both

# Without auto-shift correction
modtector evaluate -r reactivity.csv -s structure.dp -o results/ -g gene_id --no-auto-shift

# Using optimized version
modtector evaluate -r reactivity.csv -s structure.dp -o results/ -g gene_id -O

plot - Visualization

Generate signal distribution plots and reactivity visualizations, including RNA structure SVG plots.

Usage

Regular Mode

modtector plot [OPTIONS] --mod <MOD_CSV> --unmod <UNMOD_CSV> --output <OUTPUT>

SVG-Only Mode

modtector plot [OPTIONS] --output <OUTPUT> --svg-template <SVG_TEMPLATE> --reactivity <REACTIVITY_CSV>

Options

Option	Short	Type	Default	Description
`--mod`	`-M`	String	Optional*	Modified sample CSV (required for regular mode)
`--unmod`	`-U`	String	Optional*	Unmodified sample CSV (required for regular mode)
`--output`	`-o`	String	Required	Output directory
`--coverage`	`-c`	Float	0.2	Coverage threshold
`--depth`	`-d`	Integer	50	Depth threshold (reads)
`--reactivity`	`-r`	String	None	Reactivity CSV file (optional)
`--gff`	`-g`	String	None	Genome annotation GFF/GTF file (optional)
`--threads`	`-t`	Integer	1	Number of parallel threads
`--log`	`-l`	String	None	Log file path (optional)

SVG Plotting Options

Option	Type	Default	Description
`--svg-template`	String	None	SVG template file for RNA structure visualization
`--svg-bases`	String	“ACGT”	Bases to include in SVG plot (e.g., “AC” for DMS-seq)
`--svg-signal`	String	“all”	Signal type to plot (`all`, `stop`, `mutation`, `score`)
`--svg-strand`	String	“+”	Strand to include (`+`, `-`, `both`)
`--svg-ref`	String	None	Reference sequence file for alignment (optional)
`--svg-max-shift`	Integer	5	Maximum shift for alignment search
`--svg-circle-filled`	Flag	False	Circle fill type: `true` for filled circles, `false` for hollow rings
`--svg-font-color`	String	None	Font color for text elements (e.g., “black”, “red”, “#000000”)
`--svg-legend-width`	Float	30.0	Legend item width for horizontal layout
`--svg-legend-height`	Float	15.0	Legend item height for vertical layout
`--interactive`	Flag	False	Generate interactive HTML visualization instead of static SVG

Note: -M and -U parameters are optional when using SVG-only mode (when both --svg-template and --reactivity are provided).

Examples

Regular Plotting

# Basic plotting
modtector plot -M mod.csv -U unmod.csv -o plots/

# With reactivity data
modtector plot -M mod.csv -U unmod.csv -o plots/ -r reactivity.csv

# With custom thresholds
modtector plot -M mod.csv -U unmod.csv -o plots/ -c 0.3 -d 100

# With genome annotation
modtector plot -M mod.csv -U unmod.csv -o plots/ -g annotation.gff

SVG-Only Mode

# Basic SVG plotting (simplified command)
modtector plot -o output/ --svg-template template.svg --reactivity data.csv

# Multi-signal SVG plotting
modtector plot -o output/ \
    --svg-template template.svg \
    --reactivity data.csv \
    --svg-bases ATGC \
    --svg-signal all \
    --svg-strand +

# Single signal SVG plotting
modtector plot -o output/ \
    --svg-template template.svg \
    --reactivity data.csv \
    --svg-bases AC \
    --svg-signal stop \
    --svg-strand +

# SVG plotting with alignment
modtector plot -o output/ \
    --svg-template template.svg \
    --reactivity data.csv \
    --svg-bases ATGC \
    --svg-signal all \
    --svg-strand + \
    --svg-ref reference.fa \
    --svg-max-shift 10

# SVG plotting with custom styling
modtector plot -o output/ \
    --svg-template template.svg \
    --reactivity data.csv \
    --svg-circle-filled \
    --svg-font-color "black" \
    --svg-legend-width 40.0 \
    --svg-legend-height 20.0

# Interactive HTML visualization
modtector plot -o output/ \
    --svg-template template.svg \
    --reactivity data.csv \
    --interactive

Combined Mode (Regular + SVG)

# Generate both regular plots and SVG plots
modtector plot -M mod.csv -U unmod.csv -o output/ \
    --svg-template template.svg \
    --reactivity data.csv \
    --svg-bases ATGC \
    --svg-signal all

SVG Plotting Features

Signal Type Support

Auto-detection: Automatically detects signal types from CSV headers
Multiple signals: Supports plotting multiple signal types (stop, mutation, etc.)
Single signal: Works with single signal files
Custom naming: Output files named by signal type (e.g., rna_structure_colored_stop.svg)

Strand Selection

Positive strand: --svg-strand + (default)
Negative strand: --svg-strand -
Both strands: --svg-strand both

Base Filtering

All bases: --svg-bases ATGC (default)
DMS-seq: --svg-bases AC (A and C only)
Custom: Any combination of A, T, C, G

Alignment Support

Reference sequence: Use --svg-ref for sequence alignment
Shift calculation: Automatic calculation of optimal shift values
Max shift: Control maximum shift search range with --svg-max-shift

Output Files

Regular Mode Output

overall_scatter.png: Signal distribution plot
high/: High coverage gene plots
low/: Low coverage gene plots

SVG Mode Output

rna_structure_colored_[signal].svg: SVG files for each signal type
Multiple SVG files for multi-signal data
Single SVG file for single-signal data

Combined Mode Output

All regular mode files
All SVG mode files

duet - Dynamic Ensemble Decomposition

Duet performs a sliding-window analysis that merges normalized stop/mutation reactivity with read-level co-variation to detect alternative ensembles without predefining cluster counts. Within each window, read-level stop/mutation co-occurrence vectors are clustered with DBSCAN to highlight dynamic structural mixtures.

Usage

modtector duet [OPTIONS] --input <INPUT> --bam <BAM> --fasta <FASTA> --output <OUTPUT>

Options

Option	Short	Type	Default	Description
`--input`	`-i`	String	Required	Normalized reactivity CSV file
`--bam`	`-b`	String	Required	Sorted BAM file for co-variation analysis
`--fasta`	`-f`	String	Required	Reference FASTA used for alignment
`--output`	`-o`	String	Required	Window-level Duet results CSV
`--epsilon`		Float	0.75	DBSCAN ε (radius) in standardized feature space
`--min-samples`		Integer	5	Minimum neighbours required to form a DBSCAN core read
`--window-size`		Integer	100	Sliding-window size in nucleotides
`--window-step`		Integer	50	Sliding-window step in nucleotides
`--threads`	`-t`	Integer	System CPU count	Number of threads for parallel processing
`--summary-output`		String	`<output>_summary.csv`	Optional per-window/per-ensemble summary CSV
`--log`	`-l`	String	None	Log file path (optional)

Outputs

Window CSV: Each window annotated with status (OK, InsufficientReads, or LowDensity), read counts, detected ensemble count, primary ensemble occupancy, noise fraction, dual-signal fraction, window-averaged reactivities, a concise cluster summary string (including per-cluster confidence), and a GlobalEnsembles mapping column.
Summary CSV: Per window/per ensemble (including DBSCAN noise) details covering occupancy, cluster confidence, average stop/mutation counts per read, dual-signal fractions, and the associated global ensemble ID. Generated automatically unless --summary-output is provided.
Global Ensemble CSV: (<output>_global.csv) summarises each global ensemble with aggregated read counts, stop/mutation totals, dual-signal read counts, contributing window/cluster counts, unique position counts, and a global confidence score.
Global Per-base CSV: (<output>_global_per_base.csv) lists every position associated with each global ensemble alongside ensemble-specific read support, normalized reactivity values, and per-base confidence scores.

Progress Reporting

During execution, duet provides detailed statistics:

Dual-signal reads: Number and percentage of reads carrying both stop and mutation signals
High-confidence positions: Number and percentage of positions assigned high confidence (reactivity >= 0.7)
Average confidence: Mean position confidence score across all analyzed positions

These statistics help assess data quality and the reliability of ensemble detection.

Examples

# Sliding-window Duet analysis (100-nt windows, 50-nt step, 16 threads)
modtector duet \
  -i signal/03_norm/iso-1_rep-1_norm.csv \
  -b signal/00_bam/iso-1_rep-1.sort.bam \
  -f reference/PDL1.fa \
  -o signal/04_duet/iso-1_rep-1_windows.csv \
  -t 16

# Custom density and window configuration with explicit summary output
modtector duet \
  -i signal/03_norm/sample_norm.csv \
  -b signal/00_bam/sample.sort.bam \
  -f reference/transcriptome.fa \
  -o signal/04_duet/sample_windows.csv \
  --epsilon 0.9 \
  --min-samples 8 \
  --window-size 120 \
  --window-step 30 \
  --summary-output signal/04_duet/sample_windows_summary.csv

extract - Extract Gene Regions from Count Results

Extract count results for specific gene regions using GTF/GFF annotation files. This command allows you to filter count CSV files to include only positions within specified gene regions, such as 18S rRNA or other target genes.

Usage

modtector extract [OPTIONS] --input <INPUT> --gtf <GTF> --target-gene <TARGET_GENE> --output <OUTPUT>

Options

Option	Short	Type	Default	Description
`--input`	`-i`	String	Required	Count result CSV file from `count` command, or bam-readcount output file
`--gtf`	`-g`	String	Required	GTF/GFF annotation file
`--target-gene`	`-t`	String	Required	Target gene name or ID (e.g., “18S”, “RN18S1”, “rRNA”)
`--output`	`-o`	String	Required	Output file prefix (each gene region saved as: `{prefix}_{gene_name}_depth{avg_depth}.csv`)
`--input-format`		String	“csv”	Input format: “csv” (default) or “bam-readcount”
`--threads`		Integer	1	Number of parallel threads for processing multiple gene regions
`--relative-position`		Flag	false	Use relative position (1-based, relative to gene start) instead of absolute genomic position
`--log`	`-l`	String	None	Log file path (optional)

How It Works

Parse GTF file: Reads the GTF/GFF annotation file and identifies gene regions matching the target gene name or ID
Match gene regions: Searches for genes where gene_id or gene_name contains the target string
Extract positions: For each matched gene region, filters the count CSV to include only positions within that region
Calculate statistics: Computes average depth for each gene region
Output separate files: Creates a separate CSV file for each gene region with filename format: {prefix}_{gene_name}_depth{avg_depth}.csv

Gene Matching

The command matches genes by:

gene_id: Exact match or substring match
gene_name: Exact match or substring match

For example, searching for “18S” will match:

RN18S1 (gene_id)
18S_rRNA (gene_name)
Any gene containing “18S” in its ID or name

Examples

# Extract 18S rRNA gene regions from count results (CSV format)
# If multiple regions match, each will be saved as a separate file
modtector extract \
  -i benchmark/sc/count/single_cell_transcripts_in_hESC_NAIN3_RHE1581_b1.sort.csv \
  -g annotation.gtf \
  -t "18S" \
  -o 18S_rRNA \
  -l extract.log

# This will create files like:
# - 18S_rRNA_RN18S1_depth1234.5.csv
# - 18S_rRNA_RN18S2_depth987.2.csv
# (if multiple 18S genes are found)

# Extract from bam-readcount output format with multithreading
modtector extract \
  -i /data1/bioinfo/07_people/lxt/Modector/PRJNA946273_fixed_bamreadcount/in_vivo_single_cell_Mut_transfected_RNAs_in_HEK293T_DMSO_RHX672.sort.readcount.txt \
  -g annotation.gtf \
  -t "18S" \
  --input-format bam-readcount \
  --threads 8 \
  -o 18S_rRNA \
  -l extract.log

# Extract with multithreading (CSV format)
modtector extract \
  -i count_results.csv \
  -g annotation.gtf \
  -t "18S" \
  --threads 8 \
  -o 18S_output

# Extract with relative position (1-based, relative to gene start)
# Position will be converted from absolute genomic position to relative position
# Example: If gene starts at position 125931, absolute position 125931 becomes relative position 1
modtector extract \
  -i count_results.csv \
  -g annotation.gtf \
  -t "18S" \
  --relative-position \
  -o 18S_rRNA_relpos

# Extract specific gene by gene_id
modtector extract \
  -i count_results.csv \
  -g annotation.gtf \
  -t "RN18S1" \
  -o RN18S1 \
  -l extract.log

# Extract gene by partial name match (may match multiple genes)
modtector extract \
  -i count_results.csv \
  -g annotation.gtf \
  -t "rRNA" \
  -o rRNA_genes

Output Format

Each output CSV file maintains the same format as the input count CSV:

ChrID,pipe_truncation_Strand,pipe_truncation_ChrPos,rf_mutation_Base,rf_mutation_Count,pipe_truncation_count,depth,rf_mutation_ins,rf_mutation_del,base_A,base_C,base_G,base_T

Only rows matching the specific gene region are included in each output file.

Position Format:

Default (absolute position): pipe_truncation_ChrPos contains the absolute genomic position (e.g., 125931)
With --relative-position: pipe_truncation_ChrPos contains the relative position (1-based, relative to gene start). For example, if a gene starts at position 125931, the first position in the gene will be 1, the second will be 2, etc.

Example:

Gene region: NT_187388.1:125931-127799
Absolute position 125931 → Relative position 1 (with --relative-position)
Absolute position 125932 → Relative position 2 (with --relative-position)

File Naming

Output files are named using the pattern: {prefix}_{gene_name}_depth{avg_depth}.csv

{prefix}: The output prefix specified with -o
{gene_name}: The gene name from GTF (or gene_id if gene_name is not available)
{avg_depth}: The average depth calculated for that gene region (rounded to 1 decimal place)

Example filenames:

18S_rRNA_RN18S1_depth1234.5.csv
rRNA_genes_28S_rRNA_depth987.2.csv

Input Formats

CSV Format (default)

Standard CSV format from modtector count command with the following columns:

ChrID, pipe_truncation_Strand, pipe_truncation_ChrPos, rf_mutation_Base, rf_mutation_Count, pipe_truncation_count, depth, rf_mutation_ins, rf_mutation_del, base_A, base_C, base_G, base_T

bam-readcount Format

Tab-separated format from bam-readcount software:

Format: chr position reference_base depth base:count:avg_mapping_quality:...
Each base (A, C, G, T, N, =, +, -) has colon-separated statistics
Note: bam-readcount format does not include stop signals (pipe_truncation_count), so this value will be set to 0 in the output
Strand information is taken from the GTF file (not available in bam-readcount output)

Performance Optimization

Multithreading: Use --threads to enable parallel processing of multiple gene regions
- Recommended: 4-8 threads for optimal performance
- Automatically falls back to sequential processing for single region or single thread
- Provides 6-40× speedup depending on format and thread count
bam-readcount Format: File is read once into memory, then processed in parallel
- Significantly faster than sequential processing (40× speedup with 8 threads)
- Memory usage: Entire file loaded into memory (consider file size)

Notes

The GTF file should contain gene_id and/or gene_name attributes
When using bam-readcount format, stop signals are not available and will be set to 0
Both GTF and GFF3 formats are supported
Gene coordinates are 1-based (GTF standard)
Each matching gene region is saved to a separate file
If multiple genes match, each will have its own output file with its gene name and average depth in the filename
Gene names in filenames are sanitized (special characters like /, :, spaces are replaced with _)
Files with zero extracted rows are not created
Multithreading: Parallel processing is most effective when extracting multiple gene regions

convert - Format Conversion

Convert various input formats to modtector pileup CSV format. Supports multiple input formats including bamreadcount, rf-rctools, rf-norm, shapemapper2, samtools-mpileup, icSHAPE-rt, and bedgraph formats.

Usage

modtector convert [OPTIONS] --input <INPUT> --output <OUTPUT>

Options

Option	Short	Type	Default	Description
`--input`	`-i`	String	Required*	Input file (or use `--input-mutation`/`--input-stop` for dual input mode)
`--output`	`-o`	String	Required	Output pileup CSV file
`--format`	`-f`	String	Auto-detect	Input format: `bamreadcount`, `rf-rctools`, `rf-norm`, `rf-norm-xml`, `shapemapper2`, `shapemapper-profile`, `samtools-mpileup`, `icSHAPE-rt`, or `bedgraph`
`--strand`	`-s`	String	“+”	Strand orientation (‘+’ for forward, ‘-’ for reverse, or ‘None’ for all strands)
`--rf-count-mutations`		Flag	false	For rf-rctools format: if true, input was generated with `rf-count -m` (mutation mode)
`--ref-fasta`		String	None	Reference FASTA file path (required for shapemapper2, icSHAPE-rt, and bedgraph formats)
`--chr-name`		String	None	Chromosome/contig name (for formats with missing metadata)
`--input-mutation`		String	None	Optional mutation input file (rf-rctools format with `-m` flag)
`--input-stop`		String	None	Optional stop input file (rf-rctools format without `-m` flag)
`--filter-strand`		String	“+”	Filter by strand orientation (‘+’ or ‘-’ or ‘None’ to process all strands)
`--log`	`-l`	String	None	Log file path (optional)

*Note: In dual input mode (--input-mutation/--input-stop), --input can be empty.

Supported Input Formats

bamreadcount: Tab-separated format from bam-readcount software
rf-rctools: RNA Framework / rctools counting output (with optional --rf-count-mutations flag)
rf-norm: rf-norm CSV output (reactivity-level data)
rf-norm-xml: rf-norm XML output
shapemapper2: ShapeMapper2 profile files (requires --ref-fasta)
shapemapper-profile: ShapeMapper2 profile format with dynamic column detection
samtools-mpileup: Samtools mpileup output (10-column tab-separated format)
icSHAPE-rt: icSHAPE-pipe RT file format (requires --ref-fasta)
bedgraph: Bedgraph format (5-column: chr, start, end, coverage, strand; requires --ref-fasta)

Examples

# Basic usage (bamreadcount format, auto-detected)
modtector convert -i bamreadcount.txt -o pileup.csv

# With strand specification
modtector convert -i bamreadcount.txt -o pileup.csv -s +

# Convert rf-rctools format (mutation mode)
modtector convert -i mutation.tsv -o pileup.csv -f rf-rctools --rf-count-mutations

# Convert rf-rctools format (stop mode)
modtector convert -i stop.tsv -o pileup.csv -f rf-rctools

# Dual input mode: merge mutation and stop files
modtector convert --input-mutation mutation.tsv --input-stop stop.tsv -o pileup.csv -f rf-rctools

# Convert shapemapper2 format (requires reference)
modtector convert -i shapemapper.profile -o pileup.csv -f shapemapper-profile --ref-fasta reference.fa

# Convert icSHAPE RT format
modtector convert -i input.rt -o pileup.csv -f icSHAPE-rt --ref-fasta reference.fa --filter-strand "+"

# Convert bedgraph format
modtector convert -i input.bedgraph -o pileup.csv -f bedgraph --ref-fasta reference.fa --filter-strand "+"

# Convert samtools mpileup format
modtector convert -i mpileup.txt -o pileup.csv -f samtools-mpileup -s +

# With logging
modtector convert -i input.txt -o pileup.csv -s + -l convert.log

Input Format (bamreadcount)

Tab-separated format from bam-readcount software:

EC_16S	1	A	77	A:77:29.25:33.82:...	C:0:0.00:...	G:0:0.00:...	T:0:0.00:...

Fields:

Column 1: Chromosome/contig name
Column 2: Position (1-based)
Column 3: Reference base
Column 4: Total depth
Column 5+: Base counts in format base:count:quality:...

Output Format (modtector pileup)

CSV format compatible with modtector reactivity command:

ChrID,pipe_truncation_Strand,pipe_truncation_ChrPos,rf_mutation_Base,rf_mutation_Count,pipe_truncation_count,depth,rf_mutation_ins,rf_mutation_del,base_A,base_C,base_G,base_T
EC_16S,+,1,A,0,0,77,0,0,77,0,0,0

Conversion Details

Mutation Count: Calculated as sum of non-reference bases (A, C, G, T)
Stop Signals: pipe_truncation_count is set to 0 (not available in bamreadcount format)
Strand Information: Uses the specified strand parameter (default: ‘+’)
Insertions/Deletions: Extracted from bamreadcount format (+ for insertions, - for deletions)

Performance

Streaming Processing: Processes files line-by-line without loading entire file into memory
Large File Support: Tested with files up to 35GB
Processing Speed: ~1-2 million lines/second (depends on I/O)
Memory Usage: O(1) constant regardless of file size
NFS Optimization: Uses 8MB buffer for better network filesystem performance

Progress Reporting

The command reports progress every 1 million lines:

Total lines processed
Valid lines converted
Error count (invalid lines skipped)

Error Handling

Invalid lines are skipped and error count is tracked
First 10 errors are logged for debugging
Processing continues despite errors
Final statistics show total, valid, and error counts

Use Cases

Workflow Integration: Convert bamreadcount output for use with modtector reactivity command
Format Standardization: Standardize bamreadcount data to modtector pileup format
Pipeline Compatibility: Enable bamreadcount-based workflows to use modtector reactivity features

Notes

Output file must have .csv extension
Strand parameter must be either ‘+’ or ‘-’ (or ‘None’ for all strands)
Input file must exist and be readable
The conversion preserves all position and depth information
Base counts (A, C, G, T) are extracted from input format
Insertions and deletions are included if present in input format
Format auto-detection works by checking file extension, headers, and data patterns
NaN values are preserved in output for formats that support them (rf-norm)

correct - PCR Bias Correction

Apply PCR bias correction to pileup CSV files using Chi-Square distribution fitting. This command corrects for PCR amplification bias where high depths cause mutation rate dilution, similar to shapemapper2’s effective_depth calculation.

Usage

modtector correct [OPTIONS] --input <INPUT> --output <OUTPUT>

Options

Option	Short	Type	Default	Description
`--input`	`-i`	String	Required	Input pileup CSV file
`--output`	`-o`	String	Required	Output pileup CSV file (with corrected effective_depth)
`--weight-increase`		Float	1.0	Weight for increasing depth correction (used when depth is too low)
`--weight-decrease`		Float	0.5	Weight for decreasing depth correction (used when depth is too high)
`--species`		String	None	Species name for filtering (optional)

How It Works

Data Binning: Groups depth and mutation rate data into bins
Chi-Square Fitting: Fits a Chi-Square distribution (df=2) to the depth-mutation rate relationship
Correction Factor Calculation: Computes correction factors based on the fitted distribution
Effective Depth Calculation: Applies correction factors to adjust effective_depth values
Output: Writes corrected pileup CSV with updated effective_depth column

Method Details

The correction uses a Chi-Square distribution model:

Distribution: Chi-Square with 2 degrees of freedom
Fitting: Brent’s method for 1D minimization to find optimal scale parameter
Correction: Applies weights for increasing/decreasing depth based on fitted model

Examples

# Basic usage
modtector correct -i pileup.csv -o pileup_corrected.csv

# With custom weights
modtector correct -i pileup.csv -o pileup_corrected.csv \
    --weight-increase 1.2 \
    --weight-decrease 0.4

# With species filtering
modtector correct -i pileup.csv -o pileup_corrected.csv \
    --species "human"

Output Format

The output CSV maintains the same format as input, with the effective_depth column updated:

Original depth column: Unchanged (total read depth)
effective_depth column: Corrected depth based on PCR bias model

Use Cases

Quality Filtering Enhancement: Combine with --min-base-qual in count command for comprehensive depth correction
High-Depth Data: Correct for mutation rate dilution in high-coverage regions
Workflow Integration: Optional step in analysis pipeline (can be disabled in default workflow)

Notes

This is an optional advanced feature
Can be combined with quality filtering (--min-base-qual) for best results
Correction factors are calculated per-species if --species is specified
The method is based on Chi-Square distribution fitting, similar to shapemapper2’s approach

Global Options

All commands support these global options:

Option	Short	Description
`--help`	`-h`	Print help information
`--version`	`-V`	Print version information

Output Files

Common Output Formats

CSV Files: Comma-separated values with headers
SVG Files: Scalable vector graphics for plots
Text Files: Plain text for evaluation results
Log Files: Processing logs and error messages

File Naming Conventions

Input files: *.bam, *.fa, *.csv
Output files: *_count.csv, *_norm.csv, *_reactivity.csv
Plots: *_distribution.svg, *_reactivity.svg, *_roc.svg
Evaluation: *_evaluation.txt, *_comprehensive.txt

Best Practices

Use appropriate thread counts: Match -t to your CPU cores
Monitor memory usage: Large datasets may require significant RAM
Check log files: Always review logs for warnings and errors
Validate inputs: Ensure BAM files are properly aligned
Use consistent parameters: Keep parameters consistent across runs
Backup results: Save intermediate results for reproducibility