User Guide
This comprehensive guide covers all aspects of using modtector, from basic concepts to advanced features.
Table of Contents
Understanding RNA Modifications
What are RNA Modifications?
RNA modifications are chemical alterations to RNA nucleotides that can affect:
RNA structure and stability
Protein-RNA interactions
Translation efficiency
RNA localization
Gene expression regulation
Common RNA Modifications
m6A: N6-methyladenosine (most common)
m1A: N1-methyladenosine
m5C: 5-methylcytosine
Ψ: Pseudouridine
D: Dihydrouridine
Detection Methods
modtector supports detection based on:
Stop signals: Pipeline truncation during reverse transcription
Mutation signals: Base mutations during reverse transcription
modtector Workflow
Overview
modtector follows a systematic workflow:
Raw BAM Files → Pileup Analysis → Normalization → Reactivity Calculation → Evaluation
↓ ↓ ↓ ↓ ↓
Alignment Signal Counts Filtered Data Modification Scores Accuracy
Step-by-Step Process
Data Input: BAM files, reference sequences, structure files
Pileup Analysis: Count stop and mutation signals
Normalization: Filter noise and outliers
Reactivity Calculation: Compare modified vs unmodified samples
Visualization: Generate plots and charts
Evaluation: Assess accuracy using known structures
Data Preparation
Input Requirements
BAM Files
Format: BAM format with proper alignment
Quality: High-quality alignments with minimal mismatches
Coverage: Sufficient read depth (recommended >50x)
Paired samples: Modified and unmodified samples
Reference Sequences
Format: FASTA format
Quality: High-quality reference sequences
Matching: Must match your BAM file alignments
Completeness: Include all target regions
Secondary Structure Files
Format: Dot-bracket notation (.dp files)
Source: Experimentally determined or predicted structures
Accuracy: High-confidence structures for evaluation
Coverage: Cover all regions of interest
Data Quality Checks
Before running modtector, verify:
BAM File Quality:
samtools flagstat sample.bam samtools view -c sample.bam
Reference Sequence:
samtools faidx reference.fa
Alignment Quality:
samtools view sample.bam | head -100
BAM Index Files (Required for count command):
# Check if index exists ls -lh sample.bam.bai # Create index if missing samtools index -b sample.bam -o sample.bam.bai -@ 8
Batch Processing and Single-cell Mode
Batch Mode (--batch)
Batch mode allows you to process multiple BAM files sequentially, each file independently.
Use Cases:
Processing multiple samples that need separate analysis
Bulk RNA-seq data with multiple replicates
When each file should be treated as an independent sample
Example:
modtector count --batch \
-b "/path/to/data/*sort.bam" \
-f reference.fa \
-o output_dir/ \
-t 8 \
-w 10000
Requirements:
BAM files must be sorted and indexed (
.baifiles must exist)Glob pattern must match at least one file
Output path must be a directory
Single-cell Unified Mode (--single-cell)
Single-cell unified mode is optimized for single-cell RNA-seq data, providing 2-3x performance improvement through unified processing.
Key Features:
Unified data distribution scanning (once instead of per-file)
Automatic cell label extraction from filenames
Cross-file parallel processing
Reduced I/O overhead
Cell Label Extraction:
Supports RHX pattern:
*RHX672.sort.bam→RHX672Falls back to last underscore-separated part:
sample_cell123.bam→cell123Final fallback: base filename without extension
Example:
modtector count --single-cell \
-b "/path/to/single_cell/*sort.bam" \
-f reference.fa \
-o output_dir/ \
-t 8 \
-w 10000 \
-l batch.log
Output:
Each cell generates a separate CSV file:
RHX672.csv,RHX673.csv, etc.Log files are also generated per cell
Performance Comparison:
Batch mode: ~N × single_file_time (sequential processing)
Single-cell unified mode: ~(N × single_file_time) / 2-3 (unified processing)
Signal Types
Stop Signals
Stop signals occur when reverse transcription is truncated at modification sites.
Characteristics
Detection: Pipeline truncation events
Sensitivity: High sensitivity for certain modifications
Specificity: Good specificity with proper controls
Background: Low background noise
Analysis
modtector count -b sample.bam -f reference.fa -o output.csv
Mutation Signals
Mutation signals occur when reverse transcription introduces errors at modification sites.
Characteristics
Detection: Base mutation events
Sensitivity: Moderate sensitivity
Specificity: Good specificity with proper controls
Background: Moderate background noise
Analysis
modtector count -b sample.bam -f reference.fa -o output.csv
Combined Analysis
modtector can analyze both signal types simultaneously:
modtector reactivity -M mod.csv -U unmod.csv -O reactivity.csv -t both
Normalization Methods
Purpose
Normalization removes systematic biases and noise from the data:
Technical noise: Sequencing artifacts
Biological noise: Background signal
Systematic bias: Sample preparation effects
Available Methods
1. Percentile28 Normalization
Method: 28th percentile scaling
Use case: General purpose normalization
Advantages: Robust to outliers
Disadvantages: May be conservative
modtector norm -i input.csv -o output.csv -m percentile28
2. Winsor90 Normalization
Method: 90th percentile winsorization
Use case: High-quality data
Advantages: Preserves signal distribution
Disadvantages: Sensitive to outliers
modtector norm -i input.csv -o output.csv -m winsor90
3. Boxplot Normalization
Method: Boxplot-based outlier removal
Use case: Data with many outliers
Advantages: Effective outlier removal
Disadvantages: May remove valid signals
modtector norm -i input.csv -o output.csv -m boxplot
Window-Based Normalization
Fixed Windows
modtector norm -i input.csv -o output.csv -m winsor90 --window 1000
Dynamic Windows
modtector norm -i input.csv -o output.csv -m winsor90 --dynamic
Sliding Windows
modtector norm -i input.csv -o output.csv -m winsor90 --window 500 --window-offset 100
Reactivity Calculation
Purpose
Reactivity calculation quantifies the difference between modified and unmodified samples to identify modification sites.
Calculation Methods
Stop Signal Methods
Current Method:
modtector reactivity -M mod.csv -U unmod.csv -O output.csv -s current
Ding Method:
modtector reactivity -M mod.csv -U unmod.csv -O output.csv -s ding
Rouskin Method:
modtector reactivity -M mod.csv -U unmod.csv -O output.csv -s rouskin
Mutation Signal Methods
Current Method:
modtector reactivity -M mod.csv -U unmod.csv -O output.csv -m current
Siegfried Method:
modtector reactivity -M mod.csv -U unmod.csv -O output.csv -m siegfried
Zubradt Method:
modtector reactivity -M mod.csv -U unmod.csv -O output.csv -m zubradt
Parameters
Pseudocount
Purpose: Avoid zero values in logarithmic calculations
Default: 1
Range: 0.1 - 10
Recommendation: Use 0.5 for sparse data
modtector reactivity -M mod.csv -U unmod.csv -O output.csv --pseudocount 0.5
Maximum Score
Purpose: Limit upper bound of reactivity values
Default: 10
Range: 5 - 50
Recommendation: Use 5 for conservative analysis
modtector reactivity -M mod.csv -U unmod.csv -O output.csv --maxscore 5
Evaluation Metrics
Purpose
Evaluation metrics assess the accuracy of modification detection using known secondary structures.
Available Metrics
1. Area Under the Curve (AUC)
Range: 0 - 1
Interpretation:
0.8: Excellent
0.7 - 0.8: Good
0.6 - 0.7: Fair
< 0.6: Poor
2. F1-Score
Range: 0 - 1
Interpretation:
0.8: Excellent
0.6 - 0.8: Good
0.4 - 0.6: Fair
< 0.4: Poor
3. Accuracy
Range: 0 - 1
Interpretation:
0.9: Excellent
0.8 - 0.9: Good
0.7 - 0.8: Fair
< 0.7: Poor
4. Sensitivity (Recall)
Range: 0 - 1
Interpretation: Proportion of true modifications detected
5. Specificity
Range: 0 - 1
Interpretation: Proportion of true non-modifications correctly identified
6. Positive Predictive Value (PPV)
Range: 0 - 1
Interpretation: Proportion of predicted modifications that are true
7. Negative Predictive Value (NPV)
Range: 0 - 1
Interpretation: Proportion of predicted non-modifications that are true
Evaluation Process
modtector evaluate -r reactivity.csv -s structure.dp -o results/ -g gene_id
Output Files
Comprehensive Results:
*_comprehensive.txtROC Curves:
*_roc.svgPR Curves:
*_pr.svgCombined Plots:
*_combined_roc.svg
Visualization
Purpose
Visualization helps interpret results and identify patterns in the data.
Available Plots
1. Signal Distribution Plots
Purpose: Show signal distribution across positions
Format: SVG
Content: Stop and mutation signals
2. Reactivity Plots
Purpose: Display reactivity scores
Format: SVG
Content: Modification sites and scores
3. ROC Curves
Purpose: Show classification performance
Format: SVG
Content: True positive rate vs false positive rate
4. PR Curves
Purpose: Show precision-recall performance
Format: SVG
Content: Precision vs recall
5. Comparison Plots
Purpose: Compare different samples or methods
Format: SVG
6. RNA Structure SVG Plots
Purpose: Visualize reactivity data on RNA secondary structure
Format: SVG
Content: Colored circles mapped to structure positions
Features:
Multi-signal support (stop, mutation, etc.)
Strand selection (+, -, both)
Base filtering (A, T, C, G)
Alignment support
Color-coded reactivity scores
Content: Side-by-side comparisons
7. Interactive HTML Visualizations
Purpose: Interactive web-based visualization of reactivity data on RNA structure
Format: HTML (with embedded SVG and JavaScript)
Content: Interactive RNA structure with reactivity data overlay
Features:
Hover Tooltips: Display position, base type, and reactivity values on hover
Zoom and Pan: Mouse wheel zoom and click-drag panning
Filtering Controls:
Reactivity threshold filtering (min/max sliders)
Base type filtering (A, T, C, G, or all)
Highlighting: Click circles to highlight specific positions
Export: Export the current view as SVG
Reset View: Reset zoom and filters to default
Usage: Add
--interactiveflag to the plot command to generate HTML instead of static SVG
SVG Plotting
Overview
RNA structure SVG plots provide an intuitive way to visualize reactivity data directly on the RNA secondary structure. This helps researchers identify modification sites and understand the relationship between structure and reactivity.
SVG-Only Mode
When you only need SVG plots without regular distribution plots, use the simplified command:
modtector plot \
-o svg_output/ \
--svg-template rna_structure.svg \
--reactivity reactivity_data.csv
Multi-Signal SVG Plotting
For data with multiple signal types (stop, mutation), plot all signals:
modtector plot \
-o svg_output/ \
--svg-template rna_structure.svg \
--reactivity reactivity_data.csv \
--svg-bases ATGC \
--svg-signal all \
--svg-strand +
Interactive HTML Visualization
Generate interactive HTML visualizations with zoom, pan, filtering, and tooltip features:
modtector plot \
-o interactive_output/ \
--svg-template rna_structure.svg \
--reactivity reactivity_data.csv \
--interactive \
--svg-signal stop \
--svg-bases ACGT
The interactive HTML file can be opened in any modern web browser and provides:
Mouse wheel zoom: Scroll to zoom in/out
Click and drag: Pan around the structure
Hover tooltips: See position, base, and reactivity values
Filtering: Adjust reactivity thresholds and filter by base type
Highlighting: Click circles to highlight specific positions
Export: Download the current view as SVG
Single Signal SVG Plotting
For single signal data, specify the signal type:
modtector plot \
-o svg_output/ \
--svg-template rna_structure.svg \
--reactivity reactivity_data.csv \
--svg-bases AC \
--svg-signal stop \
--svg-strand +
SVG Plotting with Alignment
When sequence alignment is needed:
modtector plot \
-o svg_output/ \
--svg-template rna_structure.svg \
--reactivity reactivity_data.csv \
--svg-bases ATGC \
--svg-signal all \
--svg-strand + \
--svg-ref reference.fa \
--svg-max-shift 10
SVG Output Files
Multi-signal:
rna_structure_colored_[signal_type].svgSingle-signal:
rna_structure_colored_score.svg
Plotting Options
Basic Plotting
modtector plot -M mod.csv -U unmod.csv -o plots/
With Reactivity Data
modtector plot -M mod.csv -U unmod.csv -o plots/ -r reactivity.csv
Custom Thresholds
modtector plot -M mod.csv -U unmod.csv -o plots/ -c 0.3 -d 100
With Genome Annotation
modtector plot -M mod.csv -U unmod.csv -o plots/ -g annotation.gff
Advanced Features
Multi-threading
modtector supports parallel processing for improved performance:
# Use 8 threads
modtector count -b sample.bam -f reference.fa -o output.csv -t 8
modtector reactivity -M mod.csv -U unmod.csv -O output.csv -t 8
modtector plot -M mod.csv -U unmod.csv -o plots/ -t 8
Window Analysis
Fixed Windows
modtector count -b sample.bam -f reference.fa -o output.csv -w 1000
Dynamic Windows
modtector norm -i input.csv -o output.csv -m winsor90 --dynamic
Base-Specific Analysis
Target specific bases for analysis:
# Analyze only A and C bases
modtector norm -i input.csv -o output.csv -m winsor90 --bases AC
# Analyze only G and T bases
modtector norm -i input.csv -o output.csv -m winsor90 --bases GT
Statistical Testing
Compare samples using different statistical tests:
# Student's t-test
modtector compare -M mod.csv -U unmod.csv -o comparison.csv -t t-test
# Mann-Whitney U test
modtector compare -M mod.csv -U unmod.csv -o comparison.csv -t mann-whitney
# Wilcoxon signed-rank test
modtector compare -M mod.csv -U unmod.csv -o comparison.csv -t wilcoxon
Auto-shift Correction
Automatically correct for sequence length differences:
modtector evaluate -r reactivity.csv -s structure.dp -o results/ -g gene_id --auto-shift
Base Matching
Use intelligent base matching for T/U equivalence:
modtector evaluate -r reactivity.csv -s structure.dp -o results/ -g gene_id --base-matching
Best Practices
Data Quality
High-quality alignments: Ensure BAM files are properly aligned
Sufficient coverage: Use at least 50x coverage
Proper controls: Include unmodified control samples
Quality filtering: Remove low-quality reads and positions
Parameter Selection
Normalization method: Choose based on data characteristics
Window size: Balance between noise reduction and signal preservation
Thresholds: Adjust based on expected signal levels
Statistical tests: Select appropriate test for your data
Performance Optimization
Thread count: Match to available CPU cores
Memory usage: Monitor RAM usage for large datasets
Disk space: Ensure sufficient storage for output files
Batch processing: Process multiple samples in parallel
Result Interpretation
Check evaluation metrics: Ensure good performance scores
Validate with known sites: Compare with literature
Consider biological context: Interpret results in context
Reproducibility: Document parameters and methods
Troubleshooting
Low coverage: Increase sequencing depth
Poor normalization: Try different methods
Low evaluation scores: Check data quality
Memory issues: Reduce thread count or dataset size
Documentation
Record parameters: Keep track of all settings
Version control: Use version control for reproducibility
Log files: Save and review log files
Backup results: Keep copies of important results
Common Workflows
Basic Workflow
# 1. Generate pileup data
modtector count -b mod.bam -f reference.fa -o mod_count.csv -t 4
modtector count -b unmod.bam -f reference.fa -o unmod_count.csv -t 4
# 2. Normalize signals
modtector norm -i mod_count.csv -o mod_norm.csv -m winsor90 --bases AC
modtector norm -i unmod_count.csv -o unmod_norm.csv -m winsor90 --bases AC
# 3. Calculate reactivity
modtector reactivity -M mod_norm.csv -U unmod_norm.csv -O reactivity.csv -t 4
# 4. Generate plots
modtector plot -M mod_norm.csv -U unmod_norm.csv -o plots/ -r reactivity.csv -t 4
# 5. Evaluate accuracy
modtector evaluate -r reactivity.csv -s structure.dp -o results/ -g gene_id
Advanced Workflow
# 1. Generate pileup data with windowing
modtector count -b mod.bam -f reference.fa -o mod_count.csv -w 1000 -t 8
modtector count -b unmod.bam -f reference.fa -o unmod_count.csv -w 1000 -t 8
# 2. Normalize with dynamic windows
modtector norm -i mod_count.csv -o mod_norm.csv -m winsor90 --dynamic --bases AC
modtector norm -i unmod_count.csv -o unmod_norm.csv -m winsor90 --dynamic --bases AC
# 3. Calculate reactivity with custom parameters
modtector reactivity -M mod_norm.csv -U unmod_norm.csv -O reactivity.csv \
-s ding -m siegfried --pseudocount 0.5 --maxscore 5 -t 8
# 4. Compare samples
modtector compare -M mod_norm.csv -U unmod_norm.csv -o comparison.csv \
-t mann-whitney -d 20 -f 1.5
# 5. Generate comprehensive plots
modtector plot -M mod_norm.csv -U unmod_norm.csv -o plots/ \
-r reactivity.csv -c 0.3 -d 100 -t 8
# 6. Evaluate with auto-shift correction
modtector evaluate -r reactivity.csv -s structure.dp -o results/ \
-g gene_id --auto-shift --base-matching
This user guide provides comprehensive information for using modtector effectively. For specific command details, refer to the Command Reference.