User Guide

This comprehensive guide covers all aspects of using modtector, from basic concepts to advanced features.

Table of Contents

Understanding RNA Modifications
modtector Workflow
Data Preparation
Signal Types
Normalization Methods
Reactivity Calculation
Evaluation Metrics
Visualization
Advanced Features
Best Practices

Understanding RNA Modifications

What are RNA Modifications?

RNA modifications are chemical alterations to RNA nucleotides that can affect:

RNA structure and stability
Protein-RNA interactions
Translation efficiency
RNA localization
Gene expression regulation

Common RNA Modifications

m6A: N6-methyladenosine (most common)
m1A: N1-methyladenosine
m5C: 5-methylcytosine
Ψ: Pseudouridine
D: Dihydrouridine

Detection Methods

modtector supports detection based on:

Stop signals: Pipeline truncation during reverse transcription
Mutation signals: Base mutations during reverse transcription

modtector Workflow

Overview

modtector follows a systematic workflow:

Raw BAM Files → Pileup Analysis → Normalization → Reactivity Calculation → Evaluation
     ↓              ↓                ↓                    ↓                ↓
  Alignment      Signal Counts    Filtered Data    Modification Scores   Accuracy

Step-by-Step Process

Data Input: BAM files, reference sequences, structure files
Pileup Analysis: Count stop and mutation signals
Normalization: Filter noise and outliers
Reactivity Calculation: Compare modified vs unmodified samples
Visualization: Generate plots and charts
Evaluation: Assess accuracy using known structures

Data Preparation

Input Requirements

BAM Files

Format: BAM format with proper alignment
Quality: High-quality alignments with minimal mismatches
Coverage: Sufficient read depth (recommended >50x)
Paired samples: Modified and unmodified samples

Reference Sequences

Format: FASTA format
Quality: High-quality reference sequences
Matching: Must match your BAM file alignments
Completeness: Include all target regions

Secondary Structure Files

Format: Dot-bracket notation (.dp files)
Source: Experimentally determined or predicted structures
Accuracy: High-confidence structures for evaluation
Coverage: Cover all regions of interest

Data Quality Checks

Before running modtector, verify:

BAM File Quality:

samtools flagstat sample.bam
samtools view -c sample.bam

Reference Sequence:
```
samtools faidx reference.fa
```
Alignment Quality:
```
samtools view sample.bam | head -100
```

BAM Index Files (Required for count command):

# Check if index exists
ls -lh sample.bam.bai

# Create index if missing
samtools index -b sample.bam -o sample.bam.bai -@ 8

Batch Processing and Single-cell Mode

Batch Mode (`--batch`)

Batch mode allows you to process multiple BAM files sequentially, each file independently.

Use Cases:

Processing multiple samples that need separate analysis
Bulk RNA-seq data with multiple replicates
When each file should be treated as an independent sample

Example:

modtector count --batch \
  -b "/path/to/data/*sort.bam" \
  -f reference.fa \
  -o output_dir/ \
  -t 8 \
  -w 10000

Requirements:

BAM files must be sorted and indexed (.bai files must exist)
Glob pattern must match at least one file
Output path must be a directory

Single-cell Unified Mode (`--single-cell`)

Single-cell unified mode is optimized for single-cell RNA-seq data, providing 2-3x performance improvement through unified processing.

Key Features:

Unified data distribution scanning (once instead of per-file)
Automatic cell label extraction from filenames
Cross-file parallel processing
Reduced I/O overhead

Cell Label Extraction:

Supports RHX pattern: *RHX672.sort.bam → RHX672
Falls back to last underscore-separated part: sample_cell123.bam → cell123
Final fallback: base filename without extension

Example:

modtector count --single-cell \
  -b "/path/to/single_cell/*sort.bam" \
  -f reference.fa \
  -o output_dir/ \
  -t 8 \
  -w 10000 \
  -l batch.log

Output:

Each cell generates a separate CSV file: RHX672.csv, RHX673.csv, etc.
Log files are also generated per cell

Performance Comparison:

Batch mode: ~N × single_file_time (sequential processing)
Single-cell unified mode: ~(N × single_file_time) / 2-3 (unified processing)

Signal Types

Stop Signals

Stop signals occur when reverse transcription is truncated at modification sites.

Characteristics

Detection: Pipeline truncation events
Sensitivity: High sensitivity for certain modifications
Specificity: Good specificity with proper controls
Background: Low background noise

Analysis

modtector count -b sample.bam -f reference.fa -o output.csv

Mutation Signals

Mutation signals occur when reverse transcription introduces errors at modification sites.

Characteristics

Detection: Base mutation events
Sensitivity: Moderate sensitivity
Specificity: Good specificity with proper controls
Background: Moderate background noise

Analysis

modtector count -b sample.bam -f reference.fa -o output.csv

Combined Analysis

modtector can analyze both signal types simultaneously:

modtector reactivity -M mod.csv -U unmod.csv -O reactivity.csv -t both

Normalization Methods

Purpose

Normalization removes systematic biases and noise from the data:

Technical noise: Sequencing artifacts
Biological noise: Background signal
Systematic bias: Sample preparation effects

Available Methods

1. Percentile28 Normalization

Method: 28th percentile scaling
Use case: General purpose normalization
Advantages: Robust to outliers
Disadvantages: May be conservative

modtector norm -i input.csv -o output.csv -m percentile28

2. Winsor90 Normalization

Method: 90th percentile winsorization
Use case: High-quality data
Advantages: Preserves signal distribution
Disadvantages: Sensitive to outliers

modtector norm -i input.csv -o output.csv -m winsor90

3. Boxplot Normalization

Method: Boxplot-based outlier removal
Use case: Data with many outliers
Advantages: Effective outlier removal
Disadvantages: May remove valid signals

modtector norm -i input.csv -o output.csv -m boxplot

Window-Based Normalization

Fixed Windows

modtector norm -i input.csv -o output.csv -m winsor90 --window 1000

Dynamic Windows

modtector norm -i input.csv -o output.csv -m winsor90 --dynamic

Sliding Windows

modtector norm -i input.csv -o output.csv -m winsor90 --window 500 --window-offset 100

Reactivity Calculation

Purpose

Reactivity calculation quantifies the difference between modified and unmodified samples to identify modification sites.

Calculation Methods

Stop Signal Methods

Current Method:

modtector reactivity -M mod.csv -U unmod.csv -O output.csv -s current

Ding Method:

modtector reactivity -M mod.csv -U unmod.csv -O output.csv -s ding

Rouskin Method:

modtector reactivity -M mod.csv -U unmod.csv -O output.csv -s rouskin

Mutation Signal Methods

Current Method:

modtector reactivity -M mod.csv -U unmod.csv -O output.csv -m current

Siegfried Method:

modtector reactivity -M mod.csv -U unmod.csv -O output.csv -m siegfried

Zubradt Method:

modtector reactivity -M mod.csv -U unmod.csv -O output.csv -m zubradt

Parameters

Pseudocount

Purpose: Avoid zero values in logarithmic calculations
Default: 1
Range: 0.1 - 10
Recommendation: Use 0.5 for sparse data

modtector reactivity -M mod.csv -U unmod.csv -O output.csv --pseudocount 0.5

Maximum Score

Purpose: Limit upper bound of reactivity values
Default: 10
Range: 5 - 50
Recommendation: Use 5 for conservative analysis

modtector reactivity -M mod.csv -U unmod.csv -O output.csv --maxscore 5

Evaluation Metrics

Purpose

Evaluation metrics assess the accuracy of modification detection using known secondary structures.

Available Metrics

1. Area Under the Curve (AUC)

Range: 0 - 1
Interpretation:
- 0.8: Excellent
- 0.7 - 0.8: Good
- 0.6 - 0.7: Fair
- < 0.6: Poor

2. F1-Score

Range: 0 - 1
Interpretation:
- 0.8: Excellent
- 0.6 - 0.8: Good
- 0.4 - 0.6: Fair
- < 0.4: Poor

3. Accuracy

Range: 0 - 1
Interpretation:
- 0.9: Excellent
- 0.8 - 0.9: Good
- 0.7 - 0.8: Fair
- < 0.7: Poor

4. Sensitivity (Recall)

Range: 0 - 1
Interpretation: Proportion of true modifications detected

5. Specificity

Range: 0 - 1
Interpretation: Proportion of true non-modifications correctly identified

6. Positive Predictive Value (PPV)

Range: 0 - 1
Interpretation: Proportion of predicted modifications that are true

7. Negative Predictive Value (NPV)

Range: 0 - 1
Interpretation: Proportion of predicted non-modifications that are true

Evaluation Process

modtector evaluate -r reactivity.csv -s structure.dp -o results/ -g gene_id

Output Files

Comprehensive Results: *_comprehensive.txt
ROC Curves: *_roc.svg
PR Curves: *_pr.svg
Combined Plots: *_combined_roc.svg

Visualization

Purpose

Visualization helps interpret results and identify patterns in the data.

Available Plots

1. Signal Distribution Plots

Purpose: Show signal distribution across positions
Format: SVG
Content: Stop and mutation signals

2. Reactivity Plots

Purpose: Display reactivity scores
Format: SVG
Content: Modification sites and scores

3. ROC Curves

Purpose: Show classification performance
Format: SVG
Content: True positive rate vs false positive rate

4. PR Curves

Purpose: Show precision-recall performance
Format: SVG
Content: Precision vs recall

5. Comparison Plots

Purpose: Compare different samples or methods
Format: SVG

6. RNA Structure SVG Plots

Purpose: Visualize reactivity data on RNA secondary structure
Format: SVG
Content: Colored circles mapped to structure positions
Features:
- Multi-signal support (stop, mutation, etc.)
- Strand selection (+, -, both)
- Base filtering (A, T, C, G)
- Alignment support
- Color-coded reactivity scores
Content: Side-by-side comparisons

7. Interactive HTML Visualizations

Purpose: Interactive web-based visualization of reactivity data on RNA structure
Format: HTML (with embedded SVG and JavaScript)
Content: Interactive RNA structure with reactivity data overlay
Features:
- Hover Tooltips: Display position, base type, and reactivity values on hover
- Zoom and Pan: Mouse wheel zoom and click-drag panning
- Filtering Controls:
  - Reactivity threshold filtering (min/max sliders)
  - Base type filtering (A, T, C, G, or all)
- Highlighting: Click circles to highlight specific positions
- Export: Export the current view as SVG
- Reset View: Reset zoom and filters to default
Usage: Add --interactive flag to the plot command to generate HTML instead of static SVG

SVG Plotting

Overview

RNA structure SVG plots provide an intuitive way to visualize reactivity data directly on the RNA secondary structure. This helps researchers identify modification sites and understand the relationship between structure and reactivity.

SVG-Only Mode

When you only need SVG plots without regular distribution plots, use the simplified command:

modtector plot \
    -o svg_output/ \
    --svg-template rna_structure.svg \
    --reactivity reactivity_data.csv

Multi-Signal SVG Plotting

For data with multiple signal types (stop, mutation), plot all signals:

modtector plot \
    -o svg_output/ \
    --svg-template rna_structure.svg \
    --reactivity reactivity_data.csv \
    --svg-bases ATGC \
    --svg-signal all \
    --svg-strand +

Interactive HTML Visualization

Generate interactive HTML visualizations with zoom, pan, filtering, and tooltip features:

modtector plot \
    -o interactive_output/ \
    --svg-template rna_structure.svg \
    --reactivity reactivity_data.csv \
    --interactive \
    --svg-signal stop \
    --svg-bases ACGT

The interactive HTML file can be opened in any modern web browser and provides:

Mouse wheel zoom: Scroll to zoom in/out
Click and drag: Pan around the structure
Hover tooltips: See position, base, and reactivity values
Filtering: Adjust reactivity thresholds and filter by base type
Highlighting: Click circles to highlight specific positions
Export: Download the current view as SVG

Single Signal SVG Plotting

For single signal data, specify the signal type:

modtector plot \
    -o svg_output/ \
    --svg-template rna_structure.svg \
    --reactivity reactivity_data.csv \
    --svg-bases AC \
    --svg-signal stop \
    --svg-strand +

SVG Plotting with Alignment

When sequence alignment is needed:

modtector plot \
    -o svg_output/ \
    --svg-template rna_structure.svg \
    --reactivity reactivity_data.csv \
    --svg-bases ATGC \
    --svg-signal all \
    --svg-strand + \
    --svg-ref reference.fa \
    --svg-max-shift 10

SVG Output Files

Multi-signal: rna_structure_colored_[signal_type].svg
Single-signal: rna_structure_colored_score.svg

Plotting Options

Basic Plotting

modtector plot -M mod.csv -U unmod.csv -o plots/

With Reactivity Data

modtector plot -M mod.csv -U unmod.csv -o plots/ -r reactivity.csv

Custom Thresholds

modtector plot -M mod.csv -U unmod.csv -o plots/ -c 0.3 -d 100

With Genome Annotation

modtector plot -M mod.csv -U unmod.csv -o plots/ -g annotation.gff

Advanced Features

Multi-threading

modtector supports parallel processing for improved performance:

# Use 8 threads
modtector count -b sample.bam -f reference.fa -o output.csv -t 8
modtector reactivity -M mod.csv -U unmod.csv -O output.csv -t 8
modtector plot -M mod.csv -U unmod.csv -o plots/ -t 8

Window Analysis

Fixed Windows

modtector count -b sample.bam -f reference.fa -o output.csv -w 1000

Dynamic Windows

modtector norm -i input.csv -o output.csv -m winsor90 --dynamic

Base-Specific Analysis

Target specific bases for analysis:

# Analyze only A and C bases
modtector norm -i input.csv -o output.csv -m winsor90 --bases AC

# Analyze only G and T bases
modtector norm -i input.csv -o output.csv -m winsor90 --bases GT

Statistical Testing

Compare samples using different statistical tests:

# Student's t-test
modtector compare -M mod.csv -U unmod.csv -o comparison.csv -t t-test

# Mann-Whitney U test
modtector compare -M mod.csv -U unmod.csv -o comparison.csv -t mann-whitney

# Wilcoxon signed-rank test
modtector compare -M mod.csv -U unmod.csv -o comparison.csv -t wilcoxon

Auto-shift Correction

Automatically correct for sequence length differences:

modtector evaluate -r reactivity.csv -s structure.dp -o results/ -g gene_id --auto-shift

Base Matching

Use intelligent base matching for T/U equivalence:

modtector evaluate -r reactivity.csv -s structure.dp -o results/ -g gene_id --base-matching

Best Practices

Data Quality

High-quality alignments: Ensure BAM files are properly aligned
Sufficient coverage: Use at least 50x coverage
Proper controls: Include unmodified control samples
Quality filtering: Remove low-quality reads and positions

Parameter Selection

Normalization method: Choose based on data characteristics
Window size: Balance between noise reduction and signal preservation
Thresholds: Adjust based on expected signal levels
Statistical tests: Select appropriate test for your data

Performance Optimization

Thread count: Match to available CPU cores
Memory usage: Monitor RAM usage for large datasets
Disk space: Ensure sufficient storage for output files
Batch processing: Process multiple samples in parallel

Result Interpretation

Check evaluation metrics: Ensure good performance scores
Validate with known sites: Compare with literature
Consider biological context: Interpret results in context
Reproducibility: Document parameters and methods

Troubleshooting

Low coverage: Increase sequencing depth
Poor normalization: Try different methods
Low evaluation scores: Check data quality
Memory issues: Reduce thread count or dataset size

Documentation

Record parameters: Keep track of all settings
Version control: Use version control for reproducibility
Log files: Save and review log files
Backup results: Keep copies of important results

Common Workflows

Basic Workflow

# 1. Generate pileup data
modtector count -b mod.bam -f reference.fa -o mod_count.csv -t 4
modtector count -b unmod.bam -f reference.fa -o unmod_count.csv -t 4

# 2. Normalize signals
modtector norm -i mod_count.csv -o mod_norm.csv -m winsor90 --bases AC
modtector norm -i unmod_count.csv -o unmod_norm.csv -m winsor90 --bases AC

# 3. Calculate reactivity
modtector reactivity -M mod_norm.csv -U unmod_norm.csv -O reactivity.csv -t 4

# 4. Generate plots
modtector plot -M mod_norm.csv -U unmod_norm.csv -o plots/ -r reactivity.csv -t 4

# 5. Evaluate accuracy
modtector evaluate -r reactivity.csv -s structure.dp -o results/ -g gene_id

Advanced Workflow

# 1. Generate pileup data with windowing
modtector count -b mod.bam -f reference.fa -o mod_count.csv -w 1000 -t 8
modtector count -b unmod.bam -f reference.fa -o unmod_count.csv -w 1000 -t 8

# 2. Normalize with dynamic windows
modtector norm -i mod_count.csv -o mod_norm.csv -m winsor90 --dynamic --bases AC
modtector norm -i unmod_count.csv -o unmod_norm.csv -m winsor90 --dynamic --bases AC

# 3. Calculate reactivity with custom parameters
modtector reactivity -M mod_norm.csv -U unmod_norm.csv -O reactivity.csv \
    -s ding -m siegfried --pseudocount 0.5 --maxscore 5 -t 8

# 4. Compare samples
modtector compare -M mod_norm.csv -U unmod_norm.csv -o comparison.csv \
    -t mann-whitney -d 20 -f 1.5

# 5. Generate comprehensive plots
modtector plot -M mod_norm.csv -U unmod_norm.csv -o plots/ \
    -r reactivity.csv -c 0.3 -d 100 -t 8

# 6. Evaluate with auto-shift correction
modtector evaluate -r reactivity.csv -s structure.dp -o results/ \
    -g gene_id --auto-shift --base-matching

This user guide provides comprehensive information for using modtector effectively. For specific command details, refer to the Command Reference.

User Guide

Table of Contents

Understanding RNA Modifications

What are RNA Modifications?

Common RNA Modifications

Detection Methods

modtector Workflow

Overview

Step-by-Step Process

Data Preparation

Input Requirements

BAM Files

Reference Sequences

Secondary Structure Files

Data Quality Checks

Batch Processing and Single-cell Mode

Batch Mode (--batch)

Single-cell Unified Mode (--single-cell)

Signal Types

Stop Signals

Characteristics

Analysis

Mutation Signals

Characteristics

Analysis

Combined Analysis

Normalization Methods

Purpose

Available Methods

1. Percentile28 Normalization

2. Winsor90 Normalization

3. Boxplot Normalization

Window-Based Normalization

Fixed Windows

Dynamic Windows

Sliding Windows

Reactivity Calculation

Purpose

Calculation Methods

Stop Signal Methods

Mutation Signal Methods

Parameters

Pseudocount

Maximum Score

Evaluation Metrics

Purpose

Available Metrics

1. Area Under the Curve (AUC)

2. F1-Score

3. Accuracy

4. Sensitivity (Recall)

5. Specificity

6. Positive Predictive Value (PPV)

7. Negative Predictive Value (NPV)

Evaluation Process

Output Files

Visualization

Purpose

Available Plots

1. Signal Distribution Plots

2. Reactivity Plots

3. ROC Curves

4. PR Curves

5. Comparison Plots

6. RNA Structure SVG Plots

7. Interactive HTML Visualizations

SVG Plotting

Overview

SVG-Only Mode

Multi-Signal SVG Plotting

Interactive HTML Visualization

Single Signal SVG Plotting

SVG Plotting with Alignment

SVG Output Files

Plotting Options

Basic Plotting

With Reactivity Data

Custom Thresholds

With Genome Annotation

Advanced Features

Batch Mode (`--batch`)

Single-cell Unified Mode (`--single-cell`)