User Guide

This comprehensive guide covers all aspects of using modtector, from basic concepts to advanced features.

Table of Contents

  1. Understanding RNA Modifications

  2. modtector Workflow

  3. Data Preparation

  4. Signal Types

  5. Normalization Methods

  6. Reactivity Calculation

  7. Evaluation Metrics

  8. Visualization

  9. Advanced Features

  10. Best Practices

Understanding RNA Modifications

What are RNA Modifications?

RNA modifications are chemical alterations to RNA nucleotides that can affect:

  • RNA structure and stability

  • Protein-RNA interactions

  • Translation efficiency

  • RNA localization

  • Gene expression regulation

Common RNA Modifications

  • m6A: N6-methyladenosine (most common)

  • m1A: N1-methyladenosine

  • m5C: 5-methylcytosine

  • Ψ: Pseudouridine

  • D: Dihydrouridine

Detection Methods

modtector supports detection based on:

  • Stop signals: Pipeline truncation during reverse transcription

  • Mutation signals: Base mutations during reverse transcription

modtector Workflow

Overview

modtector follows a systematic workflow:

Raw BAM Files → Pileup Analysis → Normalization → Reactivity Calculation → Evaluation
     ↓              ↓                ↓                    ↓                ↓
  Alignment      Signal Counts    Filtered Data    Modification Scores   Accuracy

Step-by-Step Process

  1. Data Input: BAM files, reference sequences, structure files

  2. Pileup Analysis: Count stop and mutation signals

  3. Normalization: Filter noise and outliers

  4. Reactivity Calculation: Compare modified vs unmodified samples

  5. Visualization: Generate plots and charts

  6. Evaluation: Assess accuracy using known structures

Data Preparation

Input Requirements

BAM Files

  • Format: BAM format with proper alignment

  • Quality: High-quality alignments with minimal mismatches

  • Coverage: Sufficient read depth (recommended >50x)

  • Paired samples: Modified and unmodified samples

Reference Sequences

  • Format: FASTA format

  • Quality: High-quality reference sequences

  • Matching: Must match your BAM file alignments

  • Completeness: Include all target regions

Secondary Structure Files

  • Format: Dot-bracket notation (.dp files)

  • Source: Experimentally determined or predicted structures

  • Accuracy: High-confidence structures for evaluation

  • Coverage: Cover all regions of interest

Data Quality Checks

Before running modtector, verify:

  1. BAM File Quality:

    samtools flagstat sample.bam
    samtools view -c sample.bam
    
  2. Reference Sequence:

    samtools faidx reference.fa
    
  3. Alignment Quality:

    samtools view sample.bam | head -100
    
  4. BAM Index Files (Required for count command):

    # Check if index exists
    ls -lh sample.bam.bai
    
    # Create index if missing
    samtools index -b sample.bam -o sample.bam.bai -@ 8
    

Batch Processing and Single-cell Mode

Batch Mode (--batch)

Batch mode allows you to process multiple BAM files sequentially, each file independently.

Use Cases:

  • Processing multiple samples that need separate analysis

  • Bulk RNA-seq data with multiple replicates

  • When each file should be treated as an independent sample

Example:

modtector count --batch \
  -b "/path/to/data/*sort.bam" \
  -f reference.fa \
  -o output_dir/ \
  -t 8 \
  -w 10000

Requirements:

  • BAM files must be sorted and indexed (.bai files must exist)

  • Glob pattern must match at least one file

  • Output path must be a directory

Single-cell Unified Mode (--single-cell)

Single-cell unified mode is optimized for single-cell RNA-seq data, providing 2-3x performance improvement through unified processing.

Key Features:

  • Unified data distribution scanning (once instead of per-file)

  • Automatic cell label extraction from filenames

  • Cross-file parallel processing

  • Reduced I/O overhead

Cell Label Extraction:

  • Supports RHX pattern: *RHX672.sort.bamRHX672

  • Falls back to last underscore-separated part: sample_cell123.bamcell123

  • Final fallback: base filename without extension

Example:

modtector count --single-cell \
  -b "/path/to/single_cell/*sort.bam" \
  -f reference.fa \
  -o output_dir/ \
  -t 8 \
  -w 10000 \
  -l batch.log

Output:

  • Each cell generates a separate CSV file: RHX672.csv, RHX673.csv, etc.

  • Log files are also generated per cell

Performance Comparison:

  • Batch mode: ~N × single_file_time (sequential processing)

  • Single-cell unified mode: ~(N × single_file_time) / 2-3 (unified processing)

Signal Types

Stop Signals

Stop signals occur when reverse transcription is truncated at modification sites.

Characteristics

  • Detection: Pipeline truncation events

  • Sensitivity: High sensitivity for certain modifications

  • Specificity: Good specificity with proper controls

  • Background: Low background noise

Analysis

modtector count -b sample.bam -f reference.fa -o output.csv

Mutation Signals

Mutation signals occur when reverse transcription introduces errors at modification sites.

Characteristics

  • Detection: Base mutation events

  • Sensitivity: Moderate sensitivity

  • Specificity: Good specificity with proper controls

  • Background: Moderate background noise

Analysis

modtector count -b sample.bam -f reference.fa -o output.csv

Combined Analysis

modtector can analyze both signal types simultaneously:

modtector reactivity -M mod.csv -U unmod.csv -O reactivity.csv -t both

Normalization Methods

Purpose

Normalization removes systematic biases and noise from the data:

  • Technical noise: Sequencing artifacts

  • Biological noise: Background signal

  • Systematic bias: Sample preparation effects

Available Methods

1. Percentile28 Normalization

  • Method: 28th percentile scaling

  • Use case: General purpose normalization

  • Advantages: Robust to outliers

  • Disadvantages: May be conservative

modtector norm -i input.csv -o output.csv -m percentile28

2. Winsor90 Normalization

  • Method: 90th percentile winsorization

  • Use case: High-quality data

  • Advantages: Preserves signal distribution

  • Disadvantages: Sensitive to outliers

modtector norm -i input.csv -o output.csv -m winsor90

3. Boxplot Normalization

  • Method: Boxplot-based outlier removal

  • Use case: Data with many outliers

  • Advantages: Effective outlier removal

  • Disadvantages: May remove valid signals

modtector norm -i input.csv -o output.csv -m boxplot

Window-Based Normalization

Fixed Windows

modtector norm -i input.csv -o output.csv -m winsor90 --window 1000

Dynamic Windows

modtector norm -i input.csv -o output.csv -m winsor90 --dynamic

Sliding Windows

modtector norm -i input.csv -o output.csv -m winsor90 --window 500 --window-offset 100

Reactivity Calculation

Purpose

Reactivity calculation quantifies the difference between modified and unmodified samples to identify modification sites.

Calculation Methods

Stop Signal Methods

  1. Current Method:

    modtector reactivity -M mod.csv -U unmod.csv -O output.csv -s current
    
  2. Ding Method:

    modtector reactivity -M mod.csv -U unmod.csv -O output.csv -s ding
    
  3. Rouskin Method:

    modtector reactivity -M mod.csv -U unmod.csv -O output.csv -s rouskin
    

Mutation Signal Methods

  1. Current Method:

    modtector reactivity -M mod.csv -U unmod.csv -O output.csv -m current
    
  2. Siegfried Method:

    modtector reactivity -M mod.csv -U unmod.csv -O output.csv -m siegfried
    
  3. Zubradt Method:

    modtector reactivity -M mod.csv -U unmod.csv -O output.csv -m zubradt
    

Parameters

Pseudocount

  • Purpose: Avoid zero values in logarithmic calculations

  • Default: 1

  • Range: 0.1 - 10

  • Recommendation: Use 0.5 for sparse data

modtector reactivity -M mod.csv -U unmod.csv -O output.csv --pseudocount 0.5

Maximum Score

  • Purpose: Limit upper bound of reactivity values

  • Default: 10

  • Range: 5 - 50

  • Recommendation: Use 5 for conservative analysis

modtector reactivity -M mod.csv -U unmod.csv -O output.csv --maxscore 5

Evaluation Metrics

Purpose

Evaluation metrics assess the accuracy of modification detection using known secondary structures.

Available Metrics

1. Area Under the Curve (AUC)

  • Range: 0 - 1

  • Interpretation:

    • 0.8: Excellent

    • 0.7 - 0.8: Good

    • 0.6 - 0.7: Fair

    • < 0.6: Poor

2. F1-Score

  • Range: 0 - 1

  • Interpretation:

    • 0.8: Excellent

    • 0.6 - 0.8: Good

    • 0.4 - 0.6: Fair

    • < 0.4: Poor

3. Accuracy

  • Range: 0 - 1

  • Interpretation:

    • 0.9: Excellent

    • 0.8 - 0.9: Good

    • 0.7 - 0.8: Fair

    • < 0.7: Poor

4. Sensitivity (Recall)

  • Range: 0 - 1

  • Interpretation: Proportion of true modifications detected

5. Specificity

  • Range: 0 - 1

  • Interpretation: Proportion of true non-modifications correctly identified

6. Positive Predictive Value (PPV)

  • Range: 0 - 1

  • Interpretation: Proportion of predicted modifications that are true

7. Negative Predictive Value (NPV)

  • Range: 0 - 1

  • Interpretation: Proportion of predicted non-modifications that are true

Evaluation Process

modtector evaluate -r reactivity.csv -s structure.dp -o results/ -g gene_id

Output Files

  1. Comprehensive Results: *_comprehensive.txt

  2. ROC Curves: *_roc.svg

  3. PR Curves: *_pr.svg

  4. Combined Plots: *_combined_roc.svg

Visualization

Purpose

Visualization helps interpret results and identify patterns in the data.

Available Plots

1. Signal Distribution Plots

  • Purpose: Show signal distribution across positions

  • Format: SVG

  • Content: Stop and mutation signals

2. Reactivity Plots

  • Purpose: Display reactivity scores

  • Format: SVG

  • Content: Modification sites and scores

3. ROC Curves

  • Purpose: Show classification performance

  • Format: SVG

  • Content: True positive rate vs false positive rate

4. PR Curves

  • Purpose: Show precision-recall performance

  • Format: SVG

  • Content: Precision vs recall

5. Comparison Plots

  • Purpose: Compare different samples or methods

  • Format: SVG

6. RNA Structure SVG Plots

  • Purpose: Visualize reactivity data on RNA secondary structure

  • Format: SVG

  • Content: Colored circles mapped to structure positions

  • Features:

    • Multi-signal support (stop, mutation, etc.)

    • Strand selection (+, -, both)

    • Base filtering (A, T, C, G)

    • Alignment support

    • Color-coded reactivity scores

  • Content: Side-by-side comparisons

7. Interactive HTML Visualizations

  • Purpose: Interactive web-based visualization of reactivity data on RNA structure

  • Format: HTML (with embedded SVG and JavaScript)

  • Content: Interactive RNA structure with reactivity data overlay

  • Features:

    • Hover Tooltips: Display position, base type, and reactivity values on hover

    • Zoom and Pan: Mouse wheel zoom and click-drag panning

    • Filtering Controls:

      • Reactivity threshold filtering (min/max sliders)

      • Base type filtering (A, T, C, G, or all)

    • Highlighting: Click circles to highlight specific positions

    • Export: Export the current view as SVG

    • Reset View: Reset zoom and filters to default

  • Usage: Add --interactive flag to the plot command to generate HTML instead of static SVG

SVG Plotting

Overview

RNA structure SVG plots provide an intuitive way to visualize reactivity data directly on the RNA secondary structure. This helps researchers identify modification sites and understand the relationship between structure and reactivity.

SVG-Only Mode

When you only need SVG plots without regular distribution plots, use the simplified command:

modtector plot \
    -o svg_output/ \
    --svg-template rna_structure.svg \
    --reactivity reactivity_data.csv

Multi-Signal SVG Plotting

For data with multiple signal types (stop, mutation), plot all signals:

modtector plot \
    -o svg_output/ \
    --svg-template rna_structure.svg \
    --reactivity reactivity_data.csv \
    --svg-bases ATGC \
    --svg-signal all \
    --svg-strand +

Interactive HTML Visualization

Generate interactive HTML visualizations with zoom, pan, filtering, and tooltip features:

modtector plot \
    -o interactive_output/ \
    --svg-template rna_structure.svg \
    --reactivity reactivity_data.csv \
    --interactive \
    --svg-signal stop \
    --svg-bases ACGT

The interactive HTML file can be opened in any modern web browser and provides:

  • Mouse wheel zoom: Scroll to zoom in/out

  • Click and drag: Pan around the structure

  • Hover tooltips: See position, base, and reactivity values

  • Filtering: Adjust reactivity thresholds and filter by base type

  • Highlighting: Click circles to highlight specific positions

  • Export: Download the current view as SVG

Single Signal SVG Plotting

For single signal data, specify the signal type:

modtector plot \
    -o svg_output/ \
    --svg-template rna_structure.svg \
    --reactivity reactivity_data.csv \
    --svg-bases AC \
    --svg-signal stop \
    --svg-strand +

SVG Plotting with Alignment

When sequence alignment is needed:

modtector plot \
    -o svg_output/ \
    --svg-template rna_structure.svg \
    --reactivity reactivity_data.csv \
    --svg-bases ATGC \
    --svg-signal all \
    --svg-strand + \
    --svg-ref reference.fa \
    --svg-max-shift 10

SVG Output Files

  • Multi-signal: rna_structure_colored_[signal_type].svg

  • Single-signal: rna_structure_colored_score.svg

Plotting Options

Basic Plotting

modtector plot -M mod.csv -U unmod.csv -o plots/

With Reactivity Data

modtector plot -M mod.csv -U unmod.csv -o plots/ -r reactivity.csv

Custom Thresholds

modtector plot -M mod.csv -U unmod.csv -o plots/ -c 0.3 -d 100

With Genome Annotation

modtector plot -M mod.csv -U unmod.csv -o plots/ -g annotation.gff

Advanced Features

Multi-threading

modtector supports parallel processing for improved performance:

# Use 8 threads
modtector count -b sample.bam -f reference.fa -o output.csv -t 8
modtector reactivity -M mod.csv -U unmod.csv -O output.csv -t 8
modtector plot -M mod.csv -U unmod.csv -o plots/ -t 8

Window Analysis

Fixed Windows

modtector count -b sample.bam -f reference.fa -o output.csv -w 1000

Dynamic Windows

modtector norm -i input.csv -o output.csv -m winsor90 --dynamic

Base-Specific Analysis

Target specific bases for analysis:

# Analyze only A and C bases
modtector norm -i input.csv -o output.csv -m winsor90 --bases AC

# Analyze only G and T bases
modtector norm -i input.csv -o output.csv -m winsor90 --bases GT

Statistical Testing

Compare samples using different statistical tests:

# Student's t-test
modtector compare -M mod.csv -U unmod.csv -o comparison.csv -t t-test

# Mann-Whitney U test
modtector compare -M mod.csv -U unmod.csv -o comparison.csv -t mann-whitney

# Wilcoxon signed-rank test
modtector compare -M mod.csv -U unmod.csv -o comparison.csv -t wilcoxon

Auto-shift Correction

Automatically correct for sequence length differences:

modtector evaluate -r reactivity.csv -s structure.dp -o results/ -g gene_id --auto-shift

Base Matching

Use intelligent base matching for T/U equivalence:

modtector evaluate -r reactivity.csv -s structure.dp -o results/ -g gene_id --base-matching

Best Practices

Data Quality

  1. High-quality alignments: Ensure BAM files are properly aligned

  2. Sufficient coverage: Use at least 50x coverage

  3. Proper controls: Include unmodified control samples

  4. Quality filtering: Remove low-quality reads and positions

Parameter Selection

  1. Normalization method: Choose based on data characteristics

  2. Window size: Balance between noise reduction and signal preservation

  3. Thresholds: Adjust based on expected signal levels

  4. Statistical tests: Select appropriate test for your data

Performance Optimization

  1. Thread count: Match to available CPU cores

  2. Memory usage: Monitor RAM usage for large datasets

  3. Disk space: Ensure sufficient storage for output files

  4. Batch processing: Process multiple samples in parallel

Result Interpretation

  1. Check evaluation metrics: Ensure good performance scores

  2. Validate with known sites: Compare with literature

  3. Consider biological context: Interpret results in context

  4. Reproducibility: Document parameters and methods

Troubleshooting

  1. Low coverage: Increase sequencing depth

  2. Poor normalization: Try different methods

  3. Low evaluation scores: Check data quality

  4. Memory issues: Reduce thread count or dataset size

Documentation

  1. Record parameters: Keep track of all settings

  2. Version control: Use version control for reproducibility

  3. Log files: Save and review log files

  4. Backup results: Keep copies of important results

Common Workflows

Basic Workflow

# 1. Generate pileup data
modtector count -b mod.bam -f reference.fa -o mod_count.csv -t 4
modtector count -b unmod.bam -f reference.fa -o unmod_count.csv -t 4

# 2. Normalize signals
modtector norm -i mod_count.csv -o mod_norm.csv -m winsor90 --bases AC
modtector norm -i unmod_count.csv -o unmod_norm.csv -m winsor90 --bases AC

# 3. Calculate reactivity
modtector reactivity -M mod_norm.csv -U unmod_norm.csv -O reactivity.csv -t 4

# 4. Generate plots
modtector plot -M mod_norm.csv -U unmod_norm.csv -o plots/ -r reactivity.csv -t 4

# 5. Evaluate accuracy
modtector evaluate -r reactivity.csv -s structure.dp -o results/ -g gene_id

Advanced Workflow

# 1. Generate pileup data with windowing
modtector count -b mod.bam -f reference.fa -o mod_count.csv -w 1000 -t 8
modtector count -b unmod.bam -f reference.fa -o unmod_count.csv -w 1000 -t 8

# 2. Normalize with dynamic windows
modtector norm -i mod_count.csv -o mod_norm.csv -m winsor90 --dynamic --bases AC
modtector norm -i unmod_count.csv -o unmod_norm.csv -m winsor90 --dynamic --bases AC

# 3. Calculate reactivity with custom parameters
modtector reactivity -M mod_norm.csv -U unmod_norm.csv -O reactivity.csv \
    -s ding -m siegfried --pseudocount 0.5 --maxscore 5 -t 8

# 4. Compare samples
modtector compare -M mod_norm.csv -U unmod_norm.csv -o comparison.csv \
    -t mann-whitney -d 20 -f 1.5

# 5. Generate comprehensive plots
modtector plot -M mod_norm.csv -U unmod_norm.csv -o plots/ \
    -r reactivity.csv -c 0.3 -d 100 -t 8

# 6. Evaluate with auto-shift correction
modtector evaluate -r reactivity.csv -s structure.dp -o results/ \
    -g gene_id --auto-shift --base-matching

This user guide provides comprehensive information for using modtector effectively. For specific command details, refer to the Command Reference.