Quick Start Guide

This guide will help you get started with modtector quickly. We’ll walk through a complete example from raw data to final results.

Prerequisites

  • modtector installed (see Installation Guide)

  • Sample BAM files (modified and unmodified samples)

  • Reference FASTA file

  • Secondary structure file (for evaluation)

Example Dataset

For this quick start, we’ll use the example data provided in the repository:

# Navigate to the example directory
cd example/

# Check available data
ls data/
ls ref/

Step 1: Generate Pileup Data

First, we’ll process the BAM files to generate pileup data:

# Process modified sample
modtector count \
    -b data/HEK293_mod_Human.bam \
    -f ref/Human_18S.fa \
    -o signal/01_count/mod_sample.csv \
    -t 4

# Process unmodified sample
modtector count \
    -b data/HEK293_unmod_Human.bam \
    -f ref/Human_18S.fa \
    -o signal/01_count/unmod_sample.csv \
    -t 4

Parameters explained:

  • -b: Input BAM file

  • -f: Reference FASTA file

  • -o: Output CSV file

  • -t: Number of threads for parallel processing

Step 2: Calculate Reactivity Scores

Calculate reactivity scores by comparing modified and unmodified samples:

modtector reactivity \
    -M signal/01_count/mod_sample.csv \
    -U signal/01_count/unmod_sample.csv \
    -O signal/02_reactivity/reactivity.csv \
    -s current \
    -m current \
    -t 4

Parameters explained:

  • -M: Modified sample CSV

  • -U: Unmodified sample CSV

  • -O: Output reactivity file

  • -s: Stop signal method

  • -m: Mutation signal method

  • -t: Number of threads

Step 3: Normalize Reactivity Signals

Normalize the reactivity data to remove noise and outliers:

modtector norm \
    -i signal/02_reactivity/reactivity.csv \
    -o signal/03_norm/normalized_reactivity.csv \
    -m winsor90

Parameters explained:

  • -M: Modified sample CSV

  • -U: Unmodified sample CSV

  • -O: Output reactivity file

  • -s: Stop signal method

  • -m: Mutation signal method

  • -t: Number of threads

Step 4: Duet Ensemble Analysis

Infer dynamic ensembles by combining normalized reactivity with read-level co-variation:

modtector duet \
    -i signal/03_norm/normalized_reactivity.csv \
    -b signal/00_bam/sample.sort.bam \
    -f reference/transcript.fa \
    -o signal/04_duet/duet_windows.csv \
    --epsilon 0.85 \
    --min-samples 8 \
    --window-size 100 \
    --window-step 50

Parameters explained:

  • -i: Normalized reactivity CSV

  • -b: Sorted BAM file for the same sample

  • -f: Reference FASTA sequence

  • -o: Window-level CSV with Duet ensemble statistics

  • --epsilon: DBSCAN radius in standardized feature space

  • --min-samples: Minimum neighbours required to form an ensemble core point

  • --window-size: Sliding-window size in nucleotides (default 100)

  • --window-step: Sliding-window step in nucleotides (default 50)

Outputs produced:

  • <output>.csv: Window-level summary with global ensemble mappings

  • <output>_summary.csv: Per window/per ensemble statistics (including noise)

  • <output>_global.csv: Aggregated global ensembles (read totals, stop/mutation counts, overlap statistics)

  • <output>_global_per_base.csv: Base-level detail for each global ensemble (read support + reactivity)

Step 5: Generate Visualizations

Create plots to visualize the results:

modtector plot \
    -M signal/01_count/mod_sample.csv \
    -U signal/01_count/unmod_sample.csv \
    -o signal/05_plot/ \
    -r signal/03_norm/normalized_reactivity.csv \
    -t 4

Parameters explained:

  • -M: Modified sample CSV

  • -U: Unmodified sample CSV

  • -o: Output directory for plots

  • -r: Reactivity file (optional)

  • -t: Number of threads

Step 6: Evaluate Accuracy

Evaluate the accuracy of your results using known secondary structure:

modtector evaluate \
    -r signal/03_norm/normalized_reactivity.csv \
    -s ref/Human_18S.dp \
    -o signal/06_evaluate/ \
    -g Human_18S \
    -S +

Parameters explained:

  • -r: Reactivity file

  • -s: Secondary structure file (.dp format)

  • -o: Output directory

  • -g: Gene ID

  • -S: Strand information (+ or -)

Expected Outputs

After running all steps, you should have:

signal/
├── 01_count/          # Raw pileup data
├── 02_reactivity/     # Reactivity scores
├── 03_norm/           # Normalized reactivity data
├── 04_duet/           # Duet window/global ensemble analysis results
├── 05_plot/           # Visualization plots
└── 06_evaluate/       # Accuracy evaluation

Key Output Files

  1. Pileup Data (01_count/*.csv):

    • Raw signal counts for each position

    • Stop and mutation signals

    • Coverage information

  2. Reactivity Scores (02_reactivity/reactivity.csv):

    • Calculated reactivity values

    • Signal differences between samples

  3. Normalized Reactivity (03_norm/normalized_reactivity.csv):

    • Filtered and normalized reactivity signals

    • Outlier-corrected reactivity values

  4. Duet Ensemble Analysis (04_duet/*.csv):

    • Window-level ensemble assessments with status, occupancy, noise fraction, dual-signal metrics

    • Per-window/per-ensemble summary CSV (including DBSCAN noise fractions)

    • Global ensemble summary (*_global.csv) aggregating reads/positions across overlapping windows

    • Per-base global ensemble detail (*_global_per_base.csv) with read support and reactivity values

  5. Plots (05_plot/*.svg):

    • Signal distribution plots

    • Reactivity visualization

    • Comparison charts

  6. Evaluation Results (06_evaluate/*.txt):

    • AUC scores

    • F1-scores

    • Accuracy metrics

    • ROC/PR curves

Understanding the Results

Reactivity Scores

  • High positive values: Likely modification sites

  • Near zero: No significant modification

  • Negative values: Possible artifacts or noise

Evaluation Metrics

  • AUC > 0.7: Good performance

  • F1-score > 0.6: Reasonable accuracy

  • Accuracy > 0.8: High confidence

Common Issues and Solutions

Issue 1: Low Coverage

Problem: Insufficient read depth Solution:

  • Increase sequencing depth

  • Adjust coverage thresholds in plotting

Issue 2: Poor Normalization

Problem: High background noise Solution:

  • Try different normalization methods

  • Adjust window sizes

  • Filter low-quality positions

Issue 3: Low Evaluation Scores

Problem: Poor accuracy metrics Solution:

  • Check data quality

  • Verify reference sequence

  • Ensure proper sample preparation

Next Steps

Now that you’ve completed the quick start:

  1. Explore Advanced Features:

    • Try different normalization methods

    • Experiment with various reactivity calculation methods

    • Use different statistical tests in comparison

  2. Analyze Your Own Data:

    • Prepare your BAM files

    • Obtain reference sequences

    • Get secondary structure information

  3. Read the Full Documentation:

Getting Help

If you encounter issues:

  1. Check the Troubleshooting Guide

  2. Review the Command Reference

  3. Look at the Examples

  4. Open an issue on the GitHub repository

Tips for Success

  1. Start Small: Begin with a small dataset to test your pipeline

  2. Check Data Quality: Ensure your BAM files are properly aligned

  3. Use Appropriate References: Match your reference sequence to your data

  4. Monitor Resources: Large datasets may require significant memory

  5. Validate Results: Always check your results against known modifications