# User Guide

This comprehensive guide covers all aspects of using modtector, from basic concepts to advanced features.

## Table of Contents

1. [Understanding RNA Modifications](#understanding-rna-modifications)
2. [modtector Workflow](#moddetector-workflow)
3. [Data Preparation](#data-preparation)
4. [Signal Types](#signal-types)
5. [Normalization Methods](#normalization-methods)
6. [Reactivity Calculation](#reactivity-calculation)
7. [Evaluation Metrics](#evaluation-metrics)
8. [Visualization](#visualization)
9. [Advanced Features](#advanced-features)
10. [Best Practices](#best-practices)

## Understanding RNA Modifications

### What are RNA Modifications?

RNA modifications are chemical alterations to RNA nucleotides that can affect:
- RNA structure and stability
- Protein-RNA interactions
- Translation efficiency
- RNA localization
- Gene expression regulation

### Common RNA Modifications

- **m6A**: N6-methyladenosine (most common)
- **m1A**: N1-methyladenosine
- **m5C**: 5-methylcytosine
- **Ψ**: Pseudouridine
- **D**: Dihydrouridine

### Detection Methods

modtector supports detection based on:
- **Stop signals**: Pipeline truncation during reverse transcription
- **Mutation signals**: Base mutations during reverse transcription

## modtector Workflow

### Overview

modtector follows a systematic workflow:

```
Raw BAM Files → Pileup Analysis → Normalization → Reactivity Calculation → Evaluation
     ↓              ↓                ↓                    ↓                ↓
  Alignment      Signal Counts    Filtered Data    Modification Scores   Accuracy
```

### Step-by-Step Process

1. **Data Input**: BAM files, reference sequences, structure files
2. **Pileup Analysis**: Count stop and mutation signals
3. **Normalization**: Filter noise and outliers
4. **Reactivity Calculation**: Compare modified vs unmodified samples
5. **Visualization**: Generate plots and charts
6. **Evaluation**: Assess accuracy using known structures

## Data Preparation

### Input Requirements

#### BAM Files
- **Format**: BAM format with proper alignment
- **Quality**: High-quality alignments with minimal mismatches
- **Coverage**: Sufficient read depth (recommended >50x)
- **Paired samples**: Modified and unmodified samples

#### Reference Sequences
- **Format**: FASTA format
- **Quality**: High-quality reference sequences
- **Matching**: Must match your BAM file alignments
- **Completeness**: Include all target regions

#### Secondary Structure Files
- **Format**: Dot-bracket notation (.dp files)
- **Source**: Experimentally determined or predicted structures
- **Accuracy**: High-confidence structures for evaluation
- **Coverage**: Cover all regions of interest

### Data Quality Checks

Before running modtector, verify:

1. **BAM File Quality**:
   ```bash
   samtools flagstat sample.bam
   samtools view -c sample.bam
   ```

2. **Reference Sequence**:
   ```bash
   samtools faidx reference.fa
   ```

3. **Alignment Quality**:
   ```bash
   samtools view sample.bam | head -100
   ```

4. **BAM Index Files** (Required for count command):
   ```bash
   # Check if index exists
   ls -lh sample.bam.bai
   
   # Create index if missing
   samtools index -b sample.bam -o sample.bam.bai -@ 8
   ```

### Batch Processing and Single-cell Mode

#### Batch Mode (`--batch`)

Batch mode allows you to process multiple BAM files sequentially, each file independently.

**Use Cases:**
- Processing multiple samples that need separate analysis
- Bulk RNA-seq data with multiple replicates
- When each file should be treated as an independent sample

**Example:**
```bash
modtector count --batch \
  -b "/path/to/data/*sort.bam" \
  -f reference.fa \
  -o output_dir/ \
  -t 8 \
  -w 10000
```

**Requirements:**
- BAM files must be sorted and indexed (`.bai` files must exist)
- Glob pattern must match at least one file
- Output path must be a directory

#### Single-cell Unified Mode (`--single-cell`)

Single-cell unified mode is optimized for single-cell RNA-seq data, providing 2-3x performance improvement through unified processing.

**Key Features:**
- Unified data distribution scanning (once instead of per-file)
- Automatic cell label extraction from filenames
- Cross-file parallel processing
- Reduced I/O overhead

**Cell Label Extraction:**
- Supports RHX pattern: `*RHX672.sort.bam` → `RHX672`
- Falls back to last underscore-separated part: `sample_cell123.bam` → `cell123`
- Final fallback: base filename without extension

**Example:**
```bash
modtector count --single-cell \
  -b "/path/to/single_cell/*sort.bam" \
  -f reference.fa \
  -o output_dir/ \
  -t 8 \
  -w 10000 \
  -l batch.log
```

**Output:**
- Each cell generates a separate CSV file: `RHX672.csv`, `RHX673.csv`, etc.
- Log files are also generated per cell

**Performance Comparison:**
- Batch mode: ~N × single_file_time (sequential processing)
- Single-cell unified mode: ~(N × single_file_time) / 2-3 (unified processing)

## Signal Types

### Stop Signals

Stop signals occur when reverse transcription is truncated at modification sites.

#### Characteristics
- **Detection**: Pipeline truncation events
- **Sensitivity**: High sensitivity for certain modifications
- **Specificity**: Good specificity with proper controls
- **Background**: Low background noise

#### Analysis
```bash
modtector count -b sample.bam -f reference.fa -o output.csv
```

### Mutation Signals

Mutation signals occur when reverse transcription introduces errors at modification sites.

#### Characteristics
- **Detection**: Base mutation events
- **Sensitivity**: Moderate sensitivity
- **Specificity**: Good specificity with proper controls
- **Background**: Moderate background noise

#### Analysis
```bash
modtector count -b sample.bam -f reference.fa -o output.csv
```

### Combined Analysis

modtector can analyze both signal types simultaneously:

```bash
modtector reactivity -M mod.csv -U unmod.csv -O reactivity.csv -t both
```

## Normalization Methods

### Purpose

Normalization removes systematic biases and noise from the data:
- **Technical noise**: Sequencing artifacts
- **Biological noise**: Background signal
- **Systematic bias**: Sample preparation effects

### Available Methods

#### 1. Percentile28 Normalization
- **Method**: 28th percentile scaling
- **Use case**: General purpose normalization
- **Advantages**: Robust to outliers
- **Disadvantages**: May be conservative

```bash
modtector norm -i input.csv -o output.csv -m percentile28
```

#### 2. Winsor90 Normalization
- **Method**: 90th percentile winsorization
- **Use case**: High-quality data
- **Advantages**: Preserves signal distribution
- **Disadvantages**: Sensitive to outliers

```bash
modtector norm -i input.csv -o output.csv -m winsor90
```

#### 3. Boxplot Normalization
- **Method**: Boxplot-based outlier removal
- **Use case**: Data with many outliers
- **Advantages**: Effective outlier removal
- **Disadvantages**: May remove valid signals

```bash
modtector norm -i input.csv -o output.csv -m boxplot
```

### Window-Based Normalization

#### Fixed Windows
```bash
modtector norm -i input.csv -o output.csv -m winsor90 --window 1000
```

#### Dynamic Windows
```bash
modtector norm -i input.csv -o output.csv -m winsor90 --dynamic
```

#### Sliding Windows
```bash
modtector norm -i input.csv -o output.csv -m winsor90 --window 500 --window-offset 100
```

## Reactivity Calculation

### Purpose

Reactivity calculation quantifies the difference between modified and unmodified samples to identify modification sites.

### Calculation Methods

#### Stop Signal Methods

1. **Current Method**:
   ```bash
   modtector reactivity -M mod.csv -U unmod.csv -O output.csv -s current
   ```

2. **Ding Method**:
   ```bash
   modtector reactivity -M mod.csv -U unmod.csv -O output.csv -s ding
   ```

3. **Rouskin Method**:
   ```bash
   modtector reactivity -M mod.csv -U unmod.csv -O output.csv -s rouskin
   ```

#### Mutation Signal Methods

1. **Current Method**:
   ```bash
   modtector reactivity -M mod.csv -U unmod.csv -O output.csv -m current
   ```

2. **Siegfried Method**:
   ```bash
   modtector reactivity -M mod.csv -U unmod.csv -O output.csv -m siegfried
   ```

3. **Zubradt Method**:
   ```bash
   modtector reactivity -M mod.csv -U unmod.csv -O output.csv -m zubradt
   ```

### Parameters

#### Pseudocount
- **Purpose**: Avoid zero values in logarithmic calculations
- **Default**: 1
- **Range**: 0.1 - 10
- **Recommendation**: Use 0.5 for sparse data

```bash
modtector reactivity -M mod.csv -U unmod.csv -O output.csv --pseudocount 0.5
```

#### Maximum Score
- **Purpose**: Limit upper bound of reactivity values
- **Default**: 10
- **Range**: 5 - 50
- **Recommendation**: Use 5 for conservative analysis

```bash
modtector reactivity -M mod.csv -U unmod.csv -O output.csv --maxscore 5
```

## Evaluation Metrics

### Purpose

Evaluation metrics assess the accuracy of modification detection using known secondary structures.

### Available Metrics

#### 1. Area Under the Curve (AUC)
- **Range**: 0 - 1
- **Interpretation**: 
  - > 0.8: Excellent
  - 0.7 - 0.8: Good
  - 0.6 - 0.7: Fair
  - < 0.6: Poor

#### 2. F1-Score
- **Range**: 0 - 1
- **Interpretation**:
  - > 0.8: Excellent
  - 0.6 - 0.8: Good
  - 0.4 - 0.6: Fair
  - < 0.4: Poor

#### 3. Accuracy
- **Range**: 0 - 1
- **Interpretation**:
  - > 0.9: Excellent
  - 0.8 - 0.9: Good
  - 0.7 - 0.8: Fair
  - < 0.7: Poor

#### 4. Sensitivity (Recall)
- **Range**: 0 - 1
- **Interpretation**: Proportion of true modifications detected

#### 5. Specificity
- **Range**: 0 - 1
- **Interpretation**: Proportion of true non-modifications correctly identified

#### 6. Positive Predictive Value (PPV)
- **Range**: 0 - 1
- **Interpretation**: Proportion of predicted modifications that are true

#### 7. Negative Predictive Value (NPV)
- **Range**: 0 - 1
- **Interpretation**: Proportion of predicted non-modifications that are true

### Evaluation Process

```bash
modtector evaluate -r reactivity.csv -s structure.dp -o results/ -g gene_id
```

### Output Files

1. **Comprehensive Results**: `*_comprehensive.txt`
2. **ROC Curves**: `*_roc.svg`
3. **PR Curves**: `*_pr.svg`
4. **Combined Plots**: `*_combined_roc.svg`

## Visualization

### Purpose

Visualization helps interpret results and identify patterns in the data.

### Available Plots

#### 1. Signal Distribution Plots
- **Purpose**: Show signal distribution across positions
- **Format**: SVG
- **Content**: Stop and mutation signals

#### 2. Reactivity Plots
- **Purpose**: Display reactivity scores
- **Format**: SVG
- **Content**: Modification sites and scores

#### 3. ROC Curves
- **Purpose**: Show classification performance
- **Format**: SVG
- **Content**: True positive rate vs false positive rate

#### 4. PR Curves
- **Purpose**: Show precision-recall performance
- **Format**: SVG
- **Content**: Precision vs recall

#### 5. Comparison Plots
- **Purpose**: Compare different samples or methods
- **Format**: SVG

#### 6. RNA Structure SVG Plots
- **Purpose**: Visualize reactivity data on RNA secondary structure
- **Format**: SVG
- **Content**: Colored circles mapped to structure positions
- **Features**: 
  - Multi-signal support (stop, mutation, etc.)
  - Strand selection (+, -, both)
  - Base filtering (A, T, C, G)
  - Alignment support
  - Color-coded reactivity scores
- **Content**: Side-by-side comparisons

#### 7. Interactive HTML Visualizations
- **Purpose**: Interactive web-based visualization of reactivity data on RNA structure
- **Format**: HTML (with embedded SVG and JavaScript)
- **Content**: Interactive RNA structure with reactivity data overlay
- **Features**:
  - **Hover Tooltips**: Display position, base type, and reactivity values on hover
  - **Zoom and Pan**: Mouse wheel zoom and click-drag panning
  - **Filtering Controls**:
    - Reactivity threshold filtering (min/max sliders)
    - Base type filtering (A, T, C, G, or all)
  - **Highlighting**: Click circles to highlight specific positions
  - **Export**: Export the current view as SVG
  - **Reset View**: Reset zoom and filters to default
- **Usage**: Add `--interactive` flag to the plot command to generate HTML instead of static SVG

### SVG Plotting

#### Overview
RNA structure SVG plots provide an intuitive way to visualize reactivity data directly on the RNA secondary structure. This helps researchers identify modification sites and understand the relationship between structure and reactivity.

#### SVG-Only Mode
When you only need SVG plots without regular distribution plots, use the simplified command:

```bash
modtector plot \
    -o svg_output/ \
    --svg-template rna_structure.svg \
    --reactivity reactivity_data.csv
```

#### Multi-Signal SVG Plotting
For data with multiple signal types (stop, mutation), plot all signals:

```bash
modtector plot \
    -o svg_output/ \
    --svg-template rna_structure.svg \
    --reactivity reactivity_data.csv \
    --svg-bases ATGC \
    --svg-signal all \
    --svg-strand +
```

#### Interactive HTML Visualization
Generate interactive HTML visualizations with zoom, pan, filtering, and tooltip features:

```bash
modtector plot \
    -o interactive_output/ \
    --svg-template rna_structure.svg \
    --reactivity reactivity_data.csv \
    --interactive \
    --svg-signal stop \
    --svg-bases ACGT
```

The interactive HTML file can be opened in any modern web browser and provides:
- **Mouse wheel zoom**: Scroll to zoom in/out
- **Click and drag**: Pan around the structure
- **Hover tooltips**: See position, base, and reactivity values
- **Filtering**: Adjust reactivity thresholds and filter by base type
- **Highlighting**: Click circles to highlight specific positions
- **Export**: Download the current view as SVG

#### Single Signal SVG Plotting
For single signal data, specify the signal type:

```bash
modtector plot \
    -o svg_output/ \
    --svg-template rna_structure.svg \
    --reactivity reactivity_data.csv \
    --svg-bases AC \
    --svg-signal stop \
    --svg-strand +
```

#### SVG Plotting with Alignment
When sequence alignment is needed:

```bash
modtector plot \
    -o svg_output/ \
    --svg-template rna_structure.svg \
    --reactivity reactivity_data.csv \
    --svg-bases ATGC \
    --svg-signal all \
    --svg-strand + \
    --svg-ref reference.fa \
    --svg-max-shift 10
```

#### SVG Output Files
- **Multi-signal**: `rna_structure_colored_[signal_type].svg`
- **Single-signal**: `rna_structure_colored_score.svg`

### Plotting Options

#### Basic Plotting
```bash
modtector plot -M mod.csv -U unmod.csv -o plots/
```

#### With Reactivity Data
```bash
modtector plot -M mod.csv -U unmod.csv -o plots/ -r reactivity.csv
```

#### Custom Thresholds
```bash
modtector plot -M mod.csv -U unmod.csv -o plots/ -c 0.3 -d 100
```

#### With Genome Annotation
```bash
modtector plot -M mod.csv -U unmod.csv -o plots/ -g annotation.gff
```

## Advanced Features

### Multi-threading

modtector supports parallel processing for improved performance:

```bash
# Use 8 threads
modtector count -b sample.bam -f reference.fa -o output.csv -t 8
modtector reactivity -M mod.csv -U unmod.csv -O output.csv -t 8
modtector plot -M mod.csv -U unmod.csv -o plots/ -t 8
```

### Window Analysis

#### Fixed Windows
```bash
modtector count -b sample.bam -f reference.fa -o output.csv -w 1000
```

#### Dynamic Windows
```bash
modtector norm -i input.csv -o output.csv -m winsor90 --dynamic
```

### Base-Specific Analysis

Target specific bases for analysis:

```bash
# Analyze only A and C bases
modtector norm -i input.csv -o output.csv -m winsor90 --bases AC

# Analyze only G and T bases
modtector norm -i input.csv -o output.csv -m winsor90 --bases GT
```

### Statistical Testing

Compare samples using different statistical tests:

```bash
# Student's t-test
modtector compare -M mod.csv -U unmod.csv -o comparison.csv -t t-test

# Mann-Whitney U test
modtector compare -M mod.csv -U unmod.csv -o comparison.csv -t mann-whitney

# Wilcoxon signed-rank test
modtector compare -M mod.csv -U unmod.csv -o comparison.csv -t wilcoxon
```

### Auto-shift Correction

Automatically correct for sequence length differences:

```bash
modtector evaluate -r reactivity.csv -s structure.dp -o results/ -g gene_id --auto-shift
```

### Base Matching

Use intelligent base matching for T/U equivalence:

```bash
modtector evaluate -r reactivity.csv -s structure.dp -o results/ -g gene_id --base-matching
```

## Best Practices

### Data Quality

1. **High-quality alignments**: Ensure BAM files are properly aligned
2. **Sufficient coverage**: Use at least 50x coverage
3. **Proper controls**: Include unmodified control samples
4. **Quality filtering**: Remove low-quality reads and positions

### Parameter Selection

1. **Normalization method**: Choose based on data characteristics
2. **Window size**: Balance between noise reduction and signal preservation
3. **Thresholds**: Adjust based on expected signal levels
4. **Statistical tests**: Select appropriate test for your data

### Performance Optimization

1. **Thread count**: Match to available CPU cores
2. **Memory usage**: Monitor RAM usage for large datasets
3. **Disk space**: Ensure sufficient storage for output files
4. **Batch processing**: Process multiple samples in parallel

### Result Interpretation

1. **Check evaluation metrics**: Ensure good performance scores
2. **Validate with known sites**: Compare with literature
3. **Consider biological context**: Interpret results in context
4. **Reproducibility**: Document parameters and methods

### Troubleshooting

1. **Low coverage**: Increase sequencing depth
2. **Poor normalization**: Try different methods
3. **Low evaluation scores**: Check data quality
4. **Memory issues**: Reduce thread count or dataset size

### Documentation

1. **Record parameters**: Keep track of all settings
2. **Version control**: Use version control for reproducibility
3. **Log files**: Save and review log files
4. **Backup results**: Keep copies of important results

## Common Workflows

### Basic Workflow
```bash
# 1. Generate pileup data
modtector count -b mod.bam -f reference.fa -o mod_count.csv -t 4
modtector count -b unmod.bam -f reference.fa -o unmod_count.csv -t 4

# 2. Normalize signals
modtector norm -i mod_count.csv -o mod_norm.csv -m winsor90 --bases AC
modtector norm -i unmod_count.csv -o unmod_norm.csv -m winsor90 --bases AC

# 3. Calculate reactivity
modtector reactivity -M mod_norm.csv -U unmod_norm.csv -O reactivity.csv -t 4

# 4. Generate plots
modtector plot -M mod_norm.csv -U unmod_norm.csv -o plots/ -r reactivity.csv -t 4

# 5. Evaluate accuracy
modtector evaluate -r reactivity.csv -s structure.dp -o results/ -g gene_id
```

### Advanced Workflow
```bash
# 1. Generate pileup data with windowing
modtector count -b mod.bam -f reference.fa -o mod_count.csv -w 1000 -t 8
modtector count -b unmod.bam -f reference.fa -o unmod_count.csv -w 1000 -t 8

# 2. Normalize with dynamic windows
modtector norm -i mod_count.csv -o mod_norm.csv -m winsor90 --dynamic --bases AC
modtector norm -i unmod_count.csv -o unmod_norm.csv -m winsor90 --dynamic --bases AC

# 3. Calculate reactivity with custom parameters
modtector reactivity -M mod_norm.csv -U unmod_norm.csv -O reactivity.csv \
    -s ding -m siegfried --pseudocount 0.5 --maxscore 5 -t 8

# 4. Compare samples
modtector compare -M mod_norm.csv -U unmod_norm.csv -o comparison.csv \
    -t mann-whitney -d 20 -f 1.5

# 5. Generate comprehensive plots
modtector plot -M mod_norm.csv -U unmod_norm.csv -o plots/ \
    -r reactivity.csv -c 0.3 -d 100 -t 8

# 6. Evaluate with auto-shift correction
modtector evaluate -r reactivity.csv -s structure.dp -o results/ \
    -g gene_id --auto-shift --base-matching
```

This user guide provides comprehensive information for using modtector effectively. For specific command details, refer to the [Command Reference](commands.md).