# Quick Start Guide This guide will help you get started with modtector quickly. We'll walk through a complete example from raw data to final results. ## Prerequisites - modtector installed (see [Installation Guide](installation.md)) - Sample BAM files (modified and unmodified samples) - Reference FASTA file - Secondary structure file (for evaluation) ## Example Dataset For this quick start, we'll use the example data provided in the repository: ```bash # Navigate to the example directory cd example/ # Check available data ls data/ ls ref/ ``` ## Step 1: Generate Pileup Data First, we'll process the BAM files to generate pileup data: ```bash # Process modified sample modtector count \ -b data/HEK293_mod_Human.bam \ -f ref/Human_18S.fa \ -o signal/01_count/mod_sample.csv \ -t 4 # Process unmodified sample modtector count \ -b data/HEK293_unmod_Human.bam \ -f ref/Human_18S.fa \ -o signal/01_count/unmod_sample.csv \ -t 4 ``` **Parameters explained:** - `-b`: Input BAM file - `-f`: Reference FASTA file - `-o`: Output CSV file - `-t`: Number of threads for parallel processing ## Step 2: Calculate Reactivity Scores Calculate reactivity scores by comparing modified and unmodified samples: ```bash modtector reactivity \ -M signal/01_count/mod_sample.csv \ -U signal/01_count/unmod_sample.csv \ -O signal/02_reactivity/reactivity.csv \ -s current \ -m current \ -t 4 ``` **Parameters explained:** - `-M`: Modified sample CSV - `-U`: Unmodified sample CSV - `-O`: Output reactivity file - `-s`: Stop signal method - `-m`: Mutation signal method - `-t`: Number of threads ## Step 3: Normalize Reactivity Signals Normalize the reactivity data to remove noise and outliers: ```bash modtector norm \ -i signal/02_reactivity/reactivity.csv \ -o signal/03_norm/normalized_reactivity.csv \ -m winsor90 ``` **Parameters explained:** - `-M`: Modified sample CSV - `-U`: Unmodified sample CSV - `-O`: Output reactivity file - `-s`: Stop signal method - `-m`: Mutation signal method - `-t`: Number of threads ## Step 4: Duet Ensemble Analysis Infer dynamic ensembles by combining normalized reactivity with read-level co-variation: ```bash modtector duet \ -i signal/03_norm/normalized_reactivity.csv \ -b signal/00_bam/sample.sort.bam \ -f reference/transcript.fa \ -o signal/04_duet/duet_windows.csv \ --epsilon 0.85 \ --min-samples 8 \ --window-size 100 \ --window-step 50 ``` **Parameters explained:** - `-i`: Normalized reactivity CSV - `-b`: Sorted BAM file for the same sample - `-f`: Reference FASTA sequence - `-o`: Window-level CSV with Duet ensemble statistics - `--epsilon`: DBSCAN radius in standardized feature space - `--min-samples`: Minimum neighbours required to form an ensemble core point - `--window-size`: Sliding-window size in nucleotides (default `100`) - `--window-step`: Sliding-window step in nucleotides (default `50`) **Outputs produced:** - `.csv`: Window-level summary with global ensemble mappings - `_summary.csv`: Per window/per ensemble statistics (including noise) - `_global.csv`: Aggregated global ensembles (read totals, stop/mutation counts, overlap statistics) - `_global_per_base.csv`: Base-level detail for each global ensemble (read support + reactivity) ## Step 5: Generate Visualizations Create plots to visualize the results: ```bash modtector plot \ -M signal/01_count/mod_sample.csv \ -U signal/01_count/unmod_sample.csv \ -o signal/05_plot/ \ -r signal/03_norm/normalized_reactivity.csv \ -t 4 ``` **Parameters explained:** - `-M`: Modified sample CSV - `-U`: Unmodified sample CSV - `-o`: Output directory for plots - `-r`: Reactivity file (optional) - `-t`: Number of threads ## Step 6: Evaluate Accuracy Evaluate the accuracy of your results using known secondary structure: ```bash modtector evaluate \ -r signal/03_norm/normalized_reactivity.csv \ -s ref/Human_18S.dp \ -o signal/06_evaluate/ \ -g Human_18S \ -S + ``` **Parameters explained:** - `-r`: Reactivity file - `-s`: Secondary structure file (.dp format) - `-o`: Output directory - `-g`: Gene ID - `-S`: Strand information (+ or -) ## Expected Outputs After running all steps, you should have: ``` signal/ ├── 01_count/ # Raw pileup data ├── 02_reactivity/ # Reactivity scores ├── 03_norm/ # Normalized reactivity data ├── 04_duet/ # Duet window/global ensemble analysis results ├── 05_plot/ # Visualization plots └── 06_evaluate/ # Accuracy evaluation ``` ### Key Output Files 1. **Pileup Data** (`01_count/*.csv`): - Raw signal counts for each position - Stop and mutation signals - Coverage information 2. **Reactivity Scores** (`02_reactivity/reactivity.csv`): - Calculated reactivity values - Signal differences between samples 3. **Normalized Reactivity** (`03_norm/normalized_reactivity.csv`): - Filtered and normalized reactivity signals - Outlier-corrected reactivity values 4. **Duet Ensemble Analysis** (`04_duet/*.csv`): - Window-level ensemble assessments with status, occupancy, noise fraction, dual-signal metrics - Per-window/per-ensemble summary CSV (including DBSCAN noise fractions) - Global ensemble summary (`*_global.csv`) aggregating reads/positions across overlapping windows - Per-base global ensemble detail (`*_global_per_base.csv`) with read support and reactivity values 5. **Plots** (`05_plot/*.svg`): - Signal distribution plots - Reactivity visualization - Comparison charts 6. **Evaluation Results** (`06_evaluate/*.txt`): - AUC scores - F1-scores - Accuracy metrics - ROC/PR curves ## Understanding the Results ### Reactivity Scores - **High positive values**: Likely modification sites - **Near zero**: No significant modification - **Negative values**: Possible artifacts or noise ### Evaluation Metrics - **AUC > 0.7**: Good performance - **F1-score > 0.6**: Reasonable accuracy - **Accuracy > 0.8**: High confidence ## Common Issues and Solutions ### Issue 1: Low Coverage **Problem**: Insufficient read depth **Solution**: - Increase sequencing depth - Adjust coverage thresholds in plotting ### Issue 2: Poor Normalization **Problem**: High background noise **Solution**: - Try different normalization methods - Adjust window sizes - Filter low-quality positions ### Issue 3: Low Evaluation Scores **Problem**: Poor accuracy metrics **Solution**: - Check data quality - Verify reference sequence - Ensure proper sample preparation ## Next Steps Now that you've completed the quick start: 1. **Explore Advanced Features**: - Try different normalization methods - Experiment with various reactivity calculation methods - Use different statistical tests in comparison 2. **Analyze Your Own Data**: - Prepare your BAM files - Obtain reference sequences - Get secondary structure information 3. **Read the Full Documentation**: - [User Guide](user-guide.md) for detailed explanations - [Command Reference](commands.md) for all options - [Examples](examples.md) for more use cases ## Getting Help If you encounter issues: 1. Check the [Troubleshooting Guide](troubleshooting.md) 2. Review the [Command Reference](commands.md) 3. Look at the [Examples](examples.md) 4. Open an issue on the [GitHub repository](https://github.com/TongZhou2017/modtector/issues) ## Tips for Success 1. **Start Small**: Begin with a small dataset to test your pipeline 2. **Check Data Quality**: Ensure your BAM files are properly aligned 3. **Use Appropriate References**: Match your reference sequence to your data 4. **Monitor Resources**: Large datasets may require significant memory 5. **Validate Results**: Always check your results against known modifications