๐๏ธ DTLN Voice Denoising
Real-time speech enhancement optimized for edge deployment with TensorFlow Lite.
๐ Features:
- Optimized for Edge AI: Lightweight model with <100KB size
- Real-time Processing: Low latency for streaming audio
- INT8 Quantization: Efficient deployment with 8-bit precision
- TensorFlow Lite: Ready for microcontroller deployment
๐ค Input
๐ฅ Output
Train Your Own DTLN Model
Upload your datasets and configure training parameters.
โ ๏ธ Note: Training requires significant computational resources and cannot run directly on Hugging Face Spaces. This interface helps you prepare your data and provides the exact commands to run training locally.
๐ฆ Datasets
Upload a ZIP file containing clean speech WAV files
Upload a ZIP file containing noise WAV files
โ๏ธ Training Parameters
DTLN Architecture
Dual-signal Transformation LSTM Network is a real-time speech enhancement model:
- Two-stage processing: Magnitude estimation โ Final enhancement
- LSTM-based: Captures temporal dependencies in speech
- <1M parameters: Lightweight for edge deployment
- Frequency + Time domain: Processes both domains for better quality
Edge Hardware Acceleration
Compatible with various edge AI accelerators:
- NPU: Arm Ethos-U series
- CPU: ARM Cortex-M series
- Quantization: 8-bit and 16-bit integer operations
- Memory: Optimized for constrained devices
Performance Targets
| Metric | Value |
|---|---|
| Model Size | ~100 KB (INT8) |
| Latency | 3-6 ms |
| Power | 30-40 mW |
| SNR Improvement | 10-15 dB |
โ ๏ธ Demo Note: This Space uses spectral subtraction for demonstration. Download the full implementation to train and deploy the actual DTLN model!
Quick Start
# 1. Install dependencies
pip install -r requirements.txt
# 2. Train model
python train_dtln.py \
--clean-dir ./data/clean_speech \
--noise-dir ./data/noise \
--epochs 50 \
--batch-size 16
# 3. Convert to TFLite INT8
python convert_to_tflite.py \
--model ./models/best_model.h5 \
--output ./models/dtln_ethos_u55.tflite \
--calibration-dir ./data/clean_speech
# 4. (Optional) Optimize for hardware accelerator
vela --accelerator-config ethos-u55-256 \
--system-config Ethos_U55_High_End_Embedded \
./models/dtln_ethos_u55.tflite
Download Full Implementation
The complete training and deployment code is available in the Files tab โ
Includes:
dtln_ethos_u55.py- Model architecturetrain_dtln.py- Training with QATconvert_to_tflite.py- TFLite conversionalif_e7_voice_denoising_guide.md- Complete guideexample_usage.py- Usage examples
Resources
Model Architecture Details
Input: Raw audio waveform @ 16kHz
- Frame length: 512 samples (32ms)
- Frame shift: 128 samples (8ms)
- Frequency bins: 257 (FFT size 512)
Network Structure:
Input Audio (16kHz)
โ
STFT (512-point)
โ
[Stage 1]
LSTM (128 units) โ Dense (sigmoid) โ Magnitude Mask 1
โ
Enhanced Magnitude 1
โ
[Stage 2]
LSTM (128 units) โ Dense (sigmoid) โ Magnitude Mask 2
โ
Enhanced Magnitude
โ
ISTFT
โ
Output Audio (16kHz)
Training Configuration:
- Loss: Combined time + frequency domain MSE
- Optimizer: Adam (lr=0.001)
- Batch size: 16
- Epochs: 50
- Quantization: INT8 post-training quantization
Memory Footprint:
- Model weights: ~80 KB (INT8)
- Tensor arena: ~100 KB
- Audio buffers: ~2 KB
- Total: ~200 KB
Edge Device Deployment
Hardware Utilization:
- NPU/CPU: For LSTM inference
- CPU: For FFT operations (CMSIS-DSP)
- Memory: Optimized buffer management
- Peripherals: I2S/PDM for audio I/O
Power Profile:
- Active inference: 30-40 mW
- Idle: <1 mW
- Average (50% duty): ~15-20 mW
Real-time Constraints:
- Frame processing: 8ms available
- FFT: ~1ms
- NPU inference: ~4ms
- IFFT + overhead: ~2ms
- Margin: ~1ms
๐ Citation
If you use this model in your research, please cite:
@inproceedings{westhausen2020dtln,
title={Dual-signal transformation LSTM network for real-time noise suppression},
author={Westhausen, Nils L and Meyer, Bernd T},
booktitle={Interspeech},
year={2020}
}