๐ŸŽ™๏ธ DTLN Voice Denoising

Real-time speech enhancement optimized for edge deployment with TensorFlow Lite.

๐Ÿš€ Features:

  • Optimized for Edge AI: Lightweight model with <100KB size
  • Real-time Processing: Low latency for streaming audio
  • INT8 Quantization: Efficient deployment with 8-bit precision
  • TensorFlow Lite: Ready for microcontroller deployment

๐Ÿ“ค Input

0 20

๐Ÿ“ฅ Output

DTLN Architecture

Dual-signal Transformation LSTM Network is a real-time speech enhancement model:

  • Two-stage processing: Magnitude estimation โ†’ Final enhancement
  • LSTM-based: Captures temporal dependencies in speech
  • <1M parameters: Lightweight for edge deployment
  • Frequency + Time domain: Processes both domains for better quality

Edge Hardware Acceleration

Compatible with various edge AI accelerators:

  • NPU: Arm Ethos-U series
  • CPU: ARM Cortex-M series
  • Quantization: 8-bit and 16-bit integer operations
  • Memory: Optimized for constrained devices

Performance Targets

Metric Value
Model Size ~100 KB (INT8)
Latency 3-6 ms
Power 30-40 mW
SNR Improvement 10-15 dB

โš ๏ธ Demo Note: This Space uses spectral subtraction for demonstration. Download the full implementation to train and deploy the actual DTLN model!

Quick Start

# 1. Install dependencies
pip install -r requirements.txt

# 2. Train model
python train_dtln.py \
    --clean-dir ./data/clean_speech \
    --noise-dir ./data/noise \
    --epochs 50 \
    --batch-size 16

# 3. Convert to TFLite INT8
python convert_to_tflite.py \
    --model ./models/best_model.h5 \
    --output ./models/dtln_ethos_u55.tflite \
    --calibration-dir ./data/clean_speech

# 4. (Optional) Optimize for hardware accelerator
vela --accelerator-config ethos-u55-256 \
     --system-config Ethos_U55_High_End_Embedded \
     ./models/dtln_ethos_u55.tflite

Download Full Implementation

The complete training and deployment code is available in the Files tab โ†’

Includes:

  • dtln_ethos_u55.py - Model architecture
  • train_dtln.py - Training with QAT
  • convert_to_tflite.py - TFLite conversion
  • alif_e7_voice_denoising_guide.md - Complete guide
  • example_usage.py - Usage examples

Resources

Model Architecture Details

Input: Raw audio waveform @ 16kHz

  • Frame length: 512 samples (32ms)
  • Frame shift: 128 samples (8ms)
  • Frequency bins: 257 (FFT size 512)

Network Structure:

Input Audio (16kHz)
    โ†“
STFT (512-point)
    โ†“
[Stage 1]
LSTM (128 units) โ†’ Dense (sigmoid) โ†’ Magnitude Mask 1
    โ†“
Enhanced Magnitude 1
    โ†“
[Stage 2]
LSTM (128 units) โ†’ Dense (sigmoid) โ†’ Magnitude Mask 2
    โ†“
Enhanced Magnitude
    โ†“
ISTFT
    โ†“
Output Audio (16kHz)

Training Configuration:

  • Loss: Combined time + frequency domain MSE
  • Optimizer: Adam (lr=0.001)
  • Batch size: 16
  • Epochs: 50
  • Quantization: INT8 post-training quantization

Memory Footprint:

  • Model weights: ~80 KB (INT8)
  • Tensor arena: ~100 KB
  • Audio buffers: ~2 KB
  • Total: ~200 KB

Edge Device Deployment

Hardware Utilization:

  • NPU/CPU: For LSTM inference
  • CPU: For FFT operations (CMSIS-DSP)
  • Memory: Optimized buffer management
  • Peripherals: I2S/PDM for audio I/O

Power Profile:

  • Active inference: 30-40 mW
  • Idle: <1 mW
  • Average (50% duty): ~15-20 mW

Real-time Constraints:

  • Frame processing: 8ms available
  • FFT: ~1ms
  • NPU inference: ~4ms
  • IFFT + overhead: ~2ms
  • Margin: ~1ms

๐Ÿ“š Citation

If you use this model in your research, please cite:

@inproceedings{westhausen2020dtln,
  title={Dual-signal transformation LSTM network for real-time noise suppression},
  author={Westhausen, Nils L and Meyer, Bernd T},
  booktitle={Interspeech},
  year={2020}
}

Built for Edge AI โ€ข Optimized for Microcontrollers โ€ข Original DTLN