Radio2Speech: High Quality Speech Recovery from Radio Frequency Signals

¹ The University of Hong Kong ² IIIS, Tsinghua University ³ Shanghai Qi Zhi Institute

INTERSPEECH 2022

Considering the microphone is easily affected by noise and soundproof materials, the radio frequency (RF) signal is a promising candidate to recover audio as it is immune to noise and can traverse many soundproof objects. In this paper, we introduce Radio2Speech, a system that uses RF signals to recover speech with high quality from the loudspeaker. Radio2Speech can recover speech comparable to the quality of a microphone, advancing from recovering only single tone music or incomprehensible speech in existing approaches. We use Radio UNet to accurately recover speech in time-frequency domain from the RF signal with limited frequency band. Also, we incorporate the neural vocoder to synthesize the speech waveform from the estimated time-frequency representation without using the contaminated phase. Quantitative and qualitative evaluations show that in quiet, noisy and soundproof scenarios, Radio2Speech achieves state-of-the-art performance and is on par with a microphone that works in quiet scenarios.

Millimeter wave (mmWave) radar transmits RF signals to the vibrating diaphragm of loudspeaker, while the reflected signals contain vibration characteristics associated with speech. Radio2Speech parses reflected RF signals to recover speech. In quiet scenarios, Radio2Speech can recover high quality speech like a microphone. Moreover, since RF signals are not affected by the noise, our system is still able to recover high quality speech in noisy scenarios, while microphone almost fails. The existence of soundproof materials can make the microphone deaf, but RF signals can traverse occlusions. Thus, the speech recovered by Radio2Speech remains intelligible in soundproof scenarios. As shown in the pipeline of Radio2Speech, the input RF signal is upsampled using the cubic spline interpolation, and then it is transformed into Mel spectrogram. The Radio UNet is proposed to recover the Mel spectrogram of the speech signal. Finally, the neural vocoder, Parallel WaveGAN, is employed to reconstruct the natural speech waveform from the estimated Mel spectrogram.

Video (3 minutes)

Speech Recovery Demo

The below videos show the experimental scenarios (i.e., quiet, noisy and soundproof scenarios) of RF data collection, and videos are recorded by a mobile phone. The source speech (8KHz) from LJSpeech (LJ047-0073) are provided for listeners to refer to. Speech examples recorded by the microphone and recovered by Radio2Speech are also show below, and they are recorded in the corresponding scenarios or recovered from the data collected in the corresponding scenarios. Moreover, spectrograms are presented to visualize the recovered speech.

Experimental scenario (video)

Source speech

Microphone

Radio2Speech (Ours)

Quiet scenario

Noisy scenario

Soundproof scenario

Speech Recovery Result

This section presents the comparison results of speech recovery. Some speech samples recovered from WaveEar and Radio2Speech (ours) and the corresponding speech samples recorded by microphone are provided for readers to better evaluate our system. The source speeches (8KHz) are also provided. We recommend that listeners use headphones for best audio experience. All of the following speech samples were unseen during training, and the results were selected at random. The first five samples are from LJSpeech dataset (i.e., LJ047-0049, LJ047-0128, LJ047-0225, LJ048-0011, LJ048-0083), and the last three samples are from TIMIT dataset (i.e., FAEM0/SX42, MGXP0/SX97, MTWH1/SX72).

(1) Quiet Scenario

The speech samples in this section are collected in quiet scenario or recovered from the data collected in quiet scenario.

Source speech

Microphone

WaveEar

Radio2Speech (Ours)

(2) Noisy Scenario

The speech samples in this section are collected in noisy scenario or recovered from the data collected in noisy scenario. Since the acoustic noise does not affect RF-based systems, the speech recovered in noisy scenarios is similar to that recovered in quiet scenarios.

Source speech

Microphone

WaveEar

Radio2Speech (Ours)