Science

A new model to produce more natural synthesized speech

The proposed Diff-ETS framework for ETS. The deep blue blocks are trainable and the sunshine blue block of the vocoder is frozen. ResBlock: Residual blocks, Attn: Consideration, Conv: Convolutional layers. Credit: Ren et al

Current technological advances are enabling the event of computational instruments that might considerably enhance the standard of life of people with disabilities or sensory impairments. These embody so-called electromyography-to-speech (ETS) conversion fashions, designed to transform electrical alerts produced by skeletal muscle mass into speech.

Researchers at University of Bremen and SUPSI lately launched Diff-ETS, a mannequin for ETS conversion that might produce extra pure synthesized speech. This mannequin, launched in a paper posted to the preprint server arXiv, could possibly be used to develop new techniques that enable people who find themselves unable to talk, equivalent to sufferers who underwent a laryngectomy (a surgical procedure to take away a part of the human voice field), to speak with others.

Most beforehand launched strategies for ETS conversion have two key elements: an EMG encoder and a vocoder. The electromyography (EMG) encoder can convert EMG alerts into acoustic speech options, whereas the vocoder makes use of these speech options to synthesize speech alerts.

“Due to an inadequate amount of available data and noisy signals, the synthesized speech often exhibits a low level of naturalness,” Zhao Ren, Kevin Scheck and their colleagues wrote of their paper. “In this work, we propose Diff-ETS, an ETS model which uses a score-based diffusion probabilistic model to enhance the naturalness of synthesized speech. The diffusion model is applied to improve the quality of the acoustic features predicted by an EMG encoder.”

In distinction with many different ETS conversion fashions developed prior to now, consisting of an encoder and vocoder, the researchers’ mannequin has three elements, particularly an EMG encoder, a diffusion probabilistic mannequin and a vocoder. The diffusion probabilistic mannequin, the second of those elements, is thus a brand new addition, which may lead to extra pure synthesized speech.

Ren, Scheck and their colleagues educated the EMG encoder to foretell a so-called log Mel spectrogram (i.e., a visible illustration of audio alerts) and phoneme targets from EMG alerts. The diffusion probabilistic mannequin, alternatively, was educated to boost log Mel spectrograms, whereas the pre-trained vocoder can translate this spectrogram into synthesized speech.

The researchers evaluated the Diff-ETS mannequin in a collection of assessments, evaluating it with a baseline ETS method. Their findings have been extremely promising, because the speech it synthesized was extra pure and human-like than that produced by the baseline technique.

“In our experiments, we evaluated fine-tuning the diffusion model on predictions of a pre-trained EMG encoder, and training both models in an end-to-end fashion,” Ren, Scheck and their colleagues wrote of their paper. “We compared Diff-ETS with a baseline ETS model without diffusion using objective metrics and a listening test. The results indicated the proposed Diff-ETS significantly improved speech naturalness over the baseline.”

Sooner or later, the ETS conversion mannequin developed by this group of researchers could possibly be used to develop higher applied sciences for the bogus technology of audible speech. These techniques may enable people who find themselves unable to talk to specific their ideas out loud, facilitating their interplay with others.

“In future efforts, one can reduce the number of model parameters using various methods, e. g., model compression and knowledge distillation, thereby generating speech samples in real-time,” the researchers wrote. “Moreover, a diffusion model can be trained together with the encoder and vocoder for further enhancing the speech quality.”

Extra data:
Zhao Ren et al, Diff-ETS: Studying a Diffusion Probabilistic Mannequin for Electromyography-to-Speech Conversion, arXiv (2024). DOI: 10.48550/arxiv.2405.08021

Journal data:
arXiv


© 2024 Science X Community

Quotation:
A brand new mannequin to provide extra pure synthesized speech (2024, May 27)
retrieved 27 May 2024
from https://techxplore.com/information/2024-05-natural-speech.html

This doc is topic to copyright. Other than any honest dealing for the aim of personal research or analysis, no
half could also be reproduced with out the written permission. The content material is offered for data functions solely.



Click Here To Join Our Telegram Channel


Source link

When you have any considerations or complaints concerning this text, please tell us and the article might be eliminated quickly. 

Raise A Concern

Show More

Related Articles

Back to top button