Item request has been placed! ×
Item request cannot be made. ×
loading  Processing Request

LEVERAGING SELF-SUPERVISED SPEECH REPRESENTATIONS FOR DOMAIN ADAPTATION IN SPEECH ENHANCEMENT

Item request has been placed! ×
Item request cannot be made. ×
loading   Processing Request
  • Publication Date:
    March 20, 2025
  • Additional Information
    • Document Number:
      20250095666
    • Appl. No:
      18/884978
    • Application Filed:
      September 13, 2024
    • Abstract:
      A method for generating a customized speech enhancement model includes obtaining noisy-clean speech data from a source domain, obtaining noisy speech data from a target domain; obtaining raw speech data, using the noisy-clean speech data, the noisy speech data, and the raw speech data, training the customized SE model based on at least one of self-supervised representation-based adaptation (SSRA), ensemble mapping, or self-supervised adaptation loss, generating the customized SE model by denoising the noisy speech data using the trained customized SE model, and providing the customized SE model to a user device to use the denoised noisy speech data.
    • Assignees:
      SAMSUNG ELECTRONICS CO., LTD. (Suwon-si, KR)
    • Claim:
      1. A method for generating a customized speech enhancement (SE) model, performed by at least one processor of an electronic device, the method comprising: obtaining noisy-clean speech data from a source domain; obtaining noisy speech data from a target domain; obtaining raw speech data; using the noisy-clean speech data, the noisy speech data, and the raw speech data, training the customized SE model based on at least one of self-supervised representation-based adaptation (SSRA), ensemble mapping, or self-supervised adaptation loss; generating the customized SE model by denoising the noisy speech data using the trained customized SE model; and providing the customized SE model to a user device to use the denoised noisy speech data.
    • Claim:
      2. The method of claim 1, wherein the training the customized SE model comprises training the customized SE model based on the SSRA, and the training the customized SE model further comprises pre-training a self-supervised learning (SSL) encoder in a self-supervised manner, providing a target domain enhanced signal to the SSL encoder, and providing source domain clean signals to the SSL encoder.
    • Claim:
      3. The method of claim 1, wherein the training the customized SE model comprises training the customized SE model based on the ensemble mapping, and the training the customized SE model further comprises pseudo labeling the noisy speech data from the target domain.
    • Claim:
      4. The method of claim 1, wherein the training the customized SE model comprises training the customized SE model based on the self-supervised adaptation loss, and the training the customized SE model further comprises using a distance metric in an SSRA loss term.
    • Claim:
      5. The method of claim 1, wherein the noisy speech data is obtained from the user device in the target domain.
    • Claim:
      6. The method of claim 5, wherein the user device comprises at least one of a mobile phone, a refrigerator, a smart watch, glasses, or a television.
    • Claim:
      7. The method of claim 1, wherein the noisy speech data is obtained from a plurality of microphones corresponding to a plurality of user devices.
    • Claim:
      8. A server device comprising: a memory storing instructions; and at least one processor, wherein the instructions, when executed by the at least one processor, cause the server device to: obtain noisy-clean speech data from a source domain; obtain noisy speech data from a target domain; obtain raw speech data; using the noisy-clean speech data, the noisy speech data, and the raw speech data, train a customized SE model based on at least one of self-supervised representation-based adaptation (SSRA), ensemble mapping, or self-supervised adaptation loss; generate the customized SE model by denoising the noisy speech data using the trained customized SE model; and provide the customized SE model to a user device to use the denoised noisy speech data.
    • Claim:
      9. The server device of claim 8, wherein the instructions, when executed by the at least one processor, cause the server device to pre-train a self-supervised learning (SSL) encoder in a self-supervised manner, provide a target domain enhanced signal to the SSL encoder, and provide source domain clean signals to the SSL encoder.
    • Claim:
      10. The server device of claim 8, wherein the instructions, when executed by the at least one processor, cause the server device to train the customized SE model based on the ensemble mapping, and pseudo label the noisy speech data from the target domain.
    • Claim:
      11. The server device of claim 8, wherein the instructions, when executed by the at least one processor, cause the server device to train the customized SE model based on the self-supervised adaptation loss, and use a distance metric in an SSRA loss term.
    • Claim:
      12. The server device of claim 8, wherein the noisy speech data is obtained from the user device in the target domain.
    • Claim:
      13. The server device of claim 12, wherein the user device comprises at least one of a mobile phone, a refrigerator, a smart watch, glasses, or a television.
    • Claim:
      14. The server device of claim 8, wherein the noisy speech data is obtained from a plurality of microphones corresponding to a plurality of user devices.
    • Claim:
      15. A non-transitory computer-readable recording medium configured to store instructions for generating a customized speech enhancement (SE) model, which, when executed by at least one processor of an electronic device, cause the at least one processor to perform a method comprising: obtaining noisy-clean speech data from a source domain; obtaining noisy speech data from a target domain; obtaining raw speech data; using the noisy-clean speech data, the noisy speech data, and the raw speech data, training the customized SE model based on at least one of self-supervised representation-based adaptation (SSRA), ensemble mapping, or self-supervised adaptation loss; generating the customized SE model by denoising the noisy speech data using the trained customized SE model; and providing the customized SE model to a user device to use the denoised noisy speech data.
    • Claim:
      16. The non-transitory computer-readable recording medium of claim 15, wherein the training the customized SE model comprises training the customized SE model based on the SSRA, and the training the customized SE model further comprises pre-training a self-supervised learning (SSL) encoder in a self-supervised manner, providing a target domain enhanced signal to the SSL encoder, and providing source domain clean signals to the SSL encoder.
    • Claim:
      17. The non-transitory computer-readable recording medium of claim 15, wherein the training the customized SE model comprises training the customized SE model based on the ensemble mapping, and the training the customized SE model further comprises pseudo labeling the noisy speech data from the target domain.
    • Claim:
      18. The non-transitory computer-readable recording medium of claim 15, wherein the training the customized SE model comprises training the customized SE model based on the self-supervised adaptation loss, and the training the customized SE model further comprises using a distance metric in an SSRA loss term.
    • Claim:
      19. The non-transitory computer-readable recording medium of claim 15, wherein the noisy speech data is obtained from the user device in the target domain.
    • Claim:
      20. The non-transitory computer-readable recording medium of claim 19, wherein the user device comprises at least one of a mobile phone, a refrigerator, a smart watch, glasses, or a television.
    • Current International Class:
      10; 10; 10
    • Accession Number:
      edspap.20250095666