Advanced Search
Turn off MathJax
Article Contents
LAN Chaofeng, YANG Guotao, CHEN Yingqi, GUO Xiaoxia. Research on Monophonic Speech Separation Method Using Time-Frequency Domain Multi-scale Information Interaction Strategy[J]. Journal of Electronics & Information Technology. doi: 10.11999/JEIT251340
Citation: LAN Chaofeng, YANG Guotao, CHEN Yingqi, GUO Xiaoxia. Research on Monophonic Speech Separation Method Using Time-Frequency Domain Multi-scale Information Interaction Strategy[J]. Journal of Electronics & Information Technology. doi: 10.11999/JEIT251340

Research on Monophonic Speech Separation Method Using Time-Frequency Domain Multi-scale Information Interaction Strategy

doi: 10.11999/JEIT251340 cstr: 32379.14.JEIT251340
Funds:  The National Natural Science Foundation of China (11804068), Heilongjiang Provincial Outstanding Young Teachers Basic Research Support Programme (YQJH2024064)
  • Received Date: 2025-12-18
  • Accepted Date: 2026-03-03
  • Rev Recd Date: 2026-03-03
  • Available Online: 2026-03-14
  •   Objective  Monaural speech separation aims to extract individual speaker signals from a single-channel mixture. It is a core technology for addressing the “cocktail party problem” and has substantial application value in low-resource, low-latency scenarios such as mobile voice assistants, teleconferencing, and hearing aids. However, the lack of spatial cues in single-channel signals, together with the substantial overlap of multiple speakers in both time-domain waveforms and frequency-domain spectra, makes accurate separation highly challenging, especially when the integrity and clarity of the target speech must be preserved. Current deep learning-based models often show limitations in three closely related aspects: effective coordination of multi-scale dependencies, efficient fusion of time-frequency information, and control of computational complexity. To address these challenges, a novel Multi-Scale Attention model integrating Time-Frequency domain information (MSA-TF) is proposed to improve separation performance, computational efficiency, and generalization capability.  Methods  The MSA-TF model contains three key components. First, a lightweight Time-Frequency fusion module is designed. The module first divides the frequency band into four subbands on the basis of speech priors, such as low-frequency energy concentration and high-frequency detail sensitivity, to extract spectral features efficiently. A dynamic gating mechanism with decomposed convolutions and SiLU activation is then applied to adaptively enhance speaker-discriminative features and suppress redundant channels associated with noise. Finally, a cross-attention mechanism is used to promote deep interaction between time-domain and frequency-domain features during the encoding stage. Global semantic information from the time domain guides the selection and weighting of useful frequency-domain features, allowing mutual correction and complementarity. This module adds only 0.8 M parameters. Second, a Multi-scale Interaction Separator is proposed to address the limitations of sequential or loosely coupled multi-scale processing in models such as SepFormer. Multi-granularity features, ranging from frame-level F1 to syllable-level semantic F4, are extracted through cascaded dilated convolutions. Its core is the “GF-LF Iterative Feedback” mechanism. The Global Flash module, based on efficient FLASH attention, captures long-range dependencies and syllable-level context. This global information is upsampled and injected into local features (Fk) through residual connections. Local Flash modules, also based on FLASH attention, then process the enhanced local features (F'k) to model fine-grained structures and suppress frame-level noise. The updated local features are subsequently fed back through adaptive pooling to refine the global representation in the next iteration. This closed-loop bidirectional flow enables deep synergy between global semantics and local details. A gated fusion mechanism at the end dynamically balances the contributions of different scales. Third, to control computational complexity, an efficient hierarchical grouped attention mechanism is adopted, reducing the complexity from quadratic to nearly linear with sequence length. The overall MSA-TF architecture is end-to-end and consists of a 1D convolutional encoder, the integrated time-frequency and multi-scale modules, a mask network, and a symmetric decoder.  Results and Discussions  Extensive experiments are conducted on the standard WSJ0-2mix and Libri-2mix datasets, with Scale-Invariant Signal-to-Noise Ratio (SI-SNR) and Signal-to-Distortion Ratio (SDR) used as evaluation metrics. Ablation studies (Table 1) confirm the individual and joint contributions of the proposed modules. When only the time-frequency module is added to the TDAnet baseline, SI-SNR increases by 0.3 dB and SDR by 0.4 dB with only a small increase in parameters, confirming its contribution to signal structure modeling, particularly for high-frequency details. When only the multi-scale interaction module is incorporated, SI-SNR increases by 2.5 dB and SDR by 2.7 dB, highlighting its central role in modeling long-term dependencies. When the time-frequency and multi-scale modules are combined in the complete MSA-TF core, a synergistic effect is obtained, reaching 17.6 dB SI-SNR, which exceeds the sum of the individual gains. This result indicates that the dual-dimensional features provided by time-frequency fusion and the deep dependency modeling enabled by multi-scale interaction strengthen each other. Spectrogram analysis (Fig. 4) further shows that the time-frequency module effectively suppresses residual high-frequency noise and produces clearer spectral contours for the target speech. On the WSJ0-2mix test set (Table 2), MSA-TF achieves state-of-the-art performance, with 17.6 dB SI-SNR and 17.8 dB SDR. It matches the performance of SuperFormer and substantially outperforms strong baselines such as Conv-Tasnet by 2.3 dB SI-SNR, while maintaining a reasonable parameter count of 15.6 M. Compared with models with larger parameter sizes, such as SignPredictionNet at 55.2 M, MSA-TF shows more efficient modeling. For generalization evaluation on the completely unseen Libri-2mix dataset (Table 3), MSA-TF, trained only on WSJ0-2mix, achieves 14.2 dB SI-SNR and 14.7 dB SDR. Its performance is comparable to that of Conv-Tasnet models trained specifically on Libri-2mix, which achieve 14.4 dB SI-SNR, and it outperforms BLSTM-Tasnet trained on Libri-2mix. This strong cross-dataset adaptability indicates that the model captures universal time-frequency characteristics and multi-scale dependency structures in speech signals rather than overfitting to a specific dataset distribution.  Conclusions  An MSA-TF model is presented to address key challenges in monaural speech separation through deep integration of multi-scale time-frequency information interaction. The proposed lightweight Time-Frequency fusion module efficiently supplements time-domain features with discriminative frequency-domain information. The Multi-scale Interaction Separator, with its iterative feedback mechanism, enables dynamic bidirectional information flow across scales and substantially improves the joint modeling of short-term details and long-term dependencies. Combined with an efficient attention design, the model achieves superior performance without excessive computational cost. Experimental results show that MSA-TF achieves leading separation performance on standard benchmarks and shows strong generalization ability on unseen data distributions, confirming the effectiveness of this comprehensive design. The model provides an efficient, robust, and generalizable solution for practical low-resource application scenarios. Future work may examine advanced cross-modal fusion techniques and dynamic scale adjustment strategies to further improve robustness and performance in more complex and variable acoustic environments.
  • loading
  • [1]
    LI Kai, CHEN Guo, SANG Wendi, et al. Advances in speech separation: Techniques, challenges, and future trends[J]. arXiv preprint arXiv: 2508.10830, 2025. doi: 10.48550/arXiv.2508.10830.
    [2]
    LUO Yi and MESGARANI N. Conv-TasNet: Surpassing ideal time–frequency magnitude masking for speech separation[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2019, 27(8): 1256–1266. doi: 10.1109/TASLP.2019.2915167.
    [3]
    ZHANG Liwen, SHI Ziqiang, HAN Jiqing, et al. FurcaNeXt: End-to-end monaural speech separation with dynamic gated dilated temporal convolutional networks[C]. The 26th International Conference on Multimedia Modeling, Daejeon, South Korea, 2020: 653–665. doi: 10.1007/978-3-030-37731-1_53.
    [4]
    SHI Huiyu, CHEN Xi, KONG Tianlong, et al. GLMSnet: Single channel speech separation framework in noisy and reverberant environments[C]. 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), Cartagena, Colombia, 2021: 663–670. doi: 10.1109/ASRU51503.2021.9688217.
    [5]
    LUO Yi, CHEN Zhuo, and YOSHIOKA T. Dual-Path RNN: Efficient long sequence modeling for time-domain single-channel speech separation[C]. The ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 2020: 46–50. doi: 10.1109/ICASSP40776.2020.9054266.
    [6]
    SUBAKAN C, RAVANELLI M, CORNELL S, et al. Attention is all you need in speech separation[C]. The ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Toronto, Canada, 2021: 21–25. doi: 10.1109/ICASSP39728.2021.9413901.
    [7]
    ZHAO Yucheng, LUO Chong, ZHA Zhengjun, et al. Multi-scale group transformer for long sequence modeling in speech separation[C]. The Twenty-Ninth International Conference on International Joint Conferences on Artificial Intelligence, Yokohama, Japan, 2021: 450.
    [8]
    RIXEN J and RENZ M. SFSRNet: Super-resolution for single-channel audio source separation[C]. The 36th AAAI Conference on Artificial Intelligence, 2022: 11220–11228. doi: 10.1609/aaai.v36i10.21372. ,,
    [9]
    TONG Weinan, ZHU Jiaxu, CHEN Jun, et al. TFCnet: Time-frequency domain corrector for speech separation[C]. The ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes Island, Greece, 2023: 1–5. doi: 10.1109/ICASSP49357.2023.10096785.
    [10]
    ROUARD S, MASSA F, and DÉFOSSEZ A. Hybrid transformers for music source separation[C]. The ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes Island, Greece, 2023: 1–5. doi: 10.1109/ICASSP49357.2023.10096956.
    [11]
    TZINIS E, WANG Zhepei, and SMARAGDIS P. Sudo RM -RF: Efficient networks for universal audio source separation[C].2020 IEEE 30th International Workshop on Machine Learning for Signal Processing (MLSP), Espoo, Finland, 2020: 1–6. doi: 10.1109/MLSP49062.2020.9231900.
    [12]
    LI Kai, YANG Runxuan, and HU Xiaolin. An efficient encoder-decoder architecture with top-down attention for speech separation[J]. arXiv preprint arXiv: 2209.15200, 2022. doi: 10.48550/arXiv.2209.15200.
    [13]
    GOEL K, GU A, DONAHUE C, et al. It’s raw! Audio generation with state-space models[C]. The 39th International Conference on Machine Learning, Baltimore, USA, 2022: 7616–7633.
    [14]
    CHEN Chen, YANG C H H, LI Kai, et al. A neural state-space modeling approach to efficient speech separation[C]. The 24th Annual Conference of the International Speech Communication Association, Dublin, Ireland, 2023: 3784–3788.
    [15]
    XU Mohan, LI Kai, CHEN Guo, et al. TIGER: Time-frequency interleaved gain extraction and reconstruction for efficient speech separation[C]. The 13th International Conference on Learning Representations, Singapore, Singapore, 2025.
    [16]
    OH H, YI J, and LEE Y. Papez: Resource-efficient speech separation with auditory working memory[C]. The ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes Island, Greece, 2023: 1–5. doi: 10.1109/ICASSP49357.2023.10095136.
    [17]
    HUA Weizhe, DAI Zihang, LIU Hanxiao, et al. Transformer quality in linear time[C]. The 39th International Conference on Machine Learning, Baltimore, USA, 2022: 9099–9117.
    [18]
    ZHAO Shengkui and MA Bin. MossFormer: Pushing the performance limit of monaural speech separation using gated single-head transformer with convolution-augmented joint self-attentions[C]. The ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes Island, Greece, 2023: 1–5. doi: 10.1109/ICASSP49357.2023.10096646.
    [19]
    ZHAO Shengkui, MA Yukun, NI Chongjia, et al. MossFormer2: Combining transformer and RNN-free recurrent network for enhanced time-domain monaural speech separation[C]. The ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Seoul, Korea, Republic of, 2024: 10356–10360. doi: 10.1109/ICASSP48485.2024.10445985.
    [20]
    HU Xiaolin, LI Kai, ZHANG Weiyi, et al. Speech separation using an asynchronous fully recurrent convolutional neural network[C]. The 35th International Conference on Neural Information Processing Systems, 2021: 1724. .
    [21]
    PAN Zexu, WICHERN G, GERMAIN F G, et al. PARIS: Pseudo-AutoRegressIve Siamese training for online speech separation[C]. The 25th Annual Conference of the International Speech Communication Association, Kos, Greece, 2024.
    [22]
    TAN H M, VU D Q, and WANG J C. Selinet: A lightweight model for single channel speech separation[C]. The ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes Island, Greece, 2023: 1–5. doi: 10.1109/ICASSP49357.2023.10097121.
    [23]
    TZINIS E, VENKATARAMANI S, WANG Zhepei, et al. Two-step sound source separation: Training on learned latent targets[C]. The IEEE International Conference on Acoustics, Speech and Signal Processing, Barcelona, Spain, 2020: 31–35. doi: 10.1109/ICASSP40776.2020.9054172.
    [24]
    LUO Jian, WANG Jianzong, CHENG Ning, et al. Tiny-Sepformer: A tiny time-domain transformer network for speech separation[J]. arXiv preprint arXiv: 2206.13689, 2022. doi: 10.48550/arXiv.2206.13689. .
    [25]
    JIANG Yanji, QIU Youli, SHEN Xueli, et al. SuperFormer: Enhanced multi-speaker speech separation network combining channel and spatial adaptability[J]. Applied Sciences, 2022, 12(15): 7650. doi: 10.3390/app12157650.
    [26]
    LIU Debang, ZHANG Tianqi, CHRISTENSEN M G, et al. Efficient time-domain speech separation using short encoded sequence network[J]. Speech Communication, 2025, 166: 103150. doi: 10.1016/j.specom.2024.103150.
    [27]
    侯进, 盛尧宝, 张波. 基于二阶统计特性的方向向量估计算法的DOA估计[J]. 电子与信息学报, 2024, 46(2): 697–704. doi: 10.11999/JEIT230172.

    HOU Jin, SHENG Yaobao, and ZHANG Bo. DOA estimation of direction vector estimation algorithm based on second-order statistical properties[J]. Journal of Electronics & Information Technology, 2024, 46(2): 697–704. doi: 10.11999/JEIT230172.
    [28]
    田浩原, 陈宇轩, 陈北京, 等. 抵抗语音转换伪造的扩散重构式主动防御方法[J]. 电子与信息学报, 2026, 48(2): 818–828. doi: 10.11999/JEIT250709.

    TIAN Haoyuan, CHEN Yuxuan, CHEN Beijing, et al. Defeating voice conversion forgery by active defense with diffusion reconstruction[J]. Journal of Electronics & Information Technology, 2026, 48(2): 818–828. doi: 10.11999/JEIT250709.
    [29]
    刘佳, 张洋瑞, 陈大鹏, 等. 结合双流注意力与对抗互重建的双模态情绪识别方法[J]. 电子与信息学报, 2026, 48(1): 277–286. doi: 10.11999/JEIT250424.

    LIU Jia, ZHANG Yangrui, CHEN Dapeng, et al. Bimodal emotion recognition method based on dual-stream attention and adversarial mutual reconstruction[J]. Journal of Electronics & Information Technology, 2026, 48(1): 277–286. doi: 10.11999/JEIT250424.
  • 加载中

Catalog

    通讯作者: 陈斌, bchen63@163.com
    • 1. 

      沈阳化工大学材料科学与工程学院 沈阳 110142

    1. 本站搜索
    2. 百度学术搜索
    3. 万方数据库搜索
    4. CNKI搜索

    Figures(3)  / Tables(6)

    Article Metrics

    Article views (62) PDF downloads(12) Cited by()
    Proportional views
    Related

    /

    DownLoad:  Full-Size Img  PowerPoint
    Return
    Return