基于跨视角相似度顺序保持的基因特征提取方法

苏树智; 张开宇; 王子莹; 张茂岩

doi:10.11999/JEIT211126

基于跨视角相似度顺序保持的基因特征提取方法

doi: 10.11999/JEIT211126 cstr: 32379.14.JEIT211126

苏树智^{1, 2, ,},
张开宇¹,
王子莹¹,
张茂岩¹

1.
安徽理工大学计算机科学与工程学院淮南 232001
2.
合肥综合性国家科学中心能源研究院(安徽省能源实验室) 合肥 230031

基金项目: 国家自然科学基金(61806006)，中国博士后科学基金(2019M660149)，合肥综合性国家科学中心能源研究院项目(19KZS203)，安徽省重点研发计划国际科技合作专项(202004b11020029)

详细信息

作者简介:
苏树智：男，副教授，研究方向为多模态模式识别、特征学习、基因分析

张开宇：男，硕士生，研究方向为多模态模式识别、基因分析

王子莹：女，硕士生，研究方向为模式识别、图像处理

张茂岩：男，硕士生，研究方向为模式识别

通讯作者:
苏树智 sushuzhi@foxmail.com

中图分类号: TN911.73; TP391.4
计量
- 文章访问数: 1337
- HTML全文浏览量: 438
- PDF下载量: 116
- 被引次数: 0
出版历程
- 收稿日期: 2021-10-14
- 修回日期: 2022-01-10
- 录用日期: 2022-01-12
- 网络出版日期: 2022-02-02
- 刊出日期: 2023-01-17

A Gene Feature Extraction Method Based on Across-view Similarity Order Preserving

1.
School of Computer Science and Engineering, Anhui University of Science & Technology, Huainan 232001, China
2.
Institute of Energy, Hefei Comprehensive National Science Center, Hefei 230031, China

Funds: The National Natural Science Foundation of China (61806006), China Postdoctoral Science Foundation (2019M660149), The Project of Institute of Energy, Hefei Comprehensive National Science Center (19KZS203), The International Science and Technology Cooperation Project of Key Research and Development Plan in Anhui Province (202004b11020029)

摘要

摘要: 基因表达数据通常具有维数高、样本少、类别分布不均等特点，如何提取基因表达数据的有效特征是基因分类研究的关键问题。该文借助相关分析理论，构建鉴别敏感的视角内相似度顺序保持散布并且约束鉴别敏感的视角间相似度相关，从而形成了一种新的基因特征提取方法，即相似度顺序保持跨视角相关分析(SOPACA)。该文方法在保持不同视角间特征类内聚集性和相似度顺序的同时具有较大的类间离散性。在癌症基因表达数据集上的良好实验结果显示了该文方法的有效性。
- 基因特征提取 /
- 相关分析理论 /
- 相似度顺序保持 /
- 鉴别敏感 /
- 癌症诊断
Abstract: Gene expression data is usually characterized by high dimension, few samples and uneven classification distribution. How to extract the effective features of gene expression data is a critical problem of gene classification. With the help of correlation analysis theory, the within-view and between-view discrimination sensitive similarity order scatter can be construsted, thus forming a new method of gene feature extraction, namely, Similarity Order Preserving Across-view Correlation Analysis(SOPACA). The proposed method not only maintains the intra-class aggregation and similarity order of features between different views, but also has a large distance between classes. Good experimental results on cancer gene expression datasets demonstrate the effectiveness of the method.
- Gene feature extraction /
- Correlation analysis theory /
- Similarity order preserving /
- Discrimination sensitive /
- Cancer diagnosis

HTML全文

算法1　SOPACA方法步骤
输入：视角数据集$\{ { {\boldsymbol{X} }^{(i)} } = ({\boldsymbol{x} }_1^{(i)},{\boldsymbol{x} }_2^{(i)}, \cdots ,{\boldsymbol{x} }_n^{(i)}) \in {{\boldsymbol{R}}^{ {d_i} \times n} }\} _{i = 1}^m$
输出：基因样本类标签
(1)利用式(7)和式(13)分别构建视角内相似度顺序保持散布矩阵　　 ${\boldsymbol{S}}_w^{(i)}$和视角间相似度相关矩阵${\boldsymbol{S}}_b^{(ij)}$；
(2)利用式(16)Lagrange函数求得特征值$\lambda $和对应特征向量；
(3)利用式(20)获得相关投影矩阵　　 $\{ {{\boldsymbol{W}}_i} = ({\boldsymbol{\alpha}}_1^{(i)},{\boldsymbol{\alpha}}_2^{(i)}, \cdots ,{\boldsymbol{\alpha}}_d^{(i)})\} _{i = 1}^m$；
(4)利用式(21)获得特征融合后的鉴别矢量${\boldsymbol{Z}}$；
(5)利用基于欧氏距离的最近邻分类器对鉴别矢量${\boldsymbol{Z}}$进行分类，　　得到基因样本类标签。

下载: 导出CSV

表 1 在肺癌基因表达数据集上的识别率变化结果

方法	5训练样本	10训练样本	15训练样本	20训练样本	25训练样本
SOPACA	98.66$ \pm $0.85	99.08$ \pm $0.91	98.70$ \pm $1.22	98.81$ \pm $0.94	99.65$ \pm $0.74
MCCA	96.08$ \pm $2.37	98.16$ \pm $1.11	97.92$ \pm $1.40	97.61$ \pm $1.00	99.30$ \pm $1.11
LDA	96.70$ \pm $2.05	98.05$ \pm $1.22	98.31$ \pm $1.23	98.51$ \pm $1.00	99.30$ \pm $0.91
GrMCCs	94.64$ \pm $3.10	96.55$ \pm $2.82	98.05$ \pm $2.31	97.61$ \pm $2.01	98.60$ \pm $1.38
LMCCA	97.01$ \pm $1.41	98.28$ \pm $1.12	98.18$ \pm $1.40	98.36$ \pm $1.10	99.30$ \pm $0.91

下载: 导出CSV

表 2 在结直肠癌基因表达数据集上的平均识别率

方法	2训练样本	3训练样本	4训练样本	5训练样本	6训练样本
SOPACA	98.67$ \pm $1.72	99.29$ \pm $1.51	99.23$ \pm $1.62	99.58$ \pm $1.32	99.09$ \pm $1.92
MCCA	95.67$ \pm $3.52	97.50$ \pm $2.41	97.31$ \pm $2.60	99.17$ \pm $1.76	98.18$ \pm $2.35
LDA	95.00$ \pm $2.83	96.07$ \pm $2.03	95.77$ \pm $4.23	96.67$ \pm $2.64	97.73$ \pm $2.40
GrMCCs	93.33$ \pm $8.75	94.29$ \pm $2.50	96.92$ \pm $3.53	97.50$ \pm $2.15	98.64$ \pm $2.20
LMCCA	96.67$ \pm $2.22	96.07$ \pm $2.41	97.31$ \pm $3.17	97.50$ \pm $2.15	97.73$ \pm $2.40

下载: 导出CSV

参考文献(24)

[1]	SHUMATE A and SALZBERG S L. Liftoff: Accurate mapping of gene annotations[J]. Bioinformatics, 2021, 37(12): 1639–1643. doi: 10.1093/BIOINFORMATICS/BTAA1016
[2]	LU Rongxiu, CAI Yingjie, ZHU Jianyong, et al. Dimension reduction of multimodal data by auto-weighted local discriminant analysis[J]. Neurocomputing, 2021, 461: 27–40. doi: 10.1016/J.NEUCOM.2021.06.035
[3]	王肖锋, 孙明月, 葛为民. 基于图像协方差无关的增量特征提取方法研究[J]. 电子与信息学报, 2019, 41(11): 2768–2776. doi: 10.11999/JEIT181138 WANG Xiaofeng, SUN Mingyue, and GE Weimin. An incremental feature extraction method without estimating image covariance matrix[J]. Journal of Electronics &Information Technology, 2019, 41(11): 2768–2776. doi: 10.11999/JEIT181138
[4]	ARTONI F, DELORME A, and MAKEIG S. Applying dimension reduction to EEG data by principal component analysis reduces the quality of its subsequent independent component decomposition[J]. NeuroImage, 2018, 175: 176–187. doi: 10.1016/j.neuroimage.2018.03.016
[5]	LI Chunna, SHAO Yuanhai, CHEN Weijie, et al. Generalized two-dimensional linear discriminant analysis with regularization[J]. Neural Networks, 2021, 142: 73–91. doi: 10.1016/J.NEUNET.2021.04.030
[6]	NAKAYAMA Y, YATA K, and AOSHIMA M. Clustering by principal component analysis with Gaussian kernel in high-dimension, low-sample-size settings[J]. Journal of Multivariate Analysis, 2021, 185: 104779. doi: 10.1016/J.JMVA.2021.104779
[7]	CLAYMAN C L, SRINIVASAN S M, and SANGWAN R S. K-means clustering and principal components analysis of microarray data of L1000 landmark genes[J]. Procedia Computer Science, 2020, 168: 97–104. doi: 10.1016/j.procs.2020.02.265
[8]	WANG Cheng, CAO Longbing, and MIAO Baiqi. Optimal feature selection for sparse linear discriminant analysis and its applications in gene expression data[J]. Computational Statistics & Data Analysis, 2013, 66: 140–149. doi: 10.1016/j.csda.2013.04.003
[9]	LIN Weiming, GAO Qinquan, DU Min, et al. Multiclass diagnosis of stages of Alzheimer's disease using linear discriminant analysis scoring for multimodal data[J]. Computers in Biology and Medicine, 2021, 134: 104478. doi: 10.1016/J.COMPBIOMED.2021.104478
[10]	苏树智, 谢军, 平昕瑞, 等. 图强化典型相关分析及在图像识别中的应用[J]. 电子与信息学报, 2021, 43(11): 3342–3349. doi: 10.11999/JEIT210154 SU Shuzhi, XIE Jun, PING Xinrui, et al. Graph enhanced canonical correlation analysis and its application to image recognition[J]. Journal of Electronics &Information Technology, 2021, 43(11): 3342–3349. doi: 10.11999/JEIT210154
[11]	LIN Dongdong, CALHOUN V D, and WANG Yuping. Correspondence between fMRI and SNP data by group sparse canonical correlation analysis[J]. Medical Image Analysis, 2014, 18(6): 891–902. doi: 10.1016/j.media.2013.10.010
[12]	TENENHAUS A, PHILIPPE C, and FROUIN V. Kernel generalized canonical correlation analysis[J]. Computational Statistics & Data Analysis, 2015, 90: 114–131. doi: 10.1016/j.csda.2015.04.004
[13]	WANG Wenjia and ZHOU Yihui. Eigenvector-based sparse canonical correlation analysis: Fast computation for estimation of multiple canonical vectors[J]. Journal of Multivariate Analysis, 2021, 185: 104781. doi: 10.1016/J.JMVA.2021.104781
[14]	YUAN Yunhao, SUN Quansen, ZHOU Qiang, et al. A novel multiset integrated canonical correlation analysis framework and its application in feature fusion[J]. Pattern Recognition, 2011, 44(5): 1031–1040. doi: 10.1016/j.patcog.2010.11.004
[15]	DELEUS F and VAN HULLE M M. Functional connectivity analysis of fMRI data based on regularized multiset canonical correlation analysis[J]. Journal of Neuroscience Methods, 2011, 197(1): 143–157. doi: 10.1016/j.jneumeth.2010.11.029
[16]	YUAN Yunhao and SUN Quansen. Graph regularized multiset canonical correlations with applications to joint feature extraction[J]. Pattern Recognition, 2014, 47(12): 3907–3919. doi: 10.1016/j.patcog.2014.06.016
[17]	SU Shuzhi, GE Hongwei, and YUAN Yunhao. Kernel-aligned multi-view canonical correlation analysis for image recognition[J]. Infrared Physics & Technology, 2016, 78: 233–240. doi: 10.1016/j.infrared.2016.08.010
[18]	GAO Lei, QI Lin, CHEN Enqing, et al. Discriminative multiple canonical correlation analysis for information fusion[J]. IEEE Transactions on Image Processing, 2018, 27(4): 1951–1965. doi: 10.1109/TIP.2017.2765820
[19]	GAO Lei, ZHANG Rui, QI Lin, et al. The labeled multiple canonical correlation analysis for information fusion[J]. IEEE Transactions on Multimedia, 2019, 21(2): 375–387. doi: 10.1109/TMM.2018.2859590
[20]	HU Haoshuang, FENG Dazheng, and CHEN Qingyan. A novel dimensionality reduction method: Similarity order preserving discriminant analysis[J]. Signal Processing, 2021, 182: 107933. doi: 10.1016/J.SIGPRO.2020.107933
[21]	SU Shuzhi, ZHU Gang, and ZHU Yanmin. An orthogonal locality and globality dimensionality reduction method based on Twin Eigen decomposition[J]. IEEE Access, 2021, 9: 55714–55725. doi: 10.1109/ACCESS.2021.3071192
[22]	SHEN Xiaobo, SUN Quansen, and YUAN Yunhao. A unified multiset canonical correlation analysis framework based on graph embedding for multiple feature extraction[J]. Neurocomputing, 2015, 148: 397–408. doi: 10.1016/j.neucom.2014.06.015
[23]	SHOKRZADE A, RAMEZANI M, TAB F A, et al. A novel extreme learning machine based kNN classification method for dealing with big data[J]. Expert Systems with Applications, 2021, 183: 115293. doi: 10.1016/J.ESWA.2021.115293
[24]	LIU Dongwei, JIA Runping, WANG Caifeng, et al. Automated detection of cancerous genomic sequences using genomic signal processing and machine learning[J]. Future Generation Computer Systems, 2019, 98: 233–237. doi: 10.1016/J.FUTURE.2018.12.041