A Gene Feature Extraction Method Based on Across-view Similarity Order Preserving
-
摘要: 基因表达数据通常具有维数高、样本少、类别分布不均等特点,如何提取基因表达数据的有效特征是基因分类研究的关键问题。该文借助相关分析理论,构建鉴别敏感的视角内相似度顺序保持散布并且约束鉴别敏感的视角间相似度相关,从而形成了一种新的基因特征提取方法,即相似度顺序保持跨视角相关分析(SOPACA)。该文方法在保持不同视角间特征类内聚集性和相似度顺序的同时具有较大的类间离散性。在癌症基因表达数据集上的良好实验结果显示了该文方法的有效性。Abstract: Gene expression data is usually characterized by high dimension, few samples and uneven classification distribution. How to extract the effective features of gene expression data is a critical problem of gene classification. With the help of correlation analysis theory, the within-view and between-view discrimination sensitive similarity order scatter can be construsted, thus forming a new method of gene feature extraction, namely, Similarity Order Preserving Across-view Correlation Analysis(SOPACA). The proposed method not only maintains the intra-class aggregation and similarity order of features between different views, but also has a large distance between classes. Good experimental results on cancer gene expression datasets demonstrate the effectiveness of the method.
-
算法1 SOPACA方法步骤 输入:视角数据集$\{ { {\boldsymbol{X} }^{(i)} } = ({\boldsymbol{x} }_1^{(i)},{\boldsymbol{x} }_2^{(i)}, \cdots ,{\boldsymbol{x} }_n^{(i)}) \in {{\boldsymbol{R}}^{ {d_i} \times n} }\} _{i = 1}^m$ 输出:基因样本类标签 (1)利用式(7)和式(13)分别构建视角内相似度顺序保持散布矩阵
${\boldsymbol{S}}_w^{(i)}$和视角间相似度相关矩阵${\boldsymbol{S}}_b^{(ij)}$;(2)利用式(16)Lagrange函数求得特征值$\lambda $和对应特征向量; (3)利用式(20)获得相关投影矩阵
$\{ {{\boldsymbol{W}}_i} = ({\boldsymbol{\alpha}}_1^{(i)},{\boldsymbol{\alpha}}_2^{(i)}, \cdots ,{\boldsymbol{\alpha}}_d^{(i)})\} _{i = 1}^m$;(4)利用式(21)获得特征融合后的鉴别矢量${\boldsymbol{Z}}$; (5)利用基于欧氏距离的最近邻分类器对鉴别矢量${\boldsymbol{Z}}$进行分类,
得到基因样本类标签。表 1 在肺癌基因表达数据集上的识别率变化结果
方法 5训练样本 10训练样本 15训练样本 20训练样本 25训练样本 SOPACA 98.66$ \pm $0.85 99.08$ \pm $0.91 98.70$ \pm $1.22 98.81$ \pm $0.94 99.65$ \pm $0.74 MCCA 96.08$ \pm $2.37 98.16$ \pm $1.11 97.92$ \pm $1.40 97.61$ \pm $1.00 99.30$ \pm $1.11 LDA 96.70$ \pm $2.05 98.05$ \pm $1.22 98.31$ \pm $1.23 98.51$ \pm $1.00 99.30$ \pm $0.91 GrMCCs 94.64$ \pm $3.10 96.55$ \pm $2.82 98.05$ \pm $2.31 97.61$ \pm $2.01 98.60$ \pm $1.38 LMCCA 97.01$ \pm $1.41 98.28$ \pm $1.12 98.18$ \pm $1.40 98.36$ \pm $1.10 99.30$ \pm $0.91 表 2 在结直肠癌基因表达数据集上的平均识别率
方法 2训练样本 3训练样本 4训练样本 5训练样本 6训练样本 SOPACA 98.67$ \pm $1.72 99.29$ \pm $1.51 99.23$ \pm $1.62 99.58$ \pm $1.32 99.09$ \pm $1.92 MCCA 95.67$ \pm $3.52 97.50$ \pm $2.41 97.31$ \pm $2.60 99.17$ \pm $1.76 98.18$ \pm $2.35 LDA 95.00$ \pm $2.83 96.07$ \pm $2.03 95.77$ \pm $4.23 96.67$ \pm $2.64 97.73$ \pm $2.40 GrMCCs 93.33$ \pm $8.75 94.29$ \pm $2.50 96.92$ \pm $3.53 97.50$ \pm $2.15 98.64$ \pm $2.20 LMCCA 96.67$ \pm $2.22 96.07$ \pm $2.41 97.31$ \pm $3.17 97.50$ \pm $2.15 97.73$ \pm $2.40 -
[1] SHUMATE A and SALZBERG S L. Liftoff: Accurate mapping of gene annotations[J]. Bioinformatics, 2021, 37(12): 1639–1643. doi: 10.1093/BIOINFORMATICS/BTAA1016 [2] LU Rongxiu, CAI Yingjie, ZHU Jianyong, et al. Dimension reduction of multimodal data by auto-weighted local discriminant analysis[J]. Neurocomputing, 2021, 461: 27–40. doi: 10.1016/J.NEUCOM.2021.06.035 [3] 王肖锋, 孙明月, 葛为民. 基于图像协方差无关的增量特征提取方法研究[J]. 电子与信息学报, 2019, 41(11): 2768–2776. doi: 10.11999/JEIT181138WANG Xiaofeng, SUN Mingyue, and GE Weimin. An incremental feature extraction method without estimating image covariance matrix[J]. Journal of Electronics &Information Technology, 2019, 41(11): 2768–2776. doi: 10.11999/JEIT181138 [4] ARTONI F, DELORME A, and MAKEIG S. Applying dimension reduction to EEG data by principal component analysis reduces the quality of its subsequent independent component decomposition[J]. NeuroImage, 2018, 175: 176–187. doi: 10.1016/j.neuroimage.2018.03.016 [5] LI Chunna, SHAO Yuanhai, CHEN Weijie, et al. Generalized two-dimensional linear discriminant analysis with regularization[J]. Neural Networks, 2021, 142: 73–91. doi: 10.1016/J.NEUNET.2021.04.030 [6] NAKAYAMA Y, YATA K, and AOSHIMA M. Clustering by principal component analysis with Gaussian kernel in high-dimension, low-sample-size settings[J]. Journal of Multivariate Analysis, 2021, 185: 104779. doi: 10.1016/J.JMVA.2021.104779 [7] CLAYMAN C L, SRINIVASAN S M, and SANGWAN R S. K-means clustering and principal components analysis of microarray data of L1000 landmark genes[J]. Procedia Computer Science, 2020, 168: 97–104. doi: 10.1016/j.procs.2020.02.265 [8] WANG Cheng, CAO Longbing, and MIAO Baiqi. Optimal feature selection for sparse linear discriminant analysis and its applications in gene expression data[J]. Computational Statistics & Data Analysis, 2013, 66: 140–149. doi: 10.1016/j.csda.2013.04.003 [9] LIN Weiming, GAO Qinquan, DU Min, et al. Multiclass diagnosis of stages of Alzheimer's disease using linear discriminant analysis scoring for multimodal data[J]. Computers in Biology and Medicine, 2021, 134: 104478. doi: 10.1016/J.COMPBIOMED.2021.104478 [10] 苏树智, 谢军, 平昕瑞, 等. 图强化典型相关分析及在图像识别中的应用[J]. 电子与信息学报, 2021, 43(11): 3342–3349. doi: 10.11999/JEIT210154SU Shuzhi, XIE Jun, PING Xinrui, et al. Graph enhanced canonical correlation analysis and its application to image recognition[J]. Journal of Electronics &Information Technology, 2021, 43(11): 3342–3349. doi: 10.11999/JEIT210154 [11] LIN Dongdong, CALHOUN V D, and WANG Yuping. Correspondence between fMRI and SNP data by group sparse canonical correlation analysis[J]. Medical Image Analysis, 2014, 18(6): 891–902. doi: 10.1016/j.media.2013.10.010 [12] TENENHAUS A, PHILIPPE C, and FROUIN V. Kernel generalized canonical correlation analysis[J]. Computational Statistics & Data Analysis, 2015, 90: 114–131. doi: 10.1016/j.csda.2015.04.004 [13] WANG Wenjia and ZHOU Yihui. Eigenvector-based sparse canonical correlation analysis: Fast computation for estimation of multiple canonical vectors[J]. Journal of Multivariate Analysis, 2021, 185: 104781. doi: 10.1016/J.JMVA.2021.104781 [14] YUAN Yunhao, SUN Quansen, ZHOU Qiang, et al. A novel multiset integrated canonical correlation analysis framework and its application in feature fusion[J]. Pattern Recognition, 2011, 44(5): 1031–1040. doi: 10.1016/j.patcog.2010.11.004 [15] DELEUS F and VAN HULLE M M. Functional connectivity analysis of fMRI data based on regularized multiset canonical correlation analysis[J]. Journal of Neuroscience Methods, 2011, 197(1): 143–157. doi: 10.1016/j.jneumeth.2010.11.029 [16] YUAN Yunhao and SUN Quansen. Graph regularized multiset canonical correlations with applications to joint feature extraction[J]. Pattern Recognition, 2014, 47(12): 3907–3919. doi: 10.1016/j.patcog.2014.06.016 [17] SU Shuzhi, GE Hongwei, and YUAN Yunhao. Kernel-aligned multi-view canonical correlation analysis for image recognition[J]. Infrared Physics & Technology, 2016, 78: 233–240. doi: 10.1016/j.infrared.2016.08.010 [18] GAO Lei, QI Lin, CHEN Enqing, et al. Discriminative multiple canonical correlation analysis for information fusion[J]. IEEE Transactions on Image Processing, 2018, 27(4): 1951–1965. doi: 10.1109/TIP.2017.2765820 [19] GAO Lei, ZHANG Rui, QI Lin, et al. The labeled multiple canonical correlation analysis for information fusion[J]. IEEE Transactions on Multimedia, 2019, 21(2): 375–387. doi: 10.1109/TMM.2018.2859590 [20] HU Haoshuang, FENG Dazheng, and CHEN Qingyan. A novel dimensionality reduction method: Similarity order preserving discriminant analysis[J]. Signal Processing, 2021, 182: 107933. doi: 10.1016/J.SIGPRO.2020.107933 [21] SU Shuzhi, ZHU Gang, and ZHU Yanmin. An orthogonal locality and globality dimensionality reduction method based on Twin Eigen decomposition[J]. IEEE Access, 2021, 9: 55714–55725. doi: 10.1109/ACCESS.2021.3071192 [22] SHEN Xiaobo, SUN Quansen, and YUAN Yunhao. A unified multiset canonical correlation analysis framework based on graph embedding for multiple feature extraction[J]. Neurocomputing, 2015, 148: 397–408. doi: 10.1016/j.neucom.2014.06.015 [23] SHOKRZADE A, RAMEZANI M, TAB F A, et al. A novel extreme learning machine based kNN classification method for dealing with big data[J]. Expert Systems with Applications, 2021, 183: 115293. doi: 10.1016/J.ESWA.2021.115293 [24] LIU Dongwei, JIA Runping, WANG Caifeng, et al. Automated detection of cancerous genomic sequences using genomic signal processing and machine learning[J]. Future Generation Computer Systems, 2019, 98: 233–237. doi: 10.1016/J.FUTURE.2018.12.041
表(3)
计量
- 文章访问数: 1021
- HTML全文浏览量: 233
- PDF下载量: 111
- 被引次数: 0