A

All materials on our website are shared by users. If you have any questions about copyright issues, please report us to resolve them. We are always happy to assist you.

Similar Documents

Description

06 IEEE International Conference on Bioinformatics and Biomedicine (BIBM) A P-Norm ingular Value Decomosition Method for Robust umor Clustering Xiang-hen Kong, Jin-Xing iu *, Chun-Hou heng *, Mi-Xiao Hou,

Transcript

06 IEEE International Conference on Bioinformatics and Biomedicine (BIBM) A P-Norm ingular Value Decomosition Method for Robust umor Clustering Xiang-hen Kong, Jin-Xing iu *, Chun-Hou heng *, Mi-Xiao Hou, Yao u chool of Information cience and Engineering, Qufu Normal University, Rizhao, 7686 handong, China * Corresonding author XK: JX: CH: MXH: Y: Abstract umor clustering based on biomolecular data lays a very imortant role for cancer classifications discovery. o further imrove the robustness, stability and accuracy of tumor clustering, we develo a novel dimension reductioethod named -norm singular value decomosition (PVD) to seek a low-rank aroximatioatrix to the bimolecular data. o enhance the robustness to outliers, the -norm is taken as the error function and the chatten -norm is used as the regularization function in our otimizatioodel. o evaluate the erformance of PVD, Kmeans clustering method is then emloyed for tumor clustering based on the low-rank aroximatioatrix. he extensive exeriments are erformed on gene exression dataset and cancer genome dataset resectively. All exerimental results demonstrate that the PVD-based method outerforms many existing methods. Esecially it is exerimentally roved that the roosed method is efficient for rocessing higher dimensional data with good robustness and suerior time erformance. Keywords dimension reduction; roubust tumor clustering; chatten -norm; -norm; singular value decomosition I. INRODUCION In the ast decades, many tumor clustering aroaches have been develoed and used to erform cancer class discovery from biomolecular data. A reliable and recise identification of the tye of tumors is effective for cancer diagnosis and treatment. However, the tyical characteristic of the gene exression data or genomic data is high dimension low samle size [], which makes most of the standard statistical methods lose effective. irstly, including too many variables may decrease the clustering accuracy of the samles, and make the traditional cluster rules difficult to set. econdly, the irrelevant or noisy variables in the original data may also degrade the erformances of the estimated cluster algorithms. Desite these difficulties, most of research works for tumor clustering from the areas of statistical machine learning have demonstrated the otential ower for tumor-tye identification [] [3] [4] [5] [6] [7]. Most of these research works focus on data reduction and denoising. It means that the clustering works is in two stes. he first ste is extracting features from the gene exression data. he second ste is using the classic classifier or clustering algorithm to classify the tumor samles based on the extracted features. heng et al. extracted features using nonnegative matrix factorization (NM) [3] [7] and sarse nonnegative matrix factorization (NM) [4] to imrove the erformance of classification [5]. ee roosed the sarse singular value decomosition (VD) for biclustering in gene exression data [8]. More and more dimension reductioethods such as sarse rincial comonent analysis (PCA) [9] [0] and enalized matrix decomosition (PMD) [] [] [3] were used successfully to analyze gene exression data. Most of the methods mentioned above are taking advantage of norm constraint. or examle, 0 -norm enalty was used to analyze gene exression data by Journee [0]. -norm was taken as the regularization function in VD [8] and PMD []. But the non-robust, too longer convergence time, and too many iterations ofteade them unsatisfactory. In this aer, a new dimension reductioethod base on singular value decomosition (VD) named -norm singular value decomosition (PVD) is roosed. Comared with VD roosed by ee for biclustering [8], PVD uses the -norm in lace of the squared robenius norm as the error function to imrove the robustness to outliers in the data, and the chatten -norm is used as the regularization function in lace of norm for sarse vectors u and v. he augmented agrangiaultilier (AM) [4] [5] [6] and alternating directioethod (ADM) [7] are emloyed to deal with the roblem nonsmooth and somewhat intractable in the otimizatioodel. he technical details of PVD will be rovided in ection II. o evaluate the validity of PVD, the traditional and classic unsuervised clustering method Kmeans is then used for tumor clustering based on the lowrank aroximatioatrix obtained by PVD. our datasets including one gene exression dataset and three genomic datasets from he Cancer Genome Atlas (CGA) are used in the exeriments. Whether comared with other dimension reductioethods such as VD, PCA, NM or comared with the ordinary clustering methods such as Kmeans, the exerimental results show that our method is efficient and feasible. he clustering accuracy is imroved, the convergence time is shorter, and esecially it is roved that PVD is more efficient for rocessing higher dimensional unbalanced genome data from CGA. he rest of the aer is organized as follows. ection II describes the PVD algorithm to seek low-rank aroximation matrix of biomolecular data. he tumor clustering exeriments based on the low-rank aroximatioatrix obtained by PVD are erformed in ection III. ection IV concludes the aer and outlines directions of future research works /6/$ IEEE 600 II. MEHODOOGY A. Definitions of -norm and chatten -norm he -norm of a matrix X R is defined as X = xij (0 ). () i j Nie et al. used the -norm ( 0 ) as the error function to imrove the robustness to outliers in given data [8] to solve the robust matrix comletion roblem. he extended chatten -norm (0 ) of a matrix X R is defined as X min{ nm, } = σ, () i= where σ i is the i -th singular value of X. Equation () is the rank of X with the definition of 0 0 = 0. When 0, the chatten -norm of X will aroximate the rank of X [9]. When =, the chatten -norm is the nuclear norm, which is usually taken as the form X. * B. he Method of PVD he PVD rocedure is resented in this section. et X be an n m biomolecular matrix whose every row reresents all of genes exression level in one samle. he rank- K ( K r) aroximation to X can be written as i λ λ ( udv,, ) ( u, d, v ) = arg min X udv + du + dv, (5) where 0 and 0 according literature [], d is a ositive scalar, u is a unit n -vector, and v is a unit - vector. λ and λ are enalty arameters that balance the goodness-of-fit for measure X duv. We note the first PVD layer udv is the best rank-one matrix aroximation of X under the -norm. When = and λ = λ = 0, the model of (5) reduces to obtain the lain VD layers. When = and =, (5) is the VD model roosed by ee et al. [8] who obtained the sarse vectors u and v using the lasso regression [] [3] [4] and Bayesian information criterion (BIC) [5]. o obtain the sarse u and v in the first PVD layer, (5) is converted to easier solved forms comaratively. ixing v, with resect to u = du, minimization of (5) is equivalent to minimizing X uv +λ u. (6) Equation (6) can be easily converted to the form: Xv uv v + λu = Xv u + λ u. (7) imilarly, for fixed u, minimization of (5) with resect to v = dv is equivalent to minimizing ( K) K X X udv k = k k k =. (3) ( K ) X gives the rank- K matrix aroximation to X in the K ( ) sense that X minimizes the squared robenius norm: + λ, (8) X udv v and (8) can be converted to the form: where K ( K ) * X = arg min X X, (4) * X AK A is the set of all n m matrices of rank K [0]. According to [0], the gene exression data or the genomic data always lies near low dimensional subsace. herefore, the PVD is carefully designed to seek low-rank aroximatioatrix of the original data. hat means the rank- K low-rank aroximatioatrix obtained by PVD is the summation of the first K levels of VD in the sense of - norm. Our resentation focuses on how to extract the first PVD layer, the subsequent layers can be extracted sequentially from the residual matrices by subtracting the receding layers. he otimizatioodel of the first layer is as follow, u X u udv + λ v = u X v + λ v = ( X u) + ( v ) + λ v = ( X u v ) + λ v. Equation (9) is equivalent to minimizing the form: X u v +λ v (9). (0) Now, to obtain the first PVD layer, we successfully convert the otimizatioodel (5) to (7) and (0). Next, the main urose is to solve (7) and (0). hese two formulas are essentially of the same form. o we give the solution to (7) in the following resentation, and (0) can be solved in the same manner. 60 Equation (7) is intractable since both of the two items are nonsmooth. he augmented agrangiaultilier (AM) [4] [5] [6] and alternating directioethod (ADM) [7] are recommended to deal with the roblem []. he AM method is very attractive because it has been roved that under some rather conditions, the AM algorithm converges Q- linearly to the otimal solution [6]. o solve (7) using AM method, we relace Xv u as E, and relace u as, λ as γ. Equation (7) can be equivalently rewritten as : min u, E = Xv u, u = E + γ. () According to AM, the following roblem is needed to solve: min u, E = Xv u, u = E + γ + E ( Xv u ) + Λ + u + Ω. () In (), it is very difficult and costly with resect to three variables u, E,, the ADM can be adoted to deal with this roblem suitable. he roblem with one variable can be easily otimized when fixing the other two variables. In this way, () results in the following three subroblems: Problem : When fixing E and, () can be written as: min u ζ + u τ, (3) u where ζ = ( Xv E Λ ), and τ = ( Ω ). Equation (3) is equivalent to solving a quadratic function, the solution can be easily obtained by ζ + τ u =. (4) Problem : When fixing u and, () is simlified as: min E E P + E η, (5) where η = ( Xv u Λ). Problem 3: When fixing u and E, () can be written as: where ψ = u + Ω. minγ + ψ, (6) he otimal solutions to subroblems in (5) and (6) are described detailedly in literatures [] [6]. Here we do not reeat it in tautology. Algorithm. Algorithm to extract the first PVD layer Inut: bimolecular data matrix: X( X R ) chatten -norm: the -norm: Regularization Parameter: λ, λ Outut: the first PVD layer of X : X (). decomose X( X R ) using standard VD, get the first VD layer, a trile { dold, uold, v old}. v vold, u dolduold 3. et ρ . Initialize 0, Λ, Ω, E, while not convergence do Udate u by (4) Udate E by the otimal solution to (5) Udate by the otimal solution to (6) Udate Λ by Λ=Λ+ ( E ( Xv u)), udate Ω=Ω+ ( u ) Udate by = ρ end while 4. u unew, v doldu new, v new can be obtained similarly as u. new 5. uˆ u new u new, v ˆ = v new v new, d ˆ u ˆ ˆ Xv. X() udv ˆ ˆ ˆ After the first K PVD layers are extracted, the rank- K aroximation to the matrix X is comuted as K X = udv ˆ ˆ ˆ. ( K) k = k k k III. EXPERIMEN AND DICUION our datasets are used to demonstrate the erformance of PVD including one high-dimensional gene exression dataset and three higher-dimensional genomic datasets from CGA. or every dataset, we emloy the PVD to extract the first K layers to obtain a rank- K aroximatioatrix. Based on the low-rank matrix, the classical unsuervised clustering algorithm Kmeans is used to evaluate the erformance of PVD. As comarisons, exeriments are also carried out using the existing methods such as VD, NM, and PCA. he cancer tye information is only used as a osterior to interret the analysis results. A. ung Cancer Data We esecially comare PVD with VD on the same subset of lung cancer gene exression dataset [7] that was 60 used by ee to illustrate the VD algorithm [8]. his dataset contains,65 genes for 56 samles. he samles contain 4 tyes of lung cancer. here are 0 ulmonary carcinoid samles (Carcinoid), 3 coloetastases (Colon), 7 normal lung samles (Normal) and 6 small cell lung carcinoma samles (mallcell). imilar to PVD, VD is also used to seek the low-rank aroximation to data matrix through extracting the first K layers. VD used robenius norm as the error function and used norm as the regularization function for sarse vectors u and v. he first three layers are extracted sequentially using PVD and VD resectively. he reason for considering only the first three layers is that the three singular values are much bigger than the rest [8]. or every layer, we comare the convergence time of the two algorithms. able I lists the results using the Matlab rogram running on a Windows 7 deskto with Intel Core i Duo CPU of a clock seed of 3.30 GHz. It takes.98,.08,.0 seconds using PVD for the first three layers resectively, however VD converges in 40.55, 7.7, and seconds. Evidently, the time erformance of our algorithm is far beyond VD. It is very imortant for dimension reduction of high dimensional data. o further demonstrate the erformance of PVD, we lot the first three sarse left singular vectors uˆ k ( k =,,3) using PVD and VD resectively in scatter lots. he subject grouing/clustering can be seen in ig.. Different colors and symbols are used to interret the cancer tyes easily. Comaring ig. (b) with ig. (a), it is found obviously that the first two vectors in PVD reveal four samle clusters, however, the first two vectors in VD reveal only three with mixing Colon and mallcell. he next two anels in ig. (b) and ig. (a) can also demonstrate the better discrimination of PVD-vectors comared with VD-vectors. o better analyze the advantage of the roosed method, the Kmeans clustering algorithm is emloyed to evaluate the erformance of PVD based on the sarse rank-3 3 aroximatioatrix ˆ udv ˆ ˆ k = k k k of the raw data. he exeriment results are also comared with that of the cometitive method VD. he accuracy rate of clustering of the roosed method PVD-Kmeans is 96.43% (54/56), only two Carcinoid samles are misclassified to Colon samle. However, the VD-Kmeans achieves 83.93% (47/56) accuracy rate. On this dataset, the number of PVD layers extracted is selected as the same as that recommended in VD method [8]. We verified by exeriment that the first three layers could get the best clustering results. ABE I. HE RUNNING IME O EXRAC HE IR K AYER UING VD AND PVD ayers ( K ) Convergence ime (econds) VD PVD (b) ig.. (a)catter lots of the entries of the first three left sarse singular vectors uˆ k ( k =,,3) using VD. (b) catter lots of the entries of the first three left sarse singular vectors uˆ k ( k =,,3) using PVD. rom the exeriment results on this dataset, also as a dimension reductioethod based on extracting sarse vectors, PVD caake the class structure more evident comared with VD. Whether the time erformance or the accuracy rate of clustering, PVD outerforms VD. B. Genome Data In the exeriments on the dataset above, PVD shows the advantages over other methods. In this section, we emloy PVD on higher dimensional data to confirm the erformance for genome data from CGA. hree datasets of genome are used to evaluate the PVD-based method comaring with NM, VD and PCA. VD is not taken as a cometitor because it is found that the convergence time of VD algorithm is too long and sometimes it does not converge with the comuter crashed when it is used on higher dimension genomic data. hat means VD maybe is not suitable to rocess higher dimensional data. he three datasets are Colorectal Cancer (CRC) dataset, Cholangiocarcinoma (CHO) dataset and quamous Cell Carcinoma of Head and Neck (HNC). All of them include 0, 50 official identified genes, but the number of samles is different. he number of (a) 603 ABE II. GENOME DAA: HE PERORMANCE O HE COMPEIIVE MEHOD WIH HREE DIEREN DAAE ROM CGA Genome dataset Kmeans NM VD-Kmeans PCA-Kmeans PVD-Kmeans CRC 8.56% (3/8) 8.56% (3/8) 79.00% (/8) 9.88% (6/8) 93.4% (63/8) CHO 00.00% 00.00% 00.00% 00.00% 00.00% HNC 79.90% (334/48) 95.45% (399/48) 95.45% (399/48) 93.06% (389/48) 96.7% (40/48) samles contained resectively in the three datasets is 8 (CRC), 45 (CHO) and 48 (HNC). Every dataset is mixed of two tyes of samles ositive and negative. We simlify ositive as P which reresents diseased samle, negative as N reresents normal samles. here are 9, 9, and 0 N samles in CRC, CHO and HNC datasets, resectively. inally, we investigate the time erformance of PVD on these datasets. ake HNC dataset (it contains the largest number of samles in the three datasets) for examle, it takes 4.6, 6.33, 6.34 seconds for the first three layers resectively running the Matlab rogram on a Windows 7 deskto with Intel Core i Duo CPU of a clock seed of 3.30 GHz. he running time for the other two datasets to extract the corresonding layer is shorter. It shows that PVD is very irstly, we study the aroriate number of layers to be effective for dimension reduction of higher dimensional data. extracted by PVD. he first five layers are extracted sequentially. Based on the obtained rank- K( K =,,3,4,5) aroximatioatrix to the raw data, the PVD-Kmeans clustering accuracy rates are comared. It is showed in ig. that for CRC and HNC datasets, the achieved clustering rate is best when K = 3. or CHO dataset, the results with different K is identical. We surmise that there are distinctly discriminations for the P samles and the N samles in the original dataset. Next, the PVD-based method is comared with NM, VD and PCA. We also adjust the arameters of the cometitive algorithm to obtain the best erformance. According to the analyses above, the first three layers are extracted using PVD and VD to get the best Kmeans clustering erformance for these three genomic datasets. As one can see in able II, PVD erforms better over the cometitive methods. or CRC dataset, PVD erforms best with the accuracy of clustering 93.4%. or CHO dataset, all method can distinguish the P and the N samles correctly. PVD achieve the best accuracy rate 96.7% over other methods on HNC dataset. IV. CONCUION AND UURE WORK In this article, we roose a novel dimension reduction methods based on -norm and chatten -norm named PVD. he effectiveness of PVD has been demonstrated through biomolecular data including gene exression data and genome data. Exeriments show that our roosed method is esecially suitable for rocessing higher dimensional data with good robustness and excellent time erformance. here are a few interesting otential directions for future research works. irstly, it is interesting to evaluate the usage of PVD as a dimension reduction tool combining other clustering or classificatioethods. econdly, we notice that every extracted PVD layer has a checkerboard structure. It may be emloyed also for biclustering as VD. We will continue to exlore the erformance of PVD in future research. ACKNOWEDGMEN his work is suorted in art by the grants of the National cience oundation of China, Nos , 6507, 65783, and 64058; oundation of cience and echnology Project of Qufu Normal University, No. xkj06. ig.. Genome data: the accruacy rate of clustering by extracting K( K =,,3, 4,5) PVD layers through the three genomic datasets CRC, CHO and HNC. REERENCE [] M. West, Bayesian actor Regression Models in the arge, mall n Paradigm, in Bayesian tatistics, 00, [] C. H. heng, D.. Huang,. hang, and X.. Kong, umor clustering using nonnegative matrix factorization with gene selection, IEEE rans Inf echnol Biomed, vol. 3, , Jul 009. [3] J. P. Brunet, P. amayo,. R. Golub, and J. P. Mesirov, Metagenes and molecular attern discovery using matrix factorization, Proc Natl Acad ci U A, vol. 0, , Mar [4] Y. Gao and G. Church, Imroving molecular cancer class discovery through sarse non-negative matrix factorization, Bioinformatics, vol., , Nov 005. [5] C. H. heng,. Y. Ng,. hang, C. K. hiu, and H. Q. Wang, umor Classification Based on Non-Negative Matrix actorization Using Gene Exression Data, IEEE rans NanoBiosci, vol. 0, , Jun 0. [6] P. A. Jaskowiak, R. J. Camello, and I. G. Costa, Proximity measures for clustering gene exressioicroarray data: a validation methodology and a comarative analysis, IEEE/ACM rans Comut Biol Bioinform, vol. 0, , Jul-Aug [7] X. Kong, C. heng, Y. Wu, and. hang, Molecular Cancer Class Discovery Using Non-negative Matrix actorization with arseness Constraint, in Intelligent Comuting International Conference on Advanced Intelligent Comuting heories and Alications, 007, [8] M. ee, H. hen, J.. Huang, and J.. Marron,

Related Search

We Need Your Support

Thank you for visiting our website and your interest in our free products and services. We are nonprofit website to share and download documents. To the running of this website, we need your help to support us.

Thanks to everyone for your continued support.

No, Thanks