Health Assessment for Piston Pump Based on Laplacian Eigenmaps-Random Forests Method

A piston pump is one of the key components in the hydraulic system. Therefore, the failure of the pump may severely hurt the reliability of the hydraulic system and cause great loss. Currently, how to effectively evaluate the health condition of the piston pump remains an open problem. In this paper, a novel health assessment method based on an integration of Laplacian eigenmaps and random forests is proposed, which takes full advantage of both the fusion ability of correlated features enabled by the Laplacian eigenmaps and the optimized feature-selection ability provided by the random forest. The proposed method (LE-RF) is applied to piston pumps for validation. The results indicate that the Laplacian eigenmaps-random forest (LE-RF) technique can provide an effective tool for piston pump health condition assessment. Compare with other manifold methods, as well as some popular deep learning method, such as LLE, ISOMAP and deep DBN, the LE-RF method can achieve better and more accurate assessment results.

geometric features of the nonlinear data, consequently gaining key information from a complex high-dimensional space. Since the first publication of manifold learning related algorithm in science, several typical algorithms have been developed, e.g., local linear embedding algorithm (LLE) [10], Laplacian eigenmaps (LE) [11], isometric mapping (ISOMAP) [12], locality preserving projections algorithm (LPP) [13]. Wherein the manifold learning is used for health assessment, Jiang et al. [14], proposed supervised Laplace mapping algorithm and successfully applied for gearbox fault diagnosis. Yu et al. [15], put forward the LPP improved algorithm and used it at fault detection and bearing health assessment. Li et al. [16], proposed a local tangent space alignment algorithm (LTSA) in a rotating machine, and successfully used it in bearing health condition assessments. Although the manifold learning method has been proposed in some common mechanical parts, the application for piston pumps and other hydraulic key components have not been demonstrated.
The manifold learning techniques proved to be effective for dimensional reduction, however, after dimensional reduction processing, not all features are effective for health states evaluation, so the evaluators need to be further refined to contain the degradation information, and it is necessary to find proper feature searching and selection methods. As a multiple classifier ensemble method [17], the random forests algorithm, proposed by Breiman [18], and applied in many fields [19], especially in machinery faults diagnosis [20][21][22], is taken into consideration in this study. Random forests can build a large number of decision trees [23]. Bagging [24], and boosting [25], are added to a unique depending subspace of origin datasets for training. Both variance and bias of classification have been considered to improve its generalization properties.
Considering the above reasons, the paper combined the manifold learning and random forest techniques and raised Laplacian eigenmaps-random forests (LE-RF) for the health assessment of piston pump. The proposed fusion method aims to find the intrinsic features. The method can better identify the piston pump health conditions and reduce the irrelevant information.
This paper is organized as follows. In Section II, the feature extraction method applied in this study is introduced. In Section III, the principal of the proposed LE-RF method is described theoretically. In Section IV, the experiment setup for the piston pump is given. In Section V, the

Original high-dimensional features formation
The extraction of features is the key for piston pumps health status evaluation. However, the vibration signals contain noise and redundant information. By means of feature extraction, the raw signals are converted to high-dimensional feature space, and the characteristics of the vibration signals can be subsequently analyzed to evaluate the health status.
The wavelet packet decomposition [26,27], with a pair of high-pass filters and low pass filters is capable of extracting the high and the low frequency components of signals. Compared to the wavelet package decomposition, wavelet decomposition [28], is insufficient to make up for the high-frequency signal resolution. For an actual piston pump, the vibration signals have properties of the non-stationary characteristics. Therefore, the wavelet packet decomposition is better able to extract the characteristics of the signal, and obtain potential information, for analyzing all band of signal components.
Suppose that { ( )| ∈ + } is the orthogonal wavelet package with respect to the filter {ℎ }, the coefficients in the subspace are as follows: where is a scale index, and is the translation index. The coefficients −1 2 and −1 2 +1 at coarser scales are where {ℎ } is low-pass filter and { } is high-pass filter. is divided into the low frequency part −1 2 and high frequency part −1 2 , the decomposition standardized coefficients of each layer can be obtained.
Since the wavelet package decomposition forms the orthogonal bases, the decompositions will preserve energy. It shows that: After the raw signals are projected to the wavelet domain by the wavelet packet transform, the energy of the wavelet coefficients measures effects in each frequency span, therefore, if an energy measure can be defined as −1 , and the energy of decomposed coefficients is expressed as follows: where −1, ( ∈ + ) stands for −1,2 or −1,2 +1 . The usage of 8-level of Daubechies 6 wavelet decomposition gives full-band division. Figure 1 shows

The LE-RF Algorithm
The flowchart of the health assessment process is shown in Figure 2. Extracting features of the original time-domain, frequency domain and time-frequency domain signals with mixed parameters have been used in the first stage. Since the nature of different mechanical components is corresponding to the acquired signals varies, the selection of a higher resolution and regularity of characteristic parameters is key to condition assessment. Secondly, the LE-RF method is selected as the data fusion method to refine the features. Thirdly, the confusion matrix and the recognition accuracy are used to evaluate the effects of health state assessment, the test set is also added to verify the recognition effects for other pumps in the same batch. Figure 3 shows the architecture of the LE-RF algorithm. Approximated by the adjacency graph [29], the weights of the adjacency graph are chosen properly by the Laplacian Beltrami operator [30]. The LE method indicates the intrinsic relationship of each cluster and divides the original data space into separated structures by correlated features. The feature vector ( ∈ ) turns into ( ∈ R ′ ) by the LE method, and the local information of embedding [11] preserves. Next, the random forests algorithm [18], creates a certain number of trees, and they vote for the most possible class, which reduces the number of different classes and benefits features reduction.

The Laplacian eigenmaps
The Laplacian eigenmaps (LE) algorithm [11,31], is a geometry-based method to properly represent complicated data on high dimensional space. The algorithm focuses on the relationship between the Laplace Beltrami operator, head equation and graph Laplacian matrix, which provides computational efficiency benefits on non-linear data, and has local preserving properties and clusters naturally. The aim of the algorithm is to maintain the data points on average.
The paper defines that points , , … , in ℜ , and a weighted graph is set up with nodes where each point connects adjacency points with a set of edges. An embedding map is established by computing the eigenvector of the graph Laplacian matrix. The steps of Laplacian eigenmaps are as follows: Step 1 (Constructing an adjacency graph matrix): the k-nearest neighbor method [32,33], is used on the original dataset, and an edge between and is created in case is among the nearest neighbors of .
Step 2 (Choose the weight of edges): There are two different approaches for weighting the edges: The heat kernel approach [11]: if node is connected to , is set as follows: The simple approach: if the nodes and are connected by an edge, = 1; if the nodes and are not connected, = 0.
Step 3 (Set up the object function): to connect data points as close as possible, the weighed graph is embedded to a lower dimensional space. = (y 1 , y 2 , … , y ) is defined to be such a map. A reasonable measurement method is used to choose an appropriate map, and the goal of optimization is to minimize the following objective function: To avoid neighboring points, and are mapped far away from each other and the mechanism for minimizing objective function is introduced. The Laplacian matrix is postulated as = − , is a diagonal weight matrix, and the elements of matrix are sums of , i.e., = ∑ .
It turns out that for any , the result can be got as follows: Therefore, the minimization problem is now simplified to find argmin( ), the constraints are = 0.
The constraint = 1 avoids the arbitrary scaling factor; Matrix is a measurement on vertices of the graph. The larger is, the more important y will be. Suppose 1 to be the constant function, once all vertices of onto the real number 1, trivial solution occurs, the constraint = 0 is added to eliminate this situation.
Step 4: (Eigenmaps): compute the eigenvalues and eigenvectors for the generalized eigenvalue problem = = − is a Laplacian matrix, which is always a symmetric, positive semidefinite matrix, is regarded as the operation function on vertices of graph .
The eigenvalues can be obtained through the eigenvalue decomposition algorithm, and 2 to d+1 items of the eigenvalues are corresponding to d-dimensional output the feature vector . The LE algorithm turns the problem into a calculation of eigenvalues and does not need to calculate iteration, so it requires little computational time.

Eigenvectors normalization
The resulting wavelet coefficients energy is normalized to limit the value of each data point within a certain range and avoid significant effects of extreme value. It is convenient for continuous processing. The formula z-score normalization method is: Where is the mean of a training set and is standard deviation from the same training dataset. Absolute maximum 1 Skewness 1 Decomposition coefficient energy All nodes energy of the eighth level WPD 256

Random forests method
The random forests method [18], is a multiple classifier assembled by many tree-type classifiers.
Once the paper imports a new input vector into the random subspace, the vector is examined by every tree in the forest, and they vote on the input one via classification. The forests select the most votes over all trees, and the most popular class is selected. The procedure is known as random forests.
While data in training dataset, the current trees are replaced. However, roughly one-third of the test samples are left out by the algorithm. These left-out samples are called out-of-bag samples, which are used for a test dataset to avoid unbiased estimate classification errors [19]. Due to the existence of these unbiased samples, there is no need for cross-validation or searching another test dataset to obtain an unbiased estimate. Both low bias and low correlation are necessary for accuracy. The low bias is determined by the out-of-bag estimate, and randomization is applied to achieve the aim of low correlation.
The steps for tree growing are as follows: We randomly select N training bootstrap samples from the original dataset, and each tree is grown on them. About (2/3)/ of the original data is used for training the tree, and the rest (1/3)/ is used to validate the classifiers. At each node, the paper input variables, then randomly choose predictors from them ( ≪ ). While the tree grows, the value of remains steady. The initial reference value of m is = | 2 ( ) + 1| or = √ , and the optimized can be found by adjusting around the initial value until the out-of-bag estimate achieves minimum error.
These predictors are used to split the node; only one out of predictor is selected for the best split results in each node.
All trees can be grown as large as they can extend, there is no limit for their growth. In the current study, the importance of the variable estimated by classification accuracy has become a trend in many studies [34]. The algorithm computes the classification accuracy of out-of-bag data and the mean descent in accuracy. Bootstrap samples are = 1, … , , stands for the importance measure of variable . The calculation algorithm is as follows: Step 1: Select the initial value of : = 1, then choose the out-of-bag data points .
Step 2: Use tree to classify , then calculate the number of correctly classified items, .
Step 3: In order to calculate the importance measure of variables , = 1, … , : the original values of is replaced in , and add the replaced results into ; then use tree to classify and calculate the number of correctly classified items, , .
Step 4: Repeat the algorithm starting at Step 1 and make the iterations while = 2, … , .
Step 5: For , the importance measure can be calculate as follows: Step 6: The importance measure is normalized by the algorithm of z-score, and is the descending standard deviation of correct classifications: Step 7: the statistical distribution is assumed as a Gaussian distribution, and is converted to a meaningful value.

Experimental System
The test bed is shown in Figure 4. The type of the piston pumps is Kawasaki K3V. Five piston pumps of typical performances of different health states were carefully selected. These pumps used for this test were previously working in the same condition, but with different total running show differences in many aspects, such as pressure, efficiency and so on. Table 2

Dimensional reduction results
The total number of samples is 500, and the number of each condition is 100. We selected 100 pieces of the vibration signal form each piston pump， when the piston pump run steadily. We built a 262-dimension matrix 500×262 for the sample points. LE, LPP, LLE, ISOMAP, PCA and KPCA are used for the dimensionality reduction process, and the methods transform high-dimensional space into low-dimensional space for feature extraction and pattern recognition analysis. We set neighboring factor = 5, and used a maximum likelihood estimator method [35] to calculate operator embedding dimension, and the embedding dimension ' is estimated as 23. As shown in Figure 6(a), the states are divided into five categories, and within-class distance is small, the clustering is obvious, and it is easy to distinguish with different class. P4 and P5 have slight overlap, but there is no impact on the health status evaluation. P1, P2 and P3 have a large among-class distance and a relatively small within-class distance; their health statuses are clearly different. In Figure 6(b), LPP gathers the samples into five groups, but P1 and P2 merge together, and the boundary between P4 and P5 is not clear. In Figure 6(c), LLE divides the samples into five groups, overlaps between P3 and P5 are obvious and the boundary of P1 and P2 is blurred. In and KPCA) is applied before classification method, then the accuracy is obtained through random forests and calculated accuracy for 10 times, and Figure 7 shows the detail of recognition results with different manifold methods. After processing, numerical average is calculated. Table 3 shows numerical average of different manifold method. The LE method has higher classification accuracy than the other methods, indicating that the method using the LE to evaluate the piston pumps operating state has the highest discrimination.   belongs to the integer set (23,262), the feature dimension is relatively high, and the accuracy become higher with a decrease in the embedding dimension. When the embedding dimension ' is 23, the accuracy achieves its highest value; when the embedding dimension ' belongs to the integer set [1,23], the accuracy descends with the decrease of the embedding dimension. The accuracy results are strong proofs to demonstrate the embedding dimension is better, only embedding dimensionality is 23, we can obtain the maximum classification accuracy.

The Result
As displayed in The comparison of the LE-RF, LE-KNN and LE-DBN is shown in Table 7.     The out-of-bag classification error is shown in Figure 9, the selected samples are the results of the LE dimensional reduction method. It can be observed that the error of classification remains steady and a relatively low level when the number of trees is larger than the threshold number.

The test set validation
The test pump (Pump 6) is added as the test set for the system. Pump 6 is also in the batch above, the only difference between Pump 6 and other pumps is running time. The running time of P6 is about 300h, and within the range of P1.
The signals from P6 are processed by feature extraction and dimension reduction. The processed data is 100×23 , and classified by the trained model of random forest. A vector of predicted labels is returned after random forest method. We calculated the frequency of each label, and the highest frequency of occurrence label is returned as label ( ∈ [1,2,3,4,5]). The states of P6 can be determined by . In this time, the P6 is predicted as a certain condition .
However, considering the initialization randomness of data, the predicted label may be influenced. For the data from P6, 100 groups of processed data ( 100×23 ) are collected. The prediction process has been recycled for 100 times. The predicted labels are combined into the set . The probability of correct prediction (P1) in set is 77%, and the conclusion has been reached that P6 is within the range of 0~500 h (the first stage of degradation).

Conclusions
A method based on LE-RF forests is proposed to assess piston pump health. The results show that compared with different manifold learning methods, the LE-RF method is proposed with better clustering and fewer classification errors, which successfully reduces the high-dimensional features into a minimal one and improves accuracy of health state recognition. The method has also verified by test set, and successfully used in degradation assessment. The proposed method plays an important role for piston pump health condition evaluation, which is helpful for predicting potential risks in factory manufacturing.
In the future research, our effort will be devoted to make the proposed method more reliable for practical use. To achieve this, a knowledge database corresponding to different health conditions of mechanical components is considered. Moreover, the database is modified if the predicted health states are inconsistent with results of the actual health states, and our health assessment algorithm can be applied to more mechanical equipment with better accuracy.