December 11, 2017, Monday


Jump to: navigation, search



This documentation helps you to apply the performance evaluation analysis as reported in our manuscript “Combined Analysis of Chromosomal Instabilities and Gene Expression for Colon Cancer progression Inference”.
Please note that this analysis follows four mains processing steps (see the Material and Methods section). Here we used RapidMiner (RM) as main tool for evaluating the disease progression inference. You can download RM from It is easy to install, and there are both a substantial market and community support.
Broadly speaking, RM is an environment for rapid prototyping of machine learning and knowledge discovery (KD) processes. It is currently used for classification, clustering, and also data integration tasks. RM processes are modeled by complex nested chains of objects called operators. These operators implement several KD processes, like data pre-processing, performance evaluation, learning algorithms, etc. The user is supported with graphical interfaces, where operators can be dropped as nodes onto the working pane and the workflow (WF) is specified by connecting the operator nodes. In other words, RM workflows represent conceptual sequences of operational steps used for specific data mining experiments. In our analysis both the inference and the evaluation performances are given by the WF reported in the following figure.

Download this workflow to start with the inference process. Now copy the code on the XML editor in RM, this way the RM-WF can be imported and executed. Please note that to run properly this WF you should provide the data (i.e., “datatypes”) as reported in our manuscript. These data-types contain all the information concerning differentially expressed genes with CNAs. As an example here we give the following sample ( sample, which contains the "Combined Data" i.e. expression levels of selected up–regulated genes with amplification and down–regulated genes with deletions.

Also note that RM uses the “repository” (see RM documentation) which takes up all data and processes. Although data can also be introduced into processes from outside (the repository) using the repository offers a number of advantages ready to start with. In order to use our data, the above data-types must be extracted and the directory created thereby must be added to RM as a local repository. To do this, click on the first button in the toolbar of the Repositories view and select new local repository. Then indicate the name of the directory. You will now find all data in the repository tree, which is shown in this view.

Then press the "Play" icon in the icon bar at the top of the frame. The process should start and after a short time the message viewer in the bottom part of the frame shows the message that the process was successfully finished. Now you are ready to see the inference performance by switching to the result prospective (View; ⇒ Preospective; ⇒ Results).

Basically, the WF implements standard Support Vector Machines (SVMs) to forecast stage progressions. SVMs are used as “black box” inference processes to score each input data according to the inference performance of the algorithm. The main issues of this WF are characterized by the following procedure.

Parameter Optimization Operator.

Often different learning models have many parameters and it is not clear which values are best for the learning task at hand. In order to perform the best and homogeneously as possible we optimized the AUC index over a space of given SVM feasible learning parameters. This way, for each input, the best SVM learning parameters are found over the same space of values. The Parameter Optimization operator allows us to iteratively cycle its nested operators and change their parameters to optimize the performance of the learning scheme. In our case the nested operator is a cross-validation process, which in turn trains and tests the SVM algorithm. In other words, we used this technique to find the best parameter combination for the SVM learning process.

Cross-Validation Operator

This procedure encapsulates a 10-fold cross-validation error estimation. Cross-validation is a twostep process: in the first step a classifier is built describing a predetermined set of data classes. In the second step, the model (a trained SVM) is used for testing new classification examples; the generalization performance of the classifier is estimated using a new test set. The input data set S is split into subsets {S1, S2, ..., Sk} - in our case k = 10.The first inner operator (SVM) realizes the learning step described above. SVM is applied 10 times using at each iteration i the set Si as the test set and S\Si as the training set. The second inner operator (model applier) realizes the second step described above. The predictive accuracy (and the other performance measures) of the classifier are then estimated using the performance operator. In this analysis we used the following (expression level) datasets: