Improving Acute Leukemia Classification through Recursive Feature Elimination and Multilayer Perceptron Analysis of Gene Expression Data

,

Historical datasets on cancer offer invaluable resources for research, enabling scientists to explore the evolution of understanding in the field.These datasets, often spanning decades, encompass a wide array of cancer types, clinical outcomes, and molecular profiles.They capture the progress of technology and knowledge, allowing researchers to reinterpret and reanalyze the data using modern analytical techniques.One notable historical dataset is the work of Golub et al. (Simsek, Badem, & Okumus, 2021), which focused on gene expression in acute myeloid leukemia (AML) and acute lymphoblastic leukemia (ALL) patients.This dataset, generated through DNA microarrays, paved the way for molecular cancer classification.Its availability continues to attract attention, as researchers seek to apply cutting-edge methodologies to reexamine the data and extract novel insights that were perhaps not apparent at the time of its original publication.The Golub et al. dataset exemplifies how revisiting historical data with contemporary tools can lead to transformative discoveries, showcasing the timeless value of such resources in advancing cancer research.
As technology and analytical methods have rapidly evolved since that time, there exists an unprecedented opportunity to reevaluate this historic dataset.New data science techniques, including Data Visualization, Correlation Analysis, and Differential Expression Analysis, stand poised to illuminate latent knowledge within historical cancer datasets, thereby offering vital contributions to contemporary cancer research and understanding (Taiwo Olaleye, 2021).Data Visualization, driven by advanced visualization tools, enables the exploration of complex gene expression profiles in an intuitive manner.This facilitates the identification of trends, outliers, and potential biomarkers that might hold relevance in today's cancer landscape.Correlation Analysis, employing sophisticated algorithms, unveils intricate relationships between genes, shedding light on potential regulatory networks or co-expression patterns that could signify important biological pathways (Ahiara, Abioye, Chiagunye, & Olaleye, 2023).Machine learning for predictive analytics is likewise a data science tool of immense potentials (Olaleye T. O., Arogundade, Misra, Abayomi-Alli, & Kose, 2023) which could help for predicting cancer severities, identify significant feature attributes for clinical tests (He, Chen, Bian, & Yang, 2023).Collectively, these techniques synergize to harness the wealth of historical data, revitalizing it as a valuable resource for contemporary researchers seeking actionable insights that have the potential to advance our understanding of cancer and drive targeted approaches to diagnosis, treatment, and patient care.Situated at the intersection of past breakthroughs and contemporary innovation, our endeavor carries the potential to bridge gaps in knowledge that may have arisen due to technological limitations of the past.By applying state-of-the-art exploratory data analysis methods to this vintage dataset, we aspire to extract overlooked nuances and potentially identify gene signatures that could offer new avenues for investigation.Through this process, we aim not only to enrich the molecular understanding of AML and ALL but also to showcase the significance of revisiting historical datasets with modern tools.This translational research (Woolf, 2008) focuses on bridging the gap between scientific discoveries and their application in clinical settings, aiming to translate laboratory findings into tangible benefits for patients and healthcare practices.This approach involves translating the knowledge gained from historical data into practical applications that have the potential to impact current cancer realities.Therefore, the challenge lies in effectively harnessing these datasets to extract novel insights that can inform contemporary cancer understanding and practices.To address this, our research sets out to leverage advanced Data Visualization, Correlation Analysis, and Machine Learning and Feature Selection technique to unveil latent patterns within historical cancer datasets.Our main objective is to unearth actionable insights that bridge the gap between past findings and present-day cancer challenges, ultimately contributing to more precise diagnostic approaches, targeted therapeutic strategies, and improved patient outcomes.The rest of the article is organized in the following ways.Section 2 discusses existing literatures on the subject, while section 3 introduces the conceptual framework of this study.Result will be discussed in section 4 and the study will be concluded in section 5.

Literature Review
Several studies have attempted to address diverse problem statements on gene expression dataset.The actualization of their objectives infers actionable insights into cancer research efforts.Some of those studies are reviewed in this section.
The focus of the work of Simsek et al. ( 2021) is to enhance leukemia diagnosis through sub-type classification using machine learning techniques on gene expression data.With early diagnosis and precise treatment being crucial in leukemia management, the study aims to leverage the power of gene expression analysis to classify leukemia into Acute Lymphoblastic Leukemia (ALL) and Acute Myeloblastic Leukemia (AML) sub-types.Recognizing the significance of timely and accurate diagnosis, the study proposes the application of machine learning methods as a valuable tool for efficient sub-type identification.The utilization of K-nearest neighbor, Linear Discriminant, Support Vector Machine, and Ensemble classifiers underscores the comprehensive approach adopted for this purpose.The anticipated results entail a comparative presentation of the outcomes achieved through these machine learning techniques, providing insights into their effectiveness for accurate leukemia sub-type classification.Ultimately, the study strives to contribute to the advancement of leukemia diagnosis and treatment by harnessing gene expression data and modern machine learning methodologies.
In He et al. (2023), the primary objective of the study was to introduce and validate a novel multitask learning framework named Multi-Tissue Transcriptome Mapping (MTM), with the aim of predicting personalized tissue-specific gene expression profiles.Recognizing the invaluable insights that transcriptional profiles offer in both fundamental and translational research, the study addresses the challenge of limited transcriptome information for tissues requiring invasive biopsies.2023) presents a novel approach, Multi-Objective Evolutionary Algorithm with Decomposition and Harris Hawks Learning (MOEA/D-HHL), for medical machine learning, leveraging a fusion of evolutionary algorithms and harris hawks learning to enhance the effectiveness of medical data analysis.The approach aims to address the growing interest in medical machine learning, amalgamating insights from computer science and medicine.The proposed MOEA/D-HHL demonstrates its capabilities through performance evaluations against established benchmarks (DTLZ1-DTLZ7).Subsequently, MOEA/D-HHL is employed to construct machine learning algorithms for medical cancer gene expression datasets, considering three key objectives: feature selection, classification accuracy, and correlation measures.The study then extends its application to real clinical data sets involving lupus nephritis and pulmonary hypertension, demonstrating impressive performance metrics.The proposed algorithm outperforms existing methods in both cases, reflected by higher Normalized Mutual Information (NMI) and Adjusted Rand Index (ARI) values.The statistical analysis underscores the predictive capabilities of the algorithm across different metrics and accentuates the stability of MOEA/D-HHL as a promising framework for advancing medical machine learning.This study's findings highlight MOEA/D-HHL's potential as a potent tool in the emerging field of medical machine learning, promising improved analysis and insights from complex medical datasets.
The paper of Singh et al. (2020) introduces MetaOmGraph (MOG) as a versatile workbench tailored for interactive exploratory data analysis of extensive expression datasets.The rising wealth of omics data in public domains presents a unique opportunity for uncovering latent insights; however, a substantial portion of these archived datasets often goes untapped.MOG emerges as a solution to this challenge, offering a free, open-source, standalone software platform that empowers researchers, irrespective of coding skills, to seamlessly visualize, assess, and dissect large datasets.The software's interactive interface enables researchers to evaluate data with metadata context, facilitating the identification of sample or gene groups based on factors like expression values, statistical associations, metadata terms, and ontology annotations.A range of interactive visualizations including line charts, scatter plots, histograms, and volcano plots enrich the experience, while various statistical analyses, such as co-expression, differential expression, and differential correlation, provide deeper insights.MOG further facilitates seamless data transfer to R for additional analyses.
Its ability to handle big data efficiently, thanks to multithreading and indexing, enhances its utility.
Researchers can effortlessly initiate new projects from numerical data or delve into existing MOG projects, which retain exploration history and can be saved and shared.The paper demonstrates MOG's prowess through case studies using curated datasets from human cancer RNA-Seq and Arabidopsis thaliana, showcasing its potential in identifying potential biomarkers and enabling substantial insights from diverse omics datasets.
In the work of Buja et al. (2009), the article presents an innovative approach to enhance the effectiveness of exploratory data analysis (EDA) and model diagnostics through the integration of statistical inference.This framework proposes a shift towards incorporating an inferential protocol akin to confirmatory statistical testing, bridging the gap between visual analysis and traditional statistical hypothesis testing.In this paradigm, visual plots serve as the analog of test statistics, while human cognitive assessment assumes the role of statistical tests.The process involves measuring the statistical significance of "discoveries" made through EDA or model diagnostics by comparing the real dataset's plot with plots generated from simulated datasets.This comparison facilitates the assessment of the real data's uniqueness or structure against simulated data's randomness.The article introduces two protocols: one inspired by the "lineup" procedure utilized in legal contexts, and another inspired by the "Rorschach" inkblot test used in psychology.The former protocol, reminiscent of a police lineup, aids in determining whether visual observations in the real data stand out from random variability.The latter, inspired by psychological acclimatization, aids in preparing analysts to interpret variability before encountering the real data.These protocols have implications for exploratory data analysis, where reference datasets are simulated with a null assumption of no underlying structure, as well as for model diagnostics, where reference datasets are simulated based on the model under consideration.The proposed approach promises to enhance the rigor of exploratory data analysis and model diagnostics, potentially improving statistical thinking and practices in data analysis workflows and educational contexts.

Results and Discussion
The translational research approach of this study adopts a 5-phase conceptual framework to achieve its aim.The phases are discussed in this section.

Data Acquisition and Preprocessing
The first two stages entails the data acquisition and preprocessing phases where data is acquired from public repository and preprocessed prior to the data science-based subsequent phases.The dataset employed in this study originates from a proof-of-concept investigation conducted by Golub et al. in 1999 (Crawford, 2017).The study demonstrated the potential of classifying new cancer cases through gene expression profiling using DNA microarray technology.This method introduced a broad strategy for identifying novel cancer categories and accurately categorizing tumors into established classes.The dataset was specifically employed for the classification of patients afflicted with acute myeloid leukemia (AML) and acute lymphoblastic leukemia (ALL).As at the time of data acquisition, the dataset had 135548 views, 15573 downloads, and a 0.11 download per view ratio.This speaks to the continuous employment of the dataset for analysis in the data science, biological and health science studies.The dataset is made up of 7130 gene descriptions with 62 data instances.
For the purpose of this study, the dataset was preprocessed in order to make it compatible with data science programming tools.The major preprocessing task implemented was the transposition of the data to a compatible python data frame format.

Feature Selection
For this gene expression study, one of the most suitable feature selection algorithm in data science, Recursive Feature Elimination (RFE) is employed and implemented using python programming.RFE is a widely used technique that recursively eliminates the least significant features from a dataset while building a predictive model (Kilincer, Ertam, Sengur, R. S., & Acharya, 2023).Given the complexity of the gene expression data and the potential presence of noisy or irrelevant features, RFE is effective in identifying the most informative genes for distinguishing between different conditions or outcomes.In practical terms, analyzing the entire 7130 gene descriptions as contained in the dataset would result into a high computational overhead, which might not return a reliable outcome owing to redundancies, which feature selection algorithms are trained to eliminate.Hence one of the prominent aim of this study which focuses on implementing a feature selection methodology to identify the most significant feature attributes out of the entire 7130.Therefore, when investigating the correlation between gene expression levels and their cancer status in this study, RFE helps to identify a subset of genes that contribute most significantly to the observed correlations.The algorithm iteratively removes the least important genes based on their relevance to the correlations, refining the set of features to a more meaningful subset.By doing so, RFE enhances the interpretability of the results and potentially reveal key genes associated with the attributes of interest.The RFE algorithm is presented below: i. Dataset (X, y): The dataset containing gene expression levels (features) and the target variable.ii.n_features_to_select: The desired number of features to retain.iii.ML_Algorithm: The chosen machine learning algorithm for evaluation.iv.Remaining_Features: List of features that have not been selected yet.v. Selected_Features: List to store the selected features.vi.Best_Feature: The feature with the highest score in each iteration.vii.Best_Score: The highest score achieved with the current set of features.viii.Features_to_Use: The set of features used for training the model in each iteration.
ix. Train_Model(): Function to train the machine learning algorithm.
x. Evaluate_Model_Performance(): Function to evaluate the model's performance.This pseudocode outlines the step-by-step procedure of the RFE algorithm.It starts with all available features and iteratively selects the best feature to add to the selected features list, while considering the performance improvement achieved by each feature.The algorithm terminates when the desired number of features is achieved.

Exploratory Data Analysis
Immediately after the feature selection phase is the exploratory analysis of the returned significant data features by RFE.It is a data mining phase that obtains actionable insights from data, towards inferring informed decision.EDA would reveal an in-depth analysis of the data which would aid better understanding of the gene attributes as they contribute to the positive status of the myeloid leukemia or the acute lymphoblastic leukemia cancer conditions.EDA has proven to be an indispensable technique needed to be implemented prior to any machine learning-based predictive analysis (Olaleye, et al., 2023).The techniques involved include: a. Summary Statistics Summary Statistics help to gain insights into the central tendency and variability of the gene observational attributes.The mean, median, and standard deviation are computed for the gene expressions towards determining a patient status as either of acute myeloid leukemia or acute lymphoblastic leukemia.It reveals the average measurements for each of the returned significant attributes.These experimental result will provide a quantitative understanding of the gene expressions and highlight any variations among their interrelatedness towards determining the two cancer acute stages.The mathematical computations are presented thus: where x is the numerical values for each of the gene expressions and n is the total number of instances; where is the number of instances; where: = gene value in the data distribution for each expression = the mean = total number of instances

b. Maximum and Minimum Expression Values
The maximum and minimum expression values for each of the measurements are identified to determine the maximum and minimum expressions.By analyzing the extreme values, patterns in gene expressions during extreme and minimal situations can be computed.This information is crucial for understanding the attitudinal trend of the genes as it relates to the two cancer stages.

c. Interquartile Range (IQR)
The interquartile range (IQR) is calculated to assess the spread and variability of the gene expressions in determining the acute stages.The IQR is a good measure of dispersion that represents the series of data points between the 25th and 75th percentiles of a data (Moustafa, et al., 2018).The IQR is computed as Q3 -Q1, where Q1 is the lower quartile and Q3 is the upper quartile.The calculation provide perceptions into the spread of the gene expressions and helps discover likelihood of outliers or unusual gene behavior.
which specify the most centered gene expression value in the 1st half of the dataset; is the most central expression value in the 2 nd half of the dataset

d. Correlation Analysis
Correlation analysis is computed to understand the relationships and dependencies between gene expressions with respect to the two cancer conditions.The correlation matrix will reveal the pairwise relationships between the gene expressions and their relationship with the cancer statuses.The correlation coefficient is between -1 to 1, with values closer to -1 implying a strong negative relationship, values closer to 1 indicating a strong positive relationship, and values close to 0 suggesting no significant relationship.With this analysis, interdependencies among the gene expressions is uncovered, which can inform their interrelatedness with the ALL or AML.For the purpose of this study, the correlation coefficient will be communicated in a heat map, which will also help to uncover the possibility of multicollinearity within the returned most significant data features.By dividing the covariance by the sum of the standard deviations of any two gene attributes, the correlation coefficient is calculated as: (7) where and 3 are two gene attributes under analysis.

Multilayer Perceptron (MLP) Prediction of Cancer Status
Upon the success of the EDA, which has provided actionable insights into the interrelatedness and the spread of data points in the returned most significant gene attributes, it is important to discover the predictive abilities of the predictive gene expression variables in detecting the cancer status as either ALL or AML.The deep learner MLP is employed for the purpose.MLP is a type of artificial neural network that consists of multiple layers of interconnected nodes.It typically comprises an input layer, one or more hidden layers, and an output layer.Each node in the network is connected to nodes in adjacent layers through weighted connections.The MLP employs nonlinear activation functions to introduce nonlinearity into the model, allowing it to capture complex patterns and relationships within the data.During training, the network adjusts the weights of these connections using various optimization algorithms, such as gradient descent, to minimize the difference between predicted and actual outputs.MLPs are widely used for various tasks, including classification, regression, and pattern recognition, due to their ability to learn from data and model intricate relationships.Each algorithm in the hierarchy of deep learning applies a nonlinear transformation to input data and acquires the ability to establish a statistical model that represents its output (Suganthi et al., 2022).This process iterates until a desirable level of predictive accuracy is attained.In this study, MLP categorized as a type of feedforward artificial neural network (ANN), is used.
Comprising at least three layers of nodes-namely the input layer, hidden layer(s), and output layer-the MLP deploys a nonlinear activation function for each neural node.The architecture of the MLP is visualized in Figure 1.The perceptron algorithm adapts the connection weights immediately after processing each individual data point, relying on the discrepancy between the perceptron's generated output and the intended outcome.The linear perceptron, an abridged variant of the least mean squares algorithm, characterizes this process.The error of an output node 'j' for the 'n'th data point is articulated as ej(n) = dj(n) -yj(n), wherein 'd' signifies the target value and 'y' denotes the output produced by the perceptron.Subsequently, the node's weights undergo adjustment to minimize the aggregate output error.This is achieved by reducing the corrections applied to the weights, guaranteeing that the perceptron converges towards the sought-after output.The calculation of these corrections is as follows:  AFFX-BioB-5_at (endogenous control) 2.
Semaphorin E For the measures of dispersion, Table 3 and Table 4 returns the mean, standard deviation, minimum and the maximum gene description values of the attributes.The mean gene expression values highlight the average expression levels across the different genes.We can observe variations in mean expression levels among the genes, indicating potential differences in their biological roles or responses to experimental conditions.For instance, the gene "AFFX-HSAC07/X00351_3_st" demonstrates a mean expression of 668.61, whereas "AFFX-HUMGAPDH/M33197_5_st" has a mean expression of 497.13.The standard deviations provide insights into the extent of variability within each gene's expression levels.The large standard deviations suggest considerable differences in gene expression across conditions for each gene.For instance, the gene "AFFX-HSAC07/X00351_M_st" exhibits a standard deviation of 2360.24,indicating substantial variability in its expression levels.The minimum and maximum values reveal the range of expression values for each gene.Notably, some genes exhibit substantial ranges between their minimum and maximum values, indicating dynamic changes in expression under different conditions.For instance, the gene "AFFX-HUMGAPDH/M33197_M_st" shows a minimum value of -16131 and a maximum value of 59647.The gene "AFFX-BioB-3_at" has a mean expression of 698.21, whereas "AFFX-BioDn-5_at" has a mean expression of 564.72.The gene "AFFX-CreX-3_at" (in Table 4) has a standard deviation of 2579.99,indicating notable variations in its expression levels.As observed from the Table xxx, some genes exhibit considerable differences between their minimum and maximum values, indicating diverse expression behaviors.For instance, the gene "AFFX-BioB-M_at" has a minimum value of -17930 and a maximum value of 29288.These statistical measures collectively offer a preliminary understanding of the characteristics of gene expression patterns.The high variability observed could be due to various factors such as experimental noise, regulatory mechanisms, or responses to external stimuli.The correlation coefficient analysis is presented for the 20 feature attributes in Figure 2   These correlation coefficients provide insights into how each variable is related to others in the dataset.Strong positive correlations suggest that the variables tend to increase together, while strong negative correlations suggest that one variable tends to increase as the other decreases.Weak correlations indicate a lesser degree of linear relationship between variables.Remember that correlation does not imply causation; it indicates a statistical relationship.Weighted Average Plot or AML, approximately 99.83% are correct predictions.This is crucial in medical diagnostics because it ensures that when the model predicts a disease (ALL or AML), it is highly likely to be accurate.ii.Accuracy (0.997916667): Accuracy represents the overall correctness of the model's predictions.An accuracy of approximately 99.79% indicates that the model correctly predicts the class (ALL or AML) for nearly all instances in the dataset.It's an excellent indicator of how well the model performs in distinguishing between the two classes.iii.F1-Score (0.998958333): The F1-Score is the harmonic mean of precision and recall.It provides a balance between precision and recall and is especially useful when dealing with imbalanced datasets or when both false positives and false negatives need to be minimized.With an F1-Score of approximately 99.90%, this model strikes an excellent balance between precision and recall.iv.Recall (0.997916667): Recall, also known as sensitivity or true positive rate, measures how many of the actual positive instances the model correctly predicted.With a recall of approximately 99.79%, the model does an exceptional job of capturing nearly all instances of both ALL and AML cases.
These performance metrics collectively indicate that the MLP trained with RFE has achieved remarkable results in approximating the ALL or AML classes in gene expression data.Such high precision and recall values suggest that the model is both highly accurate and effective at identifying patients with ALL and AML based on gene expression patterns.
The most informative genes to the MLP model had earlier been presented in Table 1.It can be inferred that the 20 features selected by RFE played a crucial role in achieving the impressive model performance.These attributes represent specific gene expressions that are highly indicative of either ALL or AML, providing valuable insights into the molecular signatures associated with these diseases.These attributes play a significant role in the model's ability to distinguish between the classes of interest, namely Acute Lymphoblastic Leukemia (ALL) and Acute Myeloid Leukemia (AML).The genes' expression levels in these attributes exhibit distinctive patterns between the two classes, and the model has learned to leverage these patterns for accurate classification.The inclusion of these attributes in the model indicates that they are crucial in identifying the molecular signatures associated with ALL and AML.This information provides insights into the biological mechanisms underlying these diseases, potentially leading to better diagnostics and targeted treatment approaches.It's worth noting that the model's success in accurately classifying instances is a result of the collective contribution of these attributes, emphasizing their importance in gene expression predictive analytics studies for medical applications.

Conclusion and Recommendations
In this study, we aimed to enhance the classification of acute leukemia subtypes, namely Acute Lymphoblastic Leukemia (ALL) and Acute Myeloid Leukemia (AML), using gene expression data analysis.We employed a systematic approach combining Recursive Feature Elimination (RFE) as a feature selection technique and Multilayer Perceptron (MLP) as the predictive modeling framework.
Our research focused on identifying the most influential genes for accurate subtype classification.Through rigorous experimentation, we achieved highly promising results.The combined RFE-MLP approach yielded exceptional precision, accuracy, F1-Score, and recall rates of approximately 99%, signifying its effectiveness in leukemia subtype classification.Importantly, our study identified specific genes, including AFFX-BioB-5_at, AFFX-BioB-M_at, and GB DEF = GABAa receptor alpha-3 subunit, as some of the 20 key contributors to the model's predictive power.These genes may serve as potential biomarkers for leukemia diagnosis, offering valuable insights for future research and clinical applications.

and Figure 3 .
It is discovered that AFFX-HUMGAPDH/M33197_5_st and AFFX-HUMGAPDH/M33197_M_st have a strong positive correlation of approximately 0.772.This suggests that these two variables tend to increase or decrease together.AFFX-HUMGAPDH/M33197_5_st and Osteomodulin have a negative correlation of approximately -0.290.This indicates that as one variable increases, the other tends to decrease.AFFX-HSAC07/X00351_M_st and GB DEF = GABAa receptor alpha-3 subunit have a positive correlation of approximately 0.293.This indicates a positive relationship between these two variables.mRNA and Semaphorin E have a very weak positive correlation of approximately 0.050, and suggests a very mild positive relationship between these variables.GB DEF = GABAa receptor alpha-3 subunit and Osteomodulin have a positive correlation of approximately 0.166 suggesting a mild positive relationship between these variables.

Figure 3 .
Figure 3.The heat plot B of correlation coefficient The result obtained after training the Multilayer Perceptron using Recursive Feature Elimination, specifically approximating the ALL or AML classes, shows impressive performance metrics presented below in Figure 4:

Figure 4 .
Figure 4.The performance metrics of the Multilayer Perceptron with RFE i. Precision (0.998263889): Precision measures how many of the predicted positive instances are actually positive.In this context, it means that out of all the instances predicted as either ALL

Table 1 .
The most significant gene expression attributes returned by the RFE

Table 2 .
Interquartile range of the most significant gene attributes

Table 3 .
Summary statistics for measures of dispersion A

Table 4 .
Summary statistics for measures of dispersion B