Synopsis



Development of machine learning algorithms for genomic profiling of leukemia subtypesSynopsisLeukemia can be by the type of white blood cells affect - lymphoid or myeloid cells. Lymphocytic leukemia (also known as lymphoid or lymphoblastic leukemia) develops in the white blood cells called lymphocytes in the bone marrow. Myeloid (also known as myelogenous) leukemia may also start in white blood cells other than lymphocytes, as well as red blood cells and platelets. In this project, the student will develop and implement machine learning algorithms on DNA microarray data to distinguish between patients with acute myeloid leukemia and acute lymphoblastic leukemia. The machine learning classifier will be trained on a training set, which is then tested on a test set, to evaluate its performance.Objectives1. Conduct a literature review of various microarray/genomic data processing/machine learning techniques2. Develop and implement machine learning algorithms on Python3. Train the classifier on a training set, and test and evaluate its performance on a test set4. Develop a microarray/genomic data classification workflow for proteome profilingDeliverables1. Microarray/Genomic data processing/machine learning algorithms2. Microarray/Genomic data classifier in PythonImportant Notes1. To have atleast 50 healthy subjects and 50 Cancer patients for each sub diagnosis.2. Please provide a step-by-step guide on development, training and testing of algorithm.3. Please provide all Datasets pre and post cleaning if applicable4. Please provide all Datasets (Training & Test Data)Project Background & GuideThe goal of machine learning is to make machines act like humans. To put this in context for this project, the Datasets collected contain thousands of Gene expressions, technically someone can be hired to manually examine the data to determine specific mutations and diagnose an individual patient. But this requires several man-hours and is impractical and manually labouring. Whereas, using machine learning we can automate this process to hasten the process, which is of utmost importance in the medical field where time is critical from the diagnosing to treatment process. There are various types of machine learning such as supervised and unsupervised machine learning. In supervised machine learning we provide the computer with a labelled set of data to produce a determined value of output. Whereas, in unsupervised machine learning we provide the computer with an unlabelled set of data and the algorithm determines the value of output based on the “input-output” result puters do not understand human language; therefore, we cannot just type a sentence in the program like sending an e-mail to a colleague to make the computer perform a certain task. Thus, we must communicate through a language that it understands. This is where interpreters such as Python come into play. These interpreters are like translators. Python has various in-built features to optimise its role as a translator. One of these features is “Functions”, which is being utilised in machine learning to help the computer perform the specified tasks optimally. “Functions” are used to interpret a certain output given a set of inputs. For example, we input a certain set of values and “tell” the interpreter that these set of inputs produce a certain value of outputs. Based on this information, the interpreter can develop an algorithm to perform the specified task and produce a specific value of output every time a certain value of input is presented to it.For the interpreter to be able to develop an efficient algorithm to determine the desired output, a vast amount of “input-output” result must be supplied. Like mentioned earlier, AI mimics human behaviour. For humans to efficiently perform a task, we need to have pre-existing knowledge on a certain topic and receive required training on said topic. Only then can we efficiently perform the required task. Similarly, AI requires and goes through the same process before it can efficiently perform the assigned task.As mentioned earlier, labelled data presented to the program enables the interpreter to develop an algorithm to produce a desired outcome. To fully optimise this “function” two sets of data are of utmost importance. Namely, “Training Data” & “Test Data”. Both these datasets are labelled datasets. “Training Data” is used to familiarise the program on the “input-output” result. “Test Data” is used to determine the accuracy of the output based on the input.To utilise machine learning various tools can be used. A few of the more popular tools are NumPy & Pandas. NumPy is used to work with list and arrays while Pandas is used to manipulate tabular data such as CSV files. Furthermore, there are also various libraries such as matplotlib, scikit-learn, Seaborn & Bokeh available to import and manipulate the data in Python. These libraries already have the various algorithms for supervised and unsupervised machine learning. Based on the available Datasets and the required output, the program can be tweaked by importing these pre-built algorithm libraries. These steps to accomplish machine learning are illustrated in Figure 3-2.Figure STYLEREF 1 \s 3 SEQ Figure \* ARABIC \s 1 2: Steps to Perform Machine LearningSeveral big companies such as Facebook and Google use the above-mentioned framework in their workflow. This is illustrated in Figure 3-3.Figure STYLEREF 1 \s 3 SEQ Figure \* ARABIC \s 1 3: Facebook Data Mining (research.fb, 2019)Having examined a basic summary and background of machine learning, it is significant to obtain a working knowledge of how machine learning functions. Assuming datasets are a pre-requisite to performing machine learning this step shall be skipped. Having collected said dataset performing due diligence to segregate the obtained dataset goes without further elaboration.Once the data collection and segregation portion of the project has been fulfilled, we can move on to write a program and train the written program. Moving forward in this report, the process of utilising machine learning will be elaborated with reference to the steps mentioned in Figure 3-6.Once the required datasets have been collected and segregated, importing said datasets into python is necessary before performing any further steps. To import the collected dataset into python, Pandas library will be utilised. Having weighed the pros and cons, with reference to Figure 3-7, Pandas will be efficient as it enables more “work done with less writing”. Even though it has a steep learning curve, this factor is redundant as we are only going to take advantage of pandas to import the datasets. Pandas also outputs the Datasets in data frames stored as Arrays which can be filtered. Furthermore, data frames stored as Arrays allow manipulation of the data using NumPy as discussed earlier.Figure STYLEREF 1 \s 3 SEQ Figure \* ARABIC \s 1 4: Pros & Cons of Pandas Library (DataFlair Team, 2019)Having successfully imported the collected Dataset into Python (Jupyter Notebook), if applicable, we can filter the Data and concentrate on the Gene expressions of interest. “Functions” such as “regex” and “sort_values” can be used in conjunction with “if-else” to clean the data and filter to examine the gene expressions of interest. An interactive plot-graph based on patient demographics can also be processed. As this step is not of significance to demonstrate an understanding of machine learning, it will be omitted in this report and will be examined and utilised further if required in the final presentation of the program.Upon completing all the before mentioned steps, it is paramount to ensure segregation of obtained data. There is a need to ensure that the training data is of greater data compared to the training data. Moving forward, the model must be trained and optimised to meet the required standards. ................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download