Az754797.vo.msecnd.net

?10 Minutes to Build Your First Solution for Azure Machine Learning Competition: Assessing Young Women’s Health RisksIntroductionThe primary goal in this competition is to predict a female's health risk (such as HIV infections) given a set of her demographic and medical questionnaire data the subject provides when she visits a clinic. A subject in this study are assigned with two categorical labels: (1) a top-level label indicating the membership in a segment and (2) a lower-level label indicating her subgroup within that segment. The task of this competition is to build a predictive system such that in production, once a female visitor to a clinic fills out the questionnaire, the model can automatically assign labels to her. Insights gleamed from these models have potential value to medical professionals who are charged with providing personalized educational strategies to the female visitors to the clinics so that their health risks can be reduced. This competition takes place on Microsoft's Azure platform using its built-in and customizable machine learning tools.Azure Machine Learning (AML) is a cloud-based tool that enables data scientists and big data professionals to build and operationalize predictive analytics solutions with ease. The following tutorial is a possible starting point for those who are interested in participating in this competition and learning about the diverse capabilities of AML. You can use these as a baseline and start building upon them in the rich AML studio environment by dragging and dropping an extensive set of available algorithms or your own custom R and Python scripts. Overview of the Sample Solution in this TutorialThe tutorial provides instructions on how you can build a qualified solution for this competition in just 10 minutes based on a sample training experiment Microsoft has provided. With only some clicks and drag-and-drop actions, your first entry for this competition can be submitted and you will be able to see your team on the leaderboard with a reasonable score. This sample experiment only serves as a tutorial on how to build solutions in AzureML. There exists significant space for you to improve the solution which can result in better scores. Please watch our 4-minute tutorial video in the Tutorial section of the Competition Detail Page.Below are the graphs of the sample training experiment, and the predictive experiment that is built from the training experiment. Training experiment. Young Women's Health Risk Assessment in Underdeveloped RegionsPredictive experiment. Young Women's Health Risk Assessment in Underdeveloped Regions [Predictive Exp.] Requirements on the Web Service API Input and Output SchemaThe input schema of the web service API HAS TO BE the same as the schema of the training data. The input data schema can be found in the data description section.The output schema of the web service API HAS TO BE as follows:Column IndexColumn NameData Type1patientIDInteger2Geo_PredInteger3Segment_PredInteger4Subgroup_PredIntegerMake sure that the order of the columns in your web service API output schema, column names and data types are the same as above. Otherwise, you will not be able to submit, or you will not get a reasonable score as you might expect.?Five Steps to Build Your First Solution in 10 MinutesHere are the five steps that you can take to build your first machine learning solution in 10 minutes for this competition. Step 1. Sign in to Azure Machine LearningStep 2. Enter the Competition Step 3. Run the Sample Training Experiment Step 4. Build and Run a Predictive Experiment, Publish and Submit Web Service API for Evaluation Step 5. View Your Ranking on Public LeaderboardFor more details of how to build the sample training experiment, and how you can build your new solution on AzureML for this competition, refer to the deep dive and instructions in?the Appendix:?How the Sample Training Experiment Is Built: a Deep DiveHow to Create a New Experiment for the CompetitionStep 1. Sign in to Azure Machine LearningStep 1.1. Open Azure Machine Learning web page using any browser. Then click "Get started now".Step 1.2. Sign in to AzureML Studio. You will be directed to the Microsoft Sign in page. If you already have a Microsoft account, such as @, @, or @, you can sign in directly here. Or, if you have an Office 365 account, or any valid Windows Azure Active Directory Account, you can also sign in directly. Otherwise, click on "Sign up now" link (the link in the green box) to sign up for a Microsoft account.Note: Please remember that after you make a successful submission to the competition, the public leaderboard will show the name associated with the Microsoft account you are using to log in. If you prefer to remain anonymous, you’d better change the name associated with this email account before you submit. Step 2. Enter the CompetitionLook for this competition in gallery of AzureML. To enter the competition, click?Gallery?at the top of your studio page and you will be directed to?Cortana Analytics Gallery. Click?Competitions?and you will find the competition?Young Women’s Health Risk Assessment in Underdeveloped Countries/Regions?in the?Machine Learning Competitions?page.192405082550000left3175020000Visit the information page of the competition.?Click?the Health Risk Assessment competition, you will see information about this competition such as summary, description, data files, rules, prizes, leaderboard, etc.Enter the competition. Click?Enter competition?to copy the sample training experiment to your AzureML workspace.?Step 3. Run the Sample Training ExperimentAfter the sample training experiment Young Women’s Health Risk Assessment in Underdeveloped Regions is copied to your Azure ML workspace, you need to copy and paste the Azure storage account key provided by the Competitions team to the Reader module to enable it read the training data from the Azure blob storage. You can name your experiment at this stage. Click the Run button at the page bottom, the sample experiment will start running. It may take around 1 minute to complete.Step 4. Build and Run the Predictive Experiment, Publish, and Submit Web Service API for EvaluationCreate predictive experiment automatically. After the sample training experiment completes successfully, at the bottom of the page click SET UP WEB SERVICE, then click Predictive Web Service. As shown below, the program automatically generates a predictive experiment using the model trained in the sample training experiment to make predictions.Slightly modify the predictive experiment. Since in the training experiment, we aggregate the 3 label columns into one using R code, in the predicative experiment, before you send the predication result to the web service output port, you need to decompose the predicated label back into 3 separate label columns. You can add Apply SQL Transformation module to the automatically generated predictive experiment as follows to build a web service API with qualified output data schema. (There are certainly other ways to do this as well.) Please follow the detailed steps below.right000You can find modules on the left side of the studio. To help get these modules faster, you can input the module name in the search field?Search experiment items. Then, from the search results, drag and drop the module you need into your experiment.?Replace the SQL Query Script in the module Apply SQL Transformation with the following scripts:select patientID, cast(substr("Scored Labels",1,1) as int) as Geo_Pred,cast(substr("Scored Labels",2,1) as int) as Segment_Pred,cast(substr("Scored Labels",3,1) as int) as Subgroup_Predfrom t1;Checklist before you proceed: □ The output portal of?Score Model?module is connected to the?first?input portal of?Apply SQL Transformation?module.□ The?Web Service Output?is connected to the output portal of?Apply SQL Transformation.4.3. Deploy web service API.?RUN?the predictive experiment, and?DEPLOY WEB SERVICE?after it completes.Click?RUN?button at the bottom of the page. The predictive experiment should complete in less than 1 minute. Then, click?DEPLOY WEB SERVICE?to generate your web service API.?4.4 Submit your web service API for evaluationAfter?Deploy Web Service?is done, click?SUBMIT COMPETITION ENTRY?at the page bottom to submit and get your web service API evaluated on the testing data.You have to agree with the terms during the submission process. You also have the opportunity to provide a customized name to your submission, which might be very helpful to remind you the features and/or models you use in this solution.?Step 5. View Your Ranking on Public Leaderboard?After your web service API is successfully evaluated on the public testing data, you will see a green check mark on the left bottom corner of the page indicating that your solution has been successfully evaluated. Click?VIEW COMPETITION SUBMISSION IN GALLERY, you will be redirected to the competition page in gallery where you can see your submission history. It may take about 1 or 2 minutes before your score on the public test data can be returned from the evaluation process. After that, you will be able to see your current ranking on the public leaderboard.?AppendixHow the Sample Training Experiment Is Built: a Deep DiveThis sample training experiment illustrates some typical steps for creating an end-to-end pipeline for a machine learning task.It consists of the following steps:Ingesting and visualizing the raw dataCombining geo column and two label columns as new labelExcluding segment and subgroup from the data set Excluding patientID from feature listCleaning missing values from the dataSplitting the data into a training set and a validation setTraining a predictive modelScoring and evaluating a trained modelIngesting and visualizing the raw dataThe organizers have provided the training data in a tabular csv format in the sample training experiment. When you copy the sample training experiment to your workspace, this data is also copied to your workspace.?You can visualize the data by simply right clicking the output portal of the training dataset, and selecting?Visualize. After the visualization windows pops up, you can select any column; the statistics and the histogram of the selected variable will then be shown in the right panel.?Combining geo column and two label columns as new labelAn?Execute R Script Module?combines the geographical ID column and the two label columns in the training data into one single label column so that later on you can build a single machine learning model for all regions, segments, and subgroups. (BTW, this is why you need to decompose this single column back into 3 columns when you are constructing your predicative experiment). Here are the R scripts used in this module:dataset1 <- maml.mapInputPort(1) # class: data.framecombined_label <- 100*dataset1$geo + 10*dataset1$segment + dataset1$subgroupdata.set <- cbind(dataset1, combined_label)maml.mapOutputPort("data.set");Excluding segment and subgroup from the data setAs a new label column is created in the previous step, columns?segment?and?subgroup?are not needed and can be excluded. Use a?Project Column?module to exclude them.The?Launch column selector?button on the right side of the screen gives you access to a dialog for selecting the columns you wish to include or exclude from the subsequent pipeline.?Excluding patientID from feature listSince patientID is unique for each row in both training and testing data, it should not be included in the feature set. Use a Metadata Editor to clear it from the feature list. DO NOT use Project Column module to remove it here since you still need it to correlate the predication results back to the scoring dataset. Cleaning missing dataUse the Clean Missing Data module to impute missing values. Below shows a screen where you can replace missing values with zeros. We are just using default settings here to clean missing data. You are encouraged to explore other way to infer the missing data!Splitting the data into a training set and a validation setTo train and validate a model, split the data into training and validation sets using the?Split?module.?In this sample experiment, split the data randomly into 75% and 25%, where the 75% of the data is output from the first output portal and used as the training data, and the remaining is output from the second output portal and used as the validation data.As shown below, the data was split randomly, stratified on the label column combined_label. Training a predictive modelNow, train the model using the training data. We first add?Multiclass Logistic Regression module from the?Machine Learning->Initialize Model->Classification?menu to the experiment. In this sample training experiment, train a multiclass logistic regression model in the interest of simplicity, but you can explore other available models or create your own R models through Create R Model by following?an example in Cortana Analytics Gallery.The Train Model module found in the Machine Learning->Train menu carries out the actual training of the logistic regression model. This module takes two inputs: the left input portal takes the model specification, and the right input portal takes the training data. The Train Model module must specify the label column. Here, indicate that the combined_label column is the target column through the column selector dialog in the Properties box.Scoring and evaluating a trained modelAfter the model is trained, you can use it to predict the training and validation data sets. The?Score Model?module in the?Machine Learning->Score?menu accomplishes this task. It takes a trained model in its left input port and the data set to be predicted in its right input port.In the scenario displayed above, the program has scored the logistic regression model against the training set in the left?Score Model?module and against the validation set in the right module. Visualizing the output data set from the right?Score Model?module adds several scoring columns to the end of the original data set. This is shown in the graph below.The scored data set can now be evaluated using the?Evaluate Model?module in the?Machine Learning->Evaluate?menu. This module takes in one, or two scored data sets as input.Evaluation, in this case, displays several metrics related to the accuracy of the model's predictions, as well as a graphical representation of the confusion matrices associated with the prediction task.The accuracy metrics and confusion matrices for the training (left) and validation (right) sets are shown in the above graph. Comparing the prediction performance of the model on both training and validation data helps determine whether the model is over-fitting on the training data. If you see that the validation performance is much worse than the training performance, it might indicate that your model is over-fitted on the training data.How to Create a New Experiment for the CompetitionIf you want to create a new training experiment, click?Save?and then?Save As?to save the sample training experiment as a new one and make further edits (feature engineering, trying different models, etc.) on it.?DO NOT?directly click?NEW?to create a new experiment since experiments created this way will not be recognized as experiments for competitions. Later on web service APIs created from such training experiment will not have the button to submit for evaluation.After the new training experiment is created, you can follow the step above to?Step 4. Run the Scoring Experiment, Publish and Submit Web Service API for Submission. ................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download