Machine Learning Tutorial

This workflow will demonstrate how to run a Logistic Regression in Red Sqirl and create an evaluation method. The Spark Logistic Regression from the SparkML package or the Hama Logistic Regression from the Hama package can be used.

You will need two files to complete this tutorial. You will find them in tutorialdata directory (see Pig Tutorial for knowing how to transfer file).

Transfer these files onto the hadoop file system into two new directories “ml_tutorial_training_data.mrtxt” and “ml_tutorial_prediction_data.mrtxt”.

Goals:

  1. Build a Logisitic Regression
  2. Build a reusable evaluation method

Build a Logistic Regression Model

This workflow will demonstrate the Hama/Spark Logistic regression action. This action allows us to run logistic regression over a data set. Its inputs are the training and the prediction dataset. That means that two source actions are needed.

The operations for the Red Sqirl Pig and Hama packages can be different. In the following tutorial we have coloured in orange the Spark specific operations and in pink the Pig ones. Choose one or the other depending of the package you use.

The following will load training and prediction data sets.

  1. Create a new canvas by clicking the plus symbol on the canvas tabs bar
  2. Drag a Pig Text Source, double click on it, name it “iris_train”
  3. Select “ml_tutorial_training_data.mrtxt” path.
  4. Copy and paste the header “ID STRING, SEPAL_LENGTH FLOAT, SEPAL_WIDTH FLOAT, PETAL_LENGTH FLOAT, PETAL_WIDTH FLOAT, SPECIES CATEGORY ”
  5. Drag a Pig Text Source, double click on it, name it “iris_predict”
  6. Select “ml_tutorial_prediction_data.mrtxt” path.
  7. Copy and paste the header “ID STRING, SEPAL_LENGTH FLOAT, SEPAL_WIDTH FLOAT, PETAL_LENGTH FLOAT, PETAL_WIDTH FLOAT, SPECIES CATEGORY ”

Finally we configure the model.

  1. Drag the logistic regression action to the canvas.
  2. Select the source “iris_train” and create a link to the new “hama LR”/“spark LR” action.
  3. A new window should appear and ask you to select whether it is a training data set or a prediction data set, select training and then click OK.
  4. Select the source “iris_predict” and create a link to the new “hama LR”/“spark LR” action.
  5. In this window select prediction and then click OK.
  6. Open the “hama LR”/“spark LR” action and name it “iris_model”.
  7. The first page with list three interactions : ID, Target and Target Value.
  8. In the Id interaction select “ID”.
  9. In the Target select “SPECIES”.
  10. Finally in the Target Value input “Iris-setosa” as the value without quotes.
  11. Click next to see the Model settings such as predictors and parameters for running the model.
  12. For the purposes of the tutorial we will leave these interactions alone and just click next.
  13. Click OK.
  14. In the File top menu, save the workflow as “ml_tutorial”.
  15. Run the workflow.

Build an Evaluation Method

First we need to join the score we obtain with the predictive data.

  1. Drag a Spark or Pig Join.
  2. Select the source “iris_predict” and create a link to the new join action.
  3. Select the LR model “iris_model” and create a link to the new join action.
  4. Open the new Join action.
  5. Open it and name it “score_vs_value”.
  6. Click next on the first page.
  7. On the second page create two new fields.
    Operation Field Name Type
    iris_model.score score FLOAT
    IF(iris_predict.SPECIES = 'Iris-setosa',1,0) value INT
    CASE WHEN iris_predict.SPECIES == 'Iris-setosa' THEN 1 ELSE 0 END value INT
  8. Click next.
  9. Join on ID and label. The table should look like
    Relation Join Field
    iris_predict iris_predict.ID
    iris_model iris_model.label
  10. Click OK.

Create Bins

We will now separate our score in 10 bins of same size ranked on the score. We expect to see a high accuracy on high scores.

  1. Drag a Spark or Pig Volume Binning.
  2. Select the “score_vs_value” and create a link to the new binning action.
  3. Open it and name it “bin_score”.
  4. Choose “score” in the binning field.
  5. Type 10 for the number of bins.
  6. Click next.
  7. Click OK.

Global Model Metrics

We will calculate some global parameter of the model such as how many score do we have and how many items were to be predicted.

  1. Drag a Spark or Pig aggregator.
  2. Select the “score_vs_value” and create a link to the new aggregator action.
  3. Open it and name it “score_glob_prop”.
  4. Click next.
  5. On the second page create two new fields
    Operation Field Name Type
    SUM(1) TOTAL_SCORED FLOAT
    SUM(value) TOTAL_PREDICT FLOAT
  6. Click OK.

Bin Model Metrics

We need to calculate the same values per bin.

  1. Drag a Spark or Pig aggregator.
  2. Select the “bin_score” and create a link to the new aggregator action.
  3. Open it and name it “score_bin_prop”.
  4. Select “BIN_score” on the first page and click next.
  5. On the second page, use the “Copy” generator and create two additional fields.
    Operation Field Name Type
    SUM(1) SCORED FLOAT
    SUM(value) PREDICT FLOAT
  6. Click OK.

Evaluation

Create the evaluation end result.

  1. Drag a Spark or Pig Join.
  2. Select the “score_bin_prop” action and create a link to the new join action.
  3. Select the “score_glob_prop” action and create a link to the new join action.
  4. Open it and name it “evaluation”.
  5. Click next on the first page.
  6. On the second page create the following eight field, you can use the copy generator to help you.
    Operation Field Name Type
    score_bin_prop.BIN_score BIN INT
    score_bin_prop.SCORED SCORED FLOAT
    score_bin_prop.PREDICT PREDICT FLOAT
    score_bin_prop.PREDICT / score_bin_prop.SCORED PREDICT_RATE FLOAT
    (score_bin_prop.PREDICT * score_glob_prop.TOTAL_SCORED) / (score_bin_prop.SCORED * score_glob_prop.TOTAL_PREDICT) LIFT FLOAT
    score_glob_prop.TOTAL_SCORED TOTAL_SCORED FLOAT
    score_glob_prop.TOTAL_PREDICT TOTAL_PREDICT FLOAT
    score_glob_prop.TOTAL_PREDICT / score_glob_prop.TOTAL_SCORED BACKGROUND FLOAT
  7. Click next.
  8. Join on all the lines, by filling out 1 in the join field column.
  9. Click OK.

You can now run the workflow and see the result. The model should appear very accurate on this toy data.

Create a Super Action

What we will do now, is add this evaluation method into your footer, so that you can reuse it.

  1. Select with the mouse and the CTRL key the actions “bin_score”, “score_bin_prop”, “score_glob_prop”, “evaluation”.
  2. Go in Edit > Aggregate.
  3. On the new page change the name of the sub-workflow “evaluation10bin”.
  4. Fill out the form as below.
    • In the list of inputs: “score_and_value”
    • In the list of outputs: “eval_10”
    • In the description: “Score Evaluation split in 10 bins. The input should be a dataset with a score (value between 0 and 1) and a value (0 or 1).”
  5. Click OK.
  6. Save the workflow.
  7. Note that if you click on the new action, a new page is displayed in the help tab
  8. Go into the footer editor, create a new row with +. Name it eval.
  9. click on “...” next to “eval”, go in the choose default in the top drop-down menu and add “sa_evaluation10bin”.
  10. Click OK.
  11. Click OK.

Once you have finished with this tutorial, don't forget to clean the workflow before closing it.

Summary of workflow

In this workflow we have