Pig Tutorial

Goals:

  1. Create Ssh Connections to remote servers
  2. Copy files from remote servers to Hadoop File system
  3. Change your footer menu
  4. Build workflows that perform queries over data sets
  5. Learn how to use some of the actions that are available in the Pig Package

Pig Workflow

Copying file from a Remote to HDFS

The purpose of this task is to demonstrate the “Remote File System” functionality.

It will show how to connect to remote servers and copying files from the remote server to the Hadoop file system.

  1. Click on Remote File System.
  2. Click the plus symbol to create a new ssh connection to a remote server.
    • Host : localhost
    • Port : (do not select)
    • Password : Give your password
    • Save (do not select)
  3. Change the working directory to tutorialdata in the Red Sqirl home directory via the path that is above listed directories and click the search symbol.
  4. From the remote file system tab that should display the “/${redsqirl_home}/tutorialdata/” directory on localhost , drag the file “pig_tutorial_data.txt” to the hadoop file system tab.
  5. Locate the “pig_tutorial_data.txt” file on the hadoop file system tab (might have to click the search beside the current path).
  6. Click the plus symbol to create a new directory on hadoop file system and give it the name “pig_tutorial_data.mrtxt”.
  7. Click the checkbox beside the “pig_tutorial_data.txt” file and then click the move symbol under the current working directory.
  8. In the new window click the radio button beside “pig_tutorial_data.mrtxt” folder and click OK.

Change your footer

The action footer is the little frame on the bottom left of the FlowChart tab.

  1. Click the green information symbol.
  2. Once the configuration popup has appeared in the left column click “+”.
  3. Type “extraPig” on the new empty line.
  4. Click the “...” symbol.
  5. On the new window, select redsqirl_pig in the drop-down menu
  6. Click on the check box next to pig_audit, and click on the Select button.
  7. Click OK.
  8. Click OK.

You should see a new footer tab called “extraPig” with “Pig Audit” inside. To remove this new menu, you need to do the following.

  1. Click the green information symbol.
  2. Click on the check box beside “extraPig”.
  3. Click on the Delete button (bin icon) in the table header.
  4. Click OK.

Create a Workflow

Setup a Source Action

This Task will show how the source “action” can be configured to select flat files and change the properties such as the delimiter of the file and also the headings and types of the file.

  1. In the actions footer drag a new source icon onto the canvas.
  2. Double click to open source.
  3. Name the action “communication“.
  4. Comment the action “Configure tutorial data“.
  5. Click OK.
  6. Select “Hadoop Distributed File System“ as the data type then click next.
  7. Select “TEXT MAP-REDUCE DIRECTORY” as the data subtype and click next.
  8. On the data set screen, click on the path field or on the button.
  9. Click on the radio button beside “pig_tutorial_data.mrtxt”- if you cannot find it refresh the view by clicking on the search button- and click OK.
  10. At this stage, you will see the data correctly display on the screen, the name of the fields are “Field1 Long, Field2 Long...”
  11. On the feature title line, click on the edit button.
  12. Once it appears you can choose “Change Header”
  13. Copy and paste “subscriber_number STRING , Friend STRING , offpeak_voice INT , offpeak_sms INT , offpeak_mms INT ,peak_voice INT,peak_sms INT , peak_mms INT , sna_weight INT , subscriber_onnet INT ,friend_onnet INT” into the value field.
  14. Click OK. You will have the confirmation that the Header is correct.
  15. Click OK to exit from the Configuration window.
  16. If you leave the mouse cursor on the source action you will be able to see some configuration details
  17. Save the Workflow by going into File > Save, name it “pig_tutorial”. By default it is saved in redsqirl-save HDFS directory and the file will have the extension ‘.rs’. Click OK to save.

Setup a Pig Aggregator Action

Pig Aggregator is an action in which aggregation methods are allowed to be used when selecting columns as you would in an sql statement. These aggregation methods are AVG , MAX , SUM etc. This action will group by either the selected attributes or all (default if none is selected).

  1. Drag a pig aggregator action to the canvas.
  2. Create a link between the source that was just configured and the new pig aggregator action by clicking between the image and the arc of the source action and then clicking on the pig aggregator image.
  3. Open the new pig aggregator click, name the element “nl_sum” and Click OK.
  4. Select “subscriber_number” and click "Select" in the Group by interaction.
  5. Click next.
  6. Select the copy from the dropdown menu on the generator interaction and click OK.
  7. On the top of the table click the “+” symbol to add a new row to the table.
  8. Click on the pen in Operation field of the new row and click the “SUM()” function and add the parameters “offpeak_voice” and “peak_voice”. In the between the parameters add a “+” symbol so that the operation would read “SUM(offpeak_voice + peak_voice)”
  9. click OK.
  10. In the Field Name, type “total_voice” for the new column and change the type to DOUBLE.
  11. Click OK.
  12. In “Project” menu on the top, click “Save and Run” to run the workflow.
  13. To see the results, leave your mouse on the action “nl_sum” and hit ctrl, > Options > Data Output.
  14. You can close it by hitting “Cancel” or “OK”
  1. Drop another pig aggregator onto the canvas.
  2. Make a link between the source and the new pig aggregator.
  3. Open it, and give it the name "comm_groupbyall" and click OK.
  4. This time leave the group by list alone so nothing is selected and click next.
  5. Create a new row (+ button).
  6. In this new row copy and paste “AVG(offpeak_voice + peak_voice)”.
  7. Call the field “total_voice_avg” and select the DOUBLE type.
  8. Click OK.

Perform a Pig Join Action

To make each dataset interactable with each other it is necessary to perform a join on them.

  1. Drop a pig join onto the canvas.
  2. Create a link from “comm_groupbyall” to the new pig join action.
  3. Create a link from “nl_sum” to the new pig join action.
  4. Double click the pig join and call it “nl_vs_total”.
  5. The first page list the table aliases, click next.
  6. On the following page, make sure that “copy” is selected as the generator and click OK.
  7. Click next.
  8. This page has two interactions that specify the join type and the fields to join on, we use the default join type which is “Join” so this does not need to be changed.
  9. In “Join Field” column, type “1” in the two rows. This condition will join the two tables together.
  10. Click OK.

Filter a Data set

Now we want to make a condition to see what subscribers have a higher total voice calls than the average of the entire dataset. The easiest would be to add the condition in Join but we will create a new select for demonstration purposes.

  1. Drop a new pig filter action onto the canvas.
  2. Create a link from the pig join action to the new pig select action
  3. Open it and change the element id to be “high_voice”.
  4. Click on the Pen in the Condition section and write “nl_sum_total_voice > comm_groupbyall_total_voice_av”.
  5. Click OK and then finish the action by clicking OK.

We would like now to keep “nl_vs_total” intermediate result before running the workflow.

  1. Select “nl_vs_total” and go into the top menu Edit > Output State > Buffered

You can now “Save and Run”, and see your result in the Data Output of “nl_vs_total” and “high_voice”

  • To see the results, leave your mouse on the action “nl_vs_total” or “high_voice”, hit Ctrl > Options > Data Output.
  • You can close it by hitting “Cancel” or “OK”
  • Once you are happy with the result you can clean all the data generated by this workflow by clicking on “Select All” and then “Clean Actions” in the “Edit” top menu.

    Summary of workflow

    In this workflow we have

    1. configured source for a hdfs file
    2. selected the sum of two columns
    3. gotten the average of a column
    4. gotten the sum of a column when grouping by another
    5. joined the two tables
    6. filtered a table with a condition