Canvas

Workflow

The canvas area is designated to build workflows, these workflows that are used in Red Sqirl are controlled by Oozie to run jobs. A workflow can contain many linked actions that produce results. When an action is added to the canvas it will have an auto generated name which can be renamed. This new action must be configured to ensure that the workflow will run correctly. Source should be the first action that is added to the workflow, this action will identify the type and source location of the data. From there, linking actions together is now possible. Linking actions together depends on what the input and output types of each action is. To link a source and Hive action together, the source data type has to be Hive.

In order to get more details about a particular actions, it is possible to select it and hit Ctrl. Useful information will be displayed such as the output fields, errors and comments

Canvas tutorials can be found here

Action

In Red Sqirl an action is a procedure that controls a data set and/or runs a process on Hadoop. An action can take one or several input actions to form a complex workflow.

Naming Actions

When naming or renaming Actions it is important to remember that the action names should not be used as field names when configuring the action. If an actions name is used it could cause problems when running the workflow.

Understanding the Icons

The canvas uses actions that are represented by icons. There is a standard that is used when representing icons on the canvas. This standard sets the principles that should be used when setting the shape of the actions' icon. Here is a list of these icons and what action they should represent:

Output Type

The output type defines what is needed to read a data set. It may be the technology used (Apache Hive or Hadoop Distributed File System), the path to the data set or format details.

Packages may includes their own output types, but some exist by default.

Action's Input Class

An action can take several input classes. A class is a set of datasets, each dataset has to validate the class constraints on their type and sometimes on their fields. Each class is associated with a unique name within the action.

Saving State

An output is generated after each canvas action runs. Three category of action exists

Output Links

Certain actions may create more than 1 dataset and others may accept more than 1 type of dataset. In these cases, you may be asked to choose a data set when linking. A link name will always be in the form output -> input, where output/input is respectively the name of the action output/input. When no other options are possible, output or input names are omitted. For example '-> input', means that the output action has only one output that fits with the input constraint.

Data Type

Every field is associated to a type in Red Sqirl. Type will help the user to clarify what the fields are and what can be done from it, for example it would not make sense to add two dates together.

Grid

Grids are often used in Red Sqirl, they come sometimes with generators. A generator is used to add pre-prepared rows to your grid. These can then stay as they are or be modified to fit your needs.

Variables

Variables can be defined in workflow by double clicking on the parameter object top left. Functions are available as well as static values. Variables can be used in almost all text field, but not for field names, and other workflow meta-data parameters. Variable name should only contains letters, underscore and numbers. You can also call oozie functions inline. Note that Red Sqirl will not try to validate variable values so you may experience errors at running time.

Super Actions

Super Actions are actions that compose of other actions which users can create. Super Actions are organised into models. To create a super action the user must create a sub workflow which can be done by clicking in the menu and selecting new -> sub-workflow. Once the sub-workflow the user can drag as many SuperAction Input actions as needed to the sub-workflow to perform the desired process. From there the use can add actions to build the sub-workflow. Once the user is happy with the process, again they can add as many SuperAction Outputs as needed. If the user is happy with the workflow and wants to make use of the sub-workflow they a have a few options.

The actions of a sub-workflow are abstract compared to a workflow. There are no paths needed when creating a superaction through a sub-workflow, actions are just abstract entities in this case. As long as the inputs and outputs are given appropriate names all actions are valid. When configuring the superaction input of a sub-workflow, the user has to name the fields of the input. The only requirment for an output is giving it a name. When a sub-workflow is finished and installed, the Super Action will required the input and produce the output declared.

The Super Action also supports variables. When a sub-workflow defines variables, user can override default values when configuring the Super Action.

SuperAction Generation

SuperAction can be generated from workflows. To generate a SuperAction from a workflow you select the actions on the workflow and click the aggregate button in the canvas control buttons above the canvas or in edit -> aggregate. When the user selects aggregate they will be shown a pop up that will be used to configure that SuperAction. The configuration will ask what the names of the inputs and the name of the outputs should be.

Expand

Expand a SuperAction can be generated from workflows, expanding means copying the logic of the Super Action into the current workflow. To expand a SuperAction from a workflow you should go into the options menu when you hover the Super Action.

Scheduling

Red Sqirl comes with scheduling functionalities. A canvas can describe what Oozie calls a "bundle". A bundle is a group of schedule job that together form a data pipe. For example let's say you summarize your data daily and want to create a weekly report out of the summary. The bundle will be composed of a daily process feeding a weekly but that runs independently. Visually, a bundle will always show a colour on the background.

Each independent process is called a coordinator, a colour will be assigned on the screen for every coordinator defined. Specific variables can be set and used in the Red Sqirl actions. A coordinator starts and ends with either a synchronous dataset, the path contains a time stamp; or an asynchronous dataset, the path is static. For an asynchronous dataset, you can proceed as a normal workflow then set the running time inside the coordinator itself. For a synchronous dataset, you can either use a synchronous source or combine a synchronous sink with a synchronous sink filter. For more details about how to set up a scheduled job, please review the building a workflow section.

Path Type

A schedule workflow uses different path type:

Every time you use a synchronous source, a materialized path is defined. By using a synchronous sink, a template path is defined; therefore a synchronous sink filter is required for generated a materialized path and reuse the dataset in another coordinator.

Not all dataset can be templatized, as of today, only HDFS and HCatalog are supported.

Execution Plan

When hitting the button run, Red Sqirl will create an execution plan, generate the corresponding oozie xml files and script, then kick off the Oozie Job. The procedure follows the rules given here.

Fork and Join

Red Sqirl will executes as many tasks as possible in parallel, and will trust the YARN queue to do the correct queuing if necessary.

Optimisation

Red Sqirl will always attempt to optimise the process. It means if you have a 1-to-1 daisy-chained actions, and those actions are compatible, they will be merged.

For example if your workflow is composed of Pig Select action, linked solely to another Pig Select, both will be merged at running time to fasten the execution

Forks, joins, and saved transitional dataset (if one of the transitional dataset is buffered or recorded) take precedence over the optimisation.

Workflow vs Schedule

When running a workflow, Red Sqirl will always look what is the data available and run the requested output from the closest point. In schedule mode, only the entire dataset can run and it is not possible to select a given coordinator to run.

Action Context Menu

Create link

If you click in create link the arrow is going be generated in this selected object.

Rename object

If you click in rename object they will be shown a pop up that will be used to configure the name and the comments fo this select object.

Configure

If you click in configure they will be shown a pop up that will be used to configure this select object.

Data output

If you click in data output they will be shown a pop up that will be used to configure the data output of this select object.

Clean the Data

If you click in Clean data they will be clean the data output of this object. Temporary and buffered data will be erased but not recorded.

Oozie Action Logs

If you click in Ozzie Action Logs a new url is going to be open and show the Ozzie logs.

Edit Super Action

If you click in Edit super Action a new tab is going to open with the select super action to edit. The user just can edit if he have permission to edit.

Expand

If you click on Expand, the super action will be replaced with its content.

Refresh Super Action

If you click with right click in a super action and click in refresh super action the canvas is going to refresh the super action and is going to show or remove errors.

Regular expressions

Regular expressions are used a lot in Red Sqirl and users may cross some from time to time. A regular expression will validate an entry that a user typed.

Construct Matches
 
Examples
^.*[A-Z].*$ Contains an upper case character
[a-z]([a-z0-9_]*) Starts with a lower case character optionally followed by other lower case character, under score or number
^(#\\d{1,3}|.)?$ 1 to 3 digit or one character
 
Characters
x The character x
\\ The backslash character
\t The tab character ('\u0009')
\n The newline (line feed) character ('\u000A')
\r The carriage-return character ('\u000D')
 
Character classes
[abc] a, b, or c (simple class)
[^abc] Any character except a, b, or c (negation)
[a-zA-Z] a through z or A through Z, inclusive (range)
[a-d[m-p]] a through d, or m through p: [a-dm-p] (union)
[a-z&&[def]] d, e, or f (intersection)
[a-z&&[^bc]] a through z, except for b and c: [ad-z] (subtraction)
[a-z&&[^m-p]] a through z, and not m through p: [a-lq-z](subtraction)
 
Predefined character classes
. Any character
\d A digit: [0-9]
\D A non-digit: [^0-9]
\s A whitespace character: [ \t\n\x0B\f\r]
\S A non-whitespace character: [^\s]
\w A word character: [a-zA-Z_0-9]
\W A non-word character: [^\w]
 
Boundary matchers
^ The beginning of a line
$ The end of a line
\b A word boundary
\B A non-word boundary
\A The beginning of the input
\G The end of the previous match
\Z The end of the input but for the final terminator, if any
\z The end of the input
 
Greedy quantifiers
X ? X, once or not at all
X * X, zero or more times
X + X, one or more times
X {n } X, exactly n times
X {n ,} X, at least n times
X {n ,m } X, at least n but not more than m times
 
Reluctant quantifiers
X ?? X, once or not at all
X *? X, zero or more times
X +? X, one or more times
X {n }? X, exactly n times
X {n ,}? X, at least n times
X {n ,m }? X, at least n but not more than m times
 
Possessive quantifiers
X ?+ X, once or not at all
X *+ X, zero or more times
X ++ X, one or more times
X {n }+ X, exactly n times
X {n ,}+ X, at least n times
X {n ,m }+ X, at least n but not more than m times
 
Logical operators
XY X followed by Y
X |Y Either X or Y
(X ) X, as a capturing group
 
Quotation
\ Nothing, but quotes the following character
\Q Nothing, but quotes all characters until \E
\E Nothing, but ends quoting started by \Q

return to Red Sqirl help