4.1 What is Pig
Yahoo developed Pig and is one of the heaviest users of Hadoop, run 40 percent of all its Hadoop jobs with Pig.Twitter is also another well-known user of Pig.
Pig is a high-level Map Reduce based programming language built up over Hadoop platform. The language is commonly known as Pig Latin. Pig Latin works on the same way as the JAVA works for the implementation of Map Reduce; the only difference is that it is a high level language. Pig is a good starting point for writing some programs for beginners so that they can get familiarize with Hadoop eco system. Pig provides its own set of functions for programmers to use as toolbox. However on top of it they can build their own UDF (User Defined Functions). These UDFs can be written in Java, Python, Ruby and etc. and then call directly from the Pig Latin.
It is not easy task to implement Map Reduce Programs in any language. Till now we were thinking always in a procedural manner and try to accomplish any task. We start our program with defining some variables using some functions, array, loops, and conditional statements to write program. But now while using Hadoop we need to change our mindset because now we will write our program flow to implement Map Reduce programming model. We need to think at the level of mapper and reducer function. Pig simplifies the Hadoop programming by giving a high-level data processing language while keeps the scalability and reliability. Pig majorly has two major components –
-A high-level data processing language known as Pig Latin
-A compiler that compiles and runs Pig Latin script
Pig Latin has a very good capability to build the program with a far better ease and comforts with its powerful tool and internal architecture. Particularly its compiler helps to automatically achieve optimization; these definitely lower down the burden of programmer for having to do tuning in program.
Here are the major bullet points that differentiate Pig from SQL:
-Pig makes use of lazy evaluation
-ETL is being used
-Pipeline splits are being supported
-Execution plans can be declared
Pig Latin is a procedural language in nature in the pipeline paradigm while SQL is instead declarative. To have data in a single table from two or more different tables in SQL it must be joined in one or other way while in Pig you have several ways to keep data from two or more different tables. Pig Latin programming is very much similar to specifying a query execution plan;it makes it very easy for the programmer to explicitly control the flow of their data processing task. SQL handles trees naturally, but has no built in mechanism for splitting a data processing stream and applying different operators to each sub-stream. Pig Latin script describes a directed acyclic graph rather than a pipeline. Pig Latin's ability to include user code at any point in the pipeline is useful for pipeline development.
4.2 Pig Installation And Configuration
Pig Installation is on the same line as was that of Hadoop. Download the latest release of Pig from Apache website. Copy in into a folder,un tar the file using TAR command and this will take care of your Pig installation. Do the following settings so that to make it work –
- Using command vi $HOME/.bashrc, Set the PIG_HOME environment variable
- Set export PIG_HOME=/<<path>>/<<name of the pig unzipped pig folder>>
- #export PATH=$PATH:$PIG_HOME/bin
- #bash
- $ cd /<<path>>/<<name of the pig unzipped pig folder>>
- Set the value to local for exectype in file pig.properties which is present in conf directory
Start Pig using command $ bin/pig -x local and you will see the GRUNT shell that looks like grunt>
Run the help command to make sure Pig is working
grunt> HELP
Commands:
<piglatin statement>; - See the PigLatin manual for details: http://hadoop.apache.org/pig
File system commands:
fs<fs arguments> - Equivalent to Hadoopdfs command: http://hadoop.apache.org/common/docs/current/hdfs_shell.html
Diagnostic commands:
describe<alias>[::<alias] - Show the schema for the alias. Inner aliases can be described as A::B.
explain [-script <pigscript>] [-out <path>] [-brief] [-dot] [-param<param_name>=<param_value>]
[-param_file<file_name>] [<alias>] - Show the execution plan to compute the alias or for entire script.
-script - Explain the entire script.
-out - Store the output into directory rather than print to stdout.
-brief - Don't expand nested plans (presenting a smaller graph for overview).
-dot - Generate the output in .dot format. Default is text format.
-param<param_name - See parameter substitution for details.
-param_file<file_name> - See parameter substitution for details.
alias - Alias to explain.
dump<alias> - Compute the alias and writes the results to stdout.
Utility Commands:
exec [-param<param_name>=param_value] [-param_file<file_name>] <script> -
Execute the script with access to grunt environment including aliases.
-param<param_name - See parameter substitution for details.
-param_file<file_name> - See parameter substitution for details.
script - Script to be executed.
run [-param<param_name>=param_value] [-param_file<file_name>] <script> -
Execute the script with access to grunt environment.
-param<param_name - See parameter substitution for details.
-param_file<file_name> - See parameter substitution for details.
script - Script to be executed.
sh<shell command> - Invoke a shell command.
kill<job_id> - Kill the hadoop job specified by the hadoop job id.
set<key><value> - Provide execution parameters to Pig. Keys and values are case sensitive.
The following keys are supported:
default_parallel - Script-level reduce parallelism. Basic input size heuristics used by default.
debug - Set debug on or off. Default is off.
job.name - Single-quoted name for jobs. Default is PigLatin:<script name>
job.priority - Priority for jobs. Values: very_low, low, normal, high, very_high. Default is normal
stream.skippath - String that contains the path. This is used by streaming.
anyhadoop property.
help - Display this message.
quit - Quit the grunt shell.
4.3 Writing Simple to Complex Pig Programs
The language of Pig is also known as Pig Latin. First we will visit the data types:
Simple Data Type
Type | Size |
Int | Signed 32-bit integer |
Long | Signed 64-bit integer |
Float | 32-bit floating point |
Double | 64-bit floating point |
Bytearray | Binary object |
chararray | String |
Complex Data Type
Type | Description |
Tuple | When some fields gets arranged in order and form a row then it is known as Tuple, represented by fields separated by commas in a parentheses (Apple, 15, 1000) |
Bag | Bag is a collection of Tuples. This definitely resembles with our SQL Tables. Tuples can be duplicated in a Bag. Two Tuple in a Bag may or may not have same number of fields. A Bag is represented by curly braces that contains tuples {<Apple, 15, 1000>,<Banana, 20, 1500>} |
Map | Map is a set of key/value pairs. In Map: Key –Should be unique and of type string Value – Can be of any type [Key#Value] |
Let us perform some programming operations. We have created a space delimited file that contains following columns Emp_Name, Emp_Id, and Emp_Age. The records in file looks like -:
Employee Table
Emp_Name | Emp_Id | Emp_Age |
John | 10011 | 32 |
Johnathan | 10012 | 28 |
Sally | 10013 | 35 |
Salvia | 10014 | 26 |
Case Sensitivity Nature of Pig
Here is the distinction of Case Sensitive and Case InSensitiveInstruction for Pig –
Case Sensitive –
- Name of Relations
- Name of Fields
- Name of Pig Latin Functions (PigStorage, COUNT)
Case InSensitive –
- Name of Parameters
- Pig Latin Keywords (LOAD, USING, AS, GROUP, BY, FOREACH, GENERATE, DUMP)
Instruction for Identifier
Identifiers includes following:
- Name of relations (aliases)
- Name of fields
- Name of variables
Identifier name should start with a letter and then it can be followed by any number of letters, digits, or underscores.
LOAD
The first task is to read this file into Pig, for this purpose we will use the LOAD command
grunt> Employee = LOAD '/SOFT/PigExercise/EmployeeData' USING PigStorage(' ') as (Emp_Name:chararray, Emp_Id:int, Emp_Age:int);
After execution of this query an Employee table got created, Note that our input file EmployeeData was not having any extension like .txt or etc hence we have simply written the file name without extension (Pig accept it). We have passed single space as parameter in PigStorage() function, This is because our input file was delimited by space.
PigStorage loads and stores data in the structured text files. Note that the default field delimiter for PigStorage is tab ('\t'). We definitely have option to specify other characters as field delimiters by enclosing them in single quotes.
DUMP
To check the table content in Pig DUMP command needs to be used
grunt> DUMP Employee;
It is showing the output as
(John,10011,32)
(Johnathan,10012,28)
(Sally,10013,35)
(Salvia,10014,26)
This confirms that Employee table is created. We had 4 records in the file and all 4 are being displayed over here. Sometimes we might have thousands of records in input file and after loading in to a table, if we want to run the DUMP command to check whether table got created successfully then this command will display all thousands records, to restricts the number of records that should display on the screen we can use LIMIT command.
LIMIT
In the example above if we want to display only 3 records out of 4 then we can do it as
grunt>Lmt = LIMIT Employee 3;
grunt> DUMP Lmt;
It is showing the output as
(John,10011,32)
(Sally,10013,35)
(Johnathan,10012,28)
LIMIT command limits the number of tuples to a specified number. When we use it right after a table that was processed by an ORDER by clause operator then LIMIT returns the first n tuples. Note that here there’s no guarantee which tuples are returned. The LIMIT command does not perform categorization since it is not a read write operator. Please note that after executing of LIMIT command in our above example the records for John, Johnathan and Sally (the first 3 records) does not return in an order. Instead it displays records in order of John, Sally and Johnathan.
STORE
Now the next is to save Employee table in Pig, for this purpose we will use STORE command
grunt> STORE Employee INTO '/SOFT/PigOutput';
This statement will store the Employee table in the directory /SOFT/PigOutput. Make sure this folder PigOutputshould not be present before running above statement otherwise an error would be prompted as
2014-02-19 13:08:20,731 [main] INFO org.apache.pig.tools.pigstats.ScriptState - Pig features used in the script: UNKNOWN
2014-02-19 13:08:20,736 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 6000:
<line 4, column 0> Output Location Validation Failed for: '/SOFT/PigOutput More info to follow:
Output directory /SOFT/PigOutput already exists
DESCRIBE
When you want to check the column name and type of column name then use DESCRIBE command. While executing the statement for Employee table
grunt> DESCRIBE Employee;
Output received as
Employee: {Emp_Name: chararray,Emp_Id: int,Emp_Age: int}
Like DESCRIBE we can use EXPLAIN and ILLUSTRATE as diagnostic operators in Pig language.
EXPLAIN – It displays the execution plan used to compute a relation.
ILLUSTRATE – It displaysthe step by step how data is transformed, starting with the load command and till the data come to a final resulting relation. In order to keep the display and processing manageable only a sample of the input data is used to simulate theexecution.
FOREACH GENERATE
Suppose we want to select few columns from an existing table then we can use FOREACH and GENERATE to accomplish this. We already have an Employee table and now to select only the Name and Age column then we can write something like
grunt>Emp_Subset = FOREACH Employee GENERATE Emp_Name, Emp_Age;
grunt> DESCRIBE Emp_Subset;
Output of above statement shows only Name and Age column information.
Emp_Subset: {Emp_Name: chararray,Emp_Age: int}
In the statement “FOREACH Employee GENERATE Emp_Name, Emp_Age“ Emp_Name and Emp_Age can be replaced by $0 and $2, hence new statement can be “FOREACH Employee GENERATE $0, $2“, as Emp_Name and Name_Age are the first and third column in original Employee table. The rule is column1 is represented by $0, column2 by $1, column3 by $2, column4 by $3 and so on.
AVG
AVG function Calculates the average of numeric values in a single-column bag.
Suppose we have following data in table FruitAvg column1, column2 are FruitName, FruitCount respectively.
Apple 16
Banana 20
Orange 22
Grapes 12
Fruit Fruit
Apple 10
Banana 16
Orange 14
Grapes 12
Note that row5 in above table is Fruit Fruit (Here FruitCount column assigned a word ‘Fruit’ instead of FruitCount value). We have taken this example intentionally, This is just to show you the power of Pig, in output this row would get assigned a null value as average since Pig is performing calculation only over numeric data. Our task is to identify the average of each fruit from the table. First we will make the group on FruitName column in table FruitAvg and form a new table FruitAvgGrp using query
grunt>FruitAvgGrp = GROUP FruitAvg BYFruitName;
Now to generate the average on column FruitCount, for this we will use the following query that is using AVG function and generating output in a new table FruitAvgCalc.
grunt>FruitAvgCalc = FOREACH FruitAvgGrp GENERATE FruitAvg.FruitName AS FruitName, AVG(FruitAvg.FruitCount) AS FruitCount;
Dumping FruitAvgCalc to check what data got stored in it
grunt> DUMPFruitAvgCalc;
({(Apple),(Apple)},13.0)
({(Fruit)},)
({(Banana),(Banana)},18.0)
({(Grapes),(Grapes)},12.0)
({(Orange),(Orange)},18.0)
Along with the average the output is also showing the number of instance a FruitNameappered for generating average, in our case it was 2.
COUNT
COUNT Calculate the number of tuples in a bag.
In the above example we are doing a slight modification in FruitAvg table and removing row5 that contains ‘Fruit Fruit’ data. We have already grouped on FruitName in FruitAvgGrptable, now to generate the count we will use the query
grunt>FruitAvgCalc = FOREACH FruitAvgGrp GENERATE FruitAvg.FruitName AS FruitName, COUNT(FruitAvg.FruitCount) AS FruitCount;
Dumping to check the output
({(Apple),(Apple)},2)
({(Banana),(Banana)},2)
({(Grapes),(Grapes)},2)
({(Orange),(Orange)},2)
Note that while counting tuples COUNT function ignores the null values and count only the not null values.
MAX
Calculate the maximum value in a bag.
In table FruitAvg we are doing some modification so the table looks like as
Apple 16
Banana 20
Orange 22
Grapes 12
Apple 10
Banana 16
Orange 14
Grapes 12
Grapes 32
Watermelon 8
We have already grouped on FruitName in FruitAvgGrptable, now to generate the max we will use the query
grunt>FruitAvgCalc = FOREACH FruitAvgGrp GENERATE FruitAvg.FruitName AS FruitName, MAX(FruitAvg.FruitCount) AS FruitCount;
Dumping to check the output
({(Apple),(Apple)},16)
({(Banana),(Banana)},20)
({(Grapes),(Grapes),(Grapes)},32)
({(Orange),(Orange)},22)
({(Watermelon)},8)
MIN
Calculate the minimum value in a bag.
Referencing above MAX example we are replacing MAX by MIN and let us see what output it generates
grunt>FruitAvgCalc = FOREACH FruitAvgGrp GENERATE FruitAvg.FruitName AS FruitName, MIN(FruitAvg.FruitCount) AS FruitCount;
Dumping to check the output
({(Apple),(Apple)},10)
({(Banana),(Banana)},16)
({(Grapes),(Grapes),(Grapes)},12)
({(Orange),(Orange)},14)
({(Watermelon)},8)
SUM
Calculate the sum of numeric values in a bag.
Referencing above MAX example we are replacing MAX by SUM and let us see what output it generates
grunt>FruitAvgCalc = FOREACH FruitAvgGrp GENERATE FruitAvg.FruitName AS FruitName, SUM(FruitAvg.FruitCount) AS FruitCount;
Dumping to check the output
({(Apple),(Apple)},26)
({(Banana),(Banana)},36)
({(Grapes),(Grapes),(Grapes)},56)
({(Orange),(Orange)},36)
({(Watermelon)},8)
CONCAT
Concatenate two strings of type chararray or bytearrays.
We have following two columns A and B stored in a table InputSample
Apache | Pig |
Apache | Hive |
Apache | HBase |
Apache | Sqoop |
Apache | ZooKeeper |
Following query would be run to concatenate the two columns using CONCAT function
grunt>CONCATAandB = FOREACH InputSample GENERATE A, B, CONCAT(A,B) AS C;
Let us see what we get if we DESCRIBE table CONCATAandB
grunt> DESCRBECONCATAandB;
CONCATAandB: {A: chararray,B: chararray,C: chararray}
A new column C got created with the same data type of chararray.
Dumping to check the output
grunt> DUMP CONCATAandB;
(Apache,Pig,ApachePig)
(Apache,Hive,ApacheHive)
(Apache,HBase,ApacheHBase)
(Apache,Sqoop,ApacheSqoop)
(Apache,ZooKeeper,ApacheZooKeeper)
Both first and second column got concatenated in third column.
COUNT_STAR
COUNT_STAR function is used to calculate the number of elements in a bag. Following is must for COUNT_STAR function to work –
-Preceding GROUP ALL statement for global counts
-GROUP BY statement for group counts.
COUNT_STAR includes NULL values in the count computation, this is very unlike to the COUNT function that ignores NULL values.
DIFF
DIFF function performs comparison between two bags and returned the difference of tuples between two bags. We need to pass bags as arguments and DIFF function will compare them. If any tuples that are in one bag but not present in other bag then those tuples would be returned in a bag. In the case of when all the tuples of both bag matches then an empty bag would be returned.
SIZE
It calculates the size of the data contained; Note that for a bag it counts the number of tuples while for a tuple it counts the number of elements. Referencing above table InputSample which we used in CONCAT example and we are using SIZE function in it for column B. the query goes like as
grunt>ExampleSIZE = FOREACH InputSample GENERATE B, SIZE(B) AS BSIZE;
Dumping to check the output
(Pig,3)
(Hive,4)
(HBase,5)
(Sqoop,5)
(ZooKeeper,9)
As clear from the output that in case of working with chararray SIZE function returns the number of character in data. Similarly it works for other data type as:
-For int it returns 1
-For long it returns 1
-For float it returns 1
-For double it returns 1
-We have already seen for chararray
-For bytearray it returns the number of bytes in the array
-For tuple it returns the number of fields in the tuple
-For bag it returns number of tuple in that bag
-For map it returns the number of key value pair in that map
TOKENIZE
It Split a string of type chararray into a bag of words where each word is a tuple in the bag.Word separators could be space, double quote ("), comma, parentheses and asterisk (*).
We have tuple with data “My First Pig Program” in a table String. To get this entire string into a single tuple we cannot use space as separator since this string itself contains embedded space. So we will use ‘\n’ new line character separator. The query goes like as
grunt> String = LOAD '/SOFT/PigExercise/String' USING PigStorage('\n') as (A:chararray);
Let us use TOKENIZE function to tokenize this string.
grunt> TOKEN = FOREACH String GENERATE A, TOKENIZE(A);
Dumping to check the output
(My First Pig Program,{(My),(First),(Pig),(Program)})
IsEmpty
It Check whether a bag or map is empty; Note that the IsEmpty function checks if bag or map has no data. This function is normally used to filter data.
Math Functions (Alphabetically Sorted)
ABS | ABS function is used toget the absolute value from an expression. Note that when using the ABS function If the returned result is positive then the result is returned, otherwise if the result is negative then the negation of the result is returned. |
ACOS | ACOS is a mathematical trigonometric function that is used to return the arc cosine part from an expression. |
ASIN | ASIN is a mathematical trigonometric function that is used to return the arc sine part from an expression. |
ATAN | ASIN is a mathematical trigonometric function that is used to return the arc tangentpart from an expression. |
CBRT | CBRT function returns the cube root from an expression. |
CEIL | CEIL function is used to return the rounded and nearest to the integervalue from an expression. CEIL will never ever lower the result. Let us have look below where we have listed many possibilities and outcome after using CEIL – If a = 2.3 then CEIL(a) = 3 If a = 7.5 then CEIL(a) = 8 If a = 4.7 then CEIL(a) = 5 If a = 2.0 then CEIL(a) = 2 If a = -2.3 then CEIL(a) = -2 If a = -7.5 then CEIL(a) = -7 If a = -4.7 then CEIL(a) = -4 |
COS | COS function is used to return the trigonometric cosine from an expression. |
COSH | COSH function is used to return the hyperbolic cosine from an expression. |
EXP | EXP function is used to calculate the exponential value. EXP(x) will calculate the Euler’s number e raised to the power x. |
FLOOR | FLOOR function is used to return the rounded and farthest to the integer value from an expression. FLOOR will always lower the result.Let us have look below where we have listed many possibilities and outcome after using FLOOR – If a = 2.3 then FLOOR(a) = 2 If a = 7.5 then FLOOR(a) = 7 If a = 4.7 then FLOOR(a) = 4 If a = 2.0 then FLOOR(a) = 2 If a = -2.3 then FLOOR(a) = -3 If a = -7.5 then FLOOR(a) = -8 If a = -4.7 then FLOOR(a) = -5 |
LOG | LOG function is used to return the natural logarithm from an expression. |
LOG10 | LOG10 function is used to return the base 10 logarithm from an expression. |
RANDOM | RANDOM function is used to return a pseudo random number of data type double that is greater than or equal to 0.0 and less than 1.0. It is basically used to generate the random number. |
ROUND | ROUND function is used to perform rounding of an expression that is rounded to an integer. Let us have look below where we have listed many possibilities and outcome after using ROND – If a = 2.3 then ROUND(a) = 2 If a = 7.5 then ROUND(a) = 8 If a = 4.7 then ROUND(a) = 5 If a = 2.0 then ROUND(a) = 2 If a = -2.3 then ROUND(a) = -2 If a = -7.5 then ROUND(a) = -7 If a = -4.7 then ROUND(a) = -5 |
SIN | SIN function is used to return the sine from an expression. |
SINH | SINH function is used to return the hyperbolic sine from an expression. |
SQRT | SQRT function is used to calculate the positive square root from an expression. |
TAN | TAN function is used to return the trigonometric tangent from an angle. |
TANH | TANH function is used to return the hyperbolic tangent from an expression. |
Date Time Functions (Alphabetically Sorted)
AddDuration | AddDuration function is used to createa new datetime object by adding some datetimeduration to given datetime object. |
CurrentTime | CurrentTime function is used tocreate a datetime object of current time with having accuracy in millisecond. |
DaysBetween | DaysBetween function is used to get the number of days between two datetime objects. |
GetDay | GetDay function is used to extract the day of a month from the datetime object. |
GetHour | GetHour function is used to extract the hour of a day from the datetime object. |
GetMilliSecond | GetMilliSecond function is used to retrieve the millsecond of a second from the datetime object. |
GetMinute | GetMinute function is used to retrieve the minute of anhour from the datetime object. |
GetMonth | GetMonth function is used to retrieve the month of a year from the datetime object. |
GetSecond | GetSecond function is used to retrieve the second of a minute from the datetime object. |
GetWeek | GetWeek function is used to retrieve the week of a year from the datetimeobject. |
GetYear | GetYear function is used to get the year from the datetime object. |
HoursBetween | HoursBetween function is used to retrievethe number of hours between two given datetime objects. |
MilliSecondsBetween | MilliSecondsBetween function is used to extract the number of millseconds between two datetime objects. |
MinutesBetween | MinutsBetween function is used to extract the number of minutes between two datetime objects. |
MonthsBetween | MonthsBetween function is used to extract the number of months between two datetime objects. |
SecondsBetween | SecondsBetween function is used to retrieve the number of seconds between two datetime objects. |
SubtractDuration | SubtractDuration function is used to create a new datetime object by subtracting some duration from a datetime object. |
ToDate | ToDate function is used to create a DateTime object. This function uses the default Time Zone when the timezone is not specified. |
ToMilliSeconds | ToMilliSeconds function converts the DateTime to the number of milliseconds Since January 1, 1970 00:00:00.000 GMT. |
ToString | ToString function converts the DateTime to the string format. |
ToUnixTime | ToUnixTime function is used to convert the DateTime to the Unix Time. |
WeeksBetween | WeeksBetween function is used to extractthe number of weeks between two datetime objects. |
YearsBetween | YearsBetween function is used to extractthe number of years between two datetime objects. |