dsaravanan
/
reference


			
							123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418419420421422423424425426427428429430431432433434435436437438439440441442443444445446447448449450451452453454455456457458459460461462463464465466467468469470471472473474475476477478479480481482483484485486487488489490491492493494495496497498499500501502503504505506507508509510511512513514515516517518519520521522523524525526527528529530531532533534535536537538539540541542543544545
							By default Pig works in mapreduce mode.

Apache Pig is an abstration over mapreduce.
It is a tool which is used to analyze larger sets of data representing them as data flows.
It is used with Hadoop (It works at the top layer of Hadoop). We can perform all data 
manipulation in Hadoop using Apache Pig (alternate of Java mapreduce).

To write data anlaysis programs Pig provides High Level language - Pig Latin. In Pig, we 
use Pig Latin language.

Why do we need Apache Pig?
- can perform mapreduce
- uses multi-query approach: Reduces the length of codes
- pig latin is sql like language
- provides many built-in operators to support data operations like groupby, join etc.
- it provides nested data types like tuples, bags and maps

Difference between Apache Pig and mapreduce:
- Apache Pig is Data Flow Language where as mapreduce Data Processing paradigm
- Apache Pig is High Level Language where as mapreduce is Low Level and rigid
- concept of join operation in Pig is simple where as in mapreduce it is difficult to implement
- Apache Pig is used with a basic knowledge of query where as in (plain) mapreduce we must have
  expertise in core Java
- Apache Pig uses Multi-Query approach where as in (plain) mapreduce it is not available
- in Apache Pig, no need of compilation. Pig operator is converted internally converted into
  mapreduce job. But in case of mapreduce we need all (map function, reduce function, etc.)

In Pig - structed, semi-structed and unstructed data can be analyzed.
In plain mapreduce - only flat files are used.

Pig has features of programming language as well as features of sql.

schema - database in which you are creating a table

Difference between Apache Pig and SQL:
- Pig is procedural language, SQL is declarative
- in Pig schema is optional, in SQL schema is mandatory
- data model in Pig is nested relational (within one relational we can put another relational) where as data model in SQL is flat relational
- Pig provides default query optimization where as SQL need to apply query optimization techniques 

Difference between Apache Pig and Hive:
- Pig uses Pig Latin language where as Hive uses Hive Query Language (HiveQL) 
- Pig was created at Yahoo where as Hive was created at Facebook
- Pig Latin is procedural where as HiveQL is declarative
- Pig can handle all type of data (structed, semi-structed and unstructed) where as Hive mostly used
  for structed data (because of schema dependent)

Apache Pig:
- started in 2006 at Yahoo (purspose for executing mapreduce job on every data set) 
- in 2007 taken over by Apache
- in 2008 first release (Apache Pig)
- since 2010 full existence and become important project of Apache

In Apache Pig execution of program in three way: (every where grunt shell is used)
1. interactive: grunt shell
2. user defined functions (function is written in Python/Java and implement all those functions by using Pig concept)
3. embedded - pig script (extension of pig file - .pig)

Series of transformation/execution in Apache Pig:

1. Pig Latin Script (PLS)
PLS -> Apache Pig [Parser -> Optimizer -> Compiler -> Execution Engine] -> MapReduce -> HDFS

2. Parser:
- Initially the PLS is handled by the Parser.
- It checks the syntax of the script.
- It generates a DAG (Directed Acyclic Graph), which represents Pig Latin statements and operators
  - output of Parser is DAG
  - in DAG, logical operator are represented as nodes(verteces) and data flow are represented as edges

3. (after Parser control goes to) Optimizer:
- Optimizer takes input from DAG(logical plan)
- logical plan (DAG) is passed to logical optimizer which optimizes all commands
- Pig provides default query optimization

4. Execution Engine:
- Execution Engine will interact with Hadoop.
- finally mapreduce jobs are submitted to Hadoop in a sorted order, and these mapreduce jobs are executed
  on hadoop cluster producing the desired ouput.

Pig Latin Data Model:

Atom - Field is known as Atom (smallest unit of Pig Latin Data Model)

	001, Rajiv, 21, Hyderabad
	
Tuple - summation/integration/collection of Field is tuple(record)

	001 Rajiv  21 Hyderabad

Bag - summation/integration/collection of Tuple

	001 Rajiv  21 Hyderabad
	002 Omer   22 Kolkata
	003 Rajesh 23 Delhi	
							

In MySQL, Table is Bag, Record is Tuple, Field is Field (Column name).

Atom:
- any single value in Pig Latin irrespective of their data type is known as Atom
- value within field is called Atom and that value with datatype is called Pig
- it is stored as string(for map purposes) (example: 'Delhi', '23')

Tuple ():
- ordered set of fields (examples: (Rajesh, 23)
- but not ordered set of all fields
- in Tuple, fields can be any type
- Tuple is denoted within parentheses

Bag {}:
- Bag is collection of Tuples 
- example: {(Rajesh, 23), (Omer, 22)}
- Bag can be nested implies a Bag can be a Field in a relation
- a Bag can be a member of another Bag
- example: {Rajesh, 30, {999999999, rajesh@gmail.com}}

Map []:
- Map is a set of key-value pairs
- the key needs to be of type chararray and should be unique
- whereas the value might be of any type
- Map is represented by [ ]
- example: [ name#Rajesh, age#23 ] ('name' is field name and 'Rajesh' is value and similaraly)
- in Map the filed name and the value is differeniated by '#' symbol

Relation:
- a Relation is a Bag of Tuples
- the Relation in Pig Latin are unordered
- we need to write a command to order it

Execution Mechanism: Three modes
1. Interactive mode (Grunt shell - shell of Pig): 
	- get output using operator Dump
2. Batch mode (script - *.pig)
3. Embedded mode (user defined function)
	- Apache Pig provides the provision of defining our own functions in Pig Latin such as
	  R, Python, Java, ...
	- write user defined function in other programming language (R, Python, Java, ...)


Installation:

# display version of Pig
pig -version

# execute Pig in local mode
pig -x local (independent of Hadoop cluster)
pwd

# to quit
quit;

# Batch mode locally
pig -x local a.pig

# Pig in mapreduce mode (linked to hadoop automatically)
pig -x mapreduce
pwd

# Batch mode (mapreduce mode)
pig -x mapreduce a.pig

# two types of commands can be run in pig:
- shell commands: linux commands
- file system commands: hadoop cluster commands

# shell commands
sh ls

# file system commands
fs ls
fs -ls /user/raman/

# clear screen
clear/ctrl+L

# basic commands
pig -x mapreduce
fs -ls /user/raman/
fs -cat /user/raman/file.txt

load command is used to convert file on cluster to a relation
grunt> a=LOAD 'hdfs://172.17.0.2:9000/user/raman/file.txt' USING PigStorage(' ');

(' ') - space seperator
(':') - colon seperator

grunt> a;
grunt> Dump # Dump command will execute that program in mapreduce fashion and display the content of file.txt

------------

Assignment: (.csv file - execute all commands with the file)

=====================
$ ssh to <masternode>
$ jps
$ start-yarn.sh
$ jps
====================

Start Apache Pig in two modes:
1. local mode
2. mapreduce mode

For hadoop cluster, start Apache Pig in mapreduce mode.

# Working with Apache Pig
By default all the log files of Pig is stored in current working directory. Create a seperate working 
directory named /home/raman/Pig/ for Apache Pig.
Remove these log files from time to time in directory /home/raman/Pig/ with command rm *.log

# start Apache Pig in mapreduce mode
pig -x mapreduce
(shell prompt of Apache Pig is grunt)
(quit; to exit grunt shell in Apache Pig)

# Modes of Apache Pig
1. Interactive Mode (grunt shell mode)
2. Batch Mode (pig script mode)
3. Embedded Mode (udf (user defined function) mode - defining functions in some programming language and calling that function from pig script)

# Apache Pig in Batch mode
pig -x mapreduce file.pig # excute the script file.pig in mapreduce fashion (file.pig in hdfs cluster)

# copy file from local to hdfs cluster
hdfs dfs -put file.dat /user/raman/pig_directory/

# start Pig in mapreduce mode
pig -x mapreduce

# LOAD - converting to pig format on Apache pig server from hdfs
# copy/load data file in pig format (converting to pig format) on Apache Pig server loading data file from hdfs
grunt> A = LOAD 'hdfs://172.17.0.2:9000/user/raman/file.dat' USING PigStorage(',') AS(f1:int, f2:int, f3:int); # creates an object 'A'
(in hdfs mode always give the address of master node: 'hdfs://172.17.0.2:9000/user/raman/file.txt')
(',') - field seperator
(f1:int, f2:int, f3:int) - Number of fields should match the Number of fields in the data file

# DUMP command is used to display the output (dump don't do mapreduce) 
grunt> DUMP A; # DUMP will execute the object 'A'

# remove/delete file from hdfs cluster
hdfs dfs -rm /user/raman/pig_directory/file.dat

# running bash shell commands from grunt shell
grunt> sh ls
# running hadoop cluster commands from grunt shell
grunt> fs -ls

# pig is case-sensitive

# file.dat
1,2,3
4,5,6
7,5,6

# group in pig (GROUP in pig is similar to groupby in MySQL)
grunt> B = GROUP A BY f1; # group entire structure in map form
grunt> DUMP B; # grouping of field f1 alone (in tuple form)
(1, ((1,2,3))) # collections of tuples
(4, ((4,5,6)))
(7, ((7,5,6)))

grunt> DUMP A;
(1,2,3)
(4,5,6)
(7,5,6)

# generate is similar to select in MySQL
grunt> C = FOREACH B GENERATE COUNT($1); # take each element of B and display fields in B ($1) (count elements in each bag)
grunt> DUMP C;
(1)
(1)
(1)

# for viewing error log file
grunt> quit;
$ vi -M pig_*.log 

# vi student_data.txt
001,Rajiv,Reddy,9848022337,Hyderabad
002,Siddarth,Battacharya,9848022338,Kolkata
003,Rajesh,Khanna,9848022339,Delhi
004,Preethi,Agarwal,9848022330,Pune
005,Trupthi,Mohanthy,9848022336,Bhuwaneshwar
006,Archana,Mishra,9848022335,Chennai
007,Ragunath,Chandran,9848022334,Chennai

$ hdfs dfs -put student_data.txt /user/raman/pig_directory/

# excluding LOAD and STORE all the commands return a relation. LOAD and
STORE is to copy and retrive, all other commands return a relation.

$ pig -x mapreduce
grunt> student = LOAD 'user/raman/pig_directory/student_data.txt' USING PigStorage(',') as (id:int,firstname:chararray,lastname:chararray,contatno:chararray,location:chararray) 
grunt> STORE student INTO '/user/raman/pig_directory/outputfile.txt' USING PigStorage(','); # copy file from pig to hdfs cluster

# load command copy file from hdfs cluster to pig
# store command copy file from pig to hdfs cluster

Output(s):
Successfully stored 6 records (5757980 bytes) in: "/user/raman/pig_directory/outputfile.txt"

grunt> fs -ls /user/raman/pig_directory/
grunt> fs -ls /user/raman/pig_directory/outputfile.txt
grunt> fs -cat /user/raman/pig_directory/part-m-00000
grunt> quit;

# LOAD will simply load the data into specified relation in Apache Pig
# STORE will simply store the data from Apache Pig to the HDFS cluster

# Diagnostic operators in Apache Pig
1. DUMP
2. DESCRIBE
3. EXPLANATION
4. ILLUSTRATE

$ pig -x mapreduce
grunt> Describe student;   # data type of the data (similar to describe in mysql) 
grunt> explain student;    # more detailed of entire relation (explain operator is used to display the logical, physical and mapreduce execution plans of a relation)
grunt> illustrate student; # illustrate gives step by step execution of sequence of statements

# always write a script with the command and execute in pig

grunt> groupdata = GROUP student BY location;
grunt> describe groupdata;
grunt> dump groupdata;

grunt> groupall = GROUP student All; # one all is created corresponding to all the data (only one group with element all) for counting all the elements
grunt> dump groupall;

# COGROUP is similar to GROUP. The only difference is that GROUP operator works with one relation at a time, while COGROUP is used for two or more relations simultaneously
grunt> COGROUP <relation1> BY <field_of_relation1>, <relation2> BY <field_of_relation2>

# empdata.txt
001,Emply1,22,Newyork
002,Emply2,23,Tokyo
003,Emply3,23,Kolkata
004,Emply4,25,London
005,Emply5,23,Pune
006,Emply6,22,Chennai

$ hdfs dfs -put empdata.txt /user/raman/pig_directory/
$ pig -x mapreduce
grunt> emply = LOAD '/user/raman/pig_directory/empdata.txt' USING PigStorage(',') as (id:int,name:chararray,age:int,location:chararray);
grunt> describe emply;
grunt> cogrpdata = COGROUP student BY location, emply BY location;
grunt> describe cogrpdata;
grunt> dump cogrpdata;

# JOIN operator is used to combine records from two or more relations
# while performing JOIN operation, we declare one tuple from each relation as keys
# when these keys match the two particulars are matched
# types of JOIN:
1. self join
2. inner join/eqvi join
3. outer join

# syntax: JOIN <relation1> BY <column1>, <relation2> BY <column2>;
grunt> stemp = JOIN student BY location, emply BY location; # inner join
grunt> describe stemp;
grunt> dump stemp;
(7,Ragunath,Chandran,9848022334,Chennai,6,Emply6,22,Chennai)
(6,Archana,Mishra,9848022335,Chennai,6,Emply6,22,Chennai)

# Apache Pig supports various data types:
int - integer
chararray - string

---------------------------------------------------------------------------------------------------------------------------

Assignment: *.xl
Apache Pig (Advance Commands)

# Referencing fields
In Apache Pig fields are referenced by using positional parameter - $0, $1
and so on.

$0 - 1st field, $1 - 2nd field

Fields in apache pig can be accessed by name also.

# FOREACH
Normally, FOREACH is used after grouping.
Synatax:
X = FOREACH student GENERATE name,$2;      # GENERATE means SELECT
Y = FOREACH student GENERATE $2-$1;        # difference of field3 and field2

In Apache Pig, -- is used to comment.

$ pig -x mapreduce
grunt> sh cat file.txt;
1,3,5
1,2,3
3,1,2
3,1,4
grunt> student = LOAD '/user/raman/pig_directory/studentdata.txt' USING PigStorage(',') as (id:int,name:chararray,lastname:chararray,contact:chararray,location:chararray)
grunt> X = FOREACH student GENERATE name,$2;
grnnt> describe X;
grunt> X = FOREACH student GENERATE id*2;
grunt> describe X;
grunt> X = FOREACH student GENERATE id*2 as myid; # as (alias)
grunt> describe X;

Apache Pig supports ternary operator. (use when scripting)
grunt> Y = 2==2 ? 1:4 # (when 2 is equvalent to 2 display 1 otherwise display 4) ? = ternary operator (alternative of if else)

Apache pig supports complex data types. 
1. Tuple: An ordered set of fields (19,2) 
2. Bag: A collection of Tuples ((19,2), (18,1)) 
3. Map: A set of key value pairs [open#apache], [name#Raman,location#London]

Apache Pig supports complex data types. Data type of a filed can be Tuple,
Bag, Map. Apache Pig supports nested architecture Ex: Tuple within Tuple.
$ hdfs dfs -put data.txt /user/raman/pig_directory/data.txt
$ pig -x mapreduce
grunt> sh cat data.txt # Tuple
(3,8,9) (4,5,6)
(1,4,7) (3,7,5)
grunt> 

Create relation using complex data types.
grunt> A = LOAD '/user/raman/pig_directory/data.txt' USING PigStorage(' ') AS (t1:tuple(t1a:int,t1b:int,t1c:int), t2:tuple(t2a:int,t2b:int,t2c:int));
grunt> describe A;
A: (t1: (t1a: int,t1b: int,t1c: int), t2: (t2a: int,t2b: int, t2c: int))
grunt> -- Two fields are in relation A.
grunt> DUMP A;
((3,8,9),(4,5,6))
((1,4,7),(3,7,5))
grunt> X = FOREACH A GENERATE t1.t1a,t2.$0;
grunt> describe X;
grunt> DUMP X;
(3,4)
(1,3)
grunt> DUMP A;

Concept of Outer Bag/Inner Bag in Apache Pig:
grunt> -- Outer Bag
grunt> -- Outer Bag is always unnested
grunt> -- A (f1:int, f2:int, f3:int)
grunt> -- (1,2,3)
grunt> -- (4,2,1)
grunt> -- (8,3,4)
grunt> -- (4,3,3)
grunt> -- X = GROUP A BY f1;
grunt> -- Contents of X will be:
grunt> -- (1, ((1,2,3)))
grunt> -- (4, ((4,2,1), (4,3,3)))
grunt> -- (8, ((8,3,4)))
grunt> -- X is a relation or bag of tuple. inner bag are ((1,2,3)), ((4,2,1), (4,3,3)), ((8,3,4))

Filter is where condition of SQL with some difference:
grunt> student = LOAD '/user/raman/pig_directory/studentdata.txt' USING PigStorage(',') as (id:int,name:chararray,lastname:chararray,contact:chararray,location:chararray)
grunt> -- conditional operators: not equal to (!=), equal to (==), greater than (>), greater than equal to (>=), less than (<), less than equal to (<=)
grunt> schennai = FILTER student BY location == 'Chennai';  [select * from student where location == 'chennai']
grunt> describe schennai;
grunt> DUMP schennai;
006,Archana,Mishra,9848022335,Chennai
007,Ragunath,Chandran,9848022334,Chennai
grunt>
grunt> -- display only id and location for location == 'Chennai' (assignment)
grunt> 
grunt> nchennai = FILTER student BY NOT location == 'Chennai';  [select * from student where location != 'chennai']
grunt> DUMP nchennai;
001,Rajiv,Reddy,9848022337,Hyderabad
002,Siddarth,Battacharya,9848022338,Kolkata
003,Rajesh,Khanna,9848022339,Delhi
004,Preethi,Agarwal,9848022330,Pune
005,Trupthi,Mohanthy,9848022336,Bhuwaneshwar

grunt> schennai = FILTER student BY location matches 'Chennai';  [select * from student where location == 'chennai']
grunt> nchennai = FILTER student BY NOT location matches 'Chennai';  [select * from student where location != 'chennai']
grunt> FILTER student BY location matches '*Pradesh*'; (wildcard/regular expression)

grunt> gst = GROUP student BY location;
grunt> DUMP gst;
grunt> gstdisplay = FOREACH gst GENERATE $1; -- second field
grunt> DUMP gstdisplay;
grunt> gstdisplay = FOREACH gst GENERATE group, COUNT($1); # word count/frequency (mapreduce) (GENERATE group means on which field you have to group)
grunt> DUMP gstdisplay;
grunt> -- select location count(*) as count from student group by location; (MySQL)

grunt> -- grouping can be done on more than one field
grunt> GROUP student BY (location, name)

grunt> -- Apache Pig in-built functions: AVG, SUM, MAX, MIN
grunt> -- NYSE (csv format)
grunt> -- daily = LOAD 'NYSE_daily' as (exchange, stock, date, dividends)
grunt> -- group by exchange and stock and find average value of dividends for each exchange and stock, store into output folder

grunt> -- order (sorting a relation)
grunt> sorter = order student by location desc; (default ascending order)
grunt> -- sorting can be done on more than one field
grunt> sorter = order student by location, name desc; (location in ascending order and name in descending order)

grunt> -- distinct (unique records - remove duplicates)
grunt> distinct student;

grunt> -- select * from student limit 2;
grunt> s2 = limit student 2; # display first two records
grunt> dump s2;

grunt> -- we can generate a sample from apache pig relation
grunt> -- sample size will be in percentage
grunt> some = sample student .5; -- 50%
grunt> dump some;
grunt> -- random sample
grunt> some = sample student .5; -- 50%

grunt> -- to increase number of reducers
grunt> -- parallel 10; (number of nodes)
grunt> group student by location parallel 10; (more faster execution/parallel computation)
grunt> -- 10 reducer will be activated
grunt> -- In Apache Pig, each and every command is executed in mapreduce mode only if using pig -x mapreduce
grunt> -- parallel is used in pig -x mapreduce not in pig -x local

grunt> -- flatten - unnest a tuple/bag
grunt> -- for tuple flatten substitutes the fields of a tuple in place of the tuple
grunt> -- (a,b,c)
grunt> -- (d,e,f)
grunt> -- group on first field
grunt> -- (a, (a,b,c))
grunt> -- (d, (d,e,f))
grunt> -- FOREACH student GENERATE $1;
grunt> -- (a,b,c)
grunt> -- FOREACH student GENERATE $0;
grunt> -- a
grunt> -- GENERATE flatten($1); # split string to array
grunt> -- (a,a,b,c)

grunt> -- CROSS R1, R2; # cross-product of two relations
grunt> -- 1st record of relation 1 with all records of relation 2, 2nd record of relation1 with all records of relation2, ...

grunt> -- UNION R1,R2; # rbind (merge R1 and R2 into one relation) (combine)

grunt> -- SPLIT student into S1 if location == 'Chennai', S2 if location != 'Chennai'; # negation of UNION (bifurcate)

grunt> -- tokenize # self reading

Assignment hint:
data type date for date

# Execution of pig scripts (batch mode)
pig -x mapreduce file.pig