123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418419420421422423424425426427428429430431432433434435436437438439440441442443444445446447448449450451452453454455456457458459460461462463464465466467468469470471472473474475476477478479480481482483484485486487488489490491492493494495496497498499500501502503504505506507508509510511512513514515516517518519520521522523524525526527528529530531532533534535536537538539540541542543544545 |
- By default Pig works in mapreduce mode.
- Apache Pig is an abstration over mapreduce.
- It is a tool which is used to analyze larger sets of data representing them as data flows.
- It is used with Hadoop (It works at the top layer of Hadoop). We can perform all data
- manipulation in Hadoop using Apache Pig (alternate of Java mapreduce).
- To write data anlaysis programs Pig provides High Level language - Pig Latin. In Pig, we
- use Pig Latin language.
- Why do we need Apache Pig?
- - can perform mapreduce
- - uses multi-query approach: Reduces the length of codes
- - pig latin is sql like language
- - provides many built-in operators to support data operations like groupby, join etc.
- - it provides nested data types like tuples, bags and maps
- Difference between Apache Pig and mapreduce:
- - Apache Pig is Data Flow Language where as mapreduce Data Processing paradigm
- - Apache Pig is High Level Language where as mapreduce is Low Level and rigid
- - concept of join operation in Pig is simple where as in mapreduce it is difficult to implement
- - Apache Pig is used with a basic knowledge of query where as in (plain) mapreduce we must have
- expertise in core Java
- - Apache Pig uses Multi-Query approach where as in (plain) mapreduce it is not available
- - in Apache Pig, no need of compilation. Pig operator is converted internally converted into
- mapreduce job. But in case of mapreduce we need all (map function, reduce function, etc.)
- In Pig - structed, semi-structed and unstructed data can be analyzed.
- In plain mapreduce - only flat files are used.
- Pig has features of programming language as well as features of sql.
- schema - database in which you are creating a table
- Difference between Apache Pig and SQL:
- - Pig is procedural language, SQL is declarative
- - in Pig schema is optional, in SQL schema is mandatory
- - data model in Pig is nested relational (within one relational we can put another relational) where as data model in SQL is flat relational
- - Pig provides default query optimization where as SQL need to apply query optimization techniques
- Difference between Apache Pig and Hive:
- - Pig uses Pig Latin language where as Hive uses Hive Query Language (HiveQL)
- - Pig was created at Yahoo where as Hive was created at Facebook
- - Pig Latin is procedural where as HiveQL is declarative
- - Pig can handle all type of data (structed, semi-structed and unstructed) where as Hive mostly used
- for structed data (because of schema dependent)
- Apache Pig:
- - started in 2006 at Yahoo (purspose for executing mapreduce job on every data set)
- - in 2007 taken over by Apache
- - in 2008 first release (Apache Pig)
- - since 2010 full existence and become important project of Apache
- In Apache Pig execution of program in three way: (every where grunt shell is used)
- 1. interactive: grunt shell
- 2. user defined functions (function is written in Python/Java and implement all those functions by using Pig concept)
- 3. embedded - pig script (extension of pig file - .pig)
- Series of transformation/execution in Apache Pig:
- 1. Pig Latin Script (PLS)
- PLS -> Apache Pig [Parser -> Optimizer -> Compiler -> Execution Engine] -> MapReduce -> HDFS
- 2. Parser:
- - Initially the PLS is handled by the Parser.
- - It checks the syntax of the script.
- - It generates a DAG (Directed Acyclic Graph), which represents Pig Latin statements and operators
- - output of Parser is DAG
- - in DAG, logical operator are represented as nodes(verteces) and data flow are represented as edges
- 3. (after Parser control goes to) Optimizer:
- - Optimizer takes input from DAG(logical plan)
- - logical plan (DAG) is passed to logical optimizer which optimizes all commands
- - Pig provides default query optimization
- 4. Execution Engine:
- - Execution Engine will interact with Hadoop.
- - finally mapreduce jobs are submitted to Hadoop in a sorted order, and these mapreduce jobs are executed
- on hadoop cluster producing the desired ouput.
- Pig Latin Data Model:
- Atom - Field is known as Atom (smallest unit of Pig Latin Data Model)
- 001, Rajiv, 21, Hyderabad
-
- Tuple - summation/integration/collection of Field is tuple(record)
- 001 Rajiv 21 Hyderabad
- Bag - summation/integration/collection of Tuple
- 001 Rajiv 21 Hyderabad
- 002 Omer 22 Kolkata
- 003 Rajesh 23 Delhi
-
- In MySQL, Table is Bag, Record is Tuple, Field is Field (Column name).
- Atom:
- - any single value in Pig Latin irrespective of their data type is known as Atom
- - value within field is called Atom and that value with datatype is called Pig
- - it is stored as string(for map purposes) (example: 'Delhi', '23')
- Tuple ():
- - ordered set of fields (examples: (Rajesh, 23)
- - but not ordered set of all fields
- - in Tuple, fields can be any type
- - Tuple is denoted within parentheses
- Bag {}:
- - Bag is collection of Tuples
- - example: {(Rajesh, 23), (Omer, 22)}
- - Bag can be nested implies a Bag can be a Field in a relation
- - a Bag can be a member of another Bag
- - example: {Rajesh, 30, {999999999, rajesh@gmail.com}}
- Map []:
- - Map is a set of key-value pairs
- - the key needs to be of type chararray and should be unique
- - whereas the value might be of any type
- - Map is represented by [ ]
- - example: [ name#Rajesh, age#23 ] ('name' is field name and 'Rajesh' is value and similaraly)
- - in Map the filed name and the value is differeniated by '#' symbol
- Relation:
- - a Relation is a Bag of Tuples
- - the Relation in Pig Latin are unordered
- - we need to write a command to order it
- Execution Mechanism: Three modes
- 1. Interactive mode (Grunt shell - shell of Pig):
- - get output using operator Dump
- 2. Batch mode (script - *.pig)
- 3. Embedded mode (user defined function)
- - Apache Pig provides the provision of defining our own functions in Pig Latin such as
- R, Python, Java, ...
- - write user defined function in other programming language (R, Python, Java, ...)
- Installation:
- # display version of Pig
- pig -version
- # execute Pig in local mode
- pig -x local (independent of Hadoop cluster)
- pwd
- # to quit
- quit;
- # Batch mode locally
- pig -x local a.pig
- # Pig in mapreduce mode (linked to hadoop automatically)
- pig -x mapreduce
- pwd
- # Batch mode (mapreduce mode)
- pig -x mapreduce a.pig
- # two types of commands can be run in pig:
- - shell commands: linux commands
- - file system commands: hadoop cluster commands
- # shell commands
- sh ls
- # file system commands
- fs ls
- fs -ls /user/raman/
- # clear screen
- clear/ctrl+L
- # basic commands
- pig -x mapreduce
- fs -ls /user/raman/
- fs -cat /user/raman/file.txt
- load command is used to convert file on cluster to a relation
- grunt> a=LOAD 'hdfs://172.17.0.2:9000/user/raman/file.txt' USING PigStorage(' ');
- (' ') - space seperator
- (':') - colon seperator
- grunt> a;
- grunt> Dump # Dump command will execute that program in mapreduce fashion and display the content of file.txt
- ------------
- Assignment: (.csv file - execute all commands with the file)
- =====================
- $ ssh to <masternode>
- $ jps
- $ start-yarn.sh
- $ jps
- ====================
- Start Apache Pig in two modes:
- 1. local mode
- 2. mapreduce mode
- For hadoop cluster, start Apache Pig in mapreduce mode.
- # Working with Apache Pig
- By default all the log files of Pig is stored in current working directory. Create a seperate working
- directory named /home/raman/Pig/ for Apache Pig.
- Remove these log files from time to time in directory /home/raman/Pig/ with command rm *.log
- # start Apache Pig in mapreduce mode
- pig -x mapreduce
- (shell prompt of Apache Pig is grunt)
- (quit; to exit grunt shell in Apache Pig)
- # Modes of Apache Pig
- 1. Interactive Mode (grunt shell mode)
- 2. Batch Mode (pig script mode)
- 3. Embedded Mode (udf (user defined function) mode - defining functions in some programming language and calling that function from pig script)
- # Apache Pig in Batch mode
- pig -x mapreduce file.pig # excute the script file.pig in mapreduce fashion (file.pig in hdfs cluster)
- # copy file from local to hdfs cluster
- hdfs dfs -put file.dat /user/raman/pig_directory/
- # start Pig in mapreduce mode
- pig -x mapreduce
- # LOAD - converting to pig format on Apache pig server from hdfs
- # copy/load data file in pig format (converting to pig format) on Apache Pig server loading data file from hdfs
- grunt> A = LOAD 'hdfs://172.17.0.2:9000/user/raman/file.dat' USING PigStorage(',') AS(f1:int, f2:int, f3:int); # creates an object 'A'
- (in hdfs mode always give the address of master node: 'hdfs://172.17.0.2:9000/user/raman/file.txt')
- (',') - field seperator
- (f1:int, f2:int, f3:int) - Number of fields should match the Number of fields in the data file
- # DUMP command is used to display the output (dump don't do mapreduce)
- grunt> DUMP A; # DUMP will execute the object 'A'
- # remove/delete file from hdfs cluster
- hdfs dfs -rm /user/raman/pig_directory/file.dat
- # running bash shell commands from grunt shell
- grunt> sh ls
- # running hadoop cluster commands from grunt shell
- grunt> fs -ls
- # pig is case-sensitive
- # file.dat
- 1,2,3
- 4,5,6
- 7,5,6
- # group in pig (GROUP in pig is similar to groupby in MySQL)
- grunt> B = GROUP A BY f1; # group entire structure in map form
- grunt> DUMP B; # grouping of field f1 alone (in tuple form)
- (1, ((1,2,3))) # collections of tuples
- (4, ((4,5,6)))
- (7, ((7,5,6)))
- grunt> DUMP A;
- (1,2,3)
- (4,5,6)
- (7,5,6)
- # generate is similar to select in MySQL
- grunt> C = FOREACH B GENERATE COUNT($1); # take each element of B and display fields in B ($1) (count elements in each bag)
- grunt> DUMP C;
- (1)
- (1)
- (1)
- # for viewing error log file
- grunt> quit;
- $ vi -M pig_*.log
- # vi student_data.txt
- 001,Rajiv,Reddy,9848022337,Hyderabad
- 002,Siddarth,Battacharya,9848022338,Kolkata
- 003,Rajesh,Khanna,9848022339,Delhi
- 004,Preethi,Agarwal,9848022330,Pune
- 005,Trupthi,Mohanthy,9848022336,Bhuwaneshwar
- 006,Archana,Mishra,9848022335,Chennai
- 007,Ragunath,Chandran,9848022334,Chennai
- $ hdfs dfs -put student_data.txt /user/raman/pig_directory/
- # excluding LOAD and STORE all the commands return a relation. LOAD and
- STORE is to copy and retrive, all other commands return a relation.
- $ pig -x mapreduce
- grunt> student = LOAD 'user/raman/pig_directory/student_data.txt' USING PigStorage(',') as (id:int,firstname:chararray,lastname:chararray,contatno:chararray,location:chararray)
- grunt> STORE student INTO '/user/raman/pig_directory/outputfile.txt' USING PigStorage(','); # copy file from pig to hdfs cluster
- # load command copy file from hdfs cluster to pig
- # store command copy file from pig to hdfs cluster
- Output(s):
- Successfully stored 6 records (5757980 bytes) in: "/user/raman/pig_directory/outputfile.txt"
- grunt> fs -ls /user/raman/pig_directory/
- grunt> fs -ls /user/raman/pig_directory/outputfile.txt
- grunt> fs -cat /user/raman/pig_directory/part-m-00000
- grunt> quit;
- # LOAD will simply load the data into specified relation in Apache Pig
- # STORE will simply store the data from Apache Pig to the HDFS cluster
- # Diagnostic operators in Apache Pig
- 1. DUMP
- 2. DESCRIBE
- 3. EXPLANATION
- 4. ILLUSTRATE
- $ pig -x mapreduce
- grunt> Describe student; # data type of the data (similar to describe in mysql)
- grunt> explain student; # more detailed of entire relation (explain operator is used to display the logical, physical and mapreduce execution plans of a relation)
- grunt> illustrate student; # illustrate gives step by step execution of sequence of statements
- # always write a script with the command and execute in pig
- grunt> groupdata = GROUP student BY location;
- grunt> describe groupdata;
- grunt> dump groupdata;
- grunt> groupall = GROUP student All; # one all is created corresponding to all the data (only one group with element all) for counting all the elements
- grunt> dump groupall;
- # COGROUP is similar to GROUP. The only difference is that GROUP operator works with one relation at a time, while COGROUP is used for two or more relations simultaneously
- grunt> COGROUP <relation1> BY <field_of_relation1>, <relation2> BY <field_of_relation2>
- # empdata.txt
- 001,Emply1,22,Newyork
- 002,Emply2,23,Tokyo
- 003,Emply3,23,Kolkata
- 004,Emply4,25,London
- 005,Emply5,23,Pune
- 006,Emply6,22,Chennai
- $ hdfs dfs -put empdata.txt /user/raman/pig_directory/
- $ pig -x mapreduce
- grunt> emply = LOAD '/user/raman/pig_directory/empdata.txt' USING PigStorage(',') as (id:int,name:chararray,age:int,location:chararray);
- grunt> describe emply;
- grunt> cogrpdata = COGROUP student BY location, emply BY location;
- grunt> describe cogrpdata;
- grunt> dump cogrpdata;
- # JOIN operator is used to combine records from two or more relations
- # while performing JOIN operation, we declare one tuple from each relation as keys
- # when these keys match the two particulars are matched
- # types of JOIN:
- 1. self join
- 2. inner join/eqvi join
- 3. outer join
- # syntax: JOIN <relation1> BY <column1>, <relation2> BY <column2>;
- grunt> stemp = JOIN student BY location, emply BY location; # inner join
- grunt> describe stemp;
- grunt> dump stemp;
- (7,Ragunath,Chandran,9848022334,Chennai,6,Emply6,22,Chennai)
- (6,Archana,Mishra,9848022335,Chennai,6,Emply6,22,Chennai)
- # Apache Pig supports various data types:
- int - integer
- chararray - string
- ---------------------------------------------------------------------------------------------------------------------------
- Assignment: *.xl
- Apache Pig (Advance Commands)
- # Referencing fields
- In Apache Pig fields are referenced by using positional parameter - $0, $1
- and so on.
- $0 - 1st field, $1 - 2nd field
- Fields in apache pig can be accessed by name also.
- # FOREACH
- Normally, FOREACH is used after grouping.
- Synatax:
- X = FOREACH student GENERATE name,$2; # GENERATE means SELECT
- Y = FOREACH student GENERATE $2-$1; # difference of field3 and field2
- In Apache Pig, -- is used to comment.
- $ pig -x mapreduce
- grunt> sh cat file.txt;
- 1,3,5
- 1,2,3
- 3,1,2
- 3,1,4
- grunt> student = LOAD '/user/raman/pig_directory/studentdata.txt' USING PigStorage(',') as (id:int,name:chararray,lastname:chararray,contact:chararray,location:chararray)
- grunt> X = FOREACH student GENERATE name,$2;
- grnnt> describe X;
- grunt> X = FOREACH student GENERATE id*2;
- grunt> describe X;
- grunt> X = FOREACH student GENERATE id*2 as myid; # as (alias)
- grunt> describe X;
- Apache Pig supports ternary operator. (use when scripting)
- grunt> Y = 2==2 ? 1:4 # (when 2 is equvalent to 2 display 1 otherwise display 4) ? = ternary operator (alternative of if else)
- Apache pig supports complex data types.
- 1. Tuple: An ordered set of fields (19,2)
- 2. Bag: A collection of Tuples ((19,2), (18,1))
- 3. Map: A set of key value pairs [open#apache], [name#Raman,location#London]
- Apache Pig supports complex data types. Data type of a filed can be Tuple,
- Bag, Map. Apache Pig supports nested architecture Ex: Tuple within Tuple.
- $ hdfs dfs -put data.txt /user/raman/pig_directory/data.txt
- $ pig -x mapreduce
- grunt> sh cat data.txt # Tuple
- (3,8,9) (4,5,6)
- (1,4,7) (3,7,5)
- grunt>
- Create relation using complex data types.
- grunt> A = LOAD '/user/raman/pig_directory/data.txt' USING PigStorage(' ') AS (t1:tuple(t1a:int,t1b:int,t1c:int), t2:tuple(t2a:int,t2b:int,t2c:int));
- grunt> describe A;
- A: (t1: (t1a: int,t1b: int,t1c: int), t2: (t2a: int,t2b: int, t2c: int))
- grunt> -- Two fields are in relation A.
- grunt> DUMP A;
- ((3,8,9),(4,5,6))
- ((1,4,7),(3,7,5))
- grunt> X = FOREACH A GENERATE t1.t1a,t2.$0;
- grunt> describe X;
- grunt> DUMP X;
- (3,4)
- (1,3)
- grunt> DUMP A;
- Concept of Outer Bag/Inner Bag in Apache Pig:
- grunt> -- Outer Bag
- grunt> -- Outer Bag is always unnested
- grunt> -- A (f1:int, f2:int, f3:int)
- grunt> -- (1,2,3)
- grunt> -- (4,2,1)
- grunt> -- (8,3,4)
- grunt> -- (4,3,3)
- grunt> -- X = GROUP A BY f1;
- grunt> -- Contents of X will be:
- grunt> -- (1, ((1,2,3)))
- grunt> -- (4, ((4,2,1), (4,3,3)))
- grunt> -- (8, ((8,3,4)))
- grunt> -- X is a relation or bag of tuple. inner bag are ((1,2,3)), ((4,2,1), (4,3,3)), ((8,3,4))
- Filter is where condition of SQL with some difference:
- grunt> student = LOAD '/user/raman/pig_directory/studentdata.txt' USING PigStorage(',') as (id:int,name:chararray,lastname:chararray,contact:chararray,location:chararray)
- grunt> -- conditional operators: not equal to (!=), equal to (==), greater than (>), greater than equal to (>=), less than (<), less than equal to (<=)
- grunt> schennai = FILTER student BY location == 'Chennai'; [select * from student where location == 'chennai']
- grunt> describe schennai;
- grunt> DUMP schennai;
- 006,Archana,Mishra,9848022335,Chennai
- 007,Ragunath,Chandran,9848022334,Chennai
- grunt>
- grunt> -- display only id and location for location == 'Chennai' (assignment)
- grunt>
- grunt> nchennai = FILTER student BY NOT location == 'Chennai'; [select * from student where location != 'chennai']
- grunt> DUMP nchennai;
- 001,Rajiv,Reddy,9848022337,Hyderabad
- 002,Siddarth,Battacharya,9848022338,Kolkata
- 003,Rajesh,Khanna,9848022339,Delhi
- 004,Preethi,Agarwal,9848022330,Pune
- 005,Trupthi,Mohanthy,9848022336,Bhuwaneshwar
- grunt> schennai = FILTER student BY location matches 'Chennai'; [select * from student where location == 'chennai']
- grunt> nchennai = FILTER student BY NOT location matches 'Chennai'; [select * from student where location != 'chennai']
- grunt> FILTER student BY location matches '*Pradesh*'; (wildcard/regular expression)
- grunt> gst = GROUP student BY location;
- grunt> DUMP gst;
- grunt> gstdisplay = FOREACH gst GENERATE $1; -- second field
- grunt> DUMP gstdisplay;
- grunt> gstdisplay = FOREACH gst GENERATE group, COUNT($1); # word count/frequency (mapreduce) (GENERATE group means on which field you have to group)
- grunt> DUMP gstdisplay;
- grunt> -- select location count(*) as count from student group by location; (MySQL)
- grunt> -- grouping can be done on more than one field
- grunt> GROUP student BY (location, name)
- grunt> -- Apache Pig in-built functions: AVG, SUM, MAX, MIN
- grunt> -- NYSE (csv format)
- grunt> -- daily = LOAD 'NYSE_daily' as (exchange, stock, date, dividends)
- grunt> -- group by exchange and stock and find average value of dividends for each exchange and stock, store into output folder
- grunt> -- order (sorting a relation)
- grunt> sorter = order student by location desc; (default ascending order)
- grunt> -- sorting can be done on more than one field
- grunt> sorter = order student by location, name desc; (location in ascending order and name in descending order)
- grunt> -- distinct (unique records - remove duplicates)
- grunt> distinct student;
- grunt> -- select * from student limit 2;
- grunt> s2 = limit student 2; # display first two records
- grunt> dump s2;
- grunt> -- we can generate a sample from apache pig relation
- grunt> -- sample size will be in percentage
- grunt> some = sample student .5; -- 50%
- grunt> dump some;
- grunt> -- random sample
- grunt> some = sample student .5; -- 50%
- grunt> -- to increase number of reducers
- grunt> -- parallel 10; (number of nodes)
- grunt> group student by location parallel 10; (more faster execution/parallel computation)
- grunt> -- 10 reducer will be activated
- grunt> -- In Apache Pig, each and every command is executed in mapreduce mode only if using pig -x mapreduce
- grunt> -- parallel is used in pig -x mapreduce not in pig -x local
- grunt> -- flatten - unnest a tuple/bag
- grunt> -- for tuple flatten substitutes the fields of a tuple in place of the tuple
- grunt> -- (a,b,c)
- grunt> -- (d,e,f)
- grunt> -- group on first field
- grunt> -- (a, (a,b,c))
- grunt> -- (d, (d,e,f))
- grunt> -- FOREACH student GENERATE $1;
- grunt> -- (a,b,c)
- grunt> -- FOREACH student GENERATE $0;
- grunt> -- a
- grunt> -- GENERATE flatten($1); # split string to array
- grunt> -- (a,a,b,c)
- grunt> -- CROSS R1, R2; # cross-product of two relations
- grunt> -- 1st record of relation 1 with all records of relation 2, 2nd record of relation1 with all records of relation2, ...
- grunt> -- UNION R1,R2; # rbind (merge R1 and R2 into one relation) (combine)
- grunt> -- SPLIT student into S1 if location == 'Chennai', S2 if location != 'Chennai'; # negation of UNION (bifurcate)
- grunt> -- tokenize # self reading
- Assignment hint:
- data type date for date
- # Execution of pig scripts (batch mode)
- pig -x mapreduce file.pig
|