pigref.txt 19 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418419420421422423424425426427428429430431432433434435436437438439440441442443444445446447448449450451452453454455456457458459460461462463464465466467468469470471472473474475476477478479480481482483484485486487488489490491492493494495496497498499500501502503504505506507508509510511512513514515516517518519520521522523524525526527528529530531532533534535536537538539540541542543544545
  1. By default Pig works in mapreduce mode.
  2. Apache Pig is an abstration over mapreduce.
  3. It is a tool which is used to analyze larger sets of data representing them as data flows.
  4. It is used with Hadoop (It works at the top layer of Hadoop). We can perform all data
  5. manipulation in Hadoop using Apache Pig (alternate of Java mapreduce).
  6. To write data anlaysis programs Pig provides High Level language - Pig Latin. In Pig, we
  7. use Pig Latin language.
  8. Why do we need Apache Pig?
  9. - can perform mapreduce
  10. - uses multi-query approach: Reduces the length of codes
  11. - pig latin is sql like language
  12. - provides many built-in operators to support data operations like groupby, join etc.
  13. - it provides nested data types like tuples, bags and maps
  14. Difference between Apache Pig and mapreduce:
  15. - Apache Pig is Data Flow Language where as mapreduce Data Processing paradigm
  16. - Apache Pig is High Level Language where as mapreduce is Low Level and rigid
  17. - concept of join operation in Pig is simple where as in mapreduce it is difficult to implement
  18. - Apache Pig is used with a basic knowledge of query where as in (plain) mapreduce we must have
  19. expertise in core Java
  20. - Apache Pig uses Multi-Query approach where as in (plain) mapreduce it is not available
  21. - in Apache Pig, no need of compilation. Pig operator is converted internally converted into
  22. mapreduce job. But in case of mapreduce we need all (map function, reduce function, etc.)
  23. In Pig - structed, semi-structed and unstructed data can be analyzed.
  24. In plain mapreduce - only flat files are used.
  25. Pig has features of programming language as well as features of sql.
  26. schema - database in which you are creating a table
  27. Difference between Apache Pig and SQL:
  28. - Pig is procedural language, SQL is declarative
  29. - in Pig schema is optional, in SQL schema is mandatory
  30. - data model in Pig is nested relational (within one relational we can put another relational) where as data model in SQL is flat relational
  31. - Pig provides default query optimization where as SQL need to apply query optimization techniques
  32. Difference between Apache Pig and Hive:
  33. - Pig uses Pig Latin language where as Hive uses Hive Query Language (HiveQL)
  34. - Pig was created at Yahoo where as Hive was created at Facebook
  35. - Pig Latin is procedural where as HiveQL is declarative
  36. - Pig can handle all type of data (structed, semi-structed and unstructed) where as Hive mostly used
  37. for structed data (because of schema dependent)
  38. Apache Pig:
  39. - started in 2006 at Yahoo (purspose for executing mapreduce job on every data set)
  40. - in 2007 taken over by Apache
  41. - in 2008 first release (Apache Pig)
  42. - since 2010 full existence and become important project of Apache
  43. In Apache Pig execution of program in three way: (every where grunt shell is used)
  44. 1. interactive: grunt shell
  45. 2. user defined functions (function is written in Python/Java and implement all those functions by using Pig concept)
  46. 3. embedded - pig script (extension of pig file - .pig)
  47. Series of transformation/execution in Apache Pig:
  48. 1. Pig Latin Script (PLS)
  49. PLS -> Apache Pig [Parser -> Optimizer -> Compiler -> Execution Engine] -> MapReduce -> HDFS
  50. 2. Parser:
  51. - Initially the PLS is handled by the Parser.
  52. - It checks the syntax of the script.
  53. - It generates a DAG (Directed Acyclic Graph), which represents Pig Latin statements and operators
  54. - output of Parser is DAG
  55. - in DAG, logical operator are represented as nodes(verteces) and data flow are represented as edges
  56. 3. (after Parser control goes to) Optimizer:
  57. - Optimizer takes input from DAG(logical plan)
  58. - logical plan (DAG) is passed to logical optimizer which optimizes all commands
  59. - Pig provides default query optimization
  60. 4. Execution Engine:
  61. - Execution Engine will interact with Hadoop.
  62. - finally mapreduce jobs are submitted to Hadoop in a sorted order, and these mapreduce jobs are executed
  63. on hadoop cluster producing the desired ouput.
  64. Pig Latin Data Model:
  65. Atom - Field is known as Atom (smallest unit of Pig Latin Data Model)
  66. 001, Rajiv, 21, Hyderabad
  67. Tuple - summation/integration/collection of Field is tuple(record)
  68. 001 Rajiv 21 Hyderabad
  69. Bag - summation/integration/collection of Tuple
  70. 001 Rajiv 21 Hyderabad
  71. 002 Omer 22 Kolkata
  72. 003 Rajesh 23 Delhi
  73. In MySQL, Table is Bag, Record is Tuple, Field is Field (Column name).
  74. Atom:
  75. - any single value in Pig Latin irrespective of their data type is known as Atom
  76. - value within field is called Atom and that value with datatype is called Pig
  77. - it is stored as string(for map purposes) (example: 'Delhi', '23')
  78. Tuple ():
  79. - ordered set of fields (examples: (Rajesh, 23)
  80. - but not ordered set of all fields
  81. - in Tuple, fields can be any type
  82. - Tuple is denoted within parentheses
  83. Bag {}:
  84. - Bag is collection of Tuples
  85. - example: {(Rajesh, 23), (Omer, 22)}
  86. - Bag can be nested implies a Bag can be a Field in a relation
  87. - a Bag can be a member of another Bag
  88. - example: {Rajesh, 30, {999999999, rajesh@gmail.com}}
  89. Map []:
  90. - Map is a set of key-value pairs
  91. - the key needs to be of type chararray and should be unique
  92. - whereas the value might be of any type
  93. - Map is represented by [ ]
  94. - example: [ name#Rajesh, age#23 ] ('name' is field name and 'Rajesh' is value and similaraly)
  95. - in Map the filed name and the value is differeniated by '#' symbol
  96. Relation:
  97. - a Relation is a Bag of Tuples
  98. - the Relation in Pig Latin are unordered
  99. - we need to write a command to order it
  100. Execution Mechanism: Three modes
  101. 1. Interactive mode (Grunt shell - shell of Pig):
  102. - get output using operator Dump
  103. 2. Batch mode (script - *.pig)
  104. 3. Embedded mode (user defined function)
  105. - Apache Pig provides the provision of defining our own functions in Pig Latin such as
  106. R, Python, Java, ...
  107. - write user defined function in other programming language (R, Python, Java, ...)
  108. Installation:
  109. # display version of Pig
  110. pig -version
  111. # execute Pig in local mode
  112. pig -x local (independent of Hadoop cluster)
  113. pwd
  114. # to quit
  115. quit;
  116. # Batch mode locally
  117. pig -x local a.pig
  118. # Pig in mapreduce mode (linked to hadoop automatically)
  119. pig -x mapreduce
  120. pwd
  121. # Batch mode (mapreduce mode)
  122. pig -x mapreduce a.pig
  123. # two types of commands can be run in pig:
  124. - shell commands: linux commands
  125. - file system commands: hadoop cluster commands
  126. # shell commands
  127. sh ls
  128. # file system commands
  129. fs ls
  130. fs -ls /user/raman/
  131. # clear screen
  132. clear/ctrl+L
  133. # basic commands
  134. pig -x mapreduce
  135. fs -ls /user/raman/
  136. fs -cat /user/raman/file.txt
  137. load command is used to convert file on cluster to a relation
  138. grunt> a=LOAD 'hdfs://172.17.0.2:9000/user/raman/file.txt' USING PigStorage(' ');
  139. (' ') - space seperator
  140. (':') - colon seperator
  141. grunt> a;
  142. grunt> Dump # Dump command will execute that program in mapreduce fashion and display the content of file.txt
  143. ------------
  144. Assignment: (.csv file - execute all commands with the file)
  145. =====================
  146. $ ssh to <masternode>
  147. $ jps
  148. $ start-yarn.sh
  149. $ jps
  150. ====================
  151. Start Apache Pig in two modes:
  152. 1. local mode
  153. 2. mapreduce mode
  154. For hadoop cluster, start Apache Pig in mapreduce mode.
  155. # Working with Apache Pig
  156. By default all the log files of Pig is stored in current working directory. Create a seperate working
  157. directory named /home/raman/Pig/ for Apache Pig.
  158. Remove these log files from time to time in directory /home/raman/Pig/ with command rm *.log
  159. # start Apache Pig in mapreduce mode
  160. pig -x mapreduce
  161. (shell prompt of Apache Pig is grunt)
  162. (quit; to exit grunt shell in Apache Pig)
  163. # Modes of Apache Pig
  164. 1. Interactive Mode (grunt shell mode)
  165. 2. Batch Mode (pig script mode)
  166. 3. Embedded Mode (udf (user defined function) mode - defining functions in some programming language and calling that function from pig script)
  167. # Apache Pig in Batch mode
  168. pig -x mapreduce file.pig # excute the script file.pig in mapreduce fashion (file.pig in hdfs cluster)
  169. # copy file from local to hdfs cluster
  170. hdfs dfs -put file.dat /user/raman/pig_directory/
  171. # start Pig in mapreduce mode
  172. pig -x mapreduce
  173. # LOAD - converting to pig format on Apache pig server from hdfs
  174. # copy/load data file in pig format (converting to pig format) on Apache Pig server loading data file from hdfs
  175. grunt> A = LOAD 'hdfs://172.17.0.2:9000/user/raman/file.dat' USING PigStorage(',') AS(f1:int, f2:int, f3:int); # creates an object 'A'
  176. (in hdfs mode always give the address of master node: 'hdfs://172.17.0.2:9000/user/raman/file.txt')
  177. (',') - field seperator
  178. (f1:int, f2:int, f3:int) - Number of fields should match the Number of fields in the data file
  179. # DUMP command is used to display the output (dump don't do mapreduce)
  180. grunt> DUMP A; # DUMP will execute the object 'A'
  181. # remove/delete file from hdfs cluster
  182. hdfs dfs -rm /user/raman/pig_directory/file.dat
  183. # running bash shell commands from grunt shell
  184. grunt> sh ls
  185. # running hadoop cluster commands from grunt shell
  186. grunt> fs -ls
  187. # pig is case-sensitive
  188. # file.dat
  189. 1,2,3
  190. 4,5,6
  191. 7,5,6
  192. # group in pig (GROUP in pig is similar to groupby in MySQL)
  193. grunt> B = GROUP A BY f1; # group entire structure in map form
  194. grunt> DUMP B; # grouping of field f1 alone (in tuple form)
  195. (1, ((1,2,3))) # collections of tuples
  196. (4, ((4,5,6)))
  197. (7, ((7,5,6)))
  198. grunt> DUMP A;
  199. (1,2,3)
  200. (4,5,6)
  201. (7,5,6)
  202. # generate is similar to select in MySQL
  203. grunt> C = FOREACH B GENERATE COUNT($1); # take each element of B and display fields in B ($1) (count elements in each bag)
  204. grunt> DUMP C;
  205. (1)
  206. (1)
  207. (1)
  208. # for viewing error log file
  209. grunt> quit;
  210. $ vi -M pig_*.log
  211. # vi student_data.txt
  212. 001,Rajiv,Reddy,9848022337,Hyderabad
  213. 002,Siddarth,Battacharya,9848022338,Kolkata
  214. 003,Rajesh,Khanna,9848022339,Delhi
  215. 004,Preethi,Agarwal,9848022330,Pune
  216. 005,Trupthi,Mohanthy,9848022336,Bhuwaneshwar
  217. 006,Archana,Mishra,9848022335,Chennai
  218. 007,Ragunath,Chandran,9848022334,Chennai
  219. $ hdfs dfs -put student_data.txt /user/raman/pig_directory/
  220. # excluding LOAD and STORE all the commands return a relation. LOAD and
  221. STORE is to copy and retrive, all other commands return a relation.
  222. $ pig -x mapreduce
  223. grunt> student = LOAD 'user/raman/pig_directory/student_data.txt' USING PigStorage(',') as (id:int,firstname:chararray,lastname:chararray,contatno:chararray,location:chararray)
  224. grunt> STORE student INTO '/user/raman/pig_directory/outputfile.txt' USING PigStorage(','); # copy file from pig to hdfs cluster
  225. # load command copy file from hdfs cluster to pig
  226. # store command copy file from pig to hdfs cluster
  227. Output(s):
  228. Successfully stored 6 records (5757980 bytes) in: "/user/raman/pig_directory/outputfile.txt"
  229. grunt> fs -ls /user/raman/pig_directory/
  230. grunt> fs -ls /user/raman/pig_directory/outputfile.txt
  231. grunt> fs -cat /user/raman/pig_directory/part-m-00000
  232. grunt> quit;
  233. # LOAD will simply load the data into specified relation in Apache Pig
  234. # STORE will simply store the data from Apache Pig to the HDFS cluster
  235. # Diagnostic operators in Apache Pig
  236. 1. DUMP
  237. 2. DESCRIBE
  238. 3. EXPLANATION
  239. 4. ILLUSTRATE
  240. $ pig -x mapreduce
  241. grunt> Describe student; # data type of the data (similar to describe in mysql)
  242. grunt> explain student; # more detailed of entire relation (explain operator is used to display the logical, physical and mapreduce execution plans of a relation)
  243. grunt> illustrate student; # illustrate gives step by step execution of sequence of statements
  244. # always write a script with the command and execute in pig
  245. grunt> groupdata = GROUP student BY location;
  246. grunt> describe groupdata;
  247. grunt> dump groupdata;
  248. grunt> groupall = GROUP student All; # one all is created corresponding to all the data (only one group with element all) for counting all the elements
  249. grunt> dump groupall;
  250. # COGROUP is similar to GROUP. The only difference is that GROUP operator works with one relation at a time, while COGROUP is used for two or more relations simultaneously
  251. grunt> COGROUP <relation1> BY <field_of_relation1>, <relation2> BY <field_of_relation2>
  252. # empdata.txt
  253. 001,Emply1,22,Newyork
  254. 002,Emply2,23,Tokyo
  255. 003,Emply3,23,Kolkata
  256. 004,Emply4,25,London
  257. 005,Emply5,23,Pune
  258. 006,Emply6,22,Chennai
  259. $ hdfs dfs -put empdata.txt /user/raman/pig_directory/
  260. $ pig -x mapreduce
  261. grunt> emply = LOAD '/user/raman/pig_directory/empdata.txt' USING PigStorage(',') as (id:int,name:chararray,age:int,location:chararray);
  262. grunt> describe emply;
  263. grunt> cogrpdata = COGROUP student BY location, emply BY location;
  264. grunt> describe cogrpdata;
  265. grunt> dump cogrpdata;
  266. # JOIN operator is used to combine records from two or more relations
  267. # while performing JOIN operation, we declare one tuple from each relation as keys
  268. # when these keys match the two particulars are matched
  269. # types of JOIN:
  270. 1. self join
  271. 2. inner join/eqvi join
  272. 3. outer join
  273. # syntax: JOIN <relation1> BY <column1>, <relation2> BY <column2>;
  274. grunt> stemp = JOIN student BY location, emply BY location; # inner join
  275. grunt> describe stemp;
  276. grunt> dump stemp;
  277. (7,Ragunath,Chandran,9848022334,Chennai,6,Emply6,22,Chennai)
  278. (6,Archana,Mishra,9848022335,Chennai,6,Emply6,22,Chennai)
  279. # Apache Pig supports various data types:
  280. int - integer
  281. chararray - string
  282. ---------------------------------------------------------------------------------------------------------------------------
  283. Assignment: *.xl
  284. Apache Pig (Advance Commands)
  285. # Referencing fields
  286. In Apache Pig fields are referenced by using positional parameter - $0, $1
  287. and so on.
  288. $0 - 1st field, $1 - 2nd field
  289. Fields in apache pig can be accessed by name also.
  290. # FOREACH
  291. Normally, FOREACH is used after grouping.
  292. Synatax:
  293. X = FOREACH student GENERATE name,$2; # GENERATE means SELECT
  294. Y = FOREACH student GENERATE $2-$1; # difference of field3 and field2
  295. In Apache Pig, -- is used to comment.
  296. $ pig -x mapreduce
  297. grunt> sh cat file.txt;
  298. 1,3,5
  299. 1,2,3
  300. 3,1,2
  301. 3,1,4
  302. grunt> student = LOAD '/user/raman/pig_directory/studentdata.txt' USING PigStorage(',') as (id:int,name:chararray,lastname:chararray,contact:chararray,location:chararray)
  303. grunt> X = FOREACH student GENERATE name,$2;
  304. grnnt> describe X;
  305. grunt> X = FOREACH student GENERATE id*2;
  306. grunt> describe X;
  307. grunt> X = FOREACH student GENERATE id*2 as myid; # as (alias)
  308. grunt> describe X;
  309. Apache Pig supports ternary operator. (use when scripting)
  310. grunt> Y = 2==2 ? 1:4 # (when 2 is equvalent to 2 display 1 otherwise display 4) ? = ternary operator (alternative of if else)
  311. Apache pig supports complex data types.
  312. 1. Tuple: An ordered set of fields (19,2)
  313. 2. Bag: A collection of Tuples ((19,2), (18,1))
  314. 3. Map: A set of key value pairs [open#apache], [name#Raman,location#London]
  315. Apache Pig supports complex data types. Data type of a filed can be Tuple,
  316. Bag, Map. Apache Pig supports nested architecture Ex: Tuple within Tuple.
  317. $ hdfs dfs -put data.txt /user/raman/pig_directory/data.txt
  318. $ pig -x mapreduce
  319. grunt> sh cat data.txt # Tuple
  320. (3,8,9) (4,5,6)
  321. (1,4,7) (3,7,5)
  322. grunt>
  323. Create relation using complex data types.
  324. grunt> A = LOAD '/user/raman/pig_directory/data.txt' USING PigStorage(' ') AS (t1:tuple(t1a:int,t1b:int,t1c:int), t2:tuple(t2a:int,t2b:int,t2c:int));
  325. grunt> describe A;
  326. A: (t1: (t1a: int,t1b: int,t1c: int), t2: (t2a: int,t2b: int, t2c: int))
  327. grunt> -- Two fields are in relation A.
  328. grunt> DUMP A;
  329. ((3,8,9),(4,5,6))
  330. ((1,4,7),(3,7,5))
  331. grunt> X = FOREACH A GENERATE t1.t1a,t2.$0;
  332. grunt> describe X;
  333. grunt> DUMP X;
  334. (3,4)
  335. (1,3)
  336. grunt> DUMP A;
  337. Concept of Outer Bag/Inner Bag in Apache Pig:
  338. grunt> -- Outer Bag
  339. grunt> -- Outer Bag is always unnested
  340. grunt> -- A (f1:int, f2:int, f3:int)
  341. grunt> -- (1,2,3)
  342. grunt> -- (4,2,1)
  343. grunt> -- (8,3,4)
  344. grunt> -- (4,3,3)
  345. grunt> -- X = GROUP A BY f1;
  346. grunt> -- Contents of X will be:
  347. grunt> -- (1, ((1,2,3)))
  348. grunt> -- (4, ((4,2,1), (4,3,3)))
  349. grunt> -- (8, ((8,3,4)))
  350. grunt> -- X is a relation or bag of tuple. inner bag are ((1,2,3)), ((4,2,1), (4,3,3)), ((8,3,4))
  351. Filter is where condition of SQL with some difference:
  352. grunt> student = LOAD '/user/raman/pig_directory/studentdata.txt' USING PigStorage(',') as (id:int,name:chararray,lastname:chararray,contact:chararray,location:chararray)
  353. grunt> -- conditional operators: not equal to (!=), equal to (==), greater than (>), greater than equal to (>=), less than (<), less than equal to (<=)
  354. grunt> schennai = FILTER student BY location == 'Chennai'; [select * from student where location == 'chennai']
  355. grunt> describe schennai;
  356. grunt> DUMP schennai;
  357. 006,Archana,Mishra,9848022335,Chennai
  358. 007,Ragunath,Chandran,9848022334,Chennai
  359. grunt>
  360. grunt> -- display only id and location for location == 'Chennai' (assignment)
  361. grunt>
  362. grunt> nchennai = FILTER student BY NOT location == 'Chennai'; [select * from student where location != 'chennai']
  363. grunt> DUMP nchennai;
  364. 001,Rajiv,Reddy,9848022337,Hyderabad
  365. 002,Siddarth,Battacharya,9848022338,Kolkata
  366. 003,Rajesh,Khanna,9848022339,Delhi
  367. 004,Preethi,Agarwal,9848022330,Pune
  368. 005,Trupthi,Mohanthy,9848022336,Bhuwaneshwar
  369. grunt> schennai = FILTER student BY location matches 'Chennai'; [select * from student where location == 'chennai']
  370. grunt> nchennai = FILTER student BY NOT location matches 'Chennai'; [select * from student where location != 'chennai']
  371. grunt> FILTER student BY location matches '*Pradesh*'; (wildcard/regular expression)
  372. grunt> gst = GROUP student BY location;
  373. grunt> DUMP gst;
  374. grunt> gstdisplay = FOREACH gst GENERATE $1; -- second field
  375. grunt> DUMP gstdisplay;
  376. grunt> gstdisplay = FOREACH gst GENERATE group, COUNT($1); # word count/frequency (mapreduce) (GENERATE group means on which field you have to group)
  377. grunt> DUMP gstdisplay;
  378. grunt> -- select location count(*) as count from student group by location; (MySQL)
  379. grunt> -- grouping can be done on more than one field
  380. grunt> GROUP student BY (location, name)
  381. grunt> -- Apache Pig in-built functions: AVG, SUM, MAX, MIN
  382. grunt> -- NYSE (csv format)
  383. grunt> -- daily = LOAD 'NYSE_daily' as (exchange, stock, date, dividends)
  384. grunt> -- group by exchange and stock and find average value of dividends for each exchange and stock, store into output folder
  385. grunt> -- order (sorting a relation)
  386. grunt> sorter = order student by location desc; (default ascending order)
  387. grunt> -- sorting can be done on more than one field
  388. grunt> sorter = order student by location, name desc; (location in ascending order and name in descending order)
  389. grunt> -- distinct (unique records - remove duplicates)
  390. grunt> distinct student;
  391. grunt> -- select * from student limit 2;
  392. grunt> s2 = limit student 2; # display first two records
  393. grunt> dump s2;
  394. grunt> -- we can generate a sample from apache pig relation
  395. grunt> -- sample size will be in percentage
  396. grunt> some = sample student .5; -- 50%
  397. grunt> dump some;
  398. grunt> -- random sample
  399. grunt> some = sample student .5; -- 50%
  400. grunt> -- to increase number of reducers
  401. grunt> -- parallel 10; (number of nodes)
  402. grunt> group student by location parallel 10; (more faster execution/parallel computation)
  403. grunt> -- 10 reducer will be activated
  404. grunt> -- In Apache Pig, each and every command is executed in mapreduce mode only if using pig -x mapreduce
  405. grunt> -- parallel is used in pig -x mapreduce not in pig -x local
  406. grunt> -- flatten - unnest a tuple/bag
  407. grunt> -- for tuple flatten substitutes the fields of a tuple in place of the tuple
  408. grunt> -- (a,b,c)
  409. grunt> -- (d,e,f)
  410. grunt> -- group on first field
  411. grunt> -- (a, (a,b,c))
  412. grunt> -- (d, (d,e,f))
  413. grunt> -- FOREACH student GENERATE $1;
  414. grunt> -- (a,b,c)
  415. grunt> -- FOREACH student GENERATE $0;
  416. grunt> -- a
  417. grunt> -- GENERATE flatten($1); # split string to array
  418. grunt> -- (a,a,b,c)
  419. grunt> -- CROSS R1, R2; # cross-product of two relations
  420. grunt> -- 1st record of relation 1 with all records of relation 2, 2nd record of relation1 with all records of relation2, ...
  421. grunt> -- UNION R1,R2; # rbind (merge R1 and R2 into one relation) (combine)
  422. grunt> -- SPLIT student into S1 if location == 'Chennai', S2 if location != 'Chennai'; # negation of UNION (bifurcate)
  423. grunt> -- tokenize # self reading
  424. Assignment hint:
  425. data type date for date
  426. # Execution of pig scripts (batch mode)
  427. pig -x mapreduce file.pig