43 Apache Pig Interview Questions And Answers For Experienced

Apache Pig Interview Questions And Answers For Experienced. Here Coding compiler sharing a list of 43 interview questions on Apache Pig. These Pig questions and answers were asked in various interviews. This list will help you to crack your next Pig job interview. All the best for future and happy learning.

Table of Contents

Apache Pig Interview Questions

Here is the list of Apache Pig interview questions.

What is Apache Pig?
Why is Pig used in Hadoop?
What is the primary purpose of the Pig in the Hadoop architecture?
What are the execution modes in the Apache Pig?
What is Interactive Mode in Apache Pig?
What are Pig Scripts?
How do you write comments in Pig Scripts?
Are Pig Scripts support distributed file systems?
What is Kerberos secured cluster in Apache Pig?
How do you run Pig scripts on Kerberos secured cluster?
What is Pig Latin Statements?
How do you organize the Pig Latin statements?
What is Pig Properties?
How do you run the Pig scripts in local mode?
What is a bag in Apache Pig?

Apache Pig Interview Questions And Answers

1) What is Apache Pig?

A) Apache Pig is a high-level platform for creating programs that run on Apache Hadoop. The language for this platform is called Pig Latin. Pig Latin is a SQL-like scripting language. Pig can execute its Hadoop jobs in MapReduce, Apache Tez, or Apache Spark.

2) Why is Pig used in Hadoop?

A) Pig is a high-level scripting language that is used with Apache Hadoop. Pig enables data workers to write complex data transformations without knowing Java.

3) What is the primary purpose of the Pig in the Hadoop architecture?

A) Apache Pig – Architecture. The language used to analyze data in Hadoop using Pig is known as Pig Latin. It is a high-level data processing language which provides a rich set of data types and operators to perform various operations on the data

4) What are the execution modes in the Apache Pig?

A) Execution Modes in Apache Pig – Pig has six execution modes or exectypes:

Local Mode
Tez Local Mode
Spark Local Mode
Mapreduce Mode
Tez Mode
Spark Mode

1) Local Mode – To run Pig in local mode, you need access to a single machine; all files are installed and run using your local host and file system. Specify local mode using the -x flag (pig -x local).

2) Tez Local Mode – To run Pig in tez local mode. It is similar to local mode, except internally Pig will invoke tez runtime engine. Specify Tez local mode using the -x flag (pig -x tez_local).

Note: Tez local mode is experimental. There are some queries which just error out on bigger data in local mode.

3) Spark Local Mode – To run Pig in spark local mode. It is similar to local mode, except internally Pig will invoke spark runtime engine. Specify Spark local mode using the -x flag (pig -x spark_local).
Note: Spark local mode is experimental. There are some queries which just error out on bigger data in local mode.

4) MapReduce Mode – To run Pig in MapReduce mode, you need access to a Hadoop cluster and HDFS installation. MapReduce mode is the default mode; you can, but don’t need to, specify it using the -x flag (pig OR pig -x MapReduce).

5) Tez Mode – To run Pig in Tez mode, you need access to a Hadoop cluster and HDFS installation. Specify Tez mode using the -x flag (-x tez).

6) Spark Mode – To run Pig in Spark mode, you need access to a Spark, Yarn or Mesos cluster and HDFS installation. Specify Spark mode using the -x flag (-x spark). In Spark execution mode, it is necessary to set env::SPARK_MASTER to an appropriate value (local – local mode, yarn-client – yarn-client mode, mesos://host:port – spark on mesos or spark://host:port – spark cluster.

5) What is Interactive Mode in Apache Pig?

A) You can run Pig in interactive mode using the Grunt shell. Invoke the Grunt shell using the “pig” command and then enter your Pig Latin statements and Pig commands interactively at the command line.

6) What are Pig Scripts?

A) Pig scripts are used to place Pig Latin statements and Pig commands in a single file. While not required, it is good practice to identify the file using the *.pig extension.

You can run Pig scripts from the command line and from the Grunt shell using the run and exec commands.

Pig scripts allow you to pass values to parameters using parameter substitution.

7) How do you write comments in Pig Scripts?

A) Comments in Pig Scripts – You can include comments in Pig scripts like below format:

For multi-line comments use /* …. */

For single-line comments use —

8) Are Pig Scripts support distributed file systems?

A) Yes, Pig support of running scripts (and Jar files) that are stored in HDFS, Amazon S3, and other distributed file systems.

9) What is Kerberos secured cluster in Apache Pig?

A) Kerberos is an authentication system that uses tickets with a limited validity time.

10) How do you run Pig scripts on Kerberos secured cluster?

A) As a consequence of running a pig script on a Kerberos secured Hadoop cluster limits the running time to at most the remaining validity time of these Kerberos tickets. When doing really complex analytics this may become a problem as the job may need to run for a longer time than these ticket times allow.

Pig Interview Questions

11) What is Pig Latin Statements?

A) Pig Latin statements are the basic constructs used to process data using Pig. A Pig Latin statement is an operator that takes a relation as input and produces another relation as output. Pig Latin statements may include expressions and schemas.

12) How do you organize the Pig Latin statements?

A) Pig Latin statements are generally organized as follows:

A LOAD statement to read data from the file system.
A series of “transformation” statements to process the data.
A DUMP statement to view results or a STORE statement to save the results.

13) What is Pig Properties?

A) Pig supports a number of Java properties that you can use to customize Pig behaviour.

14) How do you run the Pig scripts in local mode?

A) Running the Pig Scripts in Local Mode

To run the Pig scripts in local mode, do the following:

Move to the pigtmp directory.
Execute the following command (using either script1-local.pig or script2-local.pig).
$ pig -x local script1-local.pig
Or if you are using Tez local mode:
$ pig -x tez_local script1-local.pig
Or if you are using Spark local mode:
$ pig -x spark_local script1-local.pig

15) What is a bag in Apache Pig?

A) When working with data in Pig we often group the data by one or more fields. These group of fields are called bags in Pig.

16) What is the difference between Pig and Hive?

A) The major difference between Pig and Hive are:

1) Hive Hadoop Component is used mainly by data analysts whereas Pig Hadoop Component is generally used by Researchers and Programmers.

2) Hive Hadoop Component is used for completely structured Data whereas Pig Hadoop Component is used for semi-structured data

17) What are complex data types in Pig?

A) Pig has three complex types: maps, tuples and bags. These complex types can contain scalar types and other complex types.

18) What is a Map in Pig?

A) A map is a char array to data element mapping which is expressed in key-value pairs. The key should always be of type char array and can be used as index to access the associated value. It is not necessary that all the values in a map be of the same type.

Map constants are defined by square brackets with ‘#’ separating keys from values and ‘,’ separating key-value pairs.

[‘Name’#’John’, ‘Age’#22]

19) What is a Tuple in Pig?

A) Tuples are fixed length, ordered collection of Pig data elements. Tuples contain fields which may be of different Pig types. A tuple is analogous to a row in Sql with fields as columns.

Since tuples are ordered it is possible to reference a field by it’s position in the tuple. A tuple can, but is not required to declare a schema which describes each field’s data type and provides a name for the field.

Tuple constants use parentheses to define tuple and commas to separate different fields.

(‘John’, 25)

20) What is a Bag in Pig?

A) Bags are the unordered collection of tuples. Since bags are unordered, we cannot reference a tuple in a bag by its position. Bags are also not required to declare a schema. In case of bags, the schema describes all the tuples in the bag.

Bag constants are constructed using braces with commas separating the tuples inside the bag.

{(‘John’, 25), (‘Nathan’, 30)}

The above constructs a bag with two tuples.

Pig Interview Questions And Answers

21) When to use Hadoop, HBase, Hive and Pig?

A) MapReduce is just a computing framework. HBase has nothing to do with it. That said, you can efficiently put or fetch data to/from HBase by writing MapReduce jobs. Alternatively, you can write sequential programs using other HBase APIs, such as Java, to put or fetch the data. But we use Hadoop, HBase etc to deal with gigantic amounts of data, so that doesn’t make much sense. Using normal sequential programs would be highly inefficient when your data is too huge.

Coming back to the first part of your question, Hadoop is basically 2 things: a Distributed FileSystem (HDFS) + a Computation or Processing framework (MapReduce). Like all other FS, HDFS also provides storage, but in a fault tolerant manner with high throughput and lower risk of data loss (because of the replication). But, being an FS, HDFS lacks random read and write access. This is where HBase comes into the picture. It’s a distributed, scalable, big data store, modelled after Google’s BigTable. It stores data as key/value pairs.

Coming to Hive. It provides us with data warehousing facilities on top of an existing Hadoop cluster. Along with that, it provides an SQL interface which makes your work easier, in case you are coming from an SQL background. You can create tables in Hive and store data there. Along with that you can even map your existing HBase tables to Hive and operate on them.

While Pig is basically a dataflow language that allows us to process enormous amounts of data very easily and quickly. Pig basically has 2 parts: the Pig Interpreter and the language, PigLatin. You write Pig script in PigLatin and using Pig interpreter process them. Pig makes our life a lot easier, otherwise writing MapReduce is always not easy. In fact, in some cases, it can really become a pain.

Both Hive and Pig queries get converted into MapReduce jobs under the hood.

22) What is the relationship between HDFS, HBase, Pig, Hive and Azkaban?

A) Hadoop environment contains all these components (HDFS, HBase, Pig, Hive, Azkaban). Short description of them can be,

HDFS – HDFS is Hadoop’s Distributed File System. Intuitively you can think of this as a filesystem that spans many servers.

HBase – It is a columnar database. where you store data in the form of a column for faster access. yes, it does use hdfs as its storage.

Pig – data flow language, its community has provided built-in functions to load and process semi-structured data like JSON and XML along with structured data.

Hive – Querying language to run queries over tables, table mounting is necessary here to play with HDFS data.

Azkaban – If you have a pipeline of Hadoop jobs, you can schedule them to run at specific timings and after or before some dependency.

At the highest level possible, you can think of HDFS being your filesystem with HBASE as the datastore. Pig and Hive would be your means of querying from your datastore. Then Azkaban would be your way of scheduling jobs.

23) Explain BagToString in Pig?

A) BagToString creates a single string from the elements of a bag, similar to SQL’s GROUP_CONCAT function.

Syntax: BagToString(vals:bag [, delimiter:chararray])

24) Explain BagToTuple?

A) BagToTuple creates a tuple from the elements of a bag. It removes only the first level of nesting; it does not recursively un-nest nested bags. Unlike FLATTEN, BagToTuple will not generate multiple output records per input record.

Syntax: BagToTuple(expression)

25) Explain Bloom?

A) Bloom filters are a common way to select a limited set of records before moving data for a join or other heavyweight operation.

Syntax: BuildBloom(String hashType, String mode, String vectorSize, String nbHash)

Bloom(String filename)

26) Explain COUNT_STAR?

A) Use the COUNT_STAR function to compute the number of elements in a bag. COUNT_STAR requires a preceding GROUP ALL statement for global counts and a GROUP BY statement for group counts.

Syntax: COUNT_STAR(expression)

27) Explain PluckTuple?

A) Pluck Tuple allows the user to specify a string prefix, and then filter for the columns in a relation that begin with that prefix or match that regex pattern.

Syntax: DEFINE pluck PluckTuple(expression1)

DEFINE pluck PluckTuple(expression1,expression3)

28) Explain TOKENIZE?

A) Use the TOKENIZE function to split a string of words (all words in a single tuple) into a bag of words (each word in a single tuple).

Syntax: TOKENIZE(expression [, ‘field_delimiter’])

29) How do you handle compression in Pig?

A) Support for compression is determined by the load/store function. PigStorage and TextLoader support gzip and bzip compression for both read (load) and write (store). BinStorage does not support compression.

Interview Questions On Apache Pig

30) What is the use of BinStorage?

A) Pig uses BinStorage to load and store the temporary data that is generated between multiple MapReduce jobs.

BinStorage works with data that is represented on disk in machine-readable format. BinStorage does NOT support compression.
BinStorage supports multiple locations (files, directories, globs) as input.

31) Explain PigDump function?

A) PigDump stores data as tuples in human-readable UTF-8 format.

Syntax: PigDump()

32) Explain JsonLoader, JsonStorage functions in Pig?

A) Use JsonLoader to load JSON data.

Use JsonStorage to store JSON data.

Syntax: JsonLoader( [‘schema’] )

JsonStorage( )

33) Explain PigStorage function?

A) PigStorage is the default function used by Pig to load/store the data. PigStorage supports structured text files (in human-readable UTF-8 format) in compressed or uncompressed form. All Pig data types (both simple and complex) can be read/written using this function. The input data to the load can be a file, a directory or a glob.

Syntax: PigStorage( [field_delimiter] , [‘options’] )

34) Explain TextLoader function?

A) TextLoader works with unstructured data in UTF8 format. Each resulting tuple contains a single field with one line of input text. TextLoader also supports compression. Currently, TextLoader support for compression is limited. TextLoader cannot be used to store data.

Syntax: TextLoader()

35) Explain HBaseStorage function?

A) HBaseStorage stores and loads data from HBase. The function takes two arguments. The first argument is a space separated list of columns. The second optional argument is a space separated list of options.

Syntax: HBaseStorage(‘columns’, [‘options’])

36) Explain AvroStorage function?

A) AvroStorage stores and loads data from Avro files. Often, you can load and store data using AvroStorage without knowing much about the Avros serialization format. AvroStorage will attempt to automatically translate a pig schema and pig data to avro data, or avro data to pig data.

Syntax: AvroStorage([‘schema|record name’], [‘options’])

37) Explain TOTUPLE function?

A) Use the TOTUPLE function to convert one or more expressions to a tuple.

Syntax: TOTUPLE(expression [, expression …])

38)Explain TOBAG function?

A) Use the TOBAG function to convert one or more expressions to individual tuples which are then placed in a bag.

Syntax: TOBAG(expression [, expression …])

39) Explain TOMAP function?

A) Use the TOMAP function to convert pairs of expressions into a map.

Syntax: TOMAP(key-expression, value-expression [, key-expression, value-expression …])

40) Explain Hive UDF function?

A) Pig invokes all types of Hive UDF, including UDF, GenericUDF, UDAF, GenericUDAF and GenericUDTF. Depending on the Hive UDF you want to use, you need to declare it in Pig with HiveUDF(handles UDF and GenericUDF), HiveUDAF(handles UDAF and GenericUDAF), HiveUDTF(handles GenericUDTF).

Syntax: HiveUDF, HiveUDAF, HiveUDTF share the same syntax.

HiveUDF(name[, constant parameters]

Interview Questions On Pig

41) What is Flatten in Pig?

A) The FLATTEN operator looks like a UDF syntactically, but it is actually an operator that changes the structure of tuples and bags in a way that a UDF cannot. Flatten un-nests tuples as well as bags.

42) How can you debug a pig script?

A) There are several method to debug a pig script. Simple method is step by step execution of a relation and then verify the result. These commands are useful to debug a pig script.

DUMP – Use the DUMP operator to run (execute) Pig Latin statements and display the results to your screen.

ILLUSTRATE – Use the ILLUSTRATE operator to review how data is transformed through a sequence of Pig Latin statements. ILLUSTRATE allows you to test your programs on small datasets and get faster turnaround times.

EXPLAIN – Use the EXPLAIN operator to review the logical, physical, and map reduce execution plans that are used to compute the specified relationship.

DESCRIBE – Use the DESCRIBE operator to view the schema of a relation. You can view outer relations as well as relations defined in a nested FOREACH statement.

43) What is the use of ILLUSTRATE in Pig?

A) Use the ILLUSTRATE operator to review how data is transformed through a sequence of Pig Latin statements. ILLUSTRATE allows you to test your programs on small datasets and get faster turnaround times.

1 thought on “Apache Pig Interview Questions And Answers For Experienced”

kiran sahu

at

Hello,
awesome article you have mention all the important question and answers as well as common question asked in java realy great full to you for writing such a wounder full article must read, going to bookmark it..
Thank you..

Apache Pig Interview Questions And Answers For Experienced