Hadoop interview questions and answers
1.What Is the Hadoop Framework?
Hadoop is an open source framework written in Java. This framework is used to write software applications that process vast amounts of data.The base programming model of Hadoop is based on Google’s MapReduce.The framework works in-parallel on large clusters that may have thousands of computers in a single node, and multiple nodes per cluster. It also processes data in a reliable and fault-tolerant manner.
Hadoop is a platform that offers both distributed storage and computational capabilities. It was first conceived to fix a scalability issue that existed in Nutch, an open source crawler and search engine. At the time, Google had published papers that described its novel distributed Google File System (GFS), and Map-Reduce, a computational framework for parallel processing. The successful implementation of these papers’ concepts in Nutch resulted into being split into two separate projects, the second of which became Hadoop.
2.what is Components of Hadoop?
Storage unit– HDFS (NameNode, DataNode)
Processing framework– YARN (ResourceManager, NodeManager)
3. What are HDFS and YARN?
HDFS (Hadoop Distributed File System) is the storage unit of Hadoop. It is responsible for storing different kinds of data as blocks in a distributed environment. It follows master and slave topology.
YARN (Yet Another Resource Negotiator) is the processing framework in Hadoop, which manages resources and provides an execution environment to the processes.
4.What are the components of HDFS?
NameNode: NameNode is the master node in the distributed environment and it maintains the metadata information for the blocks of data stored in HDFS like block location, replication factors etc.
DataNode: DataNodes are the slave nodes, which are responsible for storing data in the HDFS. NameNode manages all the DataNodes.
5.What are the components of YARN?
ResourceManager: It receives the processing requests, and then passes the parts of requests to corresponding Node Managers accordingly, where the actual processing takes place. It allocates resources to applications based on the needs.
NodeManager: NodeManager is installed on every DataNode and it is responsible for execution of the task on every single DataNode.
6.What is MapReduce?
MapReduce is a programming model and an associated implementation for processing and generating large data sets with a parallel, distributed algorithm on a cluster. MapReduce jobs usually split the input data-set into independent chunks, while a Map task will process these chunks in a completely parallel manner on different nodes. The job of the framework is to sort the outputs of the maps. The reducer produces the final result with the help of the output from the previous step.
7.What are the modes in which Hadoop run?
Apache Hadoop runs in three modes:
Local (Standalone) Mode – Hadoop by default run in a single-node, non-distributed mode, as a single Java process. Local mode uses the local file system for input and output operation. It is also used for debugging purpose, and it does not support the use of HDFS. Further, in this mode, there is no custom configuration required for configuration files.
Pseudo-Distributed Mode – Just like the Standalone mode, Hadoop also runs on a single-node in a Pseudo-distributed mode. The difference is that each daemon runs in a separate Java process in this Mode. In Pseudo-distributed mode, we need configuration for all the four files mentioned above. In this case, all daemons are running on one node and thus, both Master and Slave node are the same.
Fully-Distributed Mode – In this mode, all daemons execute in separate nodes forming a multi-node cluster. Thus, it allows separate nodes for Master and Slave.
8.What are the features of Standalone (local) mode?
By default, Hadoop run in a single-node, non-distributed mode, as a single Java process. Local mode uses the local file system for input and output operation. One can also use it for debugging purpose. It does not support the use of HDFS. Standalone mode is suitable only for running programs during development for testing. Further, in this mode, there is no custom configuration required for configuration files. Configuration files are:
9.Explain the major difference between HDFS block and InputSplit.
In simple terms, block is the physical representation of data while split is the logical representation of data present in the block. Split acts a s an intermediary between block and mapper.
Suppose we have two blocks:
Block 1: ii nntteell
Block 2: Ii ppaatt
Now, considering the map, it will read first block from ii till ll, but does not know how to process the second block at the same time. Here comes Split into play, which will form a logical group of Block1 and Block 2 as a single block.
It then forms key-value pair using inputformat and records reader and sends map for further processing With inputsplit, if you have limited resources, you can increase the split size to limit the number of maps. For instance, if there are 10 blocks of 640MB (64MB each) and there are limited resources, you can assign ‘split size’ as 128MB. This will form a logical group of 128MB, with only 5 maps executing at a time.
However, if the ‘split size’ property is set to false, whole file will form one inputsplit and is processed by single map, consuming more time when the file is bigger.
10.What Are Edge Or Gateway Nodes?
Edge nodes are the interface between the Hadoop cluster and the outside network. For this reason, they’re sometimes referred to as gateway nodes. Most commonly, edge nodes are used to run client applications and cluster administration tools. Typically edge-nodes are kept separate from the nodes that contain Hadoop services such as HDFS, MapReduce, etc, mainly to keep computing resources separate. Edge nodes running within the cluster allow for centralized management of all the Hadoop configuration entries on the cluster nodes which helps to reduce the amount of administration needed to update the config files.
The fact is, given the limited security within Hadoop itself, even if your Hadoop cluster operates in a local- or wide-area network behind an enterprise firewall, you may want to consider a cluster-specific firewall to more fully protect non-public data that may reside in the cluster. In this deployment model, think of the Hadoop cluster as an island within your IT infrastructure — for every bridge to that island you should consider an edge node for security.
11. What is distributed cache?
Distributed Cache, in Hadoop, is a service by MapReduce framework to cache files when needed. Learn more in this MapReduce Tutorial now. Once a file is cached for a specific job, hadoop will make it available on each data node both in system and in memory, where map and reduce tasks are executing.Later, you can easily access and read the cache file and populate any collection (like array, hashmap) in your code.
12.What are the Benefits of using distributed cache?
It distributes simple, read only text/data files and/or complex types like jars, archives and others. These archives are then un-archived at the slave node.
Distributed cache tracks the modification timestamps of cache files, which notifies that the files should not be modified until a job is executing currently.
13.What are the features of Pseudo mode?
Just like the Standalone mode, Hadoop can also run on a single-node in this mode. The difference is that each Hadoop daemon runs in a separate Java process in this Mode. In Pseudo-distributed mode, we need configuration for all the four files mentioned above. In this case, all daemons are running on one node and thus, both Master and Slave node are the same.
The pseudo mode is suitable for both for development and in the testing environment. In the Pseudo mode, all the daemons run on the same machine.
14.What are the features of Fully-Distributed mode?
In this mode, all daemons execute in separate nodes forming a multi-node cluster. Thus, we allow separate nodes for Master and Slave.
We use this mode in the production environment, where ‘n’ number of machines forming a cluster. Hadoop daemons run on a cluster of machines. There is one host onto which NameNode is running and the other hosts on which DataNodes are running. Therefore, NodeManager installs on every DataNode. And it is also responsible for the execution of the task on every single DataNode.
The ResourceManager manages all these NodeManager. ResourceManager receives the processing requests. After that, it passes the parts of the request to corresponding NodeManager accordingly.
15.What are configuration files in Hadoop?
Core-site.xml – It contain configuration setting for Hadoop core such as I/O settings that are common to HDFS & MapReduce. It use Hostname and port .The most commonly used port is 9000.
hdfs-site.xml – This file contains the configuration setting for HDFS daemons. hdfs-site.xml also specify default block replication and permission checking on HDFS.
mapred-site.xml – In this file, we specify a framework name for MapReduce. we can specify by setting the mapreduce.framework.name.
yarn-site.xml – This file provide configuration setting for NodeManager and ResourceManager.
16.What are the limitations of Hadoop?
Various limitations of Hadoop are:
Issue with small files – Hadoop is not suited for small files. Small files are the major problems in HDFS. A small file is significantly smaller than the HDFS block size (default 128MB). If you are storing these large number of small files, HDFS can’t handle these lots of files. As HDFS works with a small number of large files for storing data sets rather than larger number of small files. If one use the huge number of small files, then this will overload the namenode. Since namenode stores the namespace of HDFS.
HAR files, Sequence files, and Hbase overcome small files issues.
Processing Speed – With parallel and distributed algorithm, MapReduce process large data sets. MapReduce performs the task: Map and Reduce. MapReduce requires a lot of time to perform these tasks thereby increasing latency. As data is distributed and processed over the cluster in MapReduce. So, it will increase the time and reduces processing speed.
Support only Batch Processing – Hadoop supports only batch processing. It does not process streamed data and hence, overall performance is slower. MapReduce framework does not leverage the memory of the cluster to the maximum.
Iterative Processing – Hadoop is not efficient for iterative processing. As hadoop does not support cyclic data flow. That is the chain of stages in which the input to the next stage is the output from the previous stage.
Vulnerable by nature – Hadoop is entirely written in Java, a language most widely used. Hence java been most heavily exploited by cyber-criminal. Therefore it implicates in numerous security breaches.
Security- Hadoop can be challenging in managing the complex application. Hadoop is missing encryption at storage and network levels, which is a major point of concern. Hadoop supports Kerberos authentication, which is hard to manage.
17.Explain the difference between NameNode, Checkpoint NameNode and BackupNode.
NameNode is the core of HDFS that manages the metadata – the information of what file maps to what block locations and what blocks are stored on what datanode. In simple terms, it’s the data about the data being stored. NameNode supports a directory tree-like structure consisting of all the files present in HDFS on a Hadoop cluster. It uses following files for namespace:
fsimage file- It keeps track of the latest checkpoint of the namespace.
edits file-It is a log of changes that have been made to the namespace since checkpoint.
Checkpoint NameNode has the same directory structure as NameNode, and creates checkpoints for namespace at regular intervals by downloading the fsimage and edits file and margining them within the local directory. The new image after merging is then uploaded to NameNode.
There is a similar node like Checkpoint, commonly known as Secondary Node, but it does not support the ‘upload to NameNode’ functionality.
Backup Node provides similar functionality as Checkpoint, enforcing synchronization with NameNode. It maintains an up-to-date in-memory copy of file system namespace and doesn’t require getting hold of changes after regular intervals. The backup node needs to save the current state in-memory to an image file to create a new checkpoint.
18.What are the most common Input Formats in Hadoop?
There are three most common input formats in Hadoop:
Text Input Format: It is default input format in Hadoop.
Key Value Input Format: used for plain text files where the files are broken into lines
Sequence File Input Format: used for reading files in sequence
19.Define DataNode and how does NameNode tackle DataNode failures?
DataNode stores data in HDFS; it is a node where actual data resides in the file system. Each datanode sends a heartbeat message to notify that it is alive. If the namenode does noit receive a message from datanode for 10 minutes, it considers it to be dead or out of place, and starts replication of blocks that were hosted on that data node such that they are hosted on some other data node.A BlockReport contains list of all blocks on a DataNode. Now, the system starts to replicate what were stored in dead DataNode.
The NameNode manages the replication of data blocksfrom one DataNode to other. In this process, the replication data transfers directly between DataNode such that the data never passes the NameNode.
20.What are the core methods of a Reducer?
The three core methods of a Reducer are:
setup(): this method is used for configuring various parameters like input data size, distributed cache.
public void setup (context)
reduce(): heart of the reducer always called once per key with the associated reduced task
public void reduce(Key, Value, context)
cleanup(): this method is called to clean temporary files, only once at the end of the task
public void cleanup (context)
21.What is SequenceFile in Hadoop?
Extensively used in MapReduce I/O formats, SequenceFile is a flat file containing binary key/value pairs. The map outputs are stored as SequenceFile internally. It provides Reader, Writer and Sorter classes. The three SequenceFile formats are:
Uncompressed key/value records.
Record compressed key/value records – only ‘values’ are compressed here.
Block compressed key/value records – both keys and values are collected in ‘blocks’ separately and compressed. The size of the ‘block’ is configurable.
22.What is Job Tracker role in Hadoop?
Job Tracker’s primary function is resource management (managing the task trackers), tracking resource availability and task life cycle management (tracking the taks progress and fault tolerance).
It is a process that runs on a separate node, not on a DataNode often.
Job Tracker communicates with the NameNode to identify data location.
Finds the best Task Tracker Nodes to execute tasks on given nodes.
Monitors individual Task Trackers and submits the overall job back to the client.
It tracks the execution of MapReduce workloads local to the slave node.
23.What is the use of RecordReader in Hadoop?
Since Hadoop splits data into various blocks, RecordReader is used to read the slit data into single record. For instance, if our input data is split like:
Row1: Welcome to
It will be read as “Welcome to Intellipaat” using RecordReader.
24.What is Speculative Execution in Hadoop?
One limitation of Hadoop is that by distributing the tasks on several nodes, there are chances that few slow nodes limit the rest of the program. Tehre are various reasons for the tasks to be slow, which are sometimes not easy to detect. Instead of identifying and fixing the slow-running tasks, Hadoop tries to detect when the task runs slower than expected and then launches other equivalent task as backup. This backup mechanism in Hadoop is Speculative Execution.
25.What happens if you try to run a Hadoop job with an output directory that is already present?
It will throw an exception saying that the output file directory already exists.
26.What are active and passive “NameNodes”?
In HA (High Availability) architecture, we have two NameNodes – Active “NameNode” and Passive “NameNode”.
Active “NameNode” is the “NameNode” which works and runs in the cluster.
Passive “NameNode” is a standby “NameNode”, which has similar data as active “NameNode”.
When the active “NameNode” fails, the passive “NameNode” replaces the active “NameNode” in the cluster. Hence, the cluster is never without a “NameNode” and so it never fails.
27.Why does one remove or add nodes in a Hadoop cluster frequently?
One of the most attractive features of the Hadoop framework is its utilization of commodity hardware. However, this leads to frequent “DataNode” crashes in a Hadoop cluster. Another striking feature of Hadoop Framework is the ease of scale in accordance with the rapid growth in data volume. Because of these two reasons, one of the most common task of a Hadoop administrator is to commission (Add) and decommission (Remove) “Data Nodes” in a Hadoop Cluster.
28.What happens when two clients try to access the same file in the HDFS?
HDFS supports exclusive writes only.
When the first client contacts the “NameNode” to open the file for writing, the “NameNode” grants a lease to the client to create this file. When the second client tries to open the same file for writing, the “NameNode” will notice that the lease for the file is already granted to another client, and will reject the open request for the second client.
29.What is a checkpoint?
In brief, “Checkpointing” is a process that takes an FsImage, edit log and compacts them into a new FsImage. Thus, instead of replaying an edit log, the NameNode can load the final in-memory state directly from the FsImage. This is a far more efficient operation and reduces NameNode startup time. Checkpointing is performed by Secondary NameNode.
30.How is HDFS fault tolerant?
When data is stored over HDFS, NameNode replicates the data to several DataNode. The default replication factor is 3. You can change the configuration factor as per your need. If a DataNode goes down, the NameNode will automatically copy the data to another node from the replicas and make the data available. This provides fault tolerance in HDFS.
31.Can NameNode and DataNode be a commodity hardware?
The smart answer to this question would be, DataNodes are commodity hardware like personal computers and laptops as it stores data and are required in a large number. But from your experience you can tell that, NameNode is the master node and it stores metadata about all the blocks stored in HDFS. It requires high memory (RAM) space, so NameNode needs to be a high-end machine with good memory space.
32.Why do we use HDFS for applications having large data sets and not when there are a lot of small files?
HDFS is more suitable for large amounts of data sets in a single file as compared to small amount of data spread across multiple files. As you know, the NameNode stores the metadata information regarding file system in the RAM. Therefore, the amount of memory produces a limit to the number of files in my HDFS file system. In other words, too much of files will lead to generation of too much meta data. And, storing these meta data in the RAM will become a challenge. As a thumb rule, metadata for a file, block or directory takes 150 bytes.
33.How do you define “block” in HDFS? What is the default block size in Hadoop 1 and in Hadoop 2? Can it be changed?
Blocks are the nothing but the smallest continuous location on your hard drive where data is stored. HDFS stores each as blocks, and distribute it across the Hadoop cluster. Files in HDFS are broken down into block-sized chunks, which are stored as independent units.
Hadoop 1 default block size: 64 MB
Hadoop 2 default block size: 128 MB
Yes, blocks can be configured. The dfs.block.size parameter can be used in the hdfs-site.xml file to set the size of a block in a Hadoop environment.
34.What does ‘jps’ command do?
The ‘jps’ command helps us to check if the Hadoop daemons are running or not. It shows all the Hadoop daemons i.e namenode, datanode, resourcemanager, nodemanager etc. that are running on the machine.
35.How do you define “Rack Awareness” in Hadoop?
Rack Awareness is the algorithm in which the “NameNode” decides how blocks and their replicas are placed, based on rack definitions to minimize network traffic between “DataNodes” within the same rack. Let’s say we consider replication factor 3 (default), the policy is that “for every block of data, two copies will exist in one rack, third copy in a different rack”. This rule is known as the “Replica Placement Policy”.
36.What is “speculative execution” in Hadoop?
If a node appears to be executing a task slower, the master node can redundantly execute another instance of the same task on another node. Then, the task which finishes first will be accepted and the other one is killed. This process is called “speculative execution”.
37.How can I restart “NameNode” or all the daemons in Hadoop?
This question can have two answers, we will discuss both the answers. We can restart NameNode by following methods:
You can stop the NameNode individually using. /sbin /hadoop-daemon.sh stop namenode command and then start the NameNode using. /sbin/hadoop-daemon.sh start namenode command.
To stop and start all the daemons, use. /sbin/stop-all.sh and then use ./sbin/start-all.sh command which will stop all the daemons first and then start all the daemons.
38.What is the difference between an “HDFS Block” and an “Input Split”?
The “HDFS Block” is the physical division of the data while “Input Split” is the logical division of the data. HDFS divides data in blocks for storing the blocks together, whereas for processing, MapReduce divides the data into the input split and assign it to mapper function.
39.What are the main configuration parameters in a “MapReduce” program?
The main configuration parameters which users need to specify in “MapReduce” framework are:
Job’s input locations in the distributed file system
Job’s output location in the distributed file system
Input format of data
Output format of data
Class containing the map function
Class containing the reduce function
JAR file containing the mapper, reducer and driver classes
40.State the reason why we can’t perform “aggregation” (addition) in mapper? Why do we need the “reducer” for this?
This answer includes many points, so we will go through them sequentially.
We cannot perform “aggregation” (addition) in mapper because sorting does not occur in the “mapper” function. Sorting occurs only on the reducer side and without sorting aggregation cannot be done.
During “aggregation”, we need output of all the mapper functions which may not be possible to collect in the map phase as mappers may be running on different machine where the data blocks are stored.
And lastly, if we try to aggregate data at mapper, it requires communication between all mapper functions which may be running on different machines. So, it will consume high network bandwidth and can cause network bottlenecking.
41.What is the purpose of “RecordReader” in Hadoop?
The “InputSplit” defines a slice of work, but does not describe how to access it. The “RecordReader” class loads the data from its source and converts it into (key, value) pairs suitable for reading by the “Mapper” task. The “RecordReader” instance is defined by the “Input Format”.
42.How do “reducers” communicate with each other?
This is a tricky question. The “MapReduce” programming model does not allow “reducers” to communicate with each other. “Reducers” run in isolation.
43.What is a “Combiner”?
A “Combiner” is a mini “reducer” that performs the local “reduce” task. It receives the input from the “mapper” on a particular “node” and sends the output to the “reducer”. “Combiners” help in enhancing the efficiency of “MapReduce” by reducing the quantum of data that is required to be sent to the “reducers”.
44.What do you know about “SequenceFileInputFormat”?
“SequenceFileInputFormat” is an input format for reading within sequence files. It is a specific compressed binary file format which is optimized for passing the data between the outputs of one “MapReduce” job to the input of some other “MapReduce” job.
Sequence files can be generated as the output of other MapReduce tasks and are an efficient intermediate representation for data that is passing from one MapReduce job to another.
45.What is the problem with small files in Hadoop?
Hadoop is not suited for small data. Hadoop HDFS lacks the ability to support the random reading of small files. Small file in HDFS is smaller than the HDFS block size (default 128 MB). If we are storing these huge numbers of small files, HDFS can’t handle these lots of files. HDFS works with the small number of large files for storing large datasets. It is not suitable for a large number of small files. A large number of many small files overload NameNode since it stores the namespace of HDFS.
46.How is security achieved in Hadoop?
Apache Hadoop achieves security by using Kerberos.
At a high level, there are three steps that a client must take to access a service when using Kerberos. Thus, each of which involves a message exchange with a server.
Authentication – The client authenticates itself to the authentication server. Then, receives a timestamped Ticket-Granting Ticket (TGT).
Authorization – The client uses the TGT to request a service ticket from the Ticket Granting Server.
Service Request – The client uses the service ticket to authenticate itself to the server.
47.How can you transfer data from Hive to HDFS?
By writing the query:
hive> insert overwrite directory ‘/’ select * from emp;
You can write your query for the data you want to import from Hive to HDFS. The output you receive will be stored in part files in the specified HDFS path.
48.How to compress mapper output but not the reducer output?
To achieve this compression, you should set:
49.How is indexing done in HDFS?
Hadoop has a unique way of indexing. Once Hadoop framework store the data as per the block size. HDFS will keep on storing the last part of the data which will say where the next part of the data will be. In fact, this is the base of HDFS.
50.Explain Hadoop Archives?
Apache Hadoop HDFS stores and processes large (terabytes) data sets. However, storing a large number of small files in HDFS is inefficient, since each file is stored in a block, and block metadata is held in memory by the namenode.
Reading through small files normally causes lots of seeks and lots of hopping from datanode to datanode to retrieve each small file, all of which is inefficient data access pattern.
Hadoop Archive (HAR) basically deals with small files issue. HAR pack a number of small files into a large file, so, one can access the original files in parallel transparently (without expanding the files) and efficiently.
Hadoop Archives are special format archives. It maps to a file system directory. Hadoop Archive always has a *.har extension. In particular, Hadoop MapReduce uses Hadoop Archives as an Input.