Implementing Security in Hadoop Cluster

Big Data & Hadoop
1

When we talk about security in Hadoop, we need to explore all the aspect of cluster networking and understand how the Nodes and Client communicate to each other.

Let’s list down possible communication in a simple cluster.

  • Master – Slave communication => Namenode – Datanode / Jobtracker – Tasktracker communication
  • Slave to slave communication => Datanode replicate blocks to other datanode
  • Client- Master Communication => client to namenode/jobtracker
  • Client – Slave Communication => Copy data to datanode on direction of namenode
  • Daemons to Daemons Communication => Tasktracker – Datanode / Jobtracker – Namenode communication

SSH

SSH is a cryptographic network protocol for secure data communication between two networked computers. This is how the various nodes in Hadoop cluster connect via a secure channel and communicates with each others. Please note daemons do not rely upon SSH to communicate, but the machines in the cluster.

The Hadoop control scripts (but not the Daemons) rely on SSH to perform cluster-wide operations. Control script allows you to run commands on (remote) worker nodes from the master node using SSH. This is by design in Hadoop that, User is allowed to run control scripts which can be found in bin directory (ex:- start-all.sh, stop-all.sh, start-dfs.sh, start-mapred.sh etc) only via SSH.

For Example, when user runs start-all.sh on master node, this script start Namenode on local machine on which the script is run on, and start Tasktracker and Datanode daemons on each machines listed in the slaves file.

Configuration Steps:

SSH needs to be set up to allow password-less login for Hadoop user from machines in the cluster. The simplest way is to generate public/private key pair, and place it in an NFS location that is shared across the cluster.

Generate RSA key pair in hadoop user account.

$ ssh-keygen –t rsa –P “***”                       //***Type your own Passphrase

 Public key would be generated in ~/.ssh/id_rsa.pub key, need to moved to authorized_keys.

$ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys.

Note: – SSH configuration is essential part of HADOOP Installation and Configuration.

The above discussion highlights that how User performs Cluster-Wide Operation via Control Script.

Authentication & Authorization in Hadoop Cluster

How the User is authenticated to perform an operation on the cluster. For example, a file may only be readable by a group of users, so anyone not in that group is not authorized to read it.

Similarly for authorization, HDFS File Permissions provide only a mechanism for authorization, which control what a User can do to a particular file. However, it is not enough, since malicious user can gain network access to the cluster via spoofing.

Also, a cluster may host a mix of dataset with different security levels.

Therefore, there should be a full proof security mechanism that meet regulatory requirement for data protection and secure authentication.

Kerberos

Kerberos is a mature authentication protocol, to authenticate user, but it does not manage permission. It’s Hadoop’s job to determine whether that user has permission to perform a given action.

 Let’s have a very high level overview of how Kerberos works.

 The heart of Kerberos is KDC (Key Distribution Center):

There are two parts of KDC:-

  1. AS ( Authentication Server)   – Client authenticate itself to AS and received a timestamped Ticket-Granting Ticket (TGT)
  2. TGS (Ticket Granting Server) – the client uses the TGT to request a service ticket from the TGS.

The Client used this Service ticket to authenticate with server (In Hadoop -Namenode / Jobtracker) providing the service.

Note: – In the Step 1. Authentication is carried out explicitly by user. For authorization, client performs it on user behalf.

Hadoop Implementation of Kerberos

  1. Edit core-site.xml and enable Kerberos authentication and enable service-level authorization.
<configuration>
   <property>
      <name>hadoop.security.authentication</name>
      <value>kerberos</value>
   </property>
   <property>
      <name>hadoop.security.authorization</name>
      <value>true</value>
   </property>
</configuration>

    2.  Configure ACLs (Access Control List) in hadoop-policy.xml config file to control service-level authorization. For   example, you can restrict a user to only authorize for Job Submission.

The Format of ACL is comma-separated list of username, followed by white-space, followed by comma-separated list of group names.

user1, user2   group1, group2

The above list authorize to users named user1 or user2, or in group group1 or group2.

   3.  For performing any operation let’s say copy file to HDFS, User first need to get Kerberos ticket by authenticating to KDC, using kinit:

$ kinit                                    (requesting service ticket)
$ hadoop fs - put employeeList.txt         (copy data to HDFS)

Note: Once requested, Kerberos ticket is valid for 10 hours. Therefore, all subsequent operations by user can be performed without requesting another ticket for 10 hours.

On a large cluster, there are many client-server interactions, multiple calls to Namenode and Datanodes and many more other communications. Authenticating each call using Kerberos Ticket may overload KDC on a busy cluster.

Hadoop uses delegation token to allow latter authentication access without having to contact the KDC again. Delegation Token is created by Namenode. On the first RPC call to the Namenode, the client has no delegation token, so it uses Kerberos to authenticate, and as a part of response it gets a delegation token from the Namenode. In subsequent calls, it presents the delegation tokens, which the Namenode can verify.

For any operation on HDFS block, the client uses a special kind of delegation token called block access token that the Namenode passes to the client in response to a metadata request.

Delegation tokens are used by the Jobtracker and Tasktrackers to access HDFS during the course of the job. When the Job has finished, the delegation tokens are invalidated.

                              *** *** ***

1 Comment

Leave a Reply

Your email address will not be published. Required fields are marked *

Big Data & Hadoop
2
Cloudera vs AWS vs AZURE vs Google Cloud: How to decide on the right big data platform?

Background Big data concepts evolved to solve a specific problem of processing data of diversified nature, high volume and streaming data. Hadoop came with the first architectural solution to process this nature of data on commodity hardware against the high cost HPC and appliance based systems. During past progressive years, …

Big Data & Hadoop
2
Hadoop Streaming with Perl Script

In this article, I am going to explain how to use Hadoop streaming with Perl scripts. First, let’s understand some theory behind Hadoop streaming. Hadoop has been written in Java. Therefore, the native language to write MapReduce program is Java. But, Hadoop also provide an API to MapReduce that allows …

Big Data & Hadoop
1
Tool & ToolRunner – Simplifying the concept

Writing a mapper & reducer Program definition is easy. Just extend your class by org.apache.hadoop.mapreduce.Mapper and org.apache.hadoop.mapreduce.Reducer respectively and override the map and reduce methods to implement your logics. But, when it comes to write driver program (contain main method of program) for the MapReduce Job, it’s always preferable to …

error

Enjoyed this blog? Please spread the word :)