Thursday, January 31, 2013

HBase Installation : Fully Distributed Mode

HBase can be installed in 3 modes:
  1. Standalone Mode
  2. Pseudo-Distributed Mode
  3. Fully-Distributed Mode
The objective of this post is to provide a tried-and-true procedure for installing HBase in Fully-Distributed Mode. But before we dive deeper into installation nuts and bolts, here are some hbase preliminaries, that I feel I should include as a startup.

1. When talking about installing HBase in fully distributed mode we'll be addressing the following:
  • HDFS: A running instance of HDFS is required for deploying HBase in distributed mode.
  • HBase Master: HBase cluster has a master-slave architecture where the HBase Master is responsible for monitoring all the slaves i.e. Region Servers.
  • Region Servers: These are the slave nodes responsible for storing and managing regions.
  • Zookeeper Cluster: A distributed Apache HBase installation depends on a running ZooKeeper cluster. All participating nodes and clients need to be able to access the running ZooKeeper ensemble.
2. The coin of setting up a Fully Distributed HBase Cluster has got two sides to it:
  • When Zookeeper cluster is managed by HBase internally
  • When Zookeeper cluster is managed externally
3. HBase is overparticular about the DNS entries of its cluster nodes. Therefore, to avert imminent discrepancies we would be assigning host names to the cluster nodes and using them for installation.

Deploying a Fully-Distributed HBase Cluster

Assumptions
For the purpose of clarity and ease of expression, I'll be assuming that we are setting up a cluster of 3 nodes with IP Addresses
10.10.10.1
10.10.10.2
10.10.10.3
where 10.10.10.1 would be the master and 10.10.10.2,3 would be the slaves/region servers.
Also, we'll be assuming that we have a running instance of HDFS, whose NameNode daemon is running on
10.10.10.4

Case I- When HBase manages the Zookeeper ensemble
HBase by default manages a ZooKeeper "cluster" for you. It will start and stop the ZooKeeper ensemble as part of the HBase start/stop process.

Step 1: Assign hostnames to all the nodes of the cluster.
10.10.10.1 master
10.10.10.2 regionserver1
10.10.10.3 regionserver2

10.10.10.4 namenode                                                                        
Now in each of these append the required hostnames to the /etc/hosts file.
On the Namenode(10.10.10.4) add:
10.10.10.4 namenode                                                                        

On the Master Node(10.10.10.1) add:
10.10.10.1 master
10.10.10.4 namenode
10.10.10.2 regionserver1
10.10.10.3 regionserver2                                                                   

On the Region Server 1(10.10.10.2) add:
10.10.10.1 master
10.10.10.2 regionserver1
                                                                  

And on the Region Server 2(10.10.10.3) add:
10.10.10.1 master
10.10.10.3 regionserver2                                                                   

Step 2: Download a stable release of hbase from
http://apache.techartifact.com/mirror/hbase/
and untar it at a suitable location on
all the hbase cluster nodes(10.10.10.1, 10.10.10.2, 10.10.10.3).

Step 3: Edit the /conf/hbase-env.sh file on all the hbase cluster nodes(10.10.10.1, 10.10.10.2, 10.10.10.3) to add the JAVA_HOME
(for eg. /usr/lib/jvm/java-6-openjdk/) and to set the HBASE_MANAGES_ZK to true to indicate that HBase is supposed to manage the zookeeper ensemble internally.
export JAVA_HOME=your_java_home
export HBASE_MANAGES_ZK=true
                                            

Step 4: Edit the /conf/hbase-site.xml on all the hbase cluster nodes which after your editing should look like:
<configuration>
<property>
    <name>hbase.master</name>
    <value>10.10.10.1:60000</value>
</property>
<property>
    <name>hbase.rootdir</name>
    <value>hdfs://namenode:9000/hbase</value>
</property>
<property>
    <name>hbase.cluster.distributed</name>
    <value>true</value>
</property>
</configuration>
                                                                              

Here the property 'hbase.master' reflects the host and port that the HBase master(10.10.10.1) runs at. Next is the 'hbase.rootdir' property which is a directory shared by the region servers. The value has to be an HDFS location, for eg: hdfs://namenode:9000/hbase. Since we have assigned 10.10.10.4, hostname 'namenode' whose NameNode port is assumed to be 9000, we form the rootdir location as hdfs://namenode:9000/hbase. If your namenode is running on some other port replace 9000 by that port number.

Step 5: Edit the /conf/regionservers file on all the hbase cluster nodes. Add the hostnames of all the region server nodes. For eg.
regionserver1
regionserver2                                                                                     


This completes the installation process of HBase Cluster with Zookeeper Ensemble being managed internally.

Case II- When the Zookeeper ensemble is managed externally
We can manage the ZooKeeper ensemble independent of HBase and just point HBase at the cluster it should use.

Step 1: Assign hostnames to all the nodes of the hbase and zookeeper cluster. Assuming we have a two node zookeeper cluster of nodes 10.10.10.5 and 10.10.10.6
10.10.10.1 master
10.10.10.2 regionserver1
10.10.10.3 regionserver2
10.10.10.4 namenode
10.10.10.5 zkserver1
10.10.10.6 zkserver2                                                                      
  

Now in each of these append the required hostnames to the /etc/hosts file.
On the Namenode(10.10.10.4) add:
10.10.10.4 namenode                                                                        

On the Master Node(10.10.10.1) add:
10.10.10.1 master
10.10.10.4 namenode
10.10.10.2 regionserver1
10.10.10.3 regionserver2
10.10.10.5 zkserver1
10.10.10.6 zkserver2                                                                         

On the Region Server 1(10.10.10.2) add:
10.10.10.1 master
10.10.10.2 regionserver1
10.10.10.5 zkserver1
10.10.10.6 zkserver2                                                                         

And on the Region Server 2(10.10.10.3) add:
10.10.10.1 master
10.10.10.3 regionserver2
10.10.10.5 zkserver1
10.10.10.6 zkserver2                                                                         

Step 2: Download a stable release of hbase from
http://apache.techartifact.com/mirror/hbase/
and untar it at a suitable location on
all the hbase cluster nodes(10.10.10.1, 10.10.10.2, 10.10.10.3).

Step 3: Edit the /conf/hbase-env.sh file on all the hbase cluster nodes(10.10.10.1, 10.10.10.2, 10.10.10.3) to add the JAVA_HOME
(for eg. /usr/lib/jvm/java-6-openjdk/) and to set the HBASE_MANAGES_ZK to false to indicate that the zookeeper ensemble would be managed externally.
export JAVA_HOME=your_java_home
export HBASE_MANAGES_ZK=false                                            

Step 4: Edit the /conf/hbase-site.xml on all the hbase cluster nodes which after editing should look like:
<configuration>
<property>
    <name>hbase.master</name>
    <value>10.10.10.1:60000</value>
</property>
<property>
    <name>hbase.rootdir</name>
    <value>hdfs://namenode:9000/hbase</value>
</property>
<property>
    <name>hbase.cluster.distributed</name>
    <value>true</value>
</property>
<property>

    <name>hbase.zookeeper.property.clientPort</name>
    <value>2181</value>
 </property>
<property>
      <name>hbase.zookeeper.quorum</name>
      <value>zkserver1,zkserver2</value>
</property>
</configuration>                                                                               

Here the property 'hbase.master' reflects the host and port that the HBase master(10.10.10.1) runs at.
Next is the 'hbase.rootdir' property which is a directory shared by the region servers. The value has to be an HDFS location, for eg: hdfs://namenode:9000/hbase. Since we have assigned 10.10.10.4, hostname 'namenode' whose NameNode port is assumed to be 9000, we form the rootdir location as hdfs://namenode:9000/hbase. If your namenode is running on some other port replace 9000 by that port number.
The property 'hbase.zookeeper.property.clientPort' reflects a property from ZooKeeper's config zoo.cfg. It is the port at which the clients will connect.
And lastly the property 'hbase.zookeeper.quorum' is a comma separated list of servers in the ZooKeeper Quorum.

Step 5: Edit the /conf/regionservers file on all the hbase cluster nodes. Add the hostnames of all the region server nodes. For eg.
regionserver1
regionserver2                                                                                     

This brings us to an end of the installation process for HBase cluster with externally managed zookeeper ensemble.

Start your HBase Cluster
Having followed the steps above, now its time to start the deployed cluster. If your have an externally managed zookeeper cluster, make sure to start it before you proceed further.
On the master node(10.10.10.1) cd to the hbase setup and run the following command
$HBASE_HOME/bin/start-hbase.sh                                                 
This would start all the master and the region servers on respective nodes of the cluster.

Stop your HBase Cluster
To stop a running cluster, on the master node, cd to the hbase setup and run
$HBASE_HOME/bin/stop-hbase.sh                                                 

Hadoop-HBase Version Compatibility
As per my assessment, hadoop-1.0.4 and hadoop-1.0.3 versions of Hadoop work fine with all hbase versions of the series 0.94.x and 0.92.x, but not the 0.90.x series and the older releases due to version incompatibility.

Tuesday, January 29, 2013

Setup a MongoDB Sharded Cluster


"Sharding = Shared + Nothing"
Database Sharding is a type of database partitioning that separates very large databases into smaller, faster, more easily managed horizontal partitions called data shards.

Sharding in MongoDB

"MongoDB’s sharding system allows users to partition a collection within a database to distribute the collection’s documents across a number of mongod instances or shards. Sharding increases write capacity, provides the ability to support larger working sets, and raises the limits of total data size beyond the physical resources of a single node." as per the MongoDB Manual.

So when the amount of data that needs to be stored exceeds the storage limit of a single physical resource, a sharded cluster needs to be deployed. Every horizontal partition called "shard" stores a part of the entire dataset. Sharding enables distribution of data, but to have this distributed data replicated as well, each shard should be a Replica Set. For more details on Replica Set, refer to my previous post.

Sample Document

 
A Sharded Cluster in MongoDB

The figure above denotes a Sharded Cluster, where each Shard is a Replica Set which ensures that each portion of the data lying on different physical devices has a copy for backup in case of failures. As depicted, here the Document 1 is distributed across all shards.

Deploying a Sharded Cluster

A Sharded Cluster consists of
  • 1 config server(3 for production cluster)
  • 1 mongos connecting the config server(s) (1 or more for production cluster)
  • 1 shard i.e. a replica set of a standalone machine(2+ for production cluster)
For testing purpose you can also deploy all the above three daemons, on the same machine keeping in mind that the ports on which they would run have to be different. Following setups will guide how to assign a different port to each daemon process. Here we are assuming that you are using a single node(for eg. 10.10.10.10) to run all the daemons, in case of multiple machines use their respective ip addresses.

Step 1: Config Server Setup
Follow the Step 1 and 2 as given in the MongoDB Standalone Setup post.
Config Server requires a config db location which can be specified in the startup command using the --dbpath option. Before this, we need to create it and change its permissions and ownership to the user running MongoDB. Use the following command to setup the config db path of your choice. Here taking "/data/configdb"
mkdir -p /data/configdb
sudo chmod -R 777 /data/configdb
sudo chown groupName:userName /data/configdb -R       
Start the Config Server by running the following command on 10.10.10.10
mongod --configsvr --dbpath /data/configdb --port 27018 
This would start the config server on the machine at port 27018.

Step 2: MongoS Instance Setup
Here we only need to follow the Step 1 and 2 as given in the MongoDB Standalone Setup post. No other MongoS specific configuration needs to be done. To start MongoS Server run
mongos --configdb 10.10.10.10:27018 --port 27019                        
Here, the --configdb option lists the ip:port of the config servers that the mongos server is supposed to connect to and on what port they are running. In case of multiple config servers the command would be like:
mongo --configdb 10.10.10.10:27018,11.11.11.11:27017 --port 27019

Step 3: Setting up the Shards
As mentioned above, each shard is a Replica Set and every member of a Replica Set is a MongoD Instance. For setting up a Shard you need to create a Replica Set first, follow my previous post for achieving the same.
For testing purpose, we can also setup a standalone MongoD Server to add to the cluster as a shard. Follow the steps for setting up a standalone MongoD Instance as given in this post.
Having started the MongoD daemon using
/bin/mongod --config /pathToFile/mongod.conf   
on 10.10.10.10 all you need is to add this node as a shard to the cluster.

Step 4: Adding Shards to the Cluster
For adding shards, you need a MongoS terminal which can be opened by hitting the following command:
mongo mongos_instance_ip:mongos_port            
where the "mongos_instance_ip" is the ip address of the machine on which MongoS daemon is running at port "mongos_port".
In our example the command would be
mongo 10.10.10.10:27019                                   
which would open a mongos terminal.
In the terminal fire commands like 
sh.addShard("shard_ip:mongod_port");                
to add as many shards you want.
For eg, here since we have a standalone mongod instance as a shard, we will use
sh.addShard("10.10.10.10:27017");                     
You can check the successful execution of the above command using 
sh.status()                                                          
in the mongos terminal.

Note: After version 2.0.3, when adding a Replica Set as a shard to the cluster, it is not mandatory to add all nodes in the replica set using sh.addShard(). Adding one of the nodes of the Replica Set would enable mongos to discover all other members and add them to the shard automatically.
For eg. if you have a replica set named "replSet01" which has 2 members 10.10.10.1 and 10.10.10.2, to add this replica set as a shard to the above cluster you would use
sh.addShard("replSet01/10.10.10.1:27017");                            
and this would automatically add both 10.10.10.1 and 10.10.10.2 as a member of the shard being added.

Before version 2.0.3, it is required to specify all the replica set members to be added to the sharded cluster like
sh.addShard("replSet01/10.10.10.1:27017,10.10.10.2:27017");

Hope this post was a helping hand to you in testing out Sharding in MongoDB.
All the Best.

Monday, January 28, 2013

Setup a MongoDB Replica Set


Database replication is a service for shipping changes to your database, off to a copy housed on another server, potentially even in another data center.It ensures redundancy, backup, and automatic failover.
Replication occurs through groups of servers known as replica sets, each of which stores the same dataset as the other members of the replica set.

Replication in MongoDB

In MongoDB, replication is implemented in two forms:
  1. Replica Sets: Group of replication nodes containing one primary node and multiple secondary nodes. Primary accepts write operations, and secondaries are the replicating members. MongoDB’s replica sets provide automated failover. If a primary fails, the remaining members will automatically try to elect a new primary.
  2. Master-Slave Architecture: When the number of the replica nodes need to be more than 12, then Master-Slave architecture comes in. It is similar to Replica Sets except that it does not provide automated failover.
The figure below showcases the concept of Replica Sets in MongoDB, where one node is primary and rest all are secondaries. Also all of them are storing the same documents i.e. Doc 1 and 2

 
 Replication in MongoDB

Deploying a Replica Set

With trivial knowledge gathered about Replica Sets in MongoDB, try deploying one using the steps below:

Step 1: Follow Step 1,2 and 3 for setting up MongoDB on all the nodes of the Replica Set as given in my previous post.

Step 2: Create a conf file (for example "mongod.conf") at a location owned and accessible by the user with the following content. You can configure "port" and name your replica set by setting the field "replSet" to that name which would be same on all the nodes of the replica set.

dbpath=/var/lib/mongodb
logpath=/var/log/mongodb/mongodb.log
logappend=true
nojournal = true
port = 27017
bind_ip = machine_ip (for eg. 100.11.11.11)
fork = true
replSet = rsName                                                                              


Step 3: Start the MongoD daemon on each node of the Replica Set using:

/bin/mongod --config /pathToFile/mongod.conf                                

Step 4: Open a mongo terminal on the node which you want to be the primary node using: 

bin/mongo --host ip_of_the_machine                                                
To make this node the primary run the following command in the opened mongo terminal :

rs.initiate()                                                                                         

Step 5: To add the other nodes, run the add commands in the opened mongo terminal on the primary node:
rs.add("ip_of_secondary_node_1")
rs.add("ip_of_secondary_node_2")
and so on.                                                                                          
Check the status of the replica set using:

rs.conf()                                                                                            

Add secondaries to the Replica Set using a JavaScript

After initializing the primary using "rs.initiate()" add the secondaries using :

mongo primary_ip:mongod_port /path_to_js_file                             

where the .js file contains 

rs.add('ip_of_sec1:port_of_mongod');
rs.add('ip_of_sec2:port_of_mongod');
.
.
.
rs.add('ip_of_secN:port_of_mongod');                                              

To check, you can open the mongo terminal on the primary  node and run the "rs.conf()" command to see the status post running the JavaScript file.

Hope that the above steps helped you create a Replica Set successfully. My next post would be on "Creating a MongoDB Sharded Cluster".



Friday, January 25, 2013

MongoDB Installation: Standalone Mode


"Mongo" is a short for "Humongous" meaning "Huge".
By the name it goes that, MongoDB is a database phenomenal at dealing with "huge" piles of data.

As per the MongoDB Documentation,
"MongoDB is a scalable, high-performance, open source NoSQL database."

Installing MongoDB in standalone is guileless. It is also achievable using apt-get command in case of Linux. This post is for you if you wish to know what configuration files are used in MongoDB and if you intend to go a longer way rather than just a standalone installation and a casual try out of MongoDB.

Tested On
I have tested the following steps on Ubuntu 10.04 and MongoDB 2.2.2. 

Steps of Installation Step 1: Every version of MongoDB setups comes in two flavors:
  • 32-bit
  • 64-bit
The 32-bit builds run on both 32 and 64-bit machines while 64-bit build runs only on the 64-bit machines. However, as per the MongoDB manual,
"32-bit builds are limited to around 2GB of data. In general you should use the 64 bit builds. The 32 bit binaries are ok for replica set arbiters and mongos but not for production mongod's."

So it would be good to figure out the machine configuration before deployment.
Run the following command to know your machine
uname -m                                                                                         

it would return "i686" for 32 bit machines and "x86_64" for 64 bit machines.

Step 2: According to the Step 1 results download the setup from
http://www.mongodb.org/downloads
Untar the downloaded setup using 

tar  -zxvf  location_of_setup                                                             

Step 3: MongoDB requires 2 directories for it's functioning. They are:
1. Directory for storing the logs
2. Another directory for storing the data i.e. dbpath (by default /data/db)
You can use any suitable location on your machine to create these directories. For example, here we'll be taking
Log Directory: /var/log/mongodb
DBPath: /var/lib/mongodb

Create the above 2 directories as below:
mkdir /var/lib/mongodb
mkdir /var/log/mongodb           
                                                        

Change their permissions as below:
sudo chmod -R 777 /var/lib/mongodb
sudo chmod -R 777 /var/log/mongodb
                                             

Change their ownership to the user (root or non-root) as below:
sudo chown groupName:userName /var/lib/mongodb -R
sudo chown groupName:userName /var/log/mongodb -R
               

Also change the ownership of the directory where mongodb's setup is copied using similar commands as above.

Step 4: Create a conf file (for example "mongod.conf") at a location owned and accessible by the user.
dbpath=/var/lib/mongodb
logpath=/var/log/mongodb/mongodb.log                    
logappend=true
nojournal=true                                                                                 

Step 5: Start the MongoD daemon using:
/bin/mongod --config /path_to_conf_file/mongod.conf                  
  
Step 6: Verify whether the MongoD daemon is running or not using:
sudo netstat -nlp | grep mongod                                                    

 Step 7: Start using mongodb by opening the mongo shell using:
/bin/mongo                                                                                  

This completes the installation process of MongoDB in standalone mode. In my next post, I'll be catering steps on how to setup a MongoDB Replica Set.

Saturday, January 19, 2013

Quick Start Guide for Github

According to the Wiki, “Github is a distributed revision control and source code management (SCM) system with an emphasis on speed.” 
To understand better, a comparison of Github with the Version Control Systems is a great option since most of us have used some or the other VCS. Here goes the comparison:
  •  Git thinks of its data more like a set of snapshots of a mini filesystem and stores every version as a reference to the snapshot of the filesystem at that point, unlike VCSs which think of the information they keep as a set of files.
  • Git stores the entire history of the project on your local disk, due to which most operations in Git only need local files and resources to operate unlike a VCS where most operations have that network latency overhead.
  • Git provides data integrity through its check-summing mechanism.
  • All the actions Git allows are undoable. As in any VCS, you can lose or mess up changes you haven’t committed yet; but after you commit a snapshot into Git, it is very difficult to lose.

The Three stages of Github

Git has three main states that your files can reside in:
  1. Committed: means that the data is safely stored in your local database.
  2. Modified:  means that you have changed the file but have not committed it to your database yet.
  3. Staged: means that you have marked a modified file in its current version to go into your next commit snapshot.

Setup and Start Using Github

After taking a walk through the Github basics, getting to the very aim of this post i.e Setup and Start Using Github

  • Firstly install git on your machine using synaptic or the command 
            sudo apt-get install git
  • Next create an account on git, which requires any email id, a username and a password.
  • Once the account is setup, you need to set the ssh keys required for authenticating your account/local repo to github. Before this if any operation is attempted like :  
            git  push -u origin master  
       are tried, you'll get errors like :
            “Permission denied (publickey).
            fatal: The remote end hung up unexpectedly ”
       So proceed for setting up the ssh keys.

  • Steps for setting up SSH Keys
1. Check for SSH keys: 
$ cd ~/.ssh
Checks to see if there is a directory named ".ssh" in your user directory. If yes, proceed with the step 2, else jump to step 3.

2. Backup and remove existing SSH keys :
i. $ ls (Lists all the subdirectories in the .ssh folder)
config  id_rsa  id_rsa.pub  known_hosts
or
authorized_keys  id_dsa  id_dsa.pub  known_hosts
ii. $ mkdir key_backup (makes a subdirectory called "key_backup" in the current directory_
iii. $ cp id_rsa* key_backup (Copies the id_rsa and id_rsa.pub files into key_backup)   
iv. $ rm id_rsa*
3. Generate a new SSH key.
        $ ssh-keygen -t rsa -C "your_email@youremail.com"
       (this email id is the one you must have used when creating  the github account)
Output would be like :
      Generating public/private dsa key pair.
      Enter file in which to save the key (/home/abc/.ssh/id_dsa): <press enter>
      Enter passphrase (empty for no passphrase): ******
      Enter same passphrase again:  ******
      Your identification has been saved in /home/abc/.ssh/id_dsa.
      Your public key has been saved in /home/abc/.ssh/id_dsa.pub.
              The key fingerprint is: 
               .....

  •  Add your SSH key to GitHub. 
    Open the SSH Keys option in your Github account and click on the "Add SSH Key" button. Copy the content of the /home/abc/.ssh/id_dsa.pub file and paste here.  



    Then hit the Add button.


  • Test everything out.
       Try the command below :                  
            $ ssh -T git@github.com
   If an error like :
           “Agent admitted failure to sign using the key.
           Permission denied (publickey).”
   occurs then try :
           $ ssh-add ~/.ssh/id_dsa
   Output would be :
           Enter passphrase for /home/abc/.ssh/id_dsa:******
           Identity added: /home/abc/.ssh/id_dsa (/home/abc/.ssh/id_dsa)
    Now try the same command :
          $ ssh -T git@github.com
          Hi Jayati! You've successfully authenticated, but GitHub does not provide shell access.


Commit your code to a remote repository   


To add code to some remote repo on github:
  • Clone the repository to which you need to add a module/subfolder. 
            “git clone git@github.com:sample_repository.git ” 
      This creates a folder sample_repository in the current working directory.
  • Copy the folder to be added to this cloned folder and place it the way you wish to add it to the repository. Cd to the cloned copy of the repository and run the following command to add the folder to the local repository
             “git add sample_sub_folder/*” 
                      or 
             “git add sample_sub_folder/pom.xml” 
    if only one of the components(for eg. pom.xml) of the folder is to be added.
  • Next commit the added folder using   
             “git commit -m “message”
                      or
             “git commit -a”  (this opens a shell to add the commit message) 
  • Run the pull command:
             “git pull”
  • Run the push command:
             “git push”

  • To revert a commit: 
             “git revert HEAD ” 
  
That's it. Enjoy Githubing ...