Monday, April 27, 2015

Common Functions in R


Google "What's R", and you'll see there are many ways in which R has been defined. As per my understanding, firstly it's a programming language. Secondly, it's solely meant for statistical computing. It's not a generic programming language like Java. Now the question is "what all comes under statistical computing and graphics?" Wiki explains statistical computing as the interface between the mathematical science of statistics and computer science. And statistics is the study of the collection, analysis, interpretation, presentation, and organization of data. So basically R is a programming language dedicated for being used to perform the study of the collection, analysis, interpretation, presentation, and organization of data. And now you know why is it so much in talks? Because in this Big Data Yug(meaning era), data is everything for everyone.

Hoping that you have R installed on your Linux/Windows machine, I hereby intend to showcase few common operations that one might almost always need to use. I have also written about how to perform all these operations using Hive in one of my previous posts. But hey, by the way if you don't have R installed, follow this link(http://cran.r-project.org/doc/manuals/r-devel/R-admin.html) to get it done.

1. Sorting in R
2. Searching in R
3. Joins in R
4. Sampling in R
5. Calculating median in R
6. Calculating mean in R
7. Finding elements from one vector which don't exist in vector in R

Open a new terminal on your machine and type 'R' to open an R shell. All the operations that follow will need the R shell.

Sorting a Vector in R


The sort() function in R takes the vector or data frame as inuput and sorts it.
#Create a vector
>a <- c(2,6,8,1,3,7,3,60,32)
#Display the vector
> a
[1]  2  6  8  1  3  7  3 60 32
#Sort the vector
> sort(a)
[1]  1  2  3  3  6  7  8 32 60                                                                 


Searching an element in Vector in R


grep() function needs to parameters, first parameter is the element to be searched and second is the vector name.
#Search for 60 in the vector 'a'                                                          
>grep(60, a)
[1] 8


Joining Files in R


Joining in R can be done on files. First the CSV files are read into variables and then 'merge' function joins the two files, only condition being that they should possess at least one column in common whose values can be used to join the files.
#Exit the R shell and in your terminal create two csv file as shown below:

vi ~/my_join_table.csv

#Paste the following content
id,age,phone
1,18,1111111    
2,19,2222222
3,17,3333333
6,23,4444444
5,20,5555555

vi ~/my_table.csv

#Paste the following content
id,name,address 
1,Ram,add_1     
2,Shyam,add_2
3,Sita,add_3
4,Ali,add_4
5,John,add_5

#Reopen the R shell and load the two csv files to two variables. Replace "user_name" with your username.
mydata1 = read.csv("/home/user_name/my_join_table.csv", header=T)
mydata2 = read.csv("/home/user_name/my_table.csv", header=T)

#Merge the two files
myfulldata = merge(mydata1, mydata2)

#Display the data in the merged file
myfulldata
  id age   phone  name  address
1  1  18 1111111   Ram add_1
2  2  19 2222222 Shyam add_2
3  3  17 3333333  Sita add_3
4  5  20 5555555  John  add_5

Sampling in R


The 'sample' function has the following syntax:

sample(x, size, replace = FALSE, prob = NULL)
OR
sample.int(n, size = n, replace = FALSE, prob = NULL)

Arguments:

x: Either a vector of one or more elements from which to choose, or a positive integer. 
n: a positive number, the number of items to choose from. 
size: a non-negative integer giving the number of items to choose.
replace: Should sampling be with replacement?
prob: A vector of probability weights for obtaining the elements of the vector being sampled.
> x <- c(4, 7, 2, 4, 9, 10, 55, 77, 1)

> sample(x, 5, replace = FALSE, prob = NULL)                             
[1] 7 2 4 4 1


Calculating median in R


The 'median method returns median of the elements in the vector
> median(x)
[1] 7                                                                                                   

Calculating mean in R


Similarly, 'mean' method is used to calculate the mean of all the elements of a vector. 
> mean(x)                                                                                          
[1] 18.77778

Find what's in one vector and not in another using R


This operation is like an A-B operation and can be accomplished using setdiff() function which takes two parameters
x <- c(1,2,3,4)
y <- c(2,3,4)
> setdiff(x, y)                                                                                     

Hope that helps!

Tuesday, April 21, 2015

Feature comparison of Machine Learning Libraries

Machine learning is a subfield of computer science stemming from research into artificial intelligence. It is a scientific discipline that explores the construction and study of algorithms that can learn from data. Such algorithms operate by building a model from example inputs and using that to make predictions or decisions, rather than following strictly static program instructions.

Every machine learning algorithm constitutes two phases:

1. Training Phase: When the algorithm learns from the input data and creates a model for reference.
2. Testing Phase: When the algorithm predicts the results based on it’s learnings stored in the model.

Machine learning is categorized into:

1. Supervised Learning: In supervised learning, the model defines the effect one set of observations, called inputs, has on another set of observations, called outputs.
2. Unsupervised Learning: In unsupervised learning, all the observations are assumed to be caused by latent variables, that is, the observations are assumed to be at the end of the causal chain.

There is wide range of machine learning libraries that provide implementations of various classes of algorithms. In my coming posts, we shall be evaluating the following open-source machine learning APIs on performance, scalability, range of algorithms provided and extensibility.

1. H2O
2. SparkMLlib
3. Sparkling Water
4. Weka


In the following posts, we shall be installing each of the above libraries and run one implementation of an algorithm available in all. This would give us an insight into the ease of use/execution, performance of the algorithm and accuracy of the algorithm.

Commonly Used Hive Queries

Hello Readers, 

I feel extremely fortunate to be hearing questions from many of you on my personal email id about why haven't I blogged since long. So, I am returning to my blog after about two years, with a determination to addup a lot more content (which has piled up in these two years) that might be of any help to anyone working on Big Data. Thanks for the motivation friends. :) So here I go.

To start with, I chose to write the most commonly used Hive queries. You can pick on this blog any time you forget the syntax of any of them ;)

Just like in SQL, there is a common set of Hive queries that tend to be used the most. These include queries for operations like:

1. Sorting
2. Searching
3. Joins
4. Sampling
5. Calculating median
6. Calculating mean
7. Finding records from one table which don't exist in another

Assuming you have a running Hadoop cluster and Hive on top of it already setup, the following sections of the tutorial will help you understand how to go about the following operations in Hive.

Data Set Preparation

Let's assume you have the following tables and data in Hive:

'my_table'

id,name,address 
1,Ram,add_1                                                                                                                                    
2,Shyam,add_2
3,Sita,add_3
4,Ali,add_4
5,John,add_5

and another table 'my_join_table':

id,age,phone
1,18,1111111                                                                                                                                     
2,19,2222222
3,17,3333333
6,23,4444444
5,20,5555555

Creating tables in Hive

To create tables like above in Hive, follow the steps below:

1. Create two files 'my_table.csv' and 'my_join_table.csv' and copy the respective comma separated content shown above in each file.

2. Place these files in HDFS

hdfs dfs -mkdir hive_tutorial_my_table                                                                                            
hdfs dfs -put my_table.csv hive_tutorial_my_table
hdfs dfs -mkdir hive_tutorial_my_join_table
hdfs dfs -put my_join_table.csv hive_tutorial_my_join_table

Yes, we have intentionally placed the two files in two different tables, since to load the file into a hive table using Create table statement, you need to specify a folder location. File paths do not work. And by default Hive adds the data of all the files in that folder to the Hive table. Strange right?

3. Create it!

Remember to replace the 'your_hdfs_user_name' with a valid name else you'll get a depressing error statement. ;)

create external table my_table (id int, name string, address string) row format delimited fields terminated by ',' lines terminated by '\n' location '/user/your_hdfs_user_name/hive_tutorial_my_table' tblproperties ("skip.header.line.count"="1");

create external table my_join_table (id int, age int, phone string) row format delimited fields terminated by ',' lines terminated by '\n' location '/user/your_hdfs_user_name/hive_tutorial_my_join_table' tblproperties ("skip.header.line.count"="1");

Cool. So now you have all that's required to try the queries.

Running the Hive Queries

Sorting in Hive


Sorting by Integer: By default, Hive sorts an integer column in the Ascending order.

Query: select id, age, phone from my_join_table order by age;                                                    

Sorting by String: And in the lexicographical order for String columns.

Query: select id, name, address from my_table order by name;                                                    

If you have a very large data set, the ORDER BY clause might take a long long time because it pushes all data through just one reducer which is unacceptable for large datasets. To optimize, use CLUSTER BY. CLUSTER BY ensures each of N reducers gets non-overlapping ranges, then sorts by those ranges at the reducers. You can go for this if you are okay with joining the multiple output files yourself.

Query: select id, name, address from my_table cluster by name;                                                      

Searching in Hive


Searching can be performed using the 'where' clause.

Query: select id, age, phone from my_join_table where age>18;                                                   

Joins in Hive


Inner Join: The INNER JOIN keyword selects all rows from both tables as long as there is a match between the columns in both tables.

Query: select a.id, a.name, a.address, b.age, b.phone from my_table a join my_join_table b on a.id=b.id;

Left Join: The Left Outer Join keyword returns each row that satisfies the join of the first table with the second table. It also returns any rows from the first table that had no matching rows in the second table. The non-matching rows in the second table are returned as null values.

Query: select a.id, a.name, a.address, b.age, b.phone from my_table a left join my_join_table b on a.id=b.id;

Right Join: The RIGHT JOIN keyword returns all rows from the second table, with the matching rows in the first table. The result is NULL in the left side when there is no match.

Query: select a.id, a.name, a.address, b.age, b.phone from my_table a right join my_join_table b on a.id=b.id;

Full Outer Join: The FULL OUTER JOIN keyword returns all rows from the first table and from the second table. The FULL OUTER JOIN keyword combines the result of both LEFT and RIGHT joins.

Query: select a.id, a.name, a.address, b.age, b.phone from my_table a full outer join my_join_table b on a.id=b.id;

Sampling in Hive


Hive supports sampling by different parameters as follows:

Sampling by percentage:

Query: select id, name, address from my_table TABLESAMPLE(0.1 PERCENT);                          

Sampling by size:

Query: select id, name, address from my_table TABLESAMPLE(10M) s;                                       

Sampling by number of records:

Query: select id, name, address from my_table TABLESAMPLE(2 ROWS) s;                               

Sampling by Bucketing:

Query: create table my_join_table_bucketed(id INT, age INT, phone STRING) comment 'A bucketed copy of my_join_table' clustered by(age) into 4 buckets;
Query: insert into table my_join_table_bucketed select id, age, phone from my_join_table;
Query: SELECT * FROM my_join_table_bucketed TABLESAMPLE(BUCKET 3 OUT OF 4 ON rand()) s;
Query: SELECT * FROM my_join_table_bucketed TABLESAMPLE(BUCKET 1 OUT OF 4 ON age);

Calculating median


The function percentile with the following syntax can be used to calculate median in hive. It returns the exact pth percentile of a column in the group (does not work with floating point types). To calculate the median, p should be equal to 0.5. Otherwise it can be any value between 0 and 1.

percentile(BIGINT col, p)                                                                                                                  

We shall use this to calculate the median of the column 'age' in 'my_join_table'

Query: select percentile(cast(a.age as bigint), 0.5) AS age_median from my_join_table a;            

Calculating mean


Calculating mean of the values in a column is pretty straight forward with the help of the function 'avg(col)'

Query: select avg(a.age) AS age_mean from my_join_table a;                                                      

Finding records from one table which don't exist in another


Now, this is the trickiest one. We intend to find those records of table A which do not exist in table B. Sort of A-B. For this, the schema of the tables need not always be the same. The concept is very simple, we would join the two tables and apply a check statement to include only those records where the join column of the first table is NOT NULL. This way we get those records which are in the first table and not in the second.

Add a row to my_table with id not in table my_join_table.

Query: insert into my_table values (6, 'Jayati', 'add_6');

Query: select a.id, a.name, a.address, b.age, b.phone from my_table a join my_join_table b on a.id=b.id where (a.id IS NOT NULL);

Hope this tutorial gave a good start with Hive!