Installing of Hadoop and Using Hive to Get Data
The Apache Hadoop software is a framework that enables us to process a large set of data across computer clusters in distributed processing fashion. It is designed in a single server and many machine architecture and each machine provides local computation power. This article is suitable for students learning Hadoop and wants to install it on their Windows machine and start working with it in a simple and easy way.
Following are the list of tool to be downloaded
1-VirtualBox
2-Cloudera HDP Hortonworks Sandbox
After downloading the above mentions tools install VirtualBox and open the HDP Hortonworks Sandbox with VirtualBox. it will take a little bit of time to mount the image and 16GB of system RAM is recommended for this. HDP Hortonworks Sandbox is basically a preinstalled Hadoop environment that also has a bunch of associated technologies. It will appear like the following image.
Now power up the SandBox and meanwhile it is being downloaded, you can download some dataset to work with it later from here. It is basically a movies rating dataset on which we will work later in this article. Check out the VirtualBox and hopefully if all things are ok CentOs instance is running there just like shown in the following image.
Now open any browser and go to 127.0.0.1:8888 to access the dashboard called Ambari. Use maria_dev as both username and password and you will see all the options of Ambari Dashboard.
Now extract the zip file you downloaded earlier to get hands-on with Hive. There is a u.data file that contains Movie Id, User Id, Rating and TimeStampe for a lot of movies. Another useful file is u.item which contains the movies' name and their release date and some other data. Now Select Hive View from the menu option on the right top as shown in the follwoing image.
Once the hive view opens then click the Upload Table section and upload the .csv file. In the settings of the file type, select field delimiter as “Tab”. Now select the rating data file where you downloaded it and name this new table and its column as well for ease in querying later. Finally, click upload table to let Hadoop upload it to its cluster. Similar upload u.item file with field delimiter as “Pipe Character” and name accordingly.
Now go to the Query section and write the following query
SELECT movie_id, count(movie_id) as ratingCount
FROM ratings
GROUP BY movie_id
DESC
Hey! Congratulations you have written your first query on Hadoop Cluster using HIVE. Now you can try using different types of queries and try getting movie name as well using another table you uploaded earlier.
Hope you enjoyed it.