Of course, everyone can implement the hadoop cluster setup, but the thing is, how we can effectively using the hadoop framework in our requirements. Map/Reduce software framework has been developed on Java. Its mostly mapper reads the files from the HDFS and reducer writes the content to the hadoop file system. New I/O package has a predominant role in this framework. Because, parallel programming, every thread access the shared data. NIO package provides, semaphore like feature, write lock etc.,
- Even though, the map/reduce engine core is developed in Java. We can write the Mapper and reducer not only in Java, some other script languages like Scala, Python, Jython, Ruby etc., But each has its own strengths and weaknesses
- The problem behind the small chunks, each one by default 64 MB. So every file, directory and block represents as an object in the NameNode’s memory. Each of which occupies 150 bytes in NameNode’s memory. Imagine, if 10 million smaller files, obviously occupies 3GB memory.
- Also reading through small files, requires lots of file position seek.
Our Case study
We are cloudy guys! We have the capability as Cloud based solution providers. One of our tool, CloudBuddy enterprise (currently supports Amazon S3) which emits the logs for every user’s transactions such as upload, download to and from respectively on the Amazon S3. CloudBuddy enterprise server upload the logs to S3 on periodic basis. The hadoop cluster machines, will download the logs by the java scheduler. Java scheduler initiate the download trigger and download the log files from S3 and store it into the HDFS. Map/Reduce engine, fetch the input logs from the hadoop file system and process the logs. The final output will store it in Hbase, Hadoop database. Every time result will be update into hbase with new processed data by the Map/Reduce engine.
For any Map/Reduce task, we need the data set to processing the transactional logs as an input.The Mapper which has been written on java to map the each transactions (Upload/Download) based on the user id.
The log structure is comma separated values and the format is
First Column : User-ID
Second Column : [[UPLOAD – 1], [DOWNLOAD – 2], [DELETE – 3]]
Third Column : Bytes
Fourth Column : Time Stamp
Sample Logs :
For every user, the application will allocate the disk size. To computing the logs by using the map/reduce framework, it requires the input data which is already stored in the hadoop file system. The Map/Reduce framework will split the records based on the user id and combine the bytes by transaction type. The Mapper Class, its using the patten to split the columns by the delimiter
By overriding the map method to create the N-number of maps based on number of records in the input file. Mapper, maps provides the output as follows.,
Now, the Combiner add the similar transactions of the respective User-ID
Eventually, reducer calculate the total transactions for every user
<User-ID ,(Upload+Download)-(Delete))> —> <User-ID ,Total_Transactions>
The Reducer also update the output to the Hbase database. CloudBuddy Enterprise server, which uses the webservice to fetch the total transactions for every user from the Hbase and intimate to the CloudBuddy enterprise user.