Large data set analyzing/processing on cluster machines

MapReduce framework is widely using for analyzing the large set of data. One important type of data analysis is done with MapReduce is log processing. MapReduce is basically runs on any distributed file storage system such as Hadoop File System (HDFS) from Apache Hadoop. This MapReduce Framework which runs on each HDFS Node and do the parallel aggregation of the log data then produce the result.

A Hadoop job is to analyze the log data from HDFS. The basic procedures to be followed to achieve the objective are:

  • Collect the necessary log files from the cluster machines, which are producing the actual logs.
  • Batch the logs into a file which is atleast equivalent to 64MB, which is the default HDFS block level storage.
  • Compress the batched log files.
  • Send the batched log files to HDFS.
  • Run the Hadoop jobs to analyze the batch log files using the MapReduce framework.
  • Store the results into HBase for future purpose

The challenge involved in developing such system is, ‘How are we going to collect the log data from the cluster machines?’

The generic way to overcome the above stated challenge is to write the logs directly into a message queue from all the cluster machines. The utility which is running on top of each HDFS node will read the messages from the queue and convert into a compressed batch log files. But this is not an optimal way of processing the log files.

Recently, Facebook open sourced their internal logging system called Scribe. Scribe is a server for aggregating streaming of log data. It is designed to scale to a large number of nodes and also fault tolerant. Facebook describes their usage of Scribe by saying, “[Scribe]¬†runs on thousands of machines and reliably delivers tens of billions of messages a day.”

The basic components for this Scribe Logging system are:

  • Thrift Client Interface
  • Distribution System

The Scribe logging system consists of two entries, a category and a message. A category is a high description and a message consists of the details about the log data.

The types of stores currently available are:

  • file – writes to a file, either local or nfs.
  • network – sends messages to another scribe server.
  • buffer – contains a primary and a secondary store. Messages are sent to the primary store if possible, and otherwise the secondary. When the primary store becomes available the messages are read from the secondary store and sent to the primary.
  • bucket – contains a large number of other stores, and decides which messages to send to which stores based on a hash.
  • null – discards all messages.
  • thriftfile – similar to a file store but writes messages into a Thrift TFileTransport file.
  • multi – a store that forwards messages to multiple stores.

Though Scribe serves to stream the logs to the HDFS nodes directly, Hadoop and HDFS definitely cannot solve any real-time problems. Hadoop jobs needs a start up cost of at least a few seconds. HDFS reads and writes have inadequate latency for anything real-time. Auditing/ analyzing MBs of data in very fast manner, good have this system in place.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s