MapleJuice

MapleJuice is a batch processing system that works like MapReduce.

Overview

Oveview

For detailed design of MapleJuice’s components, see:

You can also check out our report.

Requirement

The code is developed and tested with Go v1.13. The only third party library used is emirpasic/gods v1.12.0 for treemap.

Usage

Run

go run main.go <port_number>

This will print log in a log file (vm.log). To print log to screen, use

go run main.go <port_number> --log2screen

Port numbers 1234, 1235 and 1236 are pre-occupied by and hence not allowed to be used.

A list of potential introducer’s addresses should be provided in the file introducer.config.

We also implemented a distributed log querier for debugging. You can check it out here.

You are then able to input commands to the terminal.

Group Membership

An introducer needs to be started first by entering

introducer
join

Other nodes can join subsequently by entering

join

After joining the group, the following commands are avaiable:

leave: leave the group

ml: print current node's membership list

id: print current node's id

The membership list includes a list of member ids, the member id is the concatenation of the member’s IP address, port number, and join time in string format.

Distributed File System

A node joins the DFS service as soon as it joins the group. The following commands are supported:

put <localfilename> <sdfsfilename>: put localfilename on local FS to sdfefilename on SDFS

get <sdfsfilename> <localfilename>: get sdfsfilename on SDFS to localfilename on local FS

delete <sdfsfilename>: delete sdfefilename on SDFS

ls <sdfsfilename>: list the storing nodes of all replicas of sdfefilename

store: list the SDFS files stored on the current node

MapleJuice

To start the MapleJuice service, you first need to let nodes join the group, and then select a node as the master node by entering

master

and others as the worker node by entering

worker

MapleJuice is invoked via two command lines. Overall a MapleJuice job takes as input a corpus of SDFS files and outputs a single SDFS file. At most one job can be processed at the same time, but multiple jobs can be submitted and queued meanwhile. Two example of applications (wordcount, URL access percentage) can be found here.

Maple

maple <maple_exe> <num_maples>
<sdfs_intermediate_filename_prefix> <sdfs_src_files>

The first parameter maple_exe is a user-specified executable that takes as input one file and outputs a series of (key, value) pairs. maple_exe is the SDFS file name. The second parameter num_maples specifies the number of Maple tasks. The last series of parameters (sdfs_src_files) specifies the location of the input files.

The output of the Maple phase (not task) is a series of SDFS files, one per key. That is, for a key K, all (K, any_value) pairs output by any Maple task must be appended to the file sdfs_intermediate_filename_prefix_K. After the Juice phase is done, you will have the option to delete these intermediate files.

Juice

juice <juice_exe> <num_juices>
<sdfs_intermediate_filename_prefix> <sdfs_dest_filename>
<delete_input={0,1}> [partitioner={range, hash}](optional, default=range)

The first parameter juice_exe is a user-specified executable that takes as input multiple (key, value) input lines, processes groups of (key, any_values) input lines together (sharing the same key, just like Reduce), and outputs (key, value) pairs. juice_exe is the SDFS file name. The second parameter num_juices specifies the number of Juice tasks.

Each juice task is responsible for a portion of the keys – each key is allotted to exactly one Juice task (this is done by the Master server). The juice task fetches the relevant SDFS files sdfs_intermediate_filename_prefix_K’s, processes the input lines in them, and appends all its output to sdfs_dest_filename sorted by key.

When the last parameter delete_input is set to 1, the MapleJuice engine deletes the input files automatically after the Juice phase is done. If delete_input is set to 0, the Juice input files is left untouched.

Developers