Skip to content

Commit

Permalink
Initial commit
Browse files Browse the repository at this point in the history
  • Loading branch information
syagev committed Aug 16, 2013
0 parents commit 827cc6c
Show file tree
Hide file tree
Showing 10 changed files with 482 additions and 0 deletions.
78 changes: 78 additions & 0 deletions README
Original file line number Diff line number Diff line change
@@ -0,0 +1,78 @@
###############################################################################
## WeizGrid
## By Stav Yagev, 2013 (Please contact me if you find bugs !)
##
##
## A simple framework for using the SGE cluster at Weizmann's CS department for parallel work.
##
## FEATURES:
## - Helps split your own code into fractions that can be run in parallel on the cluster.
## - Allows you to aggregate results in an efficient manner.
## - Write your code ONCE! Execute the same code on PC for debugging and on the cluster
## for production.
## - Recover from errors on the cluster - splitting means if 1 iteration out of a
## 1000 failed you still have 999 iterations in your hand!
##
## Use case example:
## You have a job that processes 1000 images in some way, running the same algorithm
## for each image, outputing something for each image, and finally aggregating the
## results from all images. Up until now, you ran this in a single job that took 1000
## minutes (because lets assume it takes 1 minute to process each image). Using WeizGrid
## you can take advantage of the fact that the job can be parallelized - instead of 1
## job WeizGrid helps you easily split the job to 80 parallel jobs so that you get all
## 1000 images in 1000/80=12.5 minutes (!!)
##


Auto Installation:
==================
1. Copy all the UNIX scripts to ~/WeizGrid
2. Copy the rest of the files where you want
3. In a UNIX terminal, type:
chmod +x ~/WeizGrid/wgsetup
4. Then type:
~/WeizGrid/wgsetup

Usage:
======
1. Create files in the spirit of 'sample.m' and 'calcPrimes.m'
2. Upload them to UNIX
3. To start a job, under UNIX, from the direcotry of YOUR project, run:
~/WeizGrid/qsubWG [name] [your-m-file] [queue]

Example: ~/WeizGrid/qsubWG ParallelCracker sample all.q

4. Enjoy!

*** Check out sample.m and calcPrimes for a simple usage example! ***





Manual Installation (if auto doesn't work...):
=============================================
1. Copy all the UNIX scripts to ~/WeizGrid.
2. Copy the rest of the files where you want.
3. Make sure all bash scripts have execute permission.
4. Create the following directory structure (1 folder for each queue you intend to use):
~/.matlab/cluster_jobs/all.q
~/.matlab/cluster_jobs/test.q
...
NOTE: The ~/.matlab/ usually already exists but is hidden, so make sure you
are viewing hidden files and folders .




Tips & Troubleshooting:
=======================
- If your algorithm uses a random number, make sure you are not setting the RngShuffle
option to false when invoking WGexec - because this will yield the same random
series for every "parallel" piece of work.
- Sometimes, during the aggregation part of your script, there will be a bug. Then,
you might think all your work is lost! This is not the case, look into the documentation
of WGgetResults for more information.
- There isn't verbose error checking, so if something doesn't work, check the output
files on your UNIX folder for hints... If still unsuccessful, feel free to contact
me for help
138 changes: 138 additions & 0 deletions WGexec.m
Original file line number Diff line number Diff line change
@@ -0,0 +1,138 @@
function [WGjob] = WGexec( varargin )
%WGexec Executes a job in a parallel fashion using WeizGrid
% This is the core function of WeizGrid, after you've setup an array of
% parameters for your iterations, call this method and the WG will spawn
% jobs to parallelize your work. Specifiy the following mandatory
% name/value pairs:
%
% Name - The name for the whole show. This will be used for file and job
% naming on the cluster. IT MUST NOT CONTAIN SPACES OR SPECIAL CHARACHTERS.
%
% nparallels - Into how many cluster-jobs do you want to split the work.
% If for example you have 1000 iterations and you use 5, each sub-job
% spawned will process 200 iterations. Note that the job quota on
% Weizmann on the day of publishing this code was 80.
%
% WorkFunc - The name of your function that does the work. It must have
% the following signature:
% function [WGres,bSuccess] = doOneIteration(WGglobalParam, WGsubParam, k)
%
% The WG engine will inject into WGglobalParam the global parameters
% you set when calling this function, in WGsubParam you will have the
% parameters for a particular iteration, and in k the number of the
% iteration. The function is expected to return some result which can
% later be collected with WGgetResults, and a success boolean value.
% This success value can be later used when aggregating results for
% the purposing of filtering out "bad" results.
%
% SubParams - This should be an array of structures. It's length
% determines the number of total iterations. The k'th entry will be
% made availble to the work function that will process the k'th
% iteration.
%
% GlobalParams (Default: Empty matrix) - This can be any MATLAB entity which
% will be made available to the work function for all iterations.
%
% LocalDebug (Default: false) - When set to true the WG engine will simply
% execute your iterations localy (without splitting to sub-jobs). This
% useful feature allowd you to use THE EXACT SAME CODE when debuging on
% your PC or when running on the cluster.
%
% WaitTillFinished (Default:true) - If set to true, the function will return
% only after all sub-jobs have completed. Use this when aggregating results.
% IF YOU ATTEMPT TO CALL WGgetResults AFTER A CALL TO WGexec WITHOUT THIS
% OPTION YOU MUST ENSURE BY YOURSELF ALL SUB-JOBS ON THE CLUSTER FINISHED
% OTHERWISE BEHAVIOUR IS UNPREDICTABLE!
%
% RngShuffle (Default:true) - Because your job is split, every parallel piece
% of work will by MATLAB's default behaviour generate the same series of
% random numbers. To remedy this, set this to true so that
% rng('shuffle') is invoked on every piece of parallel work.
%
% Return value: The functions returns a structure identifying the job
% which can be used when calling WGgetResults.
%
%
% Written by Stav Yagev, 2013

global WGq;
WGjob.q = WGq;
WGjob.bLocalDebug = false;
bWait = true;
WGglobalParams = []; %#ok<NASGU>
bRngShuffle = true;

iMandatory = 0;
for i = 1 : 2 : length(varargin)
name = varargin{i};
value = varargin{i+1};
switch name
case 'nparallels'
WGjob.nparallels = value;
iMandatory = iMandatory + 1;
case 'Name'
WGjob.sName = value;
iMandatory = iMandatory + 1;
case 'WorkFunc'
sWorkFunc = value;
iMandatory = iMandatory + 1;
case 'LocalDebug'
WGjob.bLocalDebug = value;
case 'SubParams'
WGallSubParams = value;
WGjob.k = length(WGallSubParams);
iMandatory = iMandatory + 1;
case 'GlobalParams'
WGglobalParams = value; %#ok<NASGU>
case 'WaitTillFinished'
bWait = value;
case 'RngShuffle'
bRngShuffle = value;
otherwise
error(['Unknown option "' name]);
end
end
if (iMandatory < 4)
error('One of the 4 mandatory property/value pairs is missing');
end


%check if doing work locally or not
if (WGjob.bLocalDebug)
if (bRngShuffle)
rng('shuffle');
end

WGjob.WGres{WGjob.k} = [];
WGjob.bSuccess(WGjob.k) = false;
for k=1:WGjob.k
[WGjob.WGres{k},WGjob.bSuccess(k)] = ...
eval([sWorkFunc '(WGglobalParams, WGsubParams(k), k)']);
end

else

for i=1:WGjob.nparallels
%caluclate the correct k-range
krng = [(i-1)*ceil(WGjob.k/WGjob.nparallels)+1 ...
min(WGjob.k, i*ceil(WGjob.k/WGjob.nparallels))];

%save the simulation data to a unique file
WGsubParams = WGallSubParams(krng(1):krng(2)); %#ok<NASGU>
save(sprintf('~/.matlab/cluster_jobs/%s/%s_%di',WGq,WGjob.sName,i), ...
'WGsubParams', 'WGglobalParams', 'krng', 'bRngShuffle');

%submit a job for this sub simulation to SGE
system(sprintf('qsub -cwd -q %s -V -N %s_%d ~/WeizGrid/dowork %s %s %d %s', ...
WGq, WGjob.sName, i, WGq, WGjob.sName, i, sWorkFunc));
end

if (bWait)
%use an empty job to wait for all sub-simulations to finish
pause(10);
system(sprintf('qsub -cwd -q %s -V -sync yes -hold_jid "%s_*" -N %s_w ~/WeizGrid/empty', ...
WGq, WGjob.sName, WGjob.sName));
end
end

end
95 changes: 95 additions & 0 deletions WGgetResults.m
Original file line number Diff line number Diff line change
@@ -0,0 +1,95 @@
function [WGtotalRes, bTotalSuccess, nLost] = WGgetResults( WGjob, varargin )
%WGgetResults Returns the aggregated results from a WG job.
% Call this function after a call to WGexec has finished and you want to
% do post processing with the results. This method has 2 modes, in the
% regular case (1-parameter only):
%
% WGjob - The identifier returned from WGexec.
%
% Return values:
% WGtotalRes - will be a cell array of size Kx1 (where K is
% the total number of iterations), and will contain the results of each
% iteration as returned by your work function.
%
% bTotalSuccess - a boolean vector of size Kx1 where each entry
% corresponds to the success code returned by your work function.
%
% nLost - the count of iterations that was lost to unknown errors.
% Suppose for example you had a 1000 iteration jobs split into 5 so that
% each parallel work got to handle 200 iterations. And suppose one of
% these jobs failed due to some bug or unknown error, nLost will have
% a value of 200 so you know your data is only from 800 iterations
% which actually finished.
%
%
% Recovery mode
% --------------------------
% If there was a bug during post-processing, you may have not saved the
% results in a meaningful way. To avoid having to re-run the whole work,
% copy the files from the following UNIX directory to somewhere on your PC
% ~/.matlab/cluster_jobs/yourq.q/
% You can then invoke WGgetResults in the following manner to get the
% results again:
%
% WGjob - A structure constructed by YOU the following way:
% WGjob.k = should be the original total number of iterations
% WGjob.nparallels = the original number of parallels the work was split into
% WGjob.sName = the exact name used for the original job
%
% Also specifiy the option 'LocalFolder' with the path to where you put
% the files on your PC/
%
% Return values: exactly the same as in the regular case.
%
%
% Written by Stav Yagev, 2013


if (length(varargin) >= 2 && strcmp(varargin(1),'LocalFolder') ...
&& ~isempty(varargin{2}))
bLocalFolder = true;
sLocalFolder = varargin{2};
else
bLocalFolder = false;
end

if (~isfield(WGjob,'bLocalDebug'))
WGjob.bLocalDebug = false;
end
if (bLocalFolder || ~WGjob.bLocalDebug)
nLost = 0;
WGtotalRes{WGjob.k} = [];
bTotalSuccess(WGjob.k) = false;

%collect results
for i=1:WGjob.nparallels
if (~bLocalFolder)
fname = sprintf('~/.matlab/cluster_jobs/%s/%s_%do.mat', ...
WGjob.q,WGjob.sName,i);
else
fname = sprintf([sLocalFolder '\\%s_%do.mat'],WGjob.sName,i);
end

if (exist(fname, 'file'))
%load the sub-simulation job data
load(fname);
WGtotalRes(krng(1):krng(2)) = WGres;
bTotalSuccess(krng(1):krng(2)) = bSuccess;

else
%caluclate the correct k-range
krng = [(i-1)*ceil(WGjob.k/WGjob.nparallels)+1 ...
min(WGjob.k, i*ceil(WGjob.k/WGjob.nparallels))];
bTotalSuccess(krng(1):krng(2)) = false;

nLost = nLost + krng(2)-krng(1)+1;
end
end

else
WGtotalRes = WGjob.WGres;
bTotalSuccess = WGjob.bSuccess;

end

end
33 changes: 33 additions & 0 deletions calcPrimes.m
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% calcPrimes.m - Sample usage of WeizGrid
%
% Together with 'smaple' this is a very simple example of
% using WeizGrid.
%
% In this example our main 'work' function takes 2 integers and finds
% common factors between them. It also accepts a 'global' parameter that
% determines whether to include the number 1 as a common factor.
%
% By Stav Yagev, 2013


function [WGres,bSuccess] = calcPrimes(WGglobalParam, WGsubParam, k)

%TODO: replace this with your own implementation code
fprintf('Processing iteration #%d\n',k);

f1 = factor(WGsubParam.X);
f2 = factor(WGsubParam.Y);
WGres = intersect(f1,f2);

if (WGglobalParam.bCount1AsAFactor)
WGres = [1 WGres];
end

%report success/failure
%this mechanism allows you to easily filter out iterations that are
%"wrong" in the aggregation stage. For the sake of the example, lets
%say we don't want to include pairs of numbers that are co-prime
bSuccess = ~isempty(WGres);

end
6 changes: 6 additions & 0 deletions dowork
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
#!/bin/bash
#$ -S /bin/bash

#Written by Stav Yagev, 2013

matlab2012a -nodisplay -r "load('~/.matlab/cluster_jobs/$1/$2_$3i'); if (bRngShuffle) rng('shuffle'); end; WGres = cell(krng(2)-krng(1)+1,1); bSuccess = zeros(krng(2)-krng(1)+1,1); for k=krng(1):krng(2) [WGres{k-krng(1)+1},bSuccess(k-krng(1)+1)] = $4(WGglobalParams,WGsubParams(k-krng(1)+1),k); end; save('~/.matlab/cluster_jobs/$1/$2_$3o','WGres','krng','bSuccess'); quit;"
2 changes: 2 additions & 0 deletions empty
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
#!/bin/bash
#$ -S /bin/bash
20 changes: 20 additions & 0 deletions qsubWG
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
#!/bin/bash
#$ -S /bin/bash

#Usage: qsubWG [name] [m-file] [queue]
#Example: qsubWG ParallelCracker sample all.q

#Written by Stav Yagev, 2013



#this cleans up temp files from previous jobs (all of this is optional)
rm -f ~/.matlab/cluster_jobs/$1/*
rm -f *.e*
rm -f *.o*


#this invokes the main job, if you decide to this yourself
#make sure you include the -V and -cwd options when you qsub

qsub -cwd -q $3 -V -N $1 ~/WeizGrid/weizgrid $2 $3
Loading

0 comments on commit 827cc6c

Please sign in to comment.