Question

I have access to a 128-core cluster on which I would like to run a parallelised job. The cluster uses Sun GridEngine and my program is written to run using Parallel Python, numpy, scipy on Python 2.5.8. Running the job on a single node (4-cores) yields an ~3.5x improvement over a single core. I would now like to take this to the next level and split the job across ~4 nodes. My qsub script looks something like this:

#!/bin/bash
# The name of the job, can be whatever makes sense to you
#$ -N jobname

# The job should be placed into the queue  all.q .
#$ -q all.q

# Redirect output stream to this file.
#$ -o jobname_output.dat

# Redirect error stream to this file.

#$ -e jobname_error.dat

# The batchsystem should use the current directory as working directory.
# Both files will be placed in the current
# directory. The batchsystem assumes to find the executable in this directory.
#$ -cwd

# request Bourne shell as shell for job.
#$ -S /bin/sh

# print date and time
date

# spython is the server s version of Python 2.5. Using python instead of spython causes the program to run in python 2.3
spython programname.py

# print date and time again
date

Does anyone have any idea of how to do this?

Answer 1

Yes, you need to include the Grid Engine option -np 16 either in your script like this:

# Use 16 processors
#$ -np 16

or on the command line when you submit the script. Or, for more permanent arrangements, use an .sge_request file.

On all the GE installations I ve ever used this will give you 16 processors (or processor cores these days) on as few nodes as necessary, so if your nodes have 4 cores you ll get 4 nodes, if they have 8 2 and so on. To place the job on, say 2 cores on 8 nodes (which you might want to do if you need a lot of memory for each process) is a little more complicated and you should consult your support team.

友情链接