English 中文(简体)
SGE - QSUB fails to submit jobs in -sync mode
原标题:

I have a perl script that prepares files for input to a binary program and submits the execution of the binary program to the SGE queueing system version 6.2u2.

The jobs are submitted with the -sync yoption to permit the parent perl script the ability to monitor the status of the submitted jobs with the waitpid function.

This is also very useful because sending a SIGTERM to the parent perl script propagates this signal to each of the children, who then forward this signal onto qsub, thus gracefully terminating all associated submitted jobs.

Thus, it is fairly crucial that I be able to submit jobs with this -sync y option.

Unfortunately, I keep getting the following error:

Unable to initialize environment because of error: range_list containes no elements

Notice the improper spelling of containes . That is NOT a typo. It just shows you how poorly maintained this area of the code/error message must be.

The attempted submissions that produce this error fail to even generate the STDOUT and STDERR files *.e{JOBID} and *.o{JOBID}. The submission just completely fails.

Searching google for this error message only results in unresolved posts on obscure message board.

This error does not even occur reliably. I can rerun my script and the same jobs will not necessarily even generate the error. It also seems not to matter from which node I attempt to submit jobs.

My hope is that someone here can figure this out.

Answers to any of these questions would thus solve my problem:

  1. Does this error persist in more recent versions of SGE?
  2. Can I alter my command line options for qsub to avoid this?
  3. What the hell is this error message talking about?
最佳回答

Our site hit this issue in SGE 6.2u5. I ve posted some questions on the mailing list, but there was no solution. Until now.

It turns out that the error message is bogus. I discovered this by reading through the change logs on the Univa github "open-core" repo. I later saw the issue mentioned in the Son Of Gridengine v8.0.0c Release Notes.

Here are the related commits in the github repo:

What the error message should say is that you ve hit the limit on the number of qsub sync -y jobs in the system. This parameter is known as MAX_DYN_EC. The default in our version was 99, and the changes above increase that default to 1000.

The definition of MAX_DYN_EC (from the sge_conf(5) man page) is:

Sets the max number of dynamic event clients (as used by qsub -sync y and by Grid Engine DRMAA API library sessions). The default is set to 99. The number of dynamic event clients should not be bigger than half of the number of file descriptors the system has. The number of file descriptors are shared among the connections to all exec hosts, all event clients, and file handles that the qmaster needs.

You can check how many dynamic event clients you using the following command:

$ qconf -secl | grep qsub | wc -l

We have added MAX_DYN_EC=1000 to qmaster_params via qconf -mconf. I ve tested submitting hundreds of qsub -sync y jobs and we no longer hit the range_list error. Prior to the MAX_DYN_EC change, doing so would reliably trigger the error.

问题回答

I found a solution to this problem - or at the very least a workaround.

My goal was to get individual instances of qsub to remain in the foreground as the job that it submitted was still in the queue or running. This was achieved with the -sync option but resulted in the horribly unpredictable bug that I describe in my question.

The solution to this problem was to use the qrsh command with the now -n option. This causes the job to behave similar to qsub -sync in that my script can implicitly monitor whether a submitted job is running by using waitpid on the qrsh instance.

The only caveat to this solution is that the queue you are operating on must not make any distinction between interactive nodes (offered by qrsh) and non-interactive nodes (accessible by qsub). Should a distinction exist (likely there are fewer interactive nodes than non-interactive) then this workaround may not help.

However, as I have found nothing even close to a solution to the qsub -sync problem that is anywhere as functional as this, let this post go out across the interwebs to any wayward soul caught in my similar situation.





相关问题
Local and remote data synchronisation

We have a local server with an access database which feeds data to clients in the same domain. Now we also have a website which is hosted externally, and working on a bridge system to provided upload/...

Write-though caching of large data sets in WCF?

We ve got a smart client that talks to a SQL Server database via WCF, displaying the entities in the database, and allowing the user to edit those entities. Some of the WCF calls return a large data ...

How are mutex and lock structures implemented?

I understand the concept of locks, mutex and other synchronization structures, but how are they implemented? Are they provided by the OS, or are these structures dependent on special CPU instructions ...

AutoResetEvent, ManualResetEvent vs Monitor

Lets say I have to orchestrate a synchronization algorithm in .Net 3.5 SP1 and any of the synchronization primitives listed in the title fit perfectly for the task. From a performance perspective, is ...

Synchronising SQL database through ADO.Net

The problem that i m having is how can i synchronise my datasets in my VS 2008 project to any changes in the database. As you know we read data from the db into the dataset which is disconnected, now ...

多线同步的明显模式? (C#)

我有两条镜子,指相同的变数——电离层和透镜。 因此,我在两条路口的24小时发言中总结了接触情况。 时间接近具有优先地位——......

热门标签