<html>
<head>
<meta content="text/html; charset=ISO-8859-1"
http-equiv="Content-Type">
</head>
<body bgcolor="#FFFFFF" text="#000000">
On 11/04/2012 11:03 PM, Seyyed Mohtadin Hashemi wrote:
<blockquote
cite="mid:CAKJMjaLbzFhEE9AynnUuKONPti94Z-p68h_Nqa_kRTSktmfZhg@mail.gmail.com"
type="cite">Hello, <br>
<br>
I have a very peculiar problem: I have a micro cluster with three
nodes (18 cores total); the nodes are clones of each other and
connected to a frontend via Ethernet. I am using Debian squeeze as
the OS for all nodes. <br>
<br>
I have compiled a GROMACS v.4.5.5 environment with FFTW v.3.3.1
and OpenMPI v.1.4.2 (OpenMPI version that is standard for Debian).
On the nodes, individually, I can do simulations of any size and
complexity; however, as soon as I want to do a parallel job the
whole thing crashes. <br>
Since my own simulations can be non-ideal for a parallel situation
I have used the gmxbench d.dppc files, this is the result I get: <br>
<br>
For a simple parallel job I use: path/mpirun –hostfile
path/machinefile –np XX path/mdrun_mpi –p tcp –s path/topol.tpr –o
path/output.trr <br>
For –np XX being smaller than or 10 it works, however as soon as I
make use of 11 or larger the whole thing crashes and I get: <br>
[host:xxxx] Signal: Bus error (7) <br>
<div>[host:xxxx] Signal code: Non-existant physical address (2) <br>
</div>
<div>[host]xxxx] Lots of lines with libmpi.so.0</div>
<br>
I have tried using different versions of OpenMPI, v.1.4.5 and all
the way to beta v.1.5.5, they all behave exactly the same. This is
making no sense. <br>
</blockquote>
<br>
Sounds like an MPI configuration program. I'd get a test program
running on 18 cores before worrying about anything else.<br>
<br>
<blockquote
cite="mid:CAKJMjaLbzFhEE9AynnUuKONPti94Z-p68h_Nqa_kRTSktmfZhg@mail.gmail.com"
type="cite"><br>
<div>When I use threads over the OpenMPI interface I can get all
cores engaged and the simulation works until 5-7 minutes in then
it gives an error “Cannot rename checkpoint file; maybe you are
out of quota?” even though I have more than 500gb left on each
node.<br>
</div>
</blockquote>
<br>
Sounds like a filesystem availability problem. Checkpoint files are
written in the working directory, so the available local disk space
is not strictly relevant.<br>
<br>
Mark<br>
<br>
<blockquote
cite="mid:CAKJMjaLbzFhEE9AynnUuKONPti94Z-p68h_Nqa_kRTSktmfZhg@mail.gmail.com"
type="cite">
<div>
</div>
<div><br>
</div>
<div>I hope somebody can help me figure out what is wrong and
maybe a possible solution.</div>
<div><br>
</div>
<div>regards,</div>
<div>Mohtadin</div>
<br>
<fieldset class="mimeAttachmentHeader"></fieldset>
<br>
</blockquote>
<br>
</body>
</html>