<html>

  <head>

    <meta content="text/html; charset=ISO-8859-1"

      http-equiv="Content-Type">

  </head>

  <body bgcolor="#FFFFFF" text="#000000">

    On 11/04/2012 11:03 PM, Seyyed Mohtadin Hashemi wrote:

    <blockquote

cite="mid:CAKJMjaLbzFhEE9AynnUuKONPti94Z-p68h_Nqa_kRTSktmfZhg@mail.gmail.com"

      type="cite">Hello, <br>

      <br>

      I have a very peculiar problem: I have a micro cluster with three

      nodes (18 cores total); the nodes are clones of each other and

      connected to a frontend via Ethernet. I am using Debian squeeze as

      the OS for all nodes. <br>

      <br>

      I have compiled a GROMACS v.4.5.5 environment with FFTW v.3.3.1

      and OpenMPI v.1.4.2 (OpenMPI version that is standard for Debian).

      On the nodes, individually, I can do simulations of any size and

      complexity; however, as soon as I want to do a parallel job the

      whole thing crashes. <br>

      Since my own simulations can be non-ideal for a parallel situation

      I have used the gmxbench d.dppc files, this is the result I get: <br>

      <br>

      For a simple parallel job I use: path/mpirun &#8211;hostfile

      path/machinefile &#8211;np XX path/mdrun_mpi &#8211;p tcp &#8211;s path/topol.tpr &#8211;o

      path/output.trr&nbsp;<br>

      For &#8211;np XX being smaller than or 10 it works, however as soon as I

      make use of 11 or larger the whole thing crashes and I get: <br>

      [host:xxxx] Signal: Bus error (7) <br>

      <div>[host:xxxx] Signal code: Non-existant physical address (2) <br>

      </div>

      <div>[host]xxxx] Lots of lines with libmpi.so.0</div>

      <br>

      I have tried using different versions of OpenMPI, v.1.4.5 and all

      the way to beta v.1.5.5, they all behave exactly the same. This is

      making no sense. <br>

    </blockquote>

    <br>

    Sounds like an MPI configuration program. I'd get a test program

    running on 18 cores before worrying about anything else.<br>

    <br>

    <blockquote

cite="mid:CAKJMjaLbzFhEE9AynnUuKONPti94Z-p68h_Nqa_kRTSktmfZhg@mail.gmail.com"

      type="cite"><br>

      <div>When I use threads over the OpenMPI interface I can get all

        cores engaged and the simulation works until 5-7 minutes in then

        it gives an error &#8220;Cannot rename checkpoint file; maybe you are

        out of quota?&#8221; even though I have more than 500gb left on each

        node.<br>

      </div>

    </blockquote>

    <br>

    Sounds like a filesystem availability problem. Checkpoint files are

    written in the working directory, so the available local disk space

    is not strictly relevant.<br>

    <br>

    Mark<br>

    <br>

    <blockquote

cite="mid:CAKJMjaLbzFhEE9AynnUuKONPti94Z-p68h_Nqa_kRTSktmfZhg@mail.gmail.com"

      type="cite">

      <div>

      </div>

      <div><br>

      </div>

      <div>I hope somebody can help me figure out what is wrong and

        maybe a possible solution.</div>

      <div><br>

      </div>

      <div>regards,</div>

      <div>Mohtadin</div>

      <br>

      <fieldset class="mimeAttachmentHeader"></fieldset>

      <br>

    </blockquote>

    <br>

  </body>

</html>