Skip to content

Unable to schedule task with mem_per_rank #3353

Description

@mgoliyad

Unable to schedule the following Task on anvil:

    self.tmgr.submit_tasks(rp.TaskDescription({
        'uid': ru.generate_id(f'{self.name}.1.%(item_counter)06d',
                              ru.ID_CUSTOM, ns=self.tmgr.session.uid),
        'name': 'T1.initial.mpnn.run',
        'executable': 'python',
        'ranks': 1,
        'mem_per_rank': 1,
        'arguments': [f'{self.base_path}/mpnn_wrapper.py',
                      f'-pdb={self.input_path}/',
                      f'-out={self.output_path_mpnn}/job_{self.passes}/',
                      f'-mpnn={self.protein_path}/',
                      f'-seqs={self.num_seqs}',
                      '-is_monomer=0',
                      '-chains=A'],
        'pre_exec': TASK_PRE_EXEC
    }))

1747918605.983 : agent_scheduling.0000 : 2335682 : 23134291756800 : ERROR : scheduling failed for p2.1.000000
Traceback (most recent call last):
File "/anvil/scratch/x-mgoliyad1/impress/test.impress/lib/python3.9/site-packages/radical/pilot/agent/scheduler/base.py", line 993, in _schedule_incoming
if self._try_allocation(task):
File "/anvil/scratch/x-mgoliyad1/impress/test.impress/lib/python3.9/site-packages/radical/pilot/agent/scheduler/base.py", line 1144, in _try_allocation
slots, partition = self.schedule_task(task)
File "/anvil/scratch/x-mgoliyad1/impress/test.impress/lib/python3.9/site-packages/radical/pilot/agent/scheduler/continuous.py", line 347, in schedule_task
assert mem_per_slot <= mem_per_node,
AssertionError: too much mem per proc 1

The session path is

/anvil/projects/x-dmr140125/radical.pilot.sandbox/rp.session.g001.anvil.rcac.purdue.edu.x-mgoliyad1.020230.0005/pilot.0000

When encountering an issue during the execution of a RADICAL-Pilot (RP) application, please check whether the source of the error is in the application code or in the code executed by the compute units (i.e., executable). If you suspect that RP is the source of the error, please open a ticket at https://github.com/radical-cybertools/radical.pilot/issues, following these steps:

  1. Enable verbose messages: Run your application script again, setting the RADICAL_VERBOSE=DEBUG and RADICAL_PILOT_VERBOSE=DEBUG environment variables. By default, RP redirects debug messages to Standard Error but you may want to redirect those messages to a single file. For example, with bash: RADICAL_VERBOSE=DEBUG RADICAL_PILOT_VERBOSE=DEBUG python example.py &> debug.out.

  2. Client and remote logs in RP: RP creates multiple logs files in a client-side sandbox and a server-side sandbox. The client-side sandbox is created in the
    working directory on the client machine (where you launched your application script); the server-side sandbox is created on the remote machine (HPC) in a predefined location. You can collect all the logs by running the following command on the client machine: radical-pilot-fecth-logfiles <session id>. In order to determine the session id, you can look in the debug logs or for a folder that is created in the directory from which you launched the application script on the client machine. That directory should have the format rp.session.*. You can find the latest folder by doing ls -ltr (last is recent). The radical-pilot-fecth-logfiles command collects all the logfiles to that rp.session.* folder. Please tar and (b/g)zip that folder and attach it to the github ticket.

  3. Provide information about the error: After fetching all the log files, go in the rp.session.* folder and execute grep -rl ERROR .. Please include the output of that command in the ticket.

Metadata

Metadata

Assignees

No one assigned
    No fields configured for Feature.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions