Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CLI created jobs are not picked up by the balsam launcher/scheduler #387

Open
fbhuiyan2 opened this issue Dec 25, 2024 · 0 comments
Open

Comments

@fbhuiyan2
Copy link

Jobs created through CLI do not get picked up by the balsam scheduler as shown below.

/lus/eagle/projects/CSTEELML/fbhuiyan/Projects/1_CSTEEL/test $ balsam site ls
ID      Name                         Path                                       Active
672     polaris_tutorial             ...hop/workflows/balsam/polaris_tutorial   No  
673     mace_dft_test                ...rust-2/1_ff_train/2-df-sp/basalm_jobs   No  
674     sophia_tutorial              ...shop/workflows/balsam/sophia_tutorial   No  
675     mace_dft_test_sophia         ...st-2/1_ff_train/2-df-sp/balsam_sophia   No  
682     csteel-sophia                ...uiyan/Projects/1_CSTEEL/balsam-sophia   Yes 
683     test_polaris                 ...EELML/fbhuiyan/Projects/1_CSTEEL/test   Yes 
(baseplus) /lus/eagle/projects/CSTEELML/fbhuiyan/Projects/1_CSTEEL/test $ balsam app ls
   ID                 Name   Site                
 2714       Lammps_general   test_polaris        
(baseplus) /lus/eagle/projects/CSTEELML/fbhuiyan/Projects/1_CSTEEL/test $ balsam job create --site test -w nacl-melt --app Lammps_general -p lmp_path=$EXE -p input_file=in.lammps -p ngpus=4 -tag job=lammps --num-nodes 1 --ranks-per-node 4 --gpus-per-rank 4 --node-packing-count 2 --wall-time-min 10 --threads-per-rank 16
workdir: nacl-melt
tags:
    job: lammps
serialized_parameters: gASVgAAAAAAAAAB9lCiMCGxtcF9wYXRolIxIL2x1cy9lYWdsZS9wcm9qZWN0cy9DU1RFRUxNTC9mYmh1aXlhbi9sYW1tcHMvbWFjZV9idWlsZC9sYW1tcHMvYnVpbGQvbG1wlIwKaW5wdXRfZmlsZZSMCWluLmxhbW1wc5SMBW5ncHVzlIwBNJR1Lg==
data: {}
return_code: null
num_nodes: 1
ranks_per_node: 4
threads_per_rank: 16
threads_per_core: 1
launch_params: {}
gpus_per_rank: 4.0
node_packing_count: 2
wall_time_min: 10
app_id: 2714
site_name: null
parameters:
    lmp_path: /lus/eagle/projects/CSTEELML/fbhuiyan/lammps/mace_build/lammps/build/lmp
    input_file: in.lammps
    ngpus: '4'
parent_ids: []
transfers: {}

Do you want to create this Job? [y/N]: y
Added Job id=40674082
(baseplus) /lus/eagle/projects/CSTEELML/fbhuiyan/Projects/1_CSTEEL/test $ balsam job ls
ID         Site           App              Workdir     State          Tags               
40674082   test_polaris   Lammps_general   nacl-melt   PREPROCESSED   {'job': 'lammps'} 
(baseplus) /lus/eagle/projects/CSTEELML/fbhuiyan/Projects/1_CSTEEL/test $ balsam queue submit -q 'debug' -A CSTEELML -n 1 -t 30 -j mpi
num_nodes: 1
wall_time_min: 30
job_mode: mpi
optional_params: {}
filter_tags: {}
partitions: []
id: 30045
site_id: 683
scheduler_id: null
project: CSTEELML
queue: debug
state: pending_submission
status_info: {}
start_time: null
end_time: null

polaris-pbs-01.hsn.cm.polaris.alcf.anl.gov: 
                                                                 Req'd  Req'd   Elap
Job ID               Username Queue    Jobname    SessID NDS TSK Memory Time  S Time
-------------------- -------- -------- ---------- ------ --- --- ------ ----- - -----
3120371.polaris-pbs* fbhuiyan debug    qlaunch30*    --    1  64    --  00:30 R   -- 

Here is the mpi log file output,

"""
2024-12-25 07:37:04.931 | 3619804 | INFO | balsam:114] Configured logging on x3005c0s1b1n0

2024-12-25 07:37:04.932 | 3619804 | INFO | balsam.platform.compute_node.alcf_polaris_node:51] x3005c0s1b1n0.hsn.cm.polaris.alcf.anl.gov detected GPU IDs: [3, 2, 1, 0]

2024-12-25 07:37:06.206 | 3619804 | INFO | balsam.site.launcher.mpi_mode:120] Job Acquisition: 1 empty nodes; 1.0 aggregate free nodes; requested up to 990 jobs [node packing allowed: True]; Acquired 1 jobs.

2024-12-25 07:37:06.208 | 3619804 | WARNING | balsam.site.launcher.mpi_mode:158] Insufficient resources to place Job 40674082 nacl-melt. Stashing for later launch.

2024-12-25 07:37:07.211 | 3619804 | WARNING | balsam.site.launcher.mpi_mode:158] Insufficient resources to place Job 40674082 nacl-melt. Stashing for later launch.

2024-12-25 07:38:07.208 | 3619804 | WARNING | balsam.site.launcher.mpi_mode:158] Insufficient resources to place Job 40674082 nacl-melt. Stashing for later launch.

2024-12-25 07:38:07.208 | 3619804 | INFO | balsam.site.launcher.mpi_mode:74] Exceeded 60 sec TTL: shutting down because nothing to do

2024-12-25 07:38:07.208 | 3619804 | INFO | balsam.site.launcher.mpi_mode:227] Launcher starting shutdown sequence

2024-12-25 07:38:07.208 | 3619804 | INFO | balsam.site.job_source:212] Signal: JobSource cancelling tick thread and deleting API Session

2024-12-25 07:38:12.437 | 3619804 | INFO | balsam.site.job_source:214] JobSource exit graceful

2024-12-25 07:38:12.437 | 3619804 | INFO | balsam.site.launcher.mpi_mode:229] Timing out active runs

2024-12-25 07:38:13.231 | 3619808 | INFO | balsam.site.status_updater:50] Signal: break out of StatusUpdater main loop

2024-12-25 07:38:13.232 | 3619808 | INFO | balsam.site.status_updater:52] StatusUpdater thread finished.
"""

However, job created thorugh python api in the same site and for the same app runs just fine. Is this a bug?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant