Skip to content

Commit 880a3db

Browse files
authored
Merge pull request #3 from hschellman/gh-pages
new example of running a batch job with local code
2 parents 72316e4 + 115b391 commit 880a3db

21 files changed

Lines changed: 1584 additions & 49 deletions

CITATION

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,3 @@
11
Please cite as:
22

3-
Dune Collaboration: "DUNE Computing Tutorial" Version 2024.01
3+
Dune Collaboration: "DUNE Computing Tutorial" Version 2025.01

_episodes/02-submit-jobs-w-justin.md

Lines changed: 9 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -1,27 +1,27 @@
11
---
2-
title: Submit grid jobs with JustIn
2+
title: New justIN Job Submission System
33
teaching: 20
44
exercises: 0
55
questions:
6-
- How to submit realistic grid jobs with JustIn
6+
- How to submit realistic grid jobs with justIN
77
objectives:
8-
- Demonstrate use of [justIn](https://dunejustin.fnal.gov) for job submission with more complicated setups.
8+
- Demonstrate use of [justIN](https://dunejustin.fnal.gov) for job submission with more complicated setups.
99
keypoints:
1010
- Always, always, always prestage input datasets. No exceptions.
1111
---
1212

13-
# PLEASE USE THE NEW [justIn](https://dunejustin.fnal.gov) SYSTEM INSTEAD OF POMS
13+
# PLEASE USE THE NEW [justIN](https://dunejustin.fnal.gov) SYSTEM INSTEAD OF POMS
1414

15-
__A simple [justIn](https://dunejustin.fnal.gov) Tutorial is currently in docdb at: [JustIn Tutorial](https://docs.dunescience.org/cgi-bin/sso/RetrieveFile?docid=30145)__
15+
__A simple [justIN](https://dunejustin.fnal.gov) Tutorial is currently in docdb at: [justIN Tutorial](https://docs.dunescience.org/cgi-bin/sso/RetrieveFile?docid=30145)__
1616

1717
A more detailed tutorial is available at:
18-
[JustIn Docs](https://dunejustin.fnal.gov/docs/)
18+
[justIN Docs](https://dunejustin.fnal.gov/docs/)
1919

20-
The [justIn](https://dunejustin.fnal.gov) system is described in detail at:
20+
The [justIN](https://dunejustin.fnal.gov) system is described in detail at:
2121

22-
__[JustIn Home](https://dunejustin.fnal.gov/dashboard/)__
22+
__[justIN Home](https://dunejustIN .fnal.gov/dashboard/)__
2323

24-
__[JustIn Docs](https://dunejustin.fnal.gov/docs/)__
24+
__[justIN Docs](https://dunejustin.fnal.gov/docs/)__
2525

2626

2727
> ## Note More documentation coming soon

_episodes/07-grid-job-submission.md

Lines changed: 21 additions & 25 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
---
2-
title: Jobsub Grid Job Submission and Common Errors - still 2024 version
2+
title: Jobsub Grid Job Submission and Common Errors (SPECIAL PURPOSE)
33
teaching: 65
44
exercises: 0
55
questions:
@@ -68,8 +68,8 @@ The past few months have seen significant changes in how DUNE (as well as other
6868
First, log in to a `dunegpvm` machine . Then you will need to set up the job submission tools (`jobsub`). If you set up `dunesw` it will be included, but if not, you need to do
6969

7070
~~~
71-
mkdir -p /pnfs/dune/scratch/users/${USER}/DUNE_tutorial_sep2025 # if you have not done this before
72-
mkdir -p /pnfs/dune/scratch/users/${USER}/sep2025tutorial
71+
mkdir -p /pnfs/dune/scratch/users/${USER}/DUNE_tutorial_jan2026 # if you have not done this before
72+
mkdir -p /pnfs/dune/scratch/users/${USER}/jan2026tutorial
7373
~~~
7474
{: ..language-bash}
7575

@@ -190,16 +190,16 @@ You will have to change the last line with your own submit file instead of the p
190190
First, we should make a tarball. Here is what we can do (assuming you are starting from /exp/dune/app/users/username/):
191191

192192
```bash
193-
cp /exp/dune/app/users/kherner/setupsep2025tutorial-grid.sh /exp/dune/app/users/${USER}/
194-
cp /exp/dune/app/users/kherner/sep2025tutorial/localProducts_larsoft_v09_72_01_e20_prof/setup-grid /exp/dune/app/users/${USER}/sep2025tutorial/localProducts_larsoft_v09_72_01_e20_prof/setup-grid
193+
cp /exp/dune/app/users/kherner/setupjan2026tutorial-grid.sh /exp/dune/app/users/${USER}/
194+
cp /exp/dune/app/users/kherner/jan2026tutorial/localProducts_larsoft_v09_72_01_e20_prof/setup-grid /exp/dune/app/users/${USER}/jan2026tutorial/localProducts_larsoft_v09_72_01_e20_prof/setup-grid
195195
```
196196

197197
Before we continue, let's examine these files a bit. We will source the first one in our job script, and it will set up the environment for us.
198198

199199
~~~
200200
#!/bin/bash
201201
202-
DIRECTORY=sep2025tutorial
202+
DIRECTORY=jan2026tutorial
203203
# we cannot rely on "whoami" in a grid job. We have no idea what the local username will be.
204204
# Use the GRID_USER environment variable instead (set automatically by jobsub).
205205
USERNAME=${GRID_USER}
@@ -217,40 +217,38 @@ mrbslp
217217

218218

219219
Now let's look at the difference between the setup-grid script and the plain setup script.
220-
Assuming you are currently in the /exp/dune/app/users/username directory:
220+
Assuming you are currently in the `/exp/dune/app/users/$USER` directory:
221221

222222
```bash
223-
diff sep2025tutorial/localProducts_larsoft_v09_72_01_e20_prof/setup sep2025tutorial/localProducts_larsoft_v09_72_01_e20_prof/setup-grid
223+
diff jan2026tutorial/localProducts_larsoft_v09_72_01_e20_prof/setup jan2026tutorial/localProducts_larsoft_v09_72_01_e20_prof/setup-grid
224224
```
225225

226226
~~~
227-
< setenv MRB_TOP "/exp/dune/app/users/<username>/sep2025tutorial"
228-
< setenv MRB_TOP_BUILD "/exp/dune/app/users/<username>/sep2025tutorial"
229-
< setenv MRB_SOURCE "/exp/dune/app/users/<username>/sep2025tutorial/srcs"
230-
< setenv MRB_INSTALL "/exp/dune/app/users/<username>/sep2025tutorial/localProducts_larsoft_v09_72_01_e20_prof"
227+
< setenv MRB_TOP "/exp/dune/app/users/<username>/jan2026tutorial"
228+
< setenv MRB_TOP_BUILD "/exp/dune/app/users/<username>/jan2026tutorial"
229+
< setenv MRB_SOURCE "/exp/dune/app/users/<username>/jan2026tutorial/srcs"
230+
< setenv MRB_INSTALL "/exp/dune/app/users/<username>/jan2026tutorial/localProducts_larsoft_v09_72_01_e20_prof"
231231
---
232-
> setenv MRB_TOP "${INPUT_TAR_DIR_LOCAL}/sep2025tutorial"
233-
> setenv MRB_TOP_BUILD "${INPUT_TAR_DIR_LOCAL}/sep2025tutorial"
234-
> setenv MRB_SOURCE "${INPUT_TAR_DIR_LOCAL}/sep2025tutorial/srcs"
235-
> setenv MRB_INSTALL "${INPUT_TAR_DIR_LOCAL}/sep2025tutorial/localProducts_larsoft_v09_72_01_e20_prof"
232+
> setenv MRB_TOP "${INPUT_TAR_DIR_LOCAL}/jan2026tutorial"
233+
> setenv MRB_TOP_BUILD "${INPUT_TAR_DIR_LOCAL}/jan2026tutorial"
234+
> setenv MRB_SOURCE "${INPUT_TAR_DIR_LOCAL}/jan2026tutorial/srcs"
235+
> setenv MRB_INSTALL "${INPUT_TAR_DIR_LOCAL}/jan2026tutorial/localProducts_larsoft_v09_72_01_e20_prof"
236236
~~~
237237

238238
As you can see, we have switched from the hard-coded directories to directories defined by environment variables; the `INPUT_TAR_DIR_LOCAL` variable will be set for us (see below).
239-
Now, let's actually create our tar file. Again assuming you are in `/exp/dune/app/users/kherner/sep2025tutorial/`:
239+
Now, let's actually create our tar file. Again assuming you are in `/exp/dune/app/users/kherner/jan2026tutorial/`:
240240
```bash
241-
tar --exclude '.git' -czf sep2025tutorial.tar.gz sep2025tutorial/localProducts_larsoft_v09_72_01_e20_prof sep2025tutorial/work setupsep2025tutorial-grid.sh
241+
tar --exclude '.git' -czf jan2026tutorial.tar.gz jan2026tutorial/localProducts_larsoft_${DUNESW_VERSION}_${DUNESW_QUALIFIER} jan2026tutorial/work setupjan2026tutorial-grid.sh
242242
```
243243
Note how we have excluded the contents of ".git" directories in the various packages, since we don't need any of that in our jobs. It turns out that the .git directory can sometimes account for a substantial fraction of a package's size on disk!
244244

245245
Then submit another job (in the following we keep the same submit file as above):
246246

247-
```bash
248-
jobsub_submit -G dune --mail_always -N 1 --memory=2500MB --disk=2GB --expected-lifetime=3h --cpu=1 --tar_file_name=dropbox:///exp/dune/app/users/<username>/sep2025tutorial.tar.gz --singularity-image /cvmfs/singularity.opensciencegrid.org/fermilab/fnal-wn-sl7:latest --append_condor_requirements='(TARGET.HAS_Singularity==true&&TARGET.HAS_CVMFS_dune_opensciencegrid_org==true&&TARGET.HAS_CVMFS_larsoft_opensciencegrid_org==true&&TARGET.CVMFS_dune_opensciencegrid_org_REVISION>=1105&&TARGET.HAS_CVMFS_fifeuser1_opensciencegrid_org==true&&TARGET.HAS_CVMFS_fifeuser2_opensciencegrid_org==true&&TARGET.HAS_CVMFS_fifeuser3_opensciencegrid_org==true&&TARGET.HAS_CVMFS_fifeuser4_opensciencegrid_org==true)' -e GFAL_PLUGIN_DIR=/usr/lib64/gfal2-plugins -e GFAL_CONFIG_DIR=/etc/gfal2.d file:///exp/dune/app/users/kherner/run_sep2025tutorial.sh
249-
```
247+
250248

251249
You'll see this is very similar to the previous case, but there are some new options:
252250

253-
* `--tar_file_name=dropbox://` automatically **copies and untars** the given tarball into a directory on the worker node, accessed via the INPUT_TAR_DIR_LOCAL environment variable in the job. The value of INPUT_TAR_DIR_LOCAL is by default $CONDOR_DIR_INPUT/name_of_tar_file_without_extension, so if you have a tar file named e.g. sep2025tutorial.tar.gz, it would be $CONDOR_DIR_INPUT/sep2025tutorial.
251+
* `--tar_file_name=dropbox://` automatically **copies and untars** the given tarball into a directory on the worker node, accessed via the INPUT_TAR_DIR_LOCAL environment variable in the job. The value of INPUT_TAR_DIR_LOCAL is by default $CONDOR_DIR_INPUT/name_of_tar_file_without_extension, so if you have a tar file named e.g. jan2026tutorial.tar.gz, it would be $CONDOR_DIR_INPUT/jan2026tutorial.
254252
* Notice that the `--append_condor_requirements` line is longer now, because we also check for the fifeuser[1-4]. opensciencegrid.org CVMFS repositories.
255253

256254
The submission output will look something like this:
@@ -265,7 +263,7 @@ Could not locate uploaded file on RCDS. Will retry in 30 seconds.
265263
Could not locate uploaded file on RCDS. Will retry in 30 seconds.
266264
Found uploaded file on RCDS.
267265
Transferring files to web sandbox...
268-
Copying file:///nashome/k/kherner/.cache/jobsub_lite/js_2023_05_24_224713_9669e535-daf9-496f-8332-c6ec8a4238d9/run_sep2025tutorial.sh [DONE] after 0s
266+
Copying file:///nashome/k/kherner/.cache/jobsub_lite/js_2023_05_24_224713_9669e535-daf9-496f-8332-c6ec8a4238d9/run_jan2026tutorial.sh [DONE] after 0s
269267
Copying file:///nashome/k/kherner/.cache/jobsub_lite/js_2023_05_24_224713_9669e535-daf9-496f-8332-c6ec8a4238d9/simple.cmd [DONE] after 0s
270268
Copying file:///nashome/k/kherner/.cache/jobsub_lite/js_2023_05_24_224713_9669e535-daf9-496f-8332-c6ec8a4238d9/simple.sh [DONE] after 0s
271269
Submitting job(s).
@@ -566,8 +564,6 @@ Some more background material on these topics (including some examples of why ce
566564

567565
[Wiki page listing differences between jobsub_lite and legacy jobsub](https://fifewiki.fnal.gov/wiki/Differences_between_jobsub_lite_and_legacy_jobsub_client/server)
568566

569-
[DUNE Computing Tutorial:Advanced topics and best practices](DUNE_computing_tutorial_advanced_topics_20210129)
570-
571567
[2021 Intensity Frontier Summer School](https://indico.fnal.gov/event/49414)
572568

573569
[The Glidein-based Workflow Management System]( https://glideinwms.fnal.gov/doc.prd/index.html )
Lines changed: 98 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,98 @@
1+
---
2+
title: justIN Grid Job Submission (UNDER CONSTRUCTION)
3+
teaching: 65
4+
exercises: 0
5+
questions:
6+
- How to submit grid jobs?
7+
objectives:
8+
- Submit a basic batchjob and understand what's happening behind the scenes
9+
- Monitor the job and look at its outputs
10+
- Review best practices for submitting jobs (including what NOT to do)
11+
keypoints:
12+
- When in doubt, ask! Understand that policies and procedures that seem annoying, overly complicated, or unnecessary (especially when compared to running an interactive test) are there to ensure efficient operation and scalability. They are also often the result of someone breaking something in the past, or of simpler approaches not scaling well.
13+
- Send test jobs after creating new workflows or making changes to existing ones. If things don't work, don't blindly resubmit and expect things to magically work the next time.
14+
- Only copy what you need in input tar files. In particular, avoid copying log files, .git directories, temporary files, etc. from interactive areas.
15+
- Take care to follow best practices when setting up input and output file locations.
16+
- Always, always, always prestage input datasets. No exceptions.
17+
---
18+
19+
<!-- > ## Note:
20+
> This section describes basic job submission. Large scale submission of jobs to read DUNE data files are described in the [next section]({{ site.baseurl }}/08-submit-jobs-w-justin/index.html). -->
21+
<!--
22+
#### Session Video
23+
24+
This session will be captured on video a placed here after the workshop for asynchronous study.
25+
<!-- The session was video captured for your asynchronous review. -->
26+
The video from the two day version of this training in May 2022 is provided [here](https://www.youtube.com/embed/QuDxkhq64Og) as a reference. -->
27+
28+
<!--
29+
<center>
30+
<iframe width="560" height="315" src="https://www.youtube.com/embed/QuDxkhq64Og" title="DUNE Computing Tutorial May 2022 Grid Job Submission and Common Errors" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>
31+
</center>
32+
-->
33+
34+
35+
36+
37+
38+
Once you have practiced basic justIn commands, please look at the instructions for running your own code below:
39+
40+
41+
42+
## First learn the basics of Justin Submit a job
43+
44+
Go to [The justIN Tutorial](https://dunejustin.fnal.gov/docs/tutorials.dune.md)
45+
46+
and work up to ["run some hello world jobs"](https://dunejustin.fnal.gov/docs/tutorials.dune.md#run-some-hello-world-jobs)
47+
48+
> ## Quiz
49+
>
50+
> 1. What is your workflow ID?
51+
>
52+
{: .solution}
53+
54+
Then work through
55+
56+
- [View your workflow on the justIN web dashboard](https://dunejustin.fnal.gov/docs/tutorials.dune.md#view-your-workflow-on-the-justin-web-dashboard)
57+
- [Jobs with inputs and outputs](https://dunejustin.fnal.gov/docs/tutorials.dune.md#jobs-with-inputs-and-outputs)
58+
- [Fetching files from Rucio managed storage](https://dunejustin.fnal.gov/docs/tutorials.dune.md#fetching-files-from-rucio-managed-storage)
59+
- (skip for now) Jobs using GPUs
60+
- [Jobs writing to scratch](https://dunejustin.fnal.gov/docs/tutorials.dune.md#jobs-writing-to-scratch)
61+
62+
63+
64+
65+
66+
## Submit a job using the tarball containing custom code
67+
68+
69+
70+
First off, a very important point: for running analysis jobs, **you may not actually need to pass an input tarball**, especially if you are just using code from the base release and you don't actually modify any of it. In that case, it is much more efficient to use everything from the release and refrain from using a tarball.
71+
All you need to do is set up any required software from CVMFS (e.g. dunetpc and/or protoduneana), and you are ready to go.
72+
If you're just modifying a fcl file, for example, but no code, it's actually more efficient to copy just the fcl(s) you're changing to the scratch directory within the job, and edit them as part of your job script (copies of a fcl file in the current working directory have priority over others by default).
73+
74+
Sometimes, though, we need to run some custom code that isn't in a release.
75+
We need a way to efficiently get code into jobs without overwhelming our data transfer systems.
76+
We have to make a few minor changes to the scripts you made in the previous tutorial section, generate a tarball, and invoke the proper jobsub options to get that into your job.
77+
There are many ways of doing this but by far the best is to use the Rapid Code Distribution Service (RCDS), as shown in our example.
78+
79+
80+
### Temporary short version of an example for custom code.
81+
82+
We're working on a long version of this but please look at these [instructions for running a justIN workflow using your own code]({{ site.baseurl }}/short_submission) for now.
83+
84+
### Cool justIN feature
85+
86+
justIN has a very useful interactive test command.
87+
88+
Here is a test from the short submission example.
89+
90+
~~~
91+
{% include test_workflow.sh %}
92+
~~~
93+
94+
it reads in a tarball from an area `$DUNEDATA` and writes output to a tmp area on your interactive machine. It works very well at emulating a grid job.
95+
96+
## Did your job work?
97+
98+
If not please ask over at #computing-questions in Slack

0 commit comments

Comments
 (0)