Skip to content

Commit fe232d4

Browse files
committed
Merge branch 'release/0.11.2'
2 parents f1b0126 + 55f2958 commit fe232d4

9 files changed

Lines changed: 43 additions & 154 deletions

File tree

.gitattributes

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
* text=auto

README.rst

Lines changed: 1 addition & 102 deletions
Original file line numberDiff line numberDiff line change
@@ -57,105 +57,4 @@ License
5757
Changelog
5858
~~~~~~~~~
5959

60-
- v0.11.1
61-
62-
+ Made web front-end for job monitoring a separate script, ``gridmap_web``,
63-
since it can be used to talk to any ``JobMonitor`` instance. (Fixes #14)
64-
+ Fixed crash if a stalled job comes back from the dead (#15).
65-
+ Fixed crash if job's hostname is somehow not in white list and the job
66-
needs to be resubmitted (#16).
67-
+ Fixed crash from trying to set ``matplotlib`` back-end multiple times.
68-
+ Cleaned up some imports and removed some unused variables.
69-
70-
- v0.11.0
71-
72-
+ Vastly more reliable job completion information thanks to switch back to
73-
using 0MQ for communication with worker nodes. No more unpickling
74-
exceptions because the SGE DRMAA implementation frequently liked to say
75-
jobs were finished when they were not.
76-
+ Add back web monitor to report basic job status.
77-
+ Switch to using custom fork of drmaa-python until
78-
drmaa-python/drmaa-python#4, which fixes Python 3 compatibility issues,
79-
gets merged.
80-
+ Now creates temporary directory for storing log files if it doesn't
81-
exist.
82-
+ Travis-CI SGE installation has been streamlined.
83-
+ Switch to using sphinx and readthedocs for documentation.
84-
+ Added detection of stalled jobs. GridMap will also automatically restart
85-
any jobs that appear stuck (up to 3 times by default), and email you a
86-
report describing their CPU and memory usage over time.
87-
88-
- v0.10.3
89-
90-
+ Fix issue where ``clean_path`` wasn't being called on the working
91-
directory, which was causing ETS-specific issues.
92-
+ Add a couple workarounds for issues with setting environment variables in
93-
Python 3.
94-
+ Made examples into unit tests and added first attempt at getting Travis
95-
setup with SGE.
96-
97-
- v0.10.2
98-
99-
+ Working directory is now correctly set for each job.
100-
+ Simplified handling of environment variables. Should now all be passed on
101-
properly.
102-
103-
- v0.10.1
104-
105-
+ Can now import ``JobException`` directly from ``gridmap`` package instead
106-
of having to import from ``gridmap.job``.
107-
108-
- v0.10.0
109-
110-
+ Now raise a ``JobException`` instead of an ``Exception`` when one of the
111-
jobs has crashed.
112-
+ Fixed potential pip installation issue from importing package for version
113-
number.
114-
115-
- v0.9.9
116-
117-
+ Changed way job results are retrieved to be a bit more efficient in cases
118-
of errors.
119-
+ All job metadata is now retrieved before job output is, which should
120-
hopefully alleviate issues where we can't get the metadata because its been
121-
flushed too quickly by the grid engine.
122-
123-
- v0.9.8
124-
125-
+ Fixed a bug where only the first error was still showing because of an
126-
extra exception caused by job_output being undefined.
127-
+ Fixed unhandled Exception with error code 24 (since somehow that is not an
128-
InvalidJobException, but just an Exception in drmaa-python).
129-
130-
- v0.9.7
131-
132-
+ No longer dies with InvalidJobException when failing to retrieve job
133-
metadata from DRMAA service.
134-
+ Now print all exceptions encountered for jobs submitted instead of just
135-
exiting after first one.
136-
+ Die via exception instead of sys.exit when there were problems with some of
137-
the submitted jobs.
138-
139-
- v0.9.6
140-
141-
+ Fixed bug where jobs were being aborted before they ran.
142-
143-
- v0.9.5
144-
145-
+ Fixed bug where ``GRID_MAP_USE_MEM_FREE`` would only be interpretted as true if
146-
spelled 'True'.
147-
+ Added documentation describing how to override constants.
148-
149-
- v0.9.4
150-
151-
+ Added support for overriding the default queue and other constants via
152-
environment variables. For example, to change the default queue, just set
153-
the environment variable ``GRID_MAP_DEFAULT_QUEUE``.
154-
+ Substantially more information is given about crashing jobs when we fail
155-
to unpickle the results from the Redis database.
156-
157-
- v0.9.3
158-
159-
+ Fixed serious bug where gridmap could not be imported in some instances.
160-
+ Refactored things a bit so there is no longer one large module with all of
161-
the code in it. (Doesn't change package interface)
60+
See `GitHub releases <https://github.com/EducationalTestingService/gridmap/releases>`__.

gridmap/conf.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -75,7 +75,7 @@
7575
try:
7676
import drmaa
7777
DRMAA_PRESENT = True
78-
except ImportError:
78+
except (ImportError, RuntimeError):
7979
logger = logging.getLogger(__name__)
8080
logger.warning('Could not import drmaa. Only local multiprocessing ' +
8181
'supported.')

gridmap/job.py

Lines changed: 14 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -52,6 +52,7 @@
5252
from multiprocessing import Pool
5353
from socket import gethostname
5454

55+
import psutil
5556
import zmq
5657

5758
from gridmap.conf import (CHECK_FREQUENCY, CREATE_PLOTS, DEFAULT_QUEUE,
@@ -77,6 +78,12 @@
7778
import matplotlib.pyplot as plt
7879

7980

81+
# Set of "not running" job statuses
82+
SLEEP_STATUSES = {psutil.STATUS_SLEEPING, psutil.STATUS_DEAD,
83+
psutil.STATUS_IDLE, psutil.STATUS_STOPPED,
84+
psutil.STATUS_ZOMBIE}
85+
86+
8087
class JobException(Exception):
8188
'''
8289
New exception type for when one of the jobs crashed.
@@ -394,8 +401,8 @@ def check_if_alive(self):
394401
logger.error("job died for unknown reason")
395402
job.cause_of_death = "unknown"
396403
elif (len(job.track_cpu) > MAX_IDLE_HEARTBEATS and
397-
all(cpu_load <= IDLE_THRESHOLD and state == 'S'
398-
for cpu_load, state in
404+
all((cpu_load <= IDLE_THRESHOLD and
405+
state in SLEEP_STATUSES) for cpu_load, state in
399406
job.track_cpu[-MAX_IDLE_HEARTBEATS:])):
400407
logger.error('Job stalled for unknown reason.')
401408
job.cause_of_death = 'stalled'
@@ -417,6 +424,8 @@ def check_if_alive(self):
417424
# try to resubmit
418425
old_id = job.jobid
419426
handle_resubmit(self.session_id, job, temp_dir=self.temp_dir)
427+
logging.info('Resubmitted job %s; it now has ID %s', old_id,
428+
job.jobid)
420429
if job.jobid is None:
421430
logger.error("giving up on job")
422431
job.ret = "job dead"
@@ -573,9 +582,9 @@ def handle_resubmit(session_id, job, temp_dir='/scratch/'):
573582
job.num_resubmits += 1
574583
job.cause_of_death = ""
575584

576-
return _resubmit(session_id, job, temp_dir)
585+
_resubmit(session_id, job, temp_dir)
577586
else:
578-
return None
587+
job.jobid = None
579588

580589

581590
def _execute(job):
@@ -701,7 +710,7 @@ def _append_job_to_session(session, job, temp_dir='/scratch/', quiet=True):
701710

702711
if not quiet:
703712
print('Your job {} has been submitted with id {}'.format(job.name,
704-
jobid),
713+
jobid),
705714
file=sys.stderr)
706715

707716
session.deleteJobTemplate(jt)

gridmap/runner.py

Lines changed: 5 additions & 39 deletions
Original file line numberDiff line numberDiff line change
@@ -42,6 +42,7 @@
4242
from io import open
4343
from subprocess import check_output
4444

45+
from psutil import Process
4546
import zmq
4647

4748
from gridmap.conf import HEARTBEAT_FREQUENCY
@@ -97,31 +98,6 @@ def _heart_beat(job_id, address, parent_pid=-1, log_file="", wait_sec=45):
9798
time.sleep(wait_sec)
9899

99100

100-
def _VmB(VmKey, pid):
101-
"""
102-
get various mem usage properties of process with id pid in MB
103-
"""
104-
105-
_proc_status = '/proc/%d/status' % pid
106-
107-
_scale = {'kB': 1.0/1024.0, 'mB': 1.0,
108-
'KB': 1.0/1024.0, 'MB': 1.0}
109-
110-
# get pseudo file /proc/<pid>/status
111-
try:
112-
with open(_proc_status) as t:
113-
v = t.read()
114-
except:
115-
return 0.0 # non-Linux?
116-
# get VmKey line e.g. 'VmRSS: 9999 kB\n ...'
117-
i = v.index(VmKey)
118-
v = v[i:].split(None, 3) # whitespace
119-
if len(v) < 3:
120-
return 0.0 # invalid format?
121-
# convert Vm value to bytes
122-
return float(v[1]) * _scale[v[2]]
123-
124-
125101
def get_memory_usage(pid):
126102
"""
127103
:param pid: Process ID for job whose memory usage we'd like to check.
@@ -130,7 +106,8 @@ def get_memory_usage(pid):
130106
:returns: Memory usage of process in Mb.
131107
"""
132108

133-
return _VmB('VmSize:', pid)
109+
p = Process(pid)
110+
return p.get_memory_usage()[0] / (1024.0 ** 2.0)
134111

135112

136113
def get_cpu_load(pid):
@@ -143,19 +120,8 @@ def get_cpu_load(pid):
143120
:rtype: (float, str)
144121
"""
145122

146-
147-
command = ["ps", "h", "-o", "pcpu,state", "-p", "%d" % (pid)]
148-
149-
try:
150-
cpu_load, state = check_output(command).strip().split()
151-
cpu_load = float(cpu_load)
152-
except:
153-
logger = logging.getLogger(__name__)
154-
logger.warning('Getting CPU info failed.', exc_info=True)
155-
cpu_load = float('NaN')
156-
state = '?'
157-
158-
return cpu_load, state
123+
p = Process(pid)
124+
return p.get_cpu_percent(), p.status
159125

160126

161127
def get_job_status(parent_pid):

gridmap/version.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -28,5 +28,5 @@
2828
:organization: ETS
2929
'''
3030

31-
__version__ = '0.11.1'
31+
__version__ = '0.11.2'
3232
VERSION = tuple(int(x) for x in __version__.split('.'))
Lines changed: 9 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -124,11 +124,17 @@ def job_to_html(job):
124124
return body_text.encode()
125125

126126

127-
def _main():
127+
def main(argv=None):
128128
"""
129129
Parse the command line inputs and start web monitor.
130+
131+
:param argv: List of arguments, as if specified on the command-line.
132+
If None, ``sys.argv`` is used instead.
133+
:type argv: list of str
130134
"""
131135
# Get command line arguments
136+
if argv is None:
137+
argv = sys.argv
132138
parser = argparse.ArgumentParser(description="Provides a web interface to \
133139
0MQ job monitor.")
134140
parser.add_argument('module_dir',
@@ -138,7 +144,7 @@ def _main():
138144
parser.add_argument('-p', '--port',
139145
help='Port for server to listen on.', type=int,
140146
default=8076)
141-
args = parser.parse_args()
147+
args = parser.parse_args(argv)
142148

143149
# Make warnings from built-in warnings module get formatted more nicely
144150
logging.captureWarnings(True)
@@ -161,4 +167,4 @@ def _main():
161167

162168

163169
if __name__ == "__main__":
164-
_main()
170+
main()

requirements.txt

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,2 +1,3 @@
1-
git+git://github.com/dan-blanchard/drmaa-python#egg=drmaa
1+
drmaa
2+
psutil
23
pyzmq

setup.py

Lines changed: 9 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -25,11 +25,18 @@
2525
exec(compile(open('gridmap/version.py').read(), 'gridmap/version.py', 'exec'))
2626
# (we use the above instead of execfile for Python 3.x compatibility)
2727

28+
2829
def readme():
2930
with open('README.rst') as f:
3031
return f.read()
3132

3233

34+
def requirements():
35+
with open('requirements.txt') as f:
36+
reqs = f.read().splitlines()
37+
return reqs
38+
39+
3340
setup(name='gridmap',
3441
version=__version__,
3542
description=('Easily map Python functions onto a cluster using a ' +
@@ -41,8 +48,8 @@ def readme():
4148
author_email='dblanchard@ets.org',
4249
license='GPL',
4350
packages=['gridmap'],
44-
install_requires=['drmaa', 'pyzmq'],
45-
scripts=['scripts/gridmap_web'],
51+
install_requires=requirements(),
52+
entry_points={'console_scripts': ['gridmap_web = gridmap.web:main']},
4653
classifiers=['Intended Audience :: Science/Research',
4754
'Intended Audience :: Developers',
4855
'License :: OSI Approved :: GNU General Public License v3 (GPLv3)',

0 commit comments

Comments
 (0)