PySDK Version
Describe the bug
It used to be possible to create a pipeline step with FrameworkProcessor, providing a source_dir (archive in S3 containing entrypoint Python file as well as other dependencies/code) and code (the name of the entrypoint Python file). The source_dir S3 uri would then get mapped into /opt/ml/processing/input/code/, and a runproc.sh script would be created and uploaded, which would then serve as the ProcessingJob' entrypoint (install requirements if present, and run the Python file inside /opt/ml/processing/input/code/).
This is currently not possible anymore. It should be possible to use an S3 uri for source_dir, but it isn't. The code in _package_code only makes sense when source_dir is a local directory which is then uploaded. We require the previous behavior to be available as well, in order to fully migrate from v2 to v3.
To reproduce
Define a FrameworkProcessor and pass an S3 uri to source_dir:
script_evaluation = FrameworkProcessor(
image_uri=self.image_uris["train"],
command=["python3"],
instance_type=self.instances["processing_type"],
instance_count=self.instances["processing_count"],
base_job_name=base_job_name,
output_kms_key=self.aws_params["kms_key_hub"],
volume_kms_key=self.aws_params["kms_key"],
network_config=self.network_config,
env={
"RANDOM_STATE": self.pipeline_params["RandomState"].to_string(),
**self.default_env_vars,
},
role=self.aws_params["exec_role"],
sagemaker_session=self.pipeline_session,
tags=self.tags,
)
step_evaluation_args = script_evaluation.run(
code="evaluate.py", <--- This is the Python entrypoint, but the evaluate.py file imports from other Python files within the source_dir
source_dir=self.s3_uri_sourcedir, # <--- This is an S3 uri and causes problems
# add inputs and outputs as needed
arguments=None,
)
Expected behavior
The same behavior as in v2: source_dir should be mapped into /opt/ml/processing/input/code/, a runproc.sh entrypoint should be created and uplodaded that runs the Python file defined in code (which is part of source_dir). source_dir should not be looked for locally, packaged and uploaded, if source_dir is already an S3 uri.
System information
A description of your system. Please provide:
- SageMaker Python SDK version: 3.7.1
Additional context
This is a roadblock for my company for the migration from v2 to v3.
PySDK Version
Describe the bug
It used to be possible to create a pipeline step with FrameworkProcessor, providing a
source_dir(archive in S3 containing entrypoint Python file as well as other dependencies/code) andcode(the name of the entrypoint Python file). Thesource_dirS3 uri would then get mapped into/opt/ml/processing/input/code/, and arunproc.shscript would be created and uploaded, which would then serve as theProcessingJob' entrypoint (install requirements if present, and run the Python file inside/opt/ml/processing/input/code/).This is currently not possible anymore. It should be possible to use an S3 uri for
source_dir, but it isn't. The code in_package_codeonly makes sense whensource_diris a local directory which is then uploaded. We require the previous behavior to be available as well, in order to fully migrate from v2 to v3.To reproduce
Define a FrameworkProcessor and pass an S3 uri to source_dir:
Expected behavior
The same behavior as in v2:
source_dirshould be mapped into/opt/ml/processing/input/code/, arunproc.shentrypoint should be created and uplodaded that runs the Python file defined incode(which is part ofsource_dir).source_dirshould not be looked for locally, packaged and uploaded, ifsource_diris already an S3 uri.System information
A description of your system. Please provide:
Additional context
This is a roadblock for my company for the migration from v2 to v3.