Skip to content

FrameworkProcessor source_dir wrong behavior #5735

@d-vesely

Description

@d-vesely

PySDK Version

  • PySDK V2 (2.x)
  • PySDK V3 (3.x)

Describe the bug
It used to be possible to create a pipeline step with FrameworkProcessor, providing a source_dir (archive in S3 containing entrypoint Python file as well as other dependencies/code) and code (the name of the entrypoint Python file). The source_dir S3 uri would then get mapped into /opt/ml/processing/input/code/, and a runproc.sh script would be created and uploaded, which would then serve as the ProcessingJob' entrypoint (install requirements if present, and run the Python file inside /opt/ml/processing/input/code/).

This is currently not possible anymore. It should be possible to use an S3 uri for source_dir, but it isn't. The code in _package_code only makes sense when source_dir is a local directory which is then uploaded. We require the previous behavior to be available as well, in order to fully migrate from v2 to v3.

To reproduce
Define a FrameworkProcessor and pass an S3 uri to source_dir:

script_evaluation = FrameworkProcessor(
    image_uri=self.image_uris["train"],
    command=["python3"],
    instance_type=self.instances["processing_type"],
    instance_count=self.instances["processing_count"],
    base_job_name=base_job_name,
    output_kms_key=self.aws_params["kms_key_hub"],
    volume_kms_key=self.aws_params["kms_key"],
    network_config=self.network_config,
    env={
        "RANDOM_STATE": self.pipeline_params["RandomState"].to_string(),
        **self.default_env_vars,
    },
    role=self.aws_params["exec_role"],
    sagemaker_session=self.pipeline_session,
    tags=self.tags,
)

step_evaluation_args = script_evaluation.run(
    code="evaluate.py", <--- This is the Python entrypoint, but the evaluate.py file imports from other Python files within the source_dir
    source_dir=self.s3_uri_sourcedir, # <--- This is an S3 uri and causes problems
    # add inputs and outputs as needed
    arguments=None,
)

Expected behavior
The same behavior as in v2: source_dir should be mapped into /opt/ml/processing/input/code/, a runproc.sh entrypoint should be created and uplodaded that runs the Python file defined in code (which is part of source_dir). source_dir should not be looked for locally, packaged and uploaded, if source_dir is already an S3 uri.

System information
A description of your system. Please provide:

  • SageMaker Python SDK version: 3.7.1

Additional context
This is a roadblock for my company for the migration from v2 to v3.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions