Skip to content

fix(train): correct Networking field names in ModelTrainer intelligent defaults#5866

Open
mandipat wants to merge 1 commit into
aws:masterfrom
mandipat:fix/networking-default-enable-network-isolation
Open

fix(train): correct Networking field names in ModelTrainer intelligent defaults#5866
mandipat wants to merge 1 commit into
aws:masterfrom
mandipat:fix/networking-default-enable-network-isolation

Conversation

@mandipat
Copy link
Copy Markdown

Summary

Fixes two bugs in _populate_intelligent_defaults_from_training_job_space() in model_trainer.py that prevent ModelTrainer from working when sagemaker_config has a VpcConfig set. This makes SageMaker distribution 4.0 unusable out of the box.

Bug 1 — Wrong field name passed to Networking constructor

default_enable_network_isolation does not exist as a field in sagemaker-core 2.x. The correct field name is enable_network_isolation. This caused a Pydantic ValidationError on every model_trainer.train() call when VPC config was present in sagemaker_config.

# BEFORE (broken)
self.networking = Networking(
    default_enable_network_isolation=default_enable_network_isolation,  # field does not exist
    ...
)

# AFTER (fixed)
self.networking = Networking(
    enable_network_isolation=default_enable_network_isolation,
    ...
)

Bug 2 — security_group_ids silently assigned wrong value

When updating an existing Networking object, security_group_ids was mistakenly assigned the value from TRAINING_JOB_SUBNETS_PATH instead of TRAINING_JOB_SECURITY_GROUP_IDS_PATH.

# BEFORE (broken)
if self.networking.security_group_ids is None:
    self.networking.subnets = self.config_mgr.resolve_value_from_config(  # wrong attribute + wrong path
        config_path=TRAINING_JOB_SUBNETS_PATH
    )

# AFTER (fixed)
if self.networking.security_group_ids is None:
    self.networking.security_group_ids = self.config_mgr.resolve_value_from_config(
        config_path=TRAINING_JOB_SECURITY_GROUP_IDS_PATH
    )

Testing

Reproduced and verified the fix using a clean venv with sagemaker-core==2.10.0 and sagemaker-train==1.10.0 (exact versions shipped in SageMaker distribution 4.0).

from sagemaker.core.training.configs import Networking

# Bug 1 confirmed
try:
    Networking(default_enable_network_isolation=None)  # ValidationError
except Exception as e:
    print(f"BUG: {e}")

# Fix confirmed
Networking(enable_network_isolation=False)  # Works

Closes #5766

…t defaults

Two bugs in _populate_intelligent_defaults_from_training_job_space():

1. Networking constructor was passed `default_enable_network_isolation`
   which does not exist in sagemaker-core 2.x. The correct field name
   is `enable_network_isolation`. This caused a Pydantic ValidationError
   when sagemaker_config had a VpcConfig value set, making ModelTrainer
   unusable in SageMaker distribution 4.0 out of the box.

2. When updating an existing Networking object, `security_group_ids` was
   mistakenly assigned the subnets value from TRAINING_JOB_SUBNETS_PATH
   instead of the correct value from TRAINING_JOB_SECURITY_GROUP_IDS_PATH.

Fixes aws#5766
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

1 participant