Skip to content

Error for Training job huggingface-pytorch-training-2025-07-04-16-29-08-379... NotImplementedError: Unsloth: meta-llama/Llama-3.1-8B is not supported in your current Unsloth version! #57

@mariotalavera

Description

@mariotalavera

UnexpectedStatusException: Error for Training job huggingface-pytorch-training-2025-07-04-16-29-08-379: Failed. Reason: AlgorithmError: ExecuteUserScriptError: ExitCode 1 ErrorMessage "raise NotImplementedError( NotImplementedError: Unsloth: meta-llama/Llama-3.1-8B is not supported in your current Unsloth version!

On attempting to run poetry poe run-training-pipeline (Sagemaker, training), I run into the following error. Have tried changing llama version; no luck.

Here at the moment, https://mlabonne.github.io/blog/posts/2024-07-29_Finetune_Llama31.html. Hoping this saves someone some time.

Error

UnexpectedStatusException: Error for Training job huggingface-pytorch-training-2025-07-04-16-29-08-379: Failed. Reason: AlgorithmError: ExecuteUserScriptError:
ExitCode 1
ErrorMessage "raise NotImplementedError(
 NotImplementedError: Unsloth: meta-llama/Llama-3.1-8B is not supported in your current Unsloth version!

Log

Poe => poetry run python -m tools.run --no-cache --run-training
�[1;35mPyTorch version 2.2.2 available.�[0m
�[1;35mChromedriver is already installed.�[0m
�[33mUSER_AGENT environment variable not set, consider setting it to identify your requests.�[0m
�[1;35mFound credentials in shared credentials file: ~/.aws/credentials�[0m
sagemaker.config INFO - Not applying SDK defaults from location: /Library/Application Support/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /Users/mariotalavera/Library/Application Support/sagemaker/config.yaml
�[1;35mLoad pretrained SentenceTransformer: sentence-transformers/all-MiniLM-L6-v2�[0m
�[1;35mInitiating a new run for the pipeline: �[0m�[1;36mtraining�[1;35m.�[0m
�[1;35mNot including stack component settings with key �[0m�[1;36morchestrator.sagemaker�[1;35m.�[0m
�[1;35mCaching is disabled by default for �[0m�[1;36mtraining�[1;35m.�[0m
�[1;35mUsing user: �[0m�[1;36mdefault�[1;35m�[0m
�[1;35mUsing stack: �[0m�[1;36mdefault�[1;35m�[0m
�[1;35m  artifact_store: �[0m�[1;36mdefault�[1;35m�[0m
�[1;35m  orchestrator: �[0m�[1;36mdefault�[1;35m�[0m
�[1;35mDashboard URL for Pipeline Run: �[0m�[34mhttp://127.0.0.1:8237/runs/e7eea189-aa1a-4167-9c29-9df86461ae80�[1;35m�[0m
�[1;35mStep �[0m�[1;36mtrain�[1;35m has started.�[0m
�[32m2025-07-04 12:29:08.079�[0m | �[1mINFO    �[0m | �[36mllm_engineering.model.finetuning.sagemaker�[0m:�[36mrun_finetuning_on_sagemaker�[0m:�[36m36�[0m - �[1mCurrent Hugging Face user: mariotalavera�[0m
�[1;35mFound credentials in shared credentials file: ~/.aws/credentials�[0m
�[1;35mFound credentials in shared credentials file: ~/.aws/credentials�[0m
�[1;35mimage_uri is not presented, retrieving image_uri based on instance_type, framework etc.�[0m
�[1;35mimage_uri is not presented, retrieving image_uri based on instance_type, framework etc.�[0m
�[1;35mCreating training-job with name: huggingface-pytorch-training-2025-07-04-16-29-08-379�[0m
2025-07-04 16:29:14 Starting - Starting the training job...
2025-07-04 16:29:28 Starting - Preparing the instances for training...
2025-07-04 16:30:06 Downloading - Downloading the training image..................
2025-07-04 16:33:23 Training - Training image download completed. Training in progress....bash: cannot set terminal process group (-1): Inappropriate ioctl for device
bash: no job control in this shell
/opt/conda/lib/python3.10/site-packages/paramiko/pkey.py:100: CryptographyDeprecationWarning: TripleDES has been moved to cryptography.hazmat.decrepit.ciphers.algorithms.TripleDES and will be removed from this module in 48.0.0.
  "cipher": algorithms.TripleDES,
/opt/conda/lib/python3.10/site-packages/paramiko/transport.py:259: CryptographyDeprecationWarning: TripleDES has been moved to cryptography.hazmat.decrepit.ciphers.algorithms.TripleDES and will be removed from this module in 48.0.0.
  "class": algorithms.TripleDES,
2025-07-04 16:33:40,237 sagemaker-training-toolkit INFO     Imported framework sagemaker_pytorch_container.training
2025-07-04 16:33:40,254 sagemaker-training-toolkit INFO     No Neurons detected (normal if no neurons installed)
2025-07-04 16:33:40,264 sagemaker_pytorch_container.training INFO     Block until all host DNS lookups succeed.
2025-07-04 16:33:40,265 sagemaker_pytorch_container.training INFO     Invoking user training script.
2025-07-04 16:33:41,669 sagemaker-training-toolkit INFO     Installing dependencies from requirements.txt
Collecting accelerate==0.33.0 (from -r requirements.txt (line 1))
Downloading accelerate-0.33.0-py3-none-any.whl.metadata (18 kB)
Collecting torch==2.4.0 (from -r requirements.txt (line 2))
Downloading torch-2.4.0-cp310-cp310-manylinux1_x86_64.whl.metadata (26 kB)
Collecting transformers==4.43.3 (from -r requirements.txt (line 3))
Downloading transformers-4.43.3-py3-none-any.whl.metadata (43 kB)
Collecting datasets==2.20.0 (from -r requirements.txt (line 4))
Downloading datasets-2.20.0-py3-none-any.whl.metadata (19 kB)
Collecting peft==0.12.0 (from -r requirements.txt (line 5))
Downloading peft-0.12.0-py3-none-any.whl.metadata (13 kB)
Collecting trl==0.9.6 (from -r requirements.txt (line 6))
Downloading trl-0.9.6-py3-none-any.whl.metadata (12 kB)
Collecting bitsandbytes==0.43.3 (from -r requirements.txt (line 7))
Downloading bitsandbytes-0.43.3-py3-none-manylinux_2_24_x86_64.whl.metadata (3.5 kB)
Collecting comet-ml==3.44.3 (from -r requirements.txt (line 8))
Downloading comet_ml-3.44.3-py3-none-any.whl.metadata (3.9 kB)
Requirement already satisfied: flash-attn==2.3.6 in /opt/conda/lib/python3.10/site-packages (from -r requirements.txt (line 9)) (2.3.6)
Collecting unsloth==2024.9.post2 (from -r requirements.txt (line 10))
Downloading unsloth-2024.9.post2-py3-none-any.whl.metadata (55 kB)
Requirement already satisfied: numpy<2.0.0,>=1.17 in /opt/conda/lib/python3.10/site-packages (from accelerate==0.33.0->-r requirements.txt (line 1)) (1.26.4)
Requirement already satisfied: packaging>=20.0 in /opt/conda/lib/python3.10/site-packages (from accelerate==0.33.0->-r requirements.txt (line 1)) (23.1)
Requirement already satisfied: psutil in /opt/conda/lib/python3.10/site-packages (from accelerate==0.33.0->-r requirements.txt (line 1)) (6.0.0)
Requirement already satisfied: pyyaml in /opt/conda/lib/python3.10/site-packages (from accelerate==0.33.0->-r requirements.txt (line 1)) (6.0.2)
Requirement already satisfied: huggingface-hub>=0.21.0 in /opt/conda/lib/python3.10/site-packages (from accelerate==0.33.0->-r requirements.txt (line 1)) (0.25.2)
Requirement already satisfied: safetensors>=0.3.1 in /opt/conda/lib/python3.10/site-packages (from accelerate==0.33.0->-r requirements.txt (line 1)) (0.4.5)
Requirement already satisfied: filelock in /opt/conda/lib/python3.10/site-packages (from torch==2.4.0->-r requirements.txt (line 2)) (3.15.4)
Requirement already satisfied: typing-extensions>=4.8.0 in /opt/conda/lib/python3.10/site-packages (from torch==2.4.0->-r requirements.txt (line 2)) (4.12.2)
Requirement already satisfied: sympy in /opt/conda/lib/python3.10/site-packages (from torch==2.4.0->-r requirements.txt (line 2)) (1.13.0)
Requirement already satisfied: networkx in /opt/conda/lib/python3.10/site-packages (from torch==2.4.0->-r requirements.txt (line 2)) (3.3)
Requirement already satisfied: jinja2 in /opt/conda/lib/python3.10/site-packages (from torch==2.4.0->-r requirements.txt (line 2)) (3.1.4)
Requirement already satisfied: fsspec in /opt/conda/lib/python3.10/site-packages (from torch==2.4.0->-r requirements.txt (line 2)) (2024.2.0)
Collecting nvidia-cuda-nvrtc-cu12==12.1.105 (from torch==2.4.0->-r requirements.txt (line 2))
Downloading nvidia_cuda_nvrtc_cu12-12.1.105-py3-none-manylinux1_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.1.105 (from torch==2.4.0->-r requirements.txt (line 2))
Downloading nvidia_cuda_runtime_cu12-12.1.105-py3-none-manylinux1_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.1.105 (from torch==2.4.0->-r requirements.txt (line 2))
Downloading nvidia_cuda_cupti_cu12-12.1.105-py3-none-manylinux1_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==9.1.0.70 (from torch==2.4.0->-r requirements.txt (line 2))
Downloading nvidia_cudnn_cu12-9.1.0.70-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cublas-cu12==12.1.3.1 (from torch==2.4.0->-r requirements.txt (line 2))
Downloading nvidia_cublas_cu12-12.1.3.1-py3-none-manylinux1_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cufft-cu12==11.0.2.54 (from torch==2.4.0->-r requirements.txt (line 2))
Downloading nvidia_cufft_cu12-11.0.2.54-py3-none-manylinux1_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-curand-cu12==10.3.2.106 (from torch==2.4.0->-r requirements.txt (line 2))
Downloading nvidia_curand_cu12-10.3.2.106-py3-none-manylinux1_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cusolver-cu12==11.4.5.107 (from torch==2.4.0->-r requirements.txt (line 2))
Downloading nvidia_cusolver_cu12-11.4.5.107-py3-none-manylinux1_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cusparse-cu12==12.1.0.106 (from torch==2.4.0->-r requirements.txt (line 2))
Downloading nvidia_cusparse_cu12-12.1.0.106-py3-none-manylinux1_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-nccl-cu12==2.20.5 (from torch==2.4.0->-r requirements.txt (line 2))
Downloading nvidia_nccl_cu12-2.20.5-py3-none-manylinux2014_x86_64.whl.metadata (1.8 kB)
Collecting nvidia-nvtx-cu12==12.1.105 (from torch==2.4.0->-r requirements.txt (line 2))
Downloading nvidia_nvtx_cu12-12.1.105-py3-none-manylinux1_x86_64.whl.metadata (1.7 kB)
Collecting triton==3.0.0 (from torch==2.4.0->-r requirements.txt (line 2))
Downloading triton-3.0.0-1-cp310-cp310-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (1.3 kB)
Requirement already satisfied: regex!=2019.12.17 in /opt/conda/lib/python3.10/site-packages (from transformers==4.43.3->-r requirements.txt (line 3)) (2024.9.11)
Requirement already satisfied: requests in /opt/conda/lib/python3.10/site-packages (from transformers==4.43.3->-r requirements.txt (line 3)) (2.32.3)
Collecting tokenizers<0.20,>=0.19 (from transformers==4.43.3->-r requirements.txt (line 3))
Downloading tokenizers-0.19.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.7 kB)
Requirement already satisfied: tqdm>=4.27 in /opt/conda/lib/python3.10/site-packages (from transformers==4.43.3->-r requirements.txt (line 3)) (4.66.5)
Requirement already satisfied: pyarrow>=15.0.0 in /opt/conda/lib/python3.10/site-packages (from datasets==2.20.0->-r requirements.txt (line 4)) (17.0.0)
Requirement already satisfied: pyarrow-hotfix in /opt/conda/lib/python3.10/site-packages (from datasets==2.20.0->-r requirements.txt (line 4)) (0.6)
Requirement already satisfied: dill<0.3.9,>=0.3.0 in /opt/conda/lib/python3.10/site-packages (from datasets==2.20.0->-r requirements.txt (line 4)) (0.3.8)
Requirement already satisfied: pandas in /opt/conda/lib/python3.10/site-packages (from datasets==2.20.0->-r requirements.txt (line 4)) (2.2.2)
Requirement already satisfied: xxhash in /opt/conda/lib/python3.10/site-packages (from datasets==2.20.0->-r requirements.txt (line 4)) (3.5.0)
Requirement already satisfied: multiprocess in /opt/conda/lib/python3.10/site-packages (from datasets==2.20.0->-r requirements.txt (line 4)) (0.70.16)
Requirement already satisfied: aiohttp in /opt/conda/lib/python3.10/site-packages (from datasets==2.20.0->-r requirements.txt (line 4)) (3.10.10)
Requirement already satisfied: tyro>=0.5.11 in /opt/conda/lib/python3.10/site-packages (from trl==0.9.6->-r requirements.txt (line 6)) (0.8.14)
Collecting everett<3.2.0,>=1.0.1 (from everett[ini]<3.2.0,>=1.0.1->comet-ml==3.44.3->-r requirements.txt (line 8))
Downloading everett-3.1.0-py2.py3-none-any.whl.metadata (17 kB)
Requirement already satisfied: jsonschema!=3.1.0,>=2.6.0 in /opt/conda/lib/python3.10/site-packages (from comet-ml==3.44.3->-r requirements.txt (line 8)) (4.23.0)
Collecting python-box<7.0.0 (from comet-ml==3.44.3->-r requirements.txt (line 8))
Downloading python_box-6.1.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (7.8 kB)
Collecting requests-toolbelt>=0.8.0 (from comet-ml==3.44.3->-r requirements.txt (line 8))
Downloading requests_toolbelt-1.0.0-py2.py3-none-any.whl.metadata (14 kB)
Collecting semantic-version>=2.8.0 (from comet-ml==3.44.3->-r requirements.txt (line 8))
Downloading semantic_version-2.10.0-py2.py3-none-any.whl.metadata (9.7 kB)
Collecting sentry-sdk>=1.1.0 (from comet-ml==3.44.3->-r requirements.txt (line 8))
Downloading sentry_sdk-2.32.0-py2.py3-none-any.whl.metadata (10 kB)
Collecting simplejson (from comet-ml==3.44.3->-r requirements.txt (line 8))
Downloading simplejson-3.20.1-cp310-cp310-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (3.2 kB)
Requirement already satisfied: urllib3>=1.21.1 in /opt/conda/lib/python3.10/site-packages (from comet-ml==3.44.3->-r requirements.txt (line 8)) (1.26.19)
Requirement already satisfied: wrapt>=1.11.2 in /opt/conda/lib/python3.10/site-packages (from comet-ml==3.44.3->-r requirements.txt (line 8)) (1.16.0)
Collecting wurlitzer>=1.0.2 (from comet-ml==3.44.3->-r requirements.txt (line 8))
Downloading wurlitzer-3.1.1-py3-none-any.whl.metadata (2.5 kB)
Collecting dulwich!=0.20.33,>=0.20.6 (from comet-ml==3.44.3->-r requirements.txt (line 8))
Downloading dulwich-0.23.1-cp310-cp310-manylinux_2_28_x86_64.whl.metadata (5.1 kB)
Requirement already satisfied: rich>=13.3.2 in /opt/conda/lib/python3.10/site-packages (from comet-ml==3.44.3->-r requirements.txt (line 8)) (13.7.1)
Requirement already satisfied: einops in /opt/conda/lib/python3.10/site-packages (from flash-attn==2.3.6->-r requirements.txt (line 9)) (0.8.0)
Requirement already satisfied: ninja in /opt/conda/lib/python3.10/site-packages (from flash-attn==2.3.6->-r requirements.txt (line 9)) (1.11.1)
Collecting xformers>=0.0.27.post2 (from unsloth==2024.9.post2->-r requirements.txt (line 10))
Downloading xformers-0.0.31-cp39-abi3-manylinux_2_28_x86_64.whl.metadata (1.0 kB)
Requirement already satisfied: sentencepiece>=0.2.0 in /opt/conda/lib/python3.10/site-packages (from unsloth==2024.9.post2->-r requirements.txt (line 10)) (0.2.0)
Requirement already satisfied: wheel>=0.42.0 in /opt/conda/lib/python3.10/site-packages (from unsloth==2024.9.post2->-r requirements.txt (line 10)) (0.44.0)
Requirement already satisfied: protobuf<4.0.0 in /opt/conda/lib/python3.10/site-packages (from unsloth==2024.9.post2->-r requirements.txt (line 10)) (3.20.3)
Collecting hf-transfer (from unsloth==2024.9.post2->-r requirements.txt (line 10))
Downloading hf_transfer-0.1.9-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (1.7 kB)
Collecting nvidia-nvjitlink-cu12 (from nvidia-cusolver-cu12==11.4.5.107->torch==2.4.0->-r requirements.txt (line 2))
Downloading nvidia_nvjitlink_cu12-12.9.86-py3-none-manylinux2010_x86_64.manylinux_2_12_x86_64.whl.metadata (1.7 kB)
Collecting configobj (from everett[ini]<3.2.0,>=1.0.1->comet-ml==3.44.3->-r requirements.txt (line 8))
Downloading configobj-5.0.9-py2.py3-none-any.whl.metadata (3.2 kB)
Requirement already satisfied: aiohappyeyeballs>=2.3.0 in /opt/conda/lib/python3.10/site-packages (from aiohttp->datasets==2.20.0->-r requirements.txt (line 4)) (2.4.3)
Requirement already satisfied: aiosignal>=1.1.2 in /opt/conda/lib/python3.10/site-packages (from aiohttp->datasets==2.20.0->-r requirements.txt (line 4)) (1.3.1)
Requirement already satisfied: attrs>=17.3.0 in /opt/conda/lib/python3.10/site-packages (from aiohttp->datasets==2.20.0->-r requirements.txt (line 4)) (23.2.0)
Requirement already satisfied: frozenlist>=1.1.1 in /opt/conda/lib/python3.10/site-packages (from aiohttp->datasets==2.20.0->-r requirements.txt (line 4)) (1.5.0)
Requirement already satisfied: multidict<7.0,>=4.5 in /opt/conda/lib/python3.10/site-packages (from aiohttp->datasets==2.20.0->-r requirements.txt (line 4)) (6.1.0)
Requirement already satisfied: yarl<2.0,>=1.12.0 in /opt/conda/lib/python3.10/site-packages (from aiohttp->datasets==2.20.0->-r requirements.txt (line 4)) (1.16.0)
Requirement already satisfied: async-timeout<5.0,>=4.0 in /opt/conda/lib/python3.10/site-packages (from aiohttp->datasets==2.20.0->-r requirements.txt (line 4)) (4.0.3)
Requirement already satisfied: jsonschema-specifications>=2023.03.6 in /opt/conda/lib/python3.10/site-packages (from jsonschema!=3.1.0,>=2.6.0->comet-ml==3.44.3->-r requirements.txt (line 8)) (2023.12.1)
Requirement already satisfied: referencing>=0.28.4 in /opt/conda/lib/python3.10/site-packages (from jsonschema!=3.1.0,>=2.6.0->comet-ml==3.44.3->-r requirements.txt (line 8)) (0.35.1)
Requirement already satisfied: rpds-py>=0.7.1 in /opt/conda/lib/python3.10/site-packages (from jsonschema!=3.1.0,>=2.6.0->comet-ml==3.44.3->-r requirements.txt (line 8)) (0.20.0)
Requirement already satisfied: charset-normalizer<4,>=2 in /opt/conda/lib/python3.10/site-packages (from requests->transformers==4.43.3->-r requirements.txt (line 3)) (3.2.0)
Requirement already satisfied: idna<4,>=2.5 in /opt/conda/lib/python3.10/site-packages (from requests->transformers==4.43.3->-r requirements.txt (line 3)) (3.7)
Requirement already satisfied: certifi>=2017.4.17 in /opt/conda/lib/python3.10/site-packages (from requests->transformers==4.43.3->-r requirements.txt (line 3)) (2024.7.4)
Requirement already satisfied: markdown-it-py>=2.2.0 in /opt/conda/lib/python3.10/site-packages (from rich>=13.3.2->comet-ml==3.44.3->-r requirements.txt (line 8)) (3.0.0)
Requirement already satisfied: pygments<3.0.0,>=2.13.0 in /opt/conda/lib/python3.10/site-packages (from rich>=13.3.2->comet-ml==3.44.3->-r requirements.txt (line 8)) (2.18.0)
Requirement already satisfied: docstring-parser>=0.16 in /opt/conda/lib/python3.10/site-packages (from tyro>=0.5.11->trl==0.9.6->-r requirements.txt (line 6)) (0.16)
Requirement already satisfied: shtab>=1.5.6 in /opt/conda/lib/python3.10/site-packages (from tyro>=0.5.11->trl==0.9.6->-r requirements.txt (line 6)) (1.7.1)
INFO: pip is looking at multiple versions of xformers to determine which version is compatible with other requirements. This could take a while.
Collecting xformers>=0.0.27.post2 (from unsloth==2024.9.post2->-r requirements.txt (line 10))
Downloading xformers-0.0.30-cp310-cp310-manylinux_2_28_x86_64.whl.metadata (1.0 kB)
Downloading xformers-0.0.29.post3-cp310-cp310-manylinux_2_28_x86_64.whl.metadata (1.0 kB)
Downloading xformers-0.0.29.post2-cp310-cp310-manylinux_2_28_x86_64.whl.metadata (1.0 kB)
Downloading xformers-0.0.29.post1-cp310-cp310-manylinux_2_28_x86_64.whl.metadata (1.0 kB)
Downloading xformers-0.0.29-cp310-cp310-manylinux_2_28_x86_64.whl.metadata (1.0 kB)
Downloading xformers-0.0.28.post3-cp310-cp310-manylinux_2_28_x86_64.whl.metadata (1.0 kB)
Downloading xformers-0.0.28.post2-cp310-cp310-manylinux_2_28_x86_64.whl.metadata (1.0 kB)
INFO: pip is still looking at multiple versions of xformers to determine which version is compatible with other requirements. This could take a while.
Downloading xformers-0.0.28.post1-cp310-cp310-manylinux_2_28_x86_64.whl.metadata (1.0 kB)
Downloading xformers-0.0.28-cp310-cp310-manylinux_2_28_x86_64.whl.metadata (1.0 kB)
Downloading xformers-0.0.27.post2-cp310-cp310-manylinux2014_x86_64.whl.metadata (1.0 kB)
Requirement already satisfied: MarkupSafe>=2.0 in /opt/conda/lib/python3.10/site-packages (from jinja2->torch==2.4.0->-r requirements.txt (line 2)) (2.1.5)
Requirement already satisfied: python-dateutil>=2.8.2 in /opt/conda/lib/python3.10/site-packages (from pandas->datasets==2.20.0->-r requirements.txt (line 4)) (2.9.0)
Requirement already satisfied: pytz>=2020.1 in /opt/conda/lib/python3.10/site-packages (from pandas->datasets==2.20.0->-r requirements.txt (line 4)) (2024.1)
Requirement already satisfied: tzdata>=2022.7 in /opt/conda/lib/python3.10/site-packages (from pandas->datasets==2.20.0->-r requirements.txt (line 4)) (2024.1)
Requirement already satisfied: mpmath<1.4,>=1.1.0 in /opt/conda/lib/python3.10/site-packages (from sympy->torch==2.4.0->-r requirements.txt (line 2)) (1.3.0)
Requirement already satisfied: mdurl~=0.1 in /opt/conda/lib/python3.10/site-packages (from markdown-it-py>=2.2.0->rich>=13.3.2->comet-ml==3.44.3->-r requirements.txt (line 8)) (0.1.2)
Requirement already satisfied: six>=1.5 in /opt/conda/lib/python3.10/site-packages (from python-dateutil>=2.8.2->pandas->datasets==2.20.0->-r requirements.txt (line 4)) (1.16.0)
Requirement already satisfied: propcache>=0.2.0 in /opt/conda/lib/python3.10/site-packages (from yarl<2.0,>=1.12.0->aiohttp->datasets==2.20.0->-r requirements.txt (line 4)) (0.2.0)
Downloading accelerate-0.33.0-py3-none-any.whl (315 kB)
Downloading torch-2.4.0-cp310-cp310-manylinux1_x86_64.whl (797.2 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 797.2/797.2 MB 38.5 MB/s eta 0:00:00
Downloading transformers-4.43.3-py3-none-any.whl (9.4 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 9.4/9.4 MB 128.6 MB/s eta 0:00:00
Downloading datasets-2.20.0-py3-none-any.whl (547 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 547.8/547.8 kB 39.3 MB/s eta 0:00:00
Downloading peft-0.12.0-py3-none-any.whl (296 kB)
Downloading trl-0.9.6-py3-none-any.whl (245 kB)
Downloading bitsandbytes-0.43.3-py3-none-manylinux_2_24_x86_64.whl (137.5 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 137.5/137.5 MB 173.4 MB/s eta 0:00:00
Downloading comet_ml-3.44.3-py3-none-any.whl (682 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 682.3/682.3 kB 63.1 MB/s eta 0:00:00
Downloading unsloth-2024.9.post2-py3-none-any.whl (155 kB)
Downloading nvidia_cublas_cu12-12.1.3.1-py3-none-manylinux1_x86_64.whl (410.6 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 410.6/410.6 MB 62.7 MB/s eta 0:00:00
Downloading nvidia_cuda_cupti_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (14.1 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 14.1/14.1 MB 95.3 MB/s eta 0:00:00
Downloading nvidia_cuda_nvrtc_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (23.7 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 23.7/23.7 MB 95.0 MB/s eta 0:00:00
Downloading nvidia_cuda_runtime_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (823 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 823.6/823.6 kB 63.7 MB/s eta 0:00:00
Downloading nvidia_cudnn_cu12-9.1.0.70-py3-none-manylinux2014_x86_64.whl (664.8 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 664.8/664.8 MB 31.9 MB/s eta 0:00:00
Downloading nvidia_cufft_cu12-11.0.2.54-py3-none-manylinux1_x86_64.whl (121.6 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 121.6/121.6 MB 76.2 MB/s eta 0:00:00
Downloading nvidia_curand_cu12-10.3.2.106-py3-none-manylinux1_x86_64.whl (56.5 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 56.5/56.5 MB 80.3 MB/s eta 0:00:00
Downloading nvidia_cusolver_cu12-11.4.5.107-py3-none-manylinux1_x86_64.whl (124.2 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 124.2/124.2 MB 84.4 MB/s eta 0:00:00
Downloading nvidia_cusparse_cu12-12.1.0.106-py3-none-manylinux1_x86_64.whl (196.0 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 196.0/196.0 MB 83.8 MB/s eta 0:00:00
Downloading nvidia_nccl_cu12-2.20.5-py3-none-manylinux2014_x86_64.whl (176.2 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 176.2/176.2 MB 94.0 MB/s eta 0:00:00
Downloading nvidia_nvtx_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (99 kB)
Downloading triton-3.0.0-1-cp310-cp310-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (209.4 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 209.4/209.4 MB 91.6 MB/s eta 0:00:00
Downloading dulwich-0.23.1-cp310-cp310-manylinux_2_28_x86_64.whl (1.1 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.1/1.1 MB 82.0 MB/s eta 0:00:00
Downloading everett-3.1.0-py2.py3-none-any.whl (35 kB)
Downloading python_box-6.1.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.3 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 3.3/3.3 MB 97.1 MB/s eta 0:00:00
Downloading requests_toolbelt-1.0.0-py2.py3-none-any.whl (54 kB)
Downloading semantic_version-2.10.0-py2.py3-none-any.whl (15 kB)
Downloading sentry_sdk-2.32.0-py2.py3-none-any.whl (356 kB)
Downloading tokenizers-0.19.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.6 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 3.6/3.6 MB 96.9 MB/s eta 0:00:00
Downloading wurlitzer-3.1.1-py3-none-any.whl (8.6 kB)
Downloading xformers-0.0.27.post2-cp310-cp310-manylinux2014_x86_64.whl (20.8 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 20.8/20.8 MB 82.2 MB/s eta 0:00:00
Downloading hf_transfer-0.1.9-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.6 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 3.6/3.6 MB 77.4 MB/s eta 0:00:00
Downloading simplejson-3.20.1-cp310-cp310-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (138 kB)
Downloading configobj-5.0.9-py2.py3-none-any.whl (35 kB)
Downloading nvidia_nvjitlink_cu12-12.9.86-py3-none-manylinux2010_x86_64.manylinux_2_12_x86_64.whl (39.7 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 39.7/39.7 MB 84.1 MB/s eta 0:00:00
Installing collected packages: everett, wurlitzer, triton, simplejson, sentry-sdk, semantic-version, python-box, nvidia-nvtx-cu12, nvidia-nvjitlink-cu12, nvidia-nccl-cu12, nvidia-curand-cu12, nvidia-cufft-cu12, nvidia-cuda-runtime-cu12, nvidia-cuda-nvrtc-cu12, nvidia-cuda-cupti-cu12, nvidia-cublas-cu12, hf-transfer, dulwich, configobj, requests-toolbelt, nvidia-cusparse-cu12, nvidia-cudnn-cu12, tokenizers, nvidia-cusolver-cu12, transformers, torch, comet-ml, xformers, datasets, bitsandbytes, accelerate, trl, peft, unsloth
Attempting uninstall: triton
Found existing installation: triton 2.1.0
Uninstalling triton-2.1.0:
Successfully uninstalled triton-2.1.0
Attempting uninstall: tokenizers
Found existing installation: tokenizers 0.15.2
Uninstalling tokenizers-0.15.2:
Successfully uninstalled tokenizers-0.15.2
Attempting uninstall: transformers
Found existing installation: transformers 4.36.0
Uninstalling transformers-4.36.0:
Successfully uninstalled transformers-4.36.0
Attempting uninstall: torch
Found existing installation: torch 2.1.0
Uninstalling torch-2.1.0:
Successfully uninstalled torch-2.1.0
Attempting uninstall: datasets
Found existing installation: datasets 2.18.0
Uninstalling datasets-2.18.0:
Successfully uninstalled datasets-2.18.0
Attempting uninstall: bitsandbytes
Found existing installation: bitsandbytes 0.44.1
Uninstalling bitsandbytes-0.44.1:
Successfully uninstalled bitsandbytes-0.44.1
Attempting uninstall: accelerate
Found existing installation: accelerate 0.26.0
Uninstalling accelerate-0.26.0:
Successfully uninstalled accelerate-0.26.0
Attempting uninstall: trl
Found existing installation: trl 0.7.4
Uninstalling trl-0.7.4:
Successfully uninstalled trl-0.7.4
Attempting uninstall: peft
Found existing installation: peft 0.7.1
Uninstalling peft-0.7.1:
Successfully uninstalled peft-0.7.1
Successfully installed accelerate-0.33.0 bitsandbytes-0.43.3 comet-ml-3.44.3 configobj-5.0.9 datasets-2.20.0 dulwich-0.23.1 everett-3.1.0 hf-transfer-0.1.9 nvidia-cublas-cu12-12.1.3.1 nvidia-cuda-cupti-cu12-12.1.105 nvidia-cuda-nvrtc-cu12-12.1.105 nvidia-cuda-runtime-cu12-12.1.105 nvidia-cudnn-cu12-9.1.0.70 nvidia-cufft-cu12-11.0.2.54 nvidia-curand-cu12-10.3.2.106 nvidia-cusolver-cu12-11.4.5.107 nvidia-cusparse-cu12-12.1.0.106 nvidia-nccl-cu12-2.20.5 nvidia-nvjitlink-cu12-12.9.86 nvidia-nvtx-cu12-12.1.105 peft-0.12.0 python-box-6.1.0 requests-toolbelt-1.0.0 semantic-version-2.10.0 sentry-sdk-2.32.0 simplejson-3.20.1 tokenizers-0.19.1 torch-2.4.0 transformers-4.43.3 triton-3.0.0 trl-0.9.6 unsloth-2024.9.post2 wurlitzer-3.1.1 xformers-0.0.27.post2
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager, possibly rendering your system unusable.It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv. Use the --root-user-action option if you know what you are doing and want to suppress this warning.
[notice] A new release of pip is available: 24.2 -> 25.1.1
[notice] To update, run: pip install --upgrade pip
2025-07-04 16:35:21,811 sagemaker-training-toolkit INFO     Waiting for the process to finish and give a return code.
2025-07-04 16:35:21,811 sagemaker-training-toolkit INFO     Done waiting for a return code. Received 0 from exiting process.
2025-07-04 16:35:21,892 sagemaker-training-toolkit INFO     No Neurons detected (normal if no neurons installed)
2025-07-04 16:35:21,926 sagemaker-training-toolkit INFO     No Neurons detected (normal if no neurons installed)
2025-07-04 16:35:21,954 sagemaker-training-toolkit INFO     No Neurons detected (normal if no neurons installed)
2025-07-04 16:35:21,966 sagemaker-training-toolkit INFO     Invoking user script
Training Env:
{
    "additional_framework_parameters": {},
    "channel_input_dirs": {},
    "current_host": "algo-1",
    "current_instance_group": "homogeneousCluster",
    "current_instance_group_hosts": [
        "algo-1"
    ],
    "current_instance_type": "ml.g5.2xlarge",
    "distribution_hosts": [],
    "distribution_instance_groups": [],
    "framework_module": "sagemaker_pytorch_container.training:main",
    "hosts": [
        "algo-1"
    ],
    "hyperparameters": {
        "dataset_huggingface_workspace": "mariotalavera",
        "finetuning_type": "sft",
        "is_dummy": true,
        "learning_rate": 0.0003,
        "model_output_huggingface_workspace": "mariotalavera",
        "num_train_epochs": 3,
        "per_device_train_batch_size": 2
    },
    "input_config_dir": "/opt/ml/input/config",
    "input_data_config": {},
    "input_dir": "/opt/ml/input",
    "instance_groups": [
        "homogeneousCluster"
    ],
    "instance_groups_dict": {
        "homogeneousCluster": {
            "instance_group_name": "homogeneousCluster",
            "instance_type": "ml.g5.2xlarge",
            "hosts": [
                "algo-1"
            ]
        }
    },
    "is_hetero": false,
    "is_master": true,
    "is_modelparallel_enabled": null,
    "is_smddpmprun_installed": false,
    "is_smddprun_installed": true,
    "job_name": "huggingface-pytorch-training-2025-07-04-16-29-08-379",
    "log_level": 20,
    "master_hostname": "algo-1",
    "model_dir": "/opt/ml/model",
    "module_dir": "s3://sagemaker-us-east-2-641217670402/huggingface-pytorch-training-2025-07-04-16-29-08-379/source/sourcedir.tar.gz",
    "module_name": "finetune",
    "network_interface_name": "eth0",
    "num_cpus": 8,
    "num_gpus": 1,
    "num_neurons": 0,
    "output_data_dir": "/opt/ml/output/data",
    "output_dir": "/opt/ml/output",
    "output_intermediate_dir": "/opt/ml/output/intermediate",
    "resource_config": {
        "current_host": "algo-1",
        "current_instance_type": "ml.g5.2xlarge",
        "current_group_name": "homogeneousCluster",
        "hosts": [
            "algo-1"
        ],
        "instance_groups": [
            {
                "instance_group_name": "homogeneousCluster",
                "instance_type": "ml.g5.2xlarge",
                "hosts": [
                    "algo-1"
                ]
            }
        ],
        "network_interface_name": "eth0",
        "topology": null
    },
    "user_entry_point": "finetune.py"
}
Environment variables:
SM_HOSTS=["algo-1"]
SM_NETWORK_INTERFACE_NAME=eth0
SM_HPS={"dataset_huggingface_workspace":"mariotalavera","finetuning_type":"sft","is_dummy":true,"learning_rate":0.0003,"model_output_huggingface_workspace":"mariotalavera","num_train_epochs":3,"per_device_train_batch_size":2}
SM_USER_ENTRY_POINT=finetune.py
SM_FRAMEWORK_PARAMS={}
SM_RESOURCE_CONFIG={"current_group_name":"homogeneousCluster","current_host":"algo-1","current_instance_type":"ml.g5.2xlarge","hosts":["algo-1"],"instance_groups":[{"hosts":["algo-1"],"instance_group_name":"homogeneousCluster","instance_type":"ml.g5.2xlarge"}],"network_interface_name":"eth0","topology":null}
SM_INPUT_DATA_CONFIG={}
SM_OUTPUT_DATA_DIR=/opt/ml/output/data
SM_CHANNELS=[]
SM_CURRENT_HOST=algo-1
SM_CURRENT_INSTANCE_TYPE=ml.g5.2xlarge
SM_CURRENT_INSTANCE_GROUP=homogeneousCluster
SM_CURRENT_INSTANCE_GROUP_HOSTS=["algo-1"]
SM_INSTANCE_GROUPS=["homogeneousCluster"]
SM_INSTANCE_GROUPS_DICT={"homogeneousCluster":{"hosts":["algo-1"],"instance_group_name":"homogeneousCluster","instance_type":"ml.g5.2xlarge"}}
SM_DISTRIBUTION_INSTANCE_GROUPS=[]
SM_IS_HETERO=false
SM_MODULE_NAME=finetune
SM_LOG_LEVEL=20
SM_FRAMEWORK_MODULE=sagemaker_pytorch_container.training:main
SM_INPUT_DIR=/opt/ml/input
SM_INPUT_CONFIG_DIR=/opt/ml/input/config
SM_OUTPUT_DIR=/opt/ml/output
SM_NUM_CPUS=8
SM_NUM_GPUS=1
SM_NUM_NEURONS=0
SM_MODEL_DIR=/opt/ml/model
SM_MODULE_DIR=s3://sagemaker-us-east-2-641217670402/huggingface-pytorch-training-2025-07-04-16-29-08-379/source/sourcedir.tar.gz
SM_TRAINING_ENV={"additional_framework_parameters":{},"channel_input_dirs":{},"current_host":"algo-1","current_instance_group":"homogeneousCluster","current_instance_group_hosts":["algo-1"],"current_instance_type":"ml.g5.2xlarge","distribution_hosts":[],"distribution_instance_groups":[],"framework_module":"sagemaker_pytorch_container.training:main","hosts":["algo-1"],"hyperparameters":{"dataset_huggingface_workspace":"mariotalavera","finetuning_type":"sft","is_dummy":true,"learning_rate":0.0003,"model_output_huggingface_workspace":"mariotalavera","num_train_epochs":3,"per_device_train_batch_size":2},"input_config_dir":"/opt/ml/input/config","input_data_config":{},"input_dir":"/opt/ml/input","instance_groups":["homogeneousCluster"],"instance_groups_dict":{"homogeneousCluster":{"hosts":["algo-1"],"instance_group_name":"homogeneousCluster","instance_type":"ml.g5.2xlarge"}},"is_hetero":false,"is_master":true,"is_modelparallel_enabled":null,"is_smddpmprun_installed":false,"is_smddprun_installed":true,"job_name":"huggingface-pytorch-training-2025-07-04-16-29-08-379","log_level":20,"master_hostname":"algo-1","model_dir":"/opt/ml/model","module_dir":"s3://sagemaker-us-east-2-641217670402/huggingface-pytorch-training-2025-07-04-16-29-08-379/source/sourcedir.tar.gz","module_name":"finetune","network_interface_name":"eth0","num_cpus":8,"num_gpus":1,"num_neurons":0,"output_data_dir":"/opt/ml/output/data","output_dir":"/opt/ml/output","output_intermediate_dir":"/opt/ml/output/intermediate","resource_config":{"current_group_name":"homogeneousCluster","current_host":"algo-1","current_instance_type":"ml.g5.2xlarge","hosts":["algo-1"],"instance_groups":[{"hosts":["algo-1"],"instance_group_name":"homogeneousCluster","instance_type":"ml.g5.2xlarge"}],"network_interface_name":"eth0","topology":null},"user_entry_point":"finetune.py"}
SM_USER_ARGS=["--dataset_huggingface_workspace","mariotalavera","--finetuning_type","sft","--is_dummy","True","--learning_rate","0.0003","--model_output_huggingface_workspace","mariotalavera","--num_train_epochs","3","--per_device_train_batch_size","2"]
SM_OUTPUT_INTERMEDIATE_DIR=/opt/ml/output/intermediate
SM_HP_DATASET_HUGGINGFACE_WORKSPACE=mariotalavera
SM_HP_FINETUNING_TYPE=sft
SM_HP_IS_DUMMY=true
SM_HP_LEARNING_RATE=0.0003
SM_HP_MODEL_OUTPUT_HUGGINGFACE_WORKSPACE=mariotalavera
SM_HP_NUM_TRAIN_EPOCHS=3
SM_HP_PER_DEVICE_TRAIN_BATCH_SIZE=2
PYTHONPATH=/opt/ml/code:/opt/conda/bin:/opt/conda/lib/python310.zip:/opt/conda/lib/python3.10:/opt/conda/lib/python3.10/lib-dynload:/opt/conda/lib/python3.10/site-packages
Invoking script with the following command:
/opt/conda/bin/python3.10 finetune.py --dataset_huggingface_workspace mariotalavera --finetuning_type sft --is_dummy True --learning_rate 0.0003 --model_output_huggingface_workspace mariotalavera --num_train_epochs 3 --per_device_train_batch_size 2
2025-07-04 16:35:21,967 sagemaker-training-toolkit INFO     Exceptions not imported for SageMaker Debugger as it is not installed.
2025-07-04 16:35:21,968 sagemaker-training-toolkit INFO     Exceptions not imported for SageMaker TF as Tensorflow is not installed.
Unsloth: Your Flash Attention 2 installation seems to be broken?
A possible explanation is you have a new CUDA version which isn't
yet compatible with FA2? Please file a ticket to Unsloth or FA2.
We shall now use Xformers instead, which does not have any performance hits!
We found this negligible impact by benchmarking on 1x A100.
🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
Num training epochs: '3'
Per device train batch size: '2'
Learning rate: 0.0003
Datasets will be loaded from Hugging Face workspace: 'mariotalavera'
Models will be saved to Hugging Face workspace: 'mariotalavera'
Training in dummy mode? 'True'
Finetuning type: 'sft'
Output data dir: '/opt/ml/output/data'
Model dir: '/opt/ml/model'
Number of GPUs: '1'
Starting SFT training...
Training from base model 'meta-llama/Llama-3.1-8B'
Traceback (most recent call last):
  File "/opt/ml/code/finetune.py", line 283, in <module>
    model, tokenizer = finetune(
  File "/opt/ml/code/finetune.py", line 80, in finetune
    model, tokenizer = load_model(
  File "/opt/ml/code/finetune.py", line 39, in load_model
    model, tokenizer = FastLanguageModel.from_pretrained(
  File "/opt/conda/lib/python3.10/site-packages/unsloth/models/loader.py", line 160, in from_pretrained
    model_name = get_model_name(model_name, load_in_4bit)
  File "/opt/conda/lib/python3.10/site-packages/unsloth/models/loader.py", line 129, in get_model_name
    raise NotImplementedError(
NotImplementedError: Unsloth: meta-llama/Llama-3.1-8B is not supported in your current Unsloth version! Please update Unsloth via:
pip uninstall unsloth -y
pip install --upgrade --no-cache-dir "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
2025-07-04 16:35:32,159 sagemaker-training-toolkit INFO     Waiting for the process to finish and give a return code.
2025-07-04 16:35:32,159 sagemaker-training-toolkit INFO     Done waiting for a return code. Received 1 from exiting process.
2025-07-04 16:35:32,160 sagemaker-training-toolkit ERROR    Reporting training FAILURE
2025-07-04 16:35:32,160 sagemaker-training-toolkit ERROR    ExecuteUserScriptError:
ExitCode 1
ErrorMessage "raise NotImplementedError(
 NotImplementedError: Unsloth: meta-llama/Llama-3.1-8B is not supported in your current Unsloth version! Please update Unsloth via
 
 pip uninstall unsloth -y
 pip install --upgrade --no-cache-dir "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git""
Command "/opt/conda/bin/python3.10 finetune.py --dataset_huggingface_workspace mariotalavera --finetuning_type sft --is_dummy True --learning_rate 0.0003 --model_output_huggingface_workspace mariotalavera --num_train_epochs 3 --per_device_train_batch_size 2"
2025-07-04 16:35:32,160 sagemaker-training-toolkit ERROR    Encountered exit_code 1

2025-07-04 16:35:52 Uploading - Uploading generated training model
2025-07-04 16:35:52 Failed - Training job failed
�[31mFailed to run step �[0m�[1;36mtrain�[31m after 1 retries. Exiting.�[0m
�[31mError for Training job huggingface-pytorch-training-2025-07-04-16-29-08-379: Failed. Reason: AlgorithmError: ExecuteUserScriptError:
ExitCode 1
ErrorMessage "raise NotImplementedError(
 NotImplementedError: Unsloth: meta-llama/Llama-3.1-8B is not supported in your current Unsloth version! Please update Unsloth via
 
 pip uninstall unsloth -y
 pip install --upgrade --no-cache-dir "unsloth[colab-new] @ git+�[0m�[34m�[0m�[34mhttps://github.com/unslothai/unsloth.git�[31m�[31m""
Command "/opt/conda/bin/python3.10 finetune.py --dataset_huggingface_workspace mariotalavera --finetuning_type sft --is_dummy True --learning_rate 0.0003 --model_output_huggingface_workspace mariotalavera --num_train_epochs 3 --per_device_train_batch_size 2", exit code: 1. Check troubleshooting guide for common errors: �[0m�[34m�[0m�[34mhttps://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-python-sdk-troubleshooting.html�[31m�[31m�[0m
Traceback (most recent call last):
  File "/Users/mariotalavera/Library/Caches/pypoetry/virtualenvs/llm-engineering-CRnbCVQ9-py3.11/lib/python3.11/site-packages/zenml/orchestrators/step_launcher.py", line 255, in launch
    self._run_step(
  File "/Users/mariotalavera/Library/Caches/pypoetry/virtualenvs/llm-engineering-CRnbCVQ9-py3.11/lib/python3.11/site-packages/zenml/orchestrators/step_launcher.py", line 377, in _run_step
    self._run_step_without_step_operator(
  File "/Users/mariotalavera/Library/Caches/pypoetry/virtualenvs/llm-engineering-CRnbCVQ9-py3.11/lib/python3.11/site-packages/zenml/orchestrators/step_launcher.py", line 462, in _run_step_without_step_operator
    runner.run(
  File "/Users/mariotalavera/Library/Caches/pypoetry/virtualenvs/llm-engineering-CRnbCVQ9-py3.11/lib/python3.11/site-packages/zenml/orchestrators/step_runner.py", line 187, in run
    return_values = step_instance.call_entrypoint(
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/mariotalavera/Library/Caches/pypoetry/virtualenvs/llm-engineering-CRnbCVQ9-py3.11/lib/python3.11/site-packages/zenml/steps/base_step.py", line 554, in call_entrypoint
    return self.entrypoint(**validated_args)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/mariotalavera/Documents/workspace/LLM-Engineers-Handbook/steps/training/train.py", line 15, in train
    run_finetuning_on_sagemaker(
  File "/Users/mariotalavera/Documents/workspace/LLM-Engineers-Handbook/llm_engineering/model/finetuning/sagemaker.py", line 69, in run_finetuning_on_sagemaker
    huggingface_estimator.fit()
  File "/Users/mariotalavera/Library/Caches/pypoetry/virtualenvs/llm-engineering-CRnbCVQ9-py3.11/lib/python3.11/site-packages/sagemaker/workflow/pipeline_context.py", line 346, in wrapper
    return run_func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/mariotalavera/Library/Caches/pypoetry/virtualenvs/llm-engineering-CRnbCVQ9-py3.11/lib/python3.11/site-packages/sagemaker/estimator.py", line 1376, in fit
    self.latest_training_job.wait(logs=logs)
  File "/Users/mariotalavera/Library/Caches/pypoetry/virtualenvs/llm-engineering-CRnbCVQ9-py3.11/lib/python3.11/site-packages/sagemaker/estimator.py", line 2750, in wait
    self.sagemaker_session.logs_for_job(self.job_name, wait=True, log_type=logs)
  File "/Users/mariotalavera/Library/Caches/pypoetry/virtualenvs/llm-engineering-CRnbCVQ9-py3.11/lib/python3.11/site-packages/sagemaker/session.py", line 5945, in logs_for_job
    _logs_for_job(self, job_name, wait, poll, log_type, timeout)
  File "/Users/mariotalavera/Library/Caches/pypoetry/virtualenvs/llm-engineering-CRnbCVQ9-py3.11/lib/python3.11/site-packages/sagemaker/session.py", line 8547, in _logs_for_job
    _check_job_status(job_name, description, "TrainingJobStatus")
  File "/Users/mariotalavera/Library/Caches/pypoetry/virtualenvs/llm-engineering-CRnbCVQ9-py3.11/lib/python3.11/site-packages/sagemaker/session.py", line 8611, in _check_job_status
    raise exceptions.UnexpectedStatusException(
sagemaker.exceptions.UnexpectedStatusException: Error for Training job huggingface-pytorch-training-2025-07-04-16-29-08-379: Failed. Reason: AlgorithmError: ExecuteUserScriptError:
ExitCode 1
ErrorMessage "raise NotImplementedError(
 NotImplementedError: Unsloth: meta-llama/Llama-3.1-8B is not supported in your current Unsloth version! Please update Unsloth via
 
 pip uninstall unsloth -y
 pip install --upgrade --no-cache-dir "unsloth[colab-new] @ git+�[0m�[34m�[0m�[34mhttps://github.com/unslothai/unsloth.git�[31m�[31m""
Command "/opt/conda/bin/python3.10 finetune.py --dataset_huggingface_workspace mariotalavera --finetuning_type sft --is_dummy True --learning_rate 0.0003 --model_output_huggingface_workspace mariotalavera --num_train_epochs 3 --per_device_train_batch_size 2", exit code: 1. Check troubleshooting guide for common errors: �[0m�[34m�[0m�[34mhttps://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-python-sdk-troubleshooting.html�[31m�[31m
�[31mPipeline run �[0m�[1;36mtraining_run_2025_07_04_12_29_07�[31m failed.�[0m


Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions