-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Open
Description
UnexpectedStatusException: Error for Training job huggingface-pytorch-training-2025-07-04-16-29-08-379: Failed. Reason: AlgorithmError: ExecuteUserScriptError: ExitCode 1 ErrorMessage "raise NotImplementedError( NotImplementedError: Unsloth: meta-llama/Llama-3.1-8B is not supported in your current Unsloth version!
On attempting to run poetry poe run-training-pipeline (Sagemaker, training), I run into the following error. Have tried changing llama version; no luck.
Here at the moment, https://mlabonne.github.io/blog/posts/2024-07-29_Finetune_Llama31.html. Hoping this saves someone some time.
Error
UnexpectedStatusException: Error for Training job huggingface-pytorch-training-2025-07-04-16-29-08-379: Failed. Reason: AlgorithmError: ExecuteUserScriptError:
ExitCode 1
ErrorMessage "raise NotImplementedError(
NotImplementedError: Unsloth: meta-llama/Llama-3.1-8B is not supported in your current Unsloth version!Log
Poe => poetry run python -m tools.run --no-cache --run-training
�[1;35mPyTorch version 2.2.2 available.�[0m
�[1;35mChromedriver is already installed.�[0m
�[33mUSER_AGENT environment variable not set, consider setting it to identify your requests.�[0m
�[1;35mFound credentials in shared credentials file: ~/.aws/credentials�[0m
sagemaker.config INFO - Not applying SDK defaults from location: /Library/Application Support/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /Users/mariotalavera/Library/Application Support/sagemaker/config.yaml
�[1;35mLoad pretrained SentenceTransformer: sentence-transformers/all-MiniLM-L6-v2�[0m
�[1;35mInitiating a new run for the pipeline: �[0m�[1;36mtraining�[1;35m.�[0m
�[1;35mNot including stack component settings with key �[0m�[1;36morchestrator.sagemaker�[1;35m.�[0m
�[1;35mCaching is disabled by default for �[0m�[1;36mtraining�[1;35m.�[0m
�[1;35mUsing user: �[0m�[1;36mdefault�[1;35m�[0m
�[1;35mUsing stack: �[0m�[1;36mdefault�[1;35m�[0m
�[1;35m artifact_store: �[0m�[1;36mdefault�[1;35m�[0m
�[1;35m orchestrator: �[0m�[1;36mdefault�[1;35m�[0m
�[1;35mDashboard URL for Pipeline Run: �[0m�[34mhttp://127.0.0.1:8237/runs/e7eea189-aa1a-4167-9c29-9df86461ae80�[1;35m�[0m
�[1;35mStep �[0m�[1;36mtrain�[1;35m has started.�[0m
�[32m2025-07-04 12:29:08.079�[0m | �[1mINFO �[0m | �[36mllm_engineering.model.finetuning.sagemaker�[0m:�[36mrun_finetuning_on_sagemaker�[0m:�[36m36�[0m - �[1mCurrent Hugging Face user: mariotalavera�[0m
�[1;35mFound credentials in shared credentials file: ~/.aws/credentials�[0m
�[1;35mFound credentials in shared credentials file: ~/.aws/credentials�[0m
�[1;35mimage_uri is not presented, retrieving image_uri based on instance_type, framework etc.�[0m
�[1;35mimage_uri is not presented, retrieving image_uri based on instance_type, framework etc.�[0m
�[1;35mCreating training-job with name: huggingface-pytorch-training-2025-07-04-16-29-08-379�[0m
2025-07-04 16:29:14 Starting - Starting the training job...
2025-07-04 16:29:28 Starting - Preparing the instances for training...
2025-07-04 16:30:06 Downloading - Downloading the training image..................
2025-07-04 16:33:23 Training - Training image download completed. Training in progress....bash: cannot set terminal process group (-1): Inappropriate ioctl for device
bash: no job control in this shell
/opt/conda/lib/python3.10/site-packages/paramiko/pkey.py:100: CryptographyDeprecationWarning: TripleDES has been moved to cryptography.hazmat.decrepit.ciphers.algorithms.TripleDES and will be removed from this module in 48.0.0.
"cipher": algorithms.TripleDES,
/opt/conda/lib/python3.10/site-packages/paramiko/transport.py:259: CryptographyDeprecationWarning: TripleDES has been moved to cryptography.hazmat.decrepit.ciphers.algorithms.TripleDES and will be removed from this module in 48.0.0.
"class": algorithms.TripleDES,
2025-07-04 16:33:40,237 sagemaker-training-toolkit INFO Imported framework sagemaker_pytorch_container.training
2025-07-04 16:33:40,254 sagemaker-training-toolkit INFO No Neurons detected (normal if no neurons installed)
2025-07-04 16:33:40,264 sagemaker_pytorch_container.training INFO Block until all host DNS lookups succeed.
2025-07-04 16:33:40,265 sagemaker_pytorch_container.training INFO Invoking user training script.
2025-07-04 16:33:41,669 sagemaker-training-toolkit INFO Installing dependencies from requirements.txt
Collecting accelerate==0.33.0 (from -r requirements.txt (line 1))
Downloading accelerate-0.33.0-py3-none-any.whl.metadata (18 kB)
Collecting torch==2.4.0 (from -r requirements.txt (line 2))
Downloading torch-2.4.0-cp310-cp310-manylinux1_x86_64.whl.metadata (26 kB)
Collecting transformers==4.43.3 (from -r requirements.txt (line 3))
Downloading transformers-4.43.3-py3-none-any.whl.metadata (43 kB)
Collecting datasets==2.20.0 (from -r requirements.txt (line 4))
Downloading datasets-2.20.0-py3-none-any.whl.metadata (19 kB)
Collecting peft==0.12.0 (from -r requirements.txt (line 5))
Downloading peft-0.12.0-py3-none-any.whl.metadata (13 kB)
Collecting trl==0.9.6 (from -r requirements.txt (line 6))
Downloading trl-0.9.6-py3-none-any.whl.metadata (12 kB)
Collecting bitsandbytes==0.43.3 (from -r requirements.txt (line 7))
Downloading bitsandbytes-0.43.3-py3-none-manylinux_2_24_x86_64.whl.metadata (3.5 kB)
Collecting comet-ml==3.44.3 (from -r requirements.txt (line 8))
Downloading comet_ml-3.44.3-py3-none-any.whl.metadata (3.9 kB)
Requirement already satisfied: flash-attn==2.3.6 in /opt/conda/lib/python3.10/site-packages (from -r requirements.txt (line 9)) (2.3.6)
Collecting unsloth==2024.9.post2 (from -r requirements.txt (line 10))
Downloading unsloth-2024.9.post2-py3-none-any.whl.metadata (55 kB)
Requirement already satisfied: numpy<2.0.0,>=1.17 in /opt/conda/lib/python3.10/site-packages (from accelerate==0.33.0->-r requirements.txt (line 1)) (1.26.4)
Requirement already satisfied: packaging>=20.0 in /opt/conda/lib/python3.10/site-packages (from accelerate==0.33.0->-r requirements.txt (line 1)) (23.1)
Requirement already satisfied: psutil in /opt/conda/lib/python3.10/site-packages (from accelerate==0.33.0->-r requirements.txt (line 1)) (6.0.0)
Requirement already satisfied: pyyaml in /opt/conda/lib/python3.10/site-packages (from accelerate==0.33.0->-r requirements.txt (line 1)) (6.0.2)
Requirement already satisfied: huggingface-hub>=0.21.0 in /opt/conda/lib/python3.10/site-packages (from accelerate==0.33.0->-r requirements.txt (line 1)) (0.25.2)
Requirement already satisfied: safetensors>=0.3.1 in /opt/conda/lib/python3.10/site-packages (from accelerate==0.33.0->-r requirements.txt (line 1)) (0.4.5)
Requirement already satisfied: filelock in /opt/conda/lib/python3.10/site-packages (from torch==2.4.0->-r requirements.txt (line 2)) (3.15.4)
Requirement already satisfied: typing-extensions>=4.8.0 in /opt/conda/lib/python3.10/site-packages (from torch==2.4.0->-r requirements.txt (line 2)) (4.12.2)
Requirement already satisfied: sympy in /opt/conda/lib/python3.10/site-packages (from torch==2.4.0->-r requirements.txt (line 2)) (1.13.0)
Requirement already satisfied: networkx in /opt/conda/lib/python3.10/site-packages (from torch==2.4.0->-r requirements.txt (line 2)) (3.3)
Requirement already satisfied: jinja2 in /opt/conda/lib/python3.10/site-packages (from torch==2.4.0->-r requirements.txt (line 2)) (3.1.4)
Requirement already satisfied: fsspec in /opt/conda/lib/python3.10/site-packages (from torch==2.4.0->-r requirements.txt (line 2)) (2024.2.0)
Collecting nvidia-cuda-nvrtc-cu12==12.1.105 (from torch==2.4.0->-r requirements.txt (line 2))
Downloading nvidia_cuda_nvrtc_cu12-12.1.105-py3-none-manylinux1_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.1.105 (from torch==2.4.0->-r requirements.txt (line 2))
Downloading nvidia_cuda_runtime_cu12-12.1.105-py3-none-manylinux1_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.1.105 (from torch==2.4.0->-r requirements.txt (line 2))
Downloading nvidia_cuda_cupti_cu12-12.1.105-py3-none-manylinux1_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==9.1.0.70 (from torch==2.4.0->-r requirements.txt (line 2))
Downloading nvidia_cudnn_cu12-9.1.0.70-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cublas-cu12==12.1.3.1 (from torch==2.4.0->-r requirements.txt (line 2))
Downloading nvidia_cublas_cu12-12.1.3.1-py3-none-manylinux1_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cufft-cu12==11.0.2.54 (from torch==2.4.0->-r requirements.txt (line 2))
Downloading nvidia_cufft_cu12-11.0.2.54-py3-none-manylinux1_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-curand-cu12==10.3.2.106 (from torch==2.4.0->-r requirements.txt (line 2))
Downloading nvidia_curand_cu12-10.3.2.106-py3-none-manylinux1_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cusolver-cu12==11.4.5.107 (from torch==2.4.0->-r requirements.txt (line 2))
Downloading nvidia_cusolver_cu12-11.4.5.107-py3-none-manylinux1_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cusparse-cu12==12.1.0.106 (from torch==2.4.0->-r requirements.txt (line 2))
Downloading nvidia_cusparse_cu12-12.1.0.106-py3-none-manylinux1_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-nccl-cu12==2.20.5 (from torch==2.4.0->-r requirements.txt (line 2))
Downloading nvidia_nccl_cu12-2.20.5-py3-none-manylinux2014_x86_64.whl.metadata (1.8 kB)
Collecting nvidia-nvtx-cu12==12.1.105 (from torch==2.4.0->-r requirements.txt (line 2))
Downloading nvidia_nvtx_cu12-12.1.105-py3-none-manylinux1_x86_64.whl.metadata (1.7 kB)
Collecting triton==3.0.0 (from torch==2.4.0->-r requirements.txt (line 2))
Downloading triton-3.0.0-1-cp310-cp310-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (1.3 kB)
Requirement already satisfied: regex!=2019.12.17 in /opt/conda/lib/python3.10/site-packages (from transformers==4.43.3->-r requirements.txt (line 3)) (2024.9.11)
Requirement already satisfied: requests in /opt/conda/lib/python3.10/site-packages (from transformers==4.43.3->-r requirements.txt (line 3)) (2.32.3)
Collecting tokenizers<0.20,>=0.19 (from transformers==4.43.3->-r requirements.txt (line 3))
Downloading tokenizers-0.19.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.7 kB)
Requirement already satisfied: tqdm>=4.27 in /opt/conda/lib/python3.10/site-packages (from transformers==4.43.3->-r requirements.txt (line 3)) (4.66.5)
Requirement already satisfied: pyarrow>=15.0.0 in /opt/conda/lib/python3.10/site-packages (from datasets==2.20.0->-r requirements.txt (line 4)) (17.0.0)
Requirement already satisfied: pyarrow-hotfix in /opt/conda/lib/python3.10/site-packages (from datasets==2.20.0->-r requirements.txt (line 4)) (0.6)
Requirement already satisfied: dill<0.3.9,>=0.3.0 in /opt/conda/lib/python3.10/site-packages (from datasets==2.20.0->-r requirements.txt (line 4)) (0.3.8)
Requirement already satisfied: pandas in /opt/conda/lib/python3.10/site-packages (from datasets==2.20.0->-r requirements.txt (line 4)) (2.2.2)
Requirement already satisfied: xxhash in /opt/conda/lib/python3.10/site-packages (from datasets==2.20.0->-r requirements.txt (line 4)) (3.5.0)
Requirement already satisfied: multiprocess in /opt/conda/lib/python3.10/site-packages (from datasets==2.20.0->-r requirements.txt (line 4)) (0.70.16)
Requirement already satisfied: aiohttp in /opt/conda/lib/python3.10/site-packages (from datasets==2.20.0->-r requirements.txt (line 4)) (3.10.10)
Requirement already satisfied: tyro>=0.5.11 in /opt/conda/lib/python3.10/site-packages (from trl==0.9.6->-r requirements.txt (line 6)) (0.8.14)
Collecting everett<3.2.0,>=1.0.1 (from everett[ini]<3.2.0,>=1.0.1->comet-ml==3.44.3->-r requirements.txt (line 8))
Downloading everett-3.1.0-py2.py3-none-any.whl.metadata (17 kB)
Requirement already satisfied: jsonschema!=3.1.0,>=2.6.0 in /opt/conda/lib/python3.10/site-packages (from comet-ml==3.44.3->-r requirements.txt (line 8)) (4.23.0)
Collecting python-box<7.0.0 (from comet-ml==3.44.3->-r requirements.txt (line 8))
Downloading python_box-6.1.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (7.8 kB)
Collecting requests-toolbelt>=0.8.0 (from comet-ml==3.44.3->-r requirements.txt (line 8))
Downloading requests_toolbelt-1.0.0-py2.py3-none-any.whl.metadata (14 kB)
Collecting semantic-version>=2.8.0 (from comet-ml==3.44.3->-r requirements.txt (line 8))
Downloading semantic_version-2.10.0-py2.py3-none-any.whl.metadata (9.7 kB)
Collecting sentry-sdk>=1.1.0 (from comet-ml==3.44.3->-r requirements.txt (line 8))
Downloading sentry_sdk-2.32.0-py2.py3-none-any.whl.metadata (10 kB)
Collecting simplejson (from comet-ml==3.44.3->-r requirements.txt (line 8))
Downloading simplejson-3.20.1-cp310-cp310-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (3.2 kB)
Requirement already satisfied: urllib3>=1.21.1 in /opt/conda/lib/python3.10/site-packages (from comet-ml==3.44.3->-r requirements.txt (line 8)) (1.26.19)
Requirement already satisfied: wrapt>=1.11.2 in /opt/conda/lib/python3.10/site-packages (from comet-ml==3.44.3->-r requirements.txt (line 8)) (1.16.0)
Collecting wurlitzer>=1.0.2 (from comet-ml==3.44.3->-r requirements.txt (line 8))
Downloading wurlitzer-3.1.1-py3-none-any.whl.metadata (2.5 kB)
Collecting dulwich!=0.20.33,>=0.20.6 (from comet-ml==3.44.3->-r requirements.txt (line 8))
Downloading dulwich-0.23.1-cp310-cp310-manylinux_2_28_x86_64.whl.metadata (5.1 kB)
Requirement already satisfied: rich>=13.3.2 in /opt/conda/lib/python3.10/site-packages (from comet-ml==3.44.3->-r requirements.txt (line 8)) (13.7.1)
Requirement already satisfied: einops in /opt/conda/lib/python3.10/site-packages (from flash-attn==2.3.6->-r requirements.txt (line 9)) (0.8.0)
Requirement already satisfied: ninja in /opt/conda/lib/python3.10/site-packages (from flash-attn==2.3.6->-r requirements.txt (line 9)) (1.11.1)
Collecting xformers>=0.0.27.post2 (from unsloth==2024.9.post2->-r requirements.txt (line 10))
Downloading xformers-0.0.31-cp39-abi3-manylinux_2_28_x86_64.whl.metadata (1.0 kB)
Requirement already satisfied: sentencepiece>=0.2.0 in /opt/conda/lib/python3.10/site-packages (from unsloth==2024.9.post2->-r requirements.txt (line 10)) (0.2.0)
Requirement already satisfied: wheel>=0.42.0 in /opt/conda/lib/python3.10/site-packages (from unsloth==2024.9.post2->-r requirements.txt (line 10)) (0.44.0)
Requirement already satisfied: protobuf<4.0.0 in /opt/conda/lib/python3.10/site-packages (from unsloth==2024.9.post2->-r requirements.txt (line 10)) (3.20.3)
Collecting hf-transfer (from unsloth==2024.9.post2->-r requirements.txt (line 10))
Downloading hf_transfer-0.1.9-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (1.7 kB)
Collecting nvidia-nvjitlink-cu12 (from nvidia-cusolver-cu12==11.4.5.107->torch==2.4.0->-r requirements.txt (line 2))
Downloading nvidia_nvjitlink_cu12-12.9.86-py3-none-manylinux2010_x86_64.manylinux_2_12_x86_64.whl.metadata (1.7 kB)
Collecting configobj (from everett[ini]<3.2.0,>=1.0.1->comet-ml==3.44.3->-r requirements.txt (line 8))
Downloading configobj-5.0.9-py2.py3-none-any.whl.metadata (3.2 kB)
Requirement already satisfied: aiohappyeyeballs>=2.3.0 in /opt/conda/lib/python3.10/site-packages (from aiohttp->datasets==2.20.0->-r requirements.txt (line 4)) (2.4.3)
Requirement already satisfied: aiosignal>=1.1.2 in /opt/conda/lib/python3.10/site-packages (from aiohttp->datasets==2.20.0->-r requirements.txt (line 4)) (1.3.1)
Requirement already satisfied: attrs>=17.3.0 in /opt/conda/lib/python3.10/site-packages (from aiohttp->datasets==2.20.0->-r requirements.txt (line 4)) (23.2.0)
Requirement already satisfied: frozenlist>=1.1.1 in /opt/conda/lib/python3.10/site-packages (from aiohttp->datasets==2.20.0->-r requirements.txt (line 4)) (1.5.0)
Requirement already satisfied: multidict<7.0,>=4.5 in /opt/conda/lib/python3.10/site-packages (from aiohttp->datasets==2.20.0->-r requirements.txt (line 4)) (6.1.0)
Requirement already satisfied: yarl<2.0,>=1.12.0 in /opt/conda/lib/python3.10/site-packages (from aiohttp->datasets==2.20.0->-r requirements.txt (line 4)) (1.16.0)
Requirement already satisfied: async-timeout<5.0,>=4.0 in /opt/conda/lib/python3.10/site-packages (from aiohttp->datasets==2.20.0->-r requirements.txt (line 4)) (4.0.3)
Requirement already satisfied: jsonschema-specifications>=2023.03.6 in /opt/conda/lib/python3.10/site-packages (from jsonschema!=3.1.0,>=2.6.0->comet-ml==3.44.3->-r requirements.txt (line 8)) (2023.12.1)
Requirement already satisfied: referencing>=0.28.4 in /opt/conda/lib/python3.10/site-packages (from jsonschema!=3.1.0,>=2.6.0->comet-ml==3.44.3->-r requirements.txt (line 8)) (0.35.1)
Requirement already satisfied: rpds-py>=0.7.1 in /opt/conda/lib/python3.10/site-packages (from jsonschema!=3.1.0,>=2.6.0->comet-ml==3.44.3->-r requirements.txt (line 8)) (0.20.0)
Requirement already satisfied: charset-normalizer<4,>=2 in /opt/conda/lib/python3.10/site-packages (from requests->transformers==4.43.3->-r requirements.txt (line 3)) (3.2.0)
Requirement already satisfied: idna<4,>=2.5 in /opt/conda/lib/python3.10/site-packages (from requests->transformers==4.43.3->-r requirements.txt (line 3)) (3.7)
Requirement already satisfied: certifi>=2017.4.17 in /opt/conda/lib/python3.10/site-packages (from requests->transformers==4.43.3->-r requirements.txt (line 3)) (2024.7.4)
Requirement already satisfied: markdown-it-py>=2.2.0 in /opt/conda/lib/python3.10/site-packages (from rich>=13.3.2->comet-ml==3.44.3->-r requirements.txt (line 8)) (3.0.0)
Requirement already satisfied: pygments<3.0.0,>=2.13.0 in /opt/conda/lib/python3.10/site-packages (from rich>=13.3.2->comet-ml==3.44.3->-r requirements.txt (line 8)) (2.18.0)
Requirement already satisfied: docstring-parser>=0.16 in /opt/conda/lib/python3.10/site-packages (from tyro>=0.5.11->trl==0.9.6->-r requirements.txt (line 6)) (0.16)
Requirement already satisfied: shtab>=1.5.6 in /opt/conda/lib/python3.10/site-packages (from tyro>=0.5.11->trl==0.9.6->-r requirements.txt (line 6)) (1.7.1)
INFO: pip is looking at multiple versions of xformers to determine which version is compatible with other requirements. This could take a while.
Collecting xformers>=0.0.27.post2 (from unsloth==2024.9.post2->-r requirements.txt (line 10))
Downloading xformers-0.0.30-cp310-cp310-manylinux_2_28_x86_64.whl.metadata (1.0 kB)
Downloading xformers-0.0.29.post3-cp310-cp310-manylinux_2_28_x86_64.whl.metadata (1.0 kB)
Downloading xformers-0.0.29.post2-cp310-cp310-manylinux_2_28_x86_64.whl.metadata (1.0 kB)
Downloading xformers-0.0.29.post1-cp310-cp310-manylinux_2_28_x86_64.whl.metadata (1.0 kB)
Downloading xformers-0.0.29-cp310-cp310-manylinux_2_28_x86_64.whl.metadata (1.0 kB)
Downloading xformers-0.0.28.post3-cp310-cp310-manylinux_2_28_x86_64.whl.metadata (1.0 kB)
Downloading xformers-0.0.28.post2-cp310-cp310-manylinux_2_28_x86_64.whl.metadata (1.0 kB)
INFO: pip is still looking at multiple versions of xformers to determine which version is compatible with other requirements. This could take a while.
Downloading xformers-0.0.28.post1-cp310-cp310-manylinux_2_28_x86_64.whl.metadata (1.0 kB)
Downloading xformers-0.0.28-cp310-cp310-manylinux_2_28_x86_64.whl.metadata (1.0 kB)
Downloading xformers-0.0.27.post2-cp310-cp310-manylinux2014_x86_64.whl.metadata (1.0 kB)
Requirement already satisfied: MarkupSafe>=2.0 in /opt/conda/lib/python3.10/site-packages (from jinja2->torch==2.4.0->-r requirements.txt (line 2)) (2.1.5)
Requirement already satisfied: python-dateutil>=2.8.2 in /opt/conda/lib/python3.10/site-packages (from pandas->datasets==2.20.0->-r requirements.txt (line 4)) (2.9.0)
Requirement already satisfied: pytz>=2020.1 in /opt/conda/lib/python3.10/site-packages (from pandas->datasets==2.20.0->-r requirements.txt (line 4)) (2024.1)
Requirement already satisfied: tzdata>=2022.7 in /opt/conda/lib/python3.10/site-packages (from pandas->datasets==2.20.0->-r requirements.txt (line 4)) (2024.1)
Requirement already satisfied: mpmath<1.4,>=1.1.0 in /opt/conda/lib/python3.10/site-packages (from sympy->torch==2.4.0->-r requirements.txt (line 2)) (1.3.0)
Requirement already satisfied: mdurl~=0.1 in /opt/conda/lib/python3.10/site-packages (from markdown-it-py>=2.2.0->rich>=13.3.2->comet-ml==3.44.3->-r requirements.txt (line 8)) (0.1.2)
Requirement already satisfied: six>=1.5 in /opt/conda/lib/python3.10/site-packages (from python-dateutil>=2.8.2->pandas->datasets==2.20.0->-r requirements.txt (line 4)) (1.16.0)
Requirement already satisfied: propcache>=0.2.0 in /opt/conda/lib/python3.10/site-packages (from yarl<2.0,>=1.12.0->aiohttp->datasets==2.20.0->-r requirements.txt (line 4)) (0.2.0)
Downloading accelerate-0.33.0-py3-none-any.whl (315 kB)
Downloading torch-2.4.0-cp310-cp310-manylinux1_x86_64.whl (797.2 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 797.2/797.2 MB 38.5 MB/s eta 0:00:00
Downloading transformers-4.43.3-py3-none-any.whl (9.4 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 9.4/9.4 MB 128.6 MB/s eta 0:00:00
Downloading datasets-2.20.0-py3-none-any.whl (547 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 547.8/547.8 kB 39.3 MB/s eta 0:00:00
Downloading peft-0.12.0-py3-none-any.whl (296 kB)
Downloading trl-0.9.6-py3-none-any.whl (245 kB)
Downloading bitsandbytes-0.43.3-py3-none-manylinux_2_24_x86_64.whl (137.5 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 137.5/137.5 MB 173.4 MB/s eta 0:00:00
Downloading comet_ml-3.44.3-py3-none-any.whl (682 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 682.3/682.3 kB 63.1 MB/s eta 0:00:00
Downloading unsloth-2024.9.post2-py3-none-any.whl (155 kB)
Downloading nvidia_cublas_cu12-12.1.3.1-py3-none-manylinux1_x86_64.whl (410.6 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 410.6/410.6 MB 62.7 MB/s eta 0:00:00
Downloading nvidia_cuda_cupti_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (14.1 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 14.1/14.1 MB 95.3 MB/s eta 0:00:00
Downloading nvidia_cuda_nvrtc_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (23.7 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 23.7/23.7 MB 95.0 MB/s eta 0:00:00
Downloading nvidia_cuda_runtime_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (823 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 823.6/823.6 kB 63.7 MB/s eta 0:00:00
Downloading nvidia_cudnn_cu12-9.1.0.70-py3-none-manylinux2014_x86_64.whl (664.8 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 664.8/664.8 MB 31.9 MB/s eta 0:00:00
Downloading nvidia_cufft_cu12-11.0.2.54-py3-none-manylinux1_x86_64.whl (121.6 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 121.6/121.6 MB 76.2 MB/s eta 0:00:00
Downloading nvidia_curand_cu12-10.3.2.106-py3-none-manylinux1_x86_64.whl (56.5 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 56.5/56.5 MB 80.3 MB/s eta 0:00:00
Downloading nvidia_cusolver_cu12-11.4.5.107-py3-none-manylinux1_x86_64.whl (124.2 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 124.2/124.2 MB 84.4 MB/s eta 0:00:00
Downloading nvidia_cusparse_cu12-12.1.0.106-py3-none-manylinux1_x86_64.whl (196.0 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 196.0/196.0 MB 83.8 MB/s eta 0:00:00
Downloading nvidia_nccl_cu12-2.20.5-py3-none-manylinux2014_x86_64.whl (176.2 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 176.2/176.2 MB 94.0 MB/s eta 0:00:00
Downloading nvidia_nvtx_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (99 kB)
Downloading triton-3.0.0-1-cp310-cp310-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (209.4 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 209.4/209.4 MB 91.6 MB/s eta 0:00:00
Downloading dulwich-0.23.1-cp310-cp310-manylinux_2_28_x86_64.whl (1.1 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.1/1.1 MB 82.0 MB/s eta 0:00:00
Downloading everett-3.1.0-py2.py3-none-any.whl (35 kB)
Downloading python_box-6.1.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.3 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 3.3/3.3 MB 97.1 MB/s eta 0:00:00
Downloading requests_toolbelt-1.0.0-py2.py3-none-any.whl (54 kB)
Downloading semantic_version-2.10.0-py2.py3-none-any.whl (15 kB)
Downloading sentry_sdk-2.32.0-py2.py3-none-any.whl (356 kB)
Downloading tokenizers-0.19.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.6 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 3.6/3.6 MB 96.9 MB/s eta 0:00:00
Downloading wurlitzer-3.1.1-py3-none-any.whl (8.6 kB)
Downloading xformers-0.0.27.post2-cp310-cp310-manylinux2014_x86_64.whl (20.8 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 20.8/20.8 MB 82.2 MB/s eta 0:00:00
Downloading hf_transfer-0.1.9-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.6 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 3.6/3.6 MB 77.4 MB/s eta 0:00:00
Downloading simplejson-3.20.1-cp310-cp310-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (138 kB)
Downloading configobj-5.0.9-py2.py3-none-any.whl (35 kB)
Downloading nvidia_nvjitlink_cu12-12.9.86-py3-none-manylinux2010_x86_64.manylinux_2_12_x86_64.whl (39.7 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 39.7/39.7 MB 84.1 MB/s eta 0:00:00
Installing collected packages: everett, wurlitzer, triton, simplejson, sentry-sdk, semantic-version, python-box, nvidia-nvtx-cu12, nvidia-nvjitlink-cu12, nvidia-nccl-cu12, nvidia-curand-cu12, nvidia-cufft-cu12, nvidia-cuda-runtime-cu12, nvidia-cuda-nvrtc-cu12, nvidia-cuda-cupti-cu12, nvidia-cublas-cu12, hf-transfer, dulwich, configobj, requests-toolbelt, nvidia-cusparse-cu12, nvidia-cudnn-cu12, tokenizers, nvidia-cusolver-cu12, transformers, torch, comet-ml, xformers, datasets, bitsandbytes, accelerate, trl, peft, unsloth
Attempting uninstall: triton
Found existing installation: triton 2.1.0
Uninstalling triton-2.1.0:
Successfully uninstalled triton-2.1.0
Attempting uninstall: tokenizers
Found existing installation: tokenizers 0.15.2
Uninstalling tokenizers-0.15.2:
Successfully uninstalled tokenizers-0.15.2
Attempting uninstall: transformers
Found existing installation: transformers 4.36.0
Uninstalling transformers-4.36.0:
Successfully uninstalled transformers-4.36.0
Attempting uninstall: torch
Found existing installation: torch 2.1.0
Uninstalling torch-2.1.0:
Successfully uninstalled torch-2.1.0
Attempting uninstall: datasets
Found existing installation: datasets 2.18.0
Uninstalling datasets-2.18.0:
Successfully uninstalled datasets-2.18.0
Attempting uninstall: bitsandbytes
Found existing installation: bitsandbytes 0.44.1
Uninstalling bitsandbytes-0.44.1:
Successfully uninstalled bitsandbytes-0.44.1
Attempting uninstall: accelerate
Found existing installation: accelerate 0.26.0
Uninstalling accelerate-0.26.0:
Successfully uninstalled accelerate-0.26.0
Attempting uninstall: trl
Found existing installation: trl 0.7.4
Uninstalling trl-0.7.4:
Successfully uninstalled trl-0.7.4
Attempting uninstall: peft
Found existing installation: peft 0.7.1
Uninstalling peft-0.7.1:
Successfully uninstalled peft-0.7.1
Successfully installed accelerate-0.33.0 bitsandbytes-0.43.3 comet-ml-3.44.3 configobj-5.0.9 datasets-2.20.0 dulwich-0.23.1 everett-3.1.0 hf-transfer-0.1.9 nvidia-cublas-cu12-12.1.3.1 nvidia-cuda-cupti-cu12-12.1.105 nvidia-cuda-nvrtc-cu12-12.1.105 nvidia-cuda-runtime-cu12-12.1.105 nvidia-cudnn-cu12-9.1.0.70 nvidia-cufft-cu12-11.0.2.54 nvidia-curand-cu12-10.3.2.106 nvidia-cusolver-cu12-11.4.5.107 nvidia-cusparse-cu12-12.1.0.106 nvidia-nccl-cu12-2.20.5 nvidia-nvjitlink-cu12-12.9.86 nvidia-nvtx-cu12-12.1.105 peft-0.12.0 python-box-6.1.0 requests-toolbelt-1.0.0 semantic-version-2.10.0 sentry-sdk-2.32.0 simplejson-3.20.1 tokenizers-0.19.1 torch-2.4.0 transformers-4.43.3 triton-3.0.0 trl-0.9.6 unsloth-2024.9.post2 wurlitzer-3.1.1 xformers-0.0.27.post2
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager, possibly rendering your system unusable.It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv. Use the --root-user-action option if you know what you are doing and want to suppress this warning.
[notice] A new release of pip is available: 24.2 -> 25.1.1
[notice] To update, run: pip install --upgrade pip
2025-07-04 16:35:21,811 sagemaker-training-toolkit INFO Waiting for the process to finish and give a return code.
2025-07-04 16:35:21,811 sagemaker-training-toolkit INFO Done waiting for a return code. Received 0 from exiting process.
2025-07-04 16:35:21,892 sagemaker-training-toolkit INFO No Neurons detected (normal if no neurons installed)
2025-07-04 16:35:21,926 sagemaker-training-toolkit INFO No Neurons detected (normal if no neurons installed)
2025-07-04 16:35:21,954 sagemaker-training-toolkit INFO No Neurons detected (normal if no neurons installed)
2025-07-04 16:35:21,966 sagemaker-training-toolkit INFO Invoking user script
Training Env:
{
"additional_framework_parameters": {},
"channel_input_dirs": {},
"current_host": "algo-1",
"current_instance_group": "homogeneousCluster",
"current_instance_group_hosts": [
"algo-1"
],
"current_instance_type": "ml.g5.2xlarge",
"distribution_hosts": [],
"distribution_instance_groups": [],
"framework_module": "sagemaker_pytorch_container.training:main",
"hosts": [
"algo-1"
],
"hyperparameters": {
"dataset_huggingface_workspace": "mariotalavera",
"finetuning_type": "sft",
"is_dummy": true,
"learning_rate": 0.0003,
"model_output_huggingface_workspace": "mariotalavera",
"num_train_epochs": 3,
"per_device_train_batch_size": 2
},
"input_config_dir": "/opt/ml/input/config",
"input_data_config": {},
"input_dir": "/opt/ml/input",
"instance_groups": [
"homogeneousCluster"
],
"instance_groups_dict": {
"homogeneousCluster": {
"instance_group_name": "homogeneousCluster",
"instance_type": "ml.g5.2xlarge",
"hosts": [
"algo-1"
]
}
},
"is_hetero": false,
"is_master": true,
"is_modelparallel_enabled": null,
"is_smddpmprun_installed": false,
"is_smddprun_installed": true,
"job_name": "huggingface-pytorch-training-2025-07-04-16-29-08-379",
"log_level": 20,
"master_hostname": "algo-1",
"model_dir": "/opt/ml/model",
"module_dir": "s3://sagemaker-us-east-2-641217670402/huggingface-pytorch-training-2025-07-04-16-29-08-379/source/sourcedir.tar.gz",
"module_name": "finetune",
"network_interface_name": "eth0",
"num_cpus": 8,
"num_gpus": 1,
"num_neurons": 0,
"output_data_dir": "/opt/ml/output/data",
"output_dir": "/opt/ml/output",
"output_intermediate_dir": "/opt/ml/output/intermediate",
"resource_config": {
"current_host": "algo-1",
"current_instance_type": "ml.g5.2xlarge",
"current_group_name": "homogeneousCluster",
"hosts": [
"algo-1"
],
"instance_groups": [
{
"instance_group_name": "homogeneousCluster",
"instance_type": "ml.g5.2xlarge",
"hosts": [
"algo-1"
]
}
],
"network_interface_name": "eth0",
"topology": null
},
"user_entry_point": "finetune.py"
}
Environment variables:
SM_HOSTS=["algo-1"]
SM_NETWORK_INTERFACE_NAME=eth0
SM_HPS={"dataset_huggingface_workspace":"mariotalavera","finetuning_type":"sft","is_dummy":true,"learning_rate":0.0003,"model_output_huggingface_workspace":"mariotalavera","num_train_epochs":3,"per_device_train_batch_size":2}
SM_USER_ENTRY_POINT=finetune.py
SM_FRAMEWORK_PARAMS={}
SM_RESOURCE_CONFIG={"current_group_name":"homogeneousCluster","current_host":"algo-1","current_instance_type":"ml.g5.2xlarge","hosts":["algo-1"],"instance_groups":[{"hosts":["algo-1"],"instance_group_name":"homogeneousCluster","instance_type":"ml.g5.2xlarge"}],"network_interface_name":"eth0","topology":null}
SM_INPUT_DATA_CONFIG={}
SM_OUTPUT_DATA_DIR=/opt/ml/output/data
SM_CHANNELS=[]
SM_CURRENT_HOST=algo-1
SM_CURRENT_INSTANCE_TYPE=ml.g5.2xlarge
SM_CURRENT_INSTANCE_GROUP=homogeneousCluster
SM_CURRENT_INSTANCE_GROUP_HOSTS=["algo-1"]
SM_INSTANCE_GROUPS=["homogeneousCluster"]
SM_INSTANCE_GROUPS_DICT={"homogeneousCluster":{"hosts":["algo-1"],"instance_group_name":"homogeneousCluster","instance_type":"ml.g5.2xlarge"}}
SM_DISTRIBUTION_INSTANCE_GROUPS=[]
SM_IS_HETERO=false
SM_MODULE_NAME=finetune
SM_LOG_LEVEL=20
SM_FRAMEWORK_MODULE=sagemaker_pytorch_container.training:main
SM_INPUT_DIR=/opt/ml/input
SM_INPUT_CONFIG_DIR=/opt/ml/input/config
SM_OUTPUT_DIR=/opt/ml/output
SM_NUM_CPUS=8
SM_NUM_GPUS=1
SM_NUM_NEURONS=0
SM_MODEL_DIR=/opt/ml/model
SM_MODULE_DIR=s3://sagemaker-us-east-2-641217670402/huggingface-pytorch-training-2025-07-04-16-29-08-379/source/sourcedir.tar.gz
SM_TRAINING_ENV={"additional_framework_parameters":{},"channel_input_dirs":{},"current_host":"algo-1","current_instance_group":"homogeneousCluster","current_instance_group_hosts":["algo-1"],"current_instance_type":"ml.g5.2xlarge","distribution_hosts":[],"distribution_instance_groups":[],"framework_module":"sagemaker_pytorch_container.training:main","hosts":["algo-1"],"hyperparameters":{"dataset_huggingface_workspace":"mariotalavera","finetuning_type":"sft","is_dummy":true,"learning_rate":0.0003,"model_output_huggingface_workspace":"mariotalavera","num_train_epochs":3,"per_device_train_batch_size":2},"input_config_dir":"/opt/ml/input/config","input_data_config":{},"input_dir":"/opt/ml/input","instance_groups":["homogeneousCluster"],"instance_groups_dict":{"homogeneousCluster":{"hosts":["algo-1"],"instance_group_name":"homogeneousCluster","instance_type":"ml.g5.2xlarge"}},"is_hetero":false,"is_master":true,"is_modelparallel_enabled":null,"is_smddpmprun_installed":false,"is_smddprun_installed":true,"job_name":"huggingface-pytorch-training-2025-07-04-16-29-08-379","log_level":20,"master_hostname":"algo-1","model_dir":"/opt/ml/model","module_dir":"s3://sagemaker-us-east-2-641217670402/huggingface-pytorch-training-2025-07-04-16-29-08-379/source/sourcedir.tar.gz","module_name":"finetune","network_interface_name":"eth0","num_cpus":8,"num_gpus":1,"num_neurons":0,"output_data_dir":"/opt/ml/output/data","output_dir":"/opt/ml/output","output_intermediate_dir":"/opt/ml/output/intermediate","resource_config":{"current_group_name":"homogeneousCluster","current_host":"algo-1","current_instance_type":"ml.g5.2xlarge","hosts":["algo-1"],"instance_groups":[{"hosts":["algo-1"],"instance_group_name":"homogeneousCluster","instance_type":"ml.g5.2xlarge"}],"network_interface_name":"eth0","topology":null},"user_entry_point":"finetune.py"}
SM_USER_ARGS=["--dataset_huggingface_workspace","mariotalavera","--finetuning_type","sft","--is_dummy","True","--learning_rate","0.0003","--model_output_huggingface_workspace","mariotalavera","--num_train_epochs","3","--per_device_train_batch_size","2"]
SM_OUTPUT_INTERMEDIATE_DIR=/opt/ml/output/intermediate
SM_HP_DATASET_HUGGINGFACE_WORKSPACE=mariotalavera
SM_HP_FINETUNING_TYPE=sft
SM_HP_IS_DUMMY=true
SM_HP_LEARNING_RATE=0.0003
SM_HP_MODEL_OUTPUT_HUGGINGFACE_WORKSPACE=mariotalavera
SM_HP_NUM_TRAIN_EPOCHS=3
SM_HP_PER_DEVICE_TRAIN_BATCH_SIZE=2
PYTHONPATH=/opt/ml/code:/opt/conda/bin:/opt/conda/lib/python310.zip:/opt/conda/lib/python3.10:/opt/conda/lib/python3.10/lib-dynload:/opt/conda/lib/python3.10/site-packages
Invoking script with the following command:
/opt/conda/bin/python3.10 finetune.py --dataset_huggingface_workspace mariotalavera --finetuning_type sft --is_dummy True --learning_rate 0.0003 --model_output_huggingface_workspace mariotalavera --num_train_epochs 3 --per_device_train_batch_size 2
2025-07-04 16:35:21,967 sagemaker-training-toolkit INFO Exceptions not imported for SageMaker Debugger as it is not installed.
2025-07-04 16:35:21,968 sagemaker-training-toolkit INFO Exceptions not imported for SageMaker TF as Tensorflow is not installed.
Unsloth: Your Flash Attention 2 installation seems to be broken?
A possible explanation is you have a new CUDA version which isn't
yet compatible with FA2? Please file a ticket to Unsloth or FA2.
We shall now use Xformers instead, which does not have any performance hits!
We found this negligible impact by benchmarking on 1x A100.
🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
Num training epochs: '3'
Per device train batch size: '2'
Learning rate: 0.0003
Datasets will be loaded from Hugging Face workspace: 'mariotalavera'
Models will be saved to Hugging Face workspace: 'mariotalavera'
Training in dummy mode? 'True'
Finetuning type: 'sft'
Output data dir: '/opt/ml/output/data'
Model dir: '/opt/ml/model'
Number of GPUs: '1'
Starting SFT training...
Training from base model 'meta-llama/Llama-3.1-8B'
Traceback (most recent call last):
File "/opt/ml/code/finetune.py", line 283, in <module>
model, tokenizer = finetune(
File "/opt/ml/code/finetune.py", line 80, in finetune
model, tokenizer = load_model(
File "/opt/ml/code/finetune.py", line 39, in load_model
model, tokenizer = FastLanguageModel.from_pretrained(
File "/opt/conda/lib/python3.10/site-packages/unsloth/models/loader.py", line 160, in from_pretrained
model_name = get_model_name(model_name, load_in_4bit)
File "/opt/conda/lib/python3.10/site-packages/unsloth/models/loader.py", line 129, in get_model_name
raise NotImplementedError(
NotImplementedError: Unsloth: meta-llama/Llama-3.1-8B is not supported in your current Unsloth version! Please update Unsloth via:
pip uninstall unsloth -y
pip install --upgrade --no-cache-dir "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
2025-07-04 16:35:32,159 sagemaker-training-toolkit INFO Waiting for the process to finish and give a return code.
2025-07-04 16:35:32,159 sagemaker-training-toolkit INFO Done waiting for a return code. Received 1 from exiting process.
2025-07-04 16:35:32,160 sagemaker-training-toolkit ERROR Reporting training FAILURE
2025-07-04 16:35:32,160 sagemaker-training-toolkit ERROR ExecuteUserScriptError:
ExitCode 1
ErrorMessage "raise NotImplementedError(
NotImplementedError: Unsloth: meta-llama/Llama-3.1-8B is not supported in your current Unsloth version! Please update Unsloth via
pip uninstall unsloth -y
pip install --upgrade --no-cache-dir "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git""
Command "/opt/conda/bin/python3.10 finetune.py --dataset_huggingface_workspace mariotalavera --finetuning_type sft --is_dummy True --learning_rate 0.0003 --model_output_huggingface_workspace mariotalavera --num_train_epochs 3 --per_device_train_batch_size 2"
2025-07-04 16:35:32,160 sagemaker-training-toolkit ERROR Encountered exit_code 1
2025-07-04 16:35:52 Uploading - Uploading generated training model
2025-07-04 16:35:52 Failed - Training job failed
�[31mFailed to run step �[0m�[1;36mtrain�[31m after 1 retries. Exiting.�[0m
�[31mError for Training job huggingface-pytorch-training-2025-07-04-16-29-08-379: Failed. Reason: AlgorithmError: ExecuteUserScriptError:
ExitCode 1
ErrorMessage "raise NotImplementedError(
NotImplementedError: Unsloth: meta-llama/Llama-3.1-8B is not supported in your current Unsloth version! Please update Unsloth via
pip uninstall unsloth -y
pip install --upgrade --no-cache-dir "unsloth[colab-new] @ git+�[0m�[34m�[0m�[34mhttps://github.com/unslothai/unsloth.git�[31m�[31m""
Command "/opt/conda/bin/python3.10 finetune.py --dataset_huggingface_workspace mariotalavera --finetuning_type sft --is_dummy True --learning_rate 0.0003 --model_output_huggingface_workspace mariotalavera --num_train_epochs 3 --per_device_train_batch_size 2", exit code: 1. Check troubleshooting guide for common errors: �[0m�[34m�[0m�[34mhttps://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-python-sdk-troubleshooting.html�[31m�[31m�[0m
Traceback (most recent call last):
File "/Users/mariotalavera/Library/Caches/pypoetry/virtualenvs/llm-engineering-CRnbCVQ9-py3.11/lib/python3.11/site-packages/zenml/orchestrators/step_launcher.py", line 255, in launch
self._run_step(
File "/Users/mariotalavera/Library/Caches/pypoetry/virtualenvs/llm-engineering-CRnbCVQ9-py3.11/lib/python3.11/site-packages/zenml/orchestrators/step_launcher.py", line 377, in _run_step
self._run_step_without_step_operator(
File "/Users/mariotalavera/Library/Caches/pypoetry/virtualenvs/llm-engineering-CRnbCVQ9-py3.11/lib/python3.11/site-packages/zenml/orchestrators/step_launcher.py", line 462, in _run_step_without_step_operator
runner.run(
File "/Users/mariotalavera/Library/Caches/pypoetry/virtualenvs/llm-engineering-CRnbCVQ9-py3.11/lib/python3.11/site-packages/zenml/orchestrators/step_runner.py", line 187, in run
return_values = step_instance.call_entrypoint(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/mariotalavera/Library/Caches/pypoetry/virtualenvs/llm-engineering-CRnbCVQ9-py3.11/lib/python3.11/site-packages/zenml/steps/base_step.py", line 554, in call_entrypoint
return self.entrypoint(**validated_args)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/mariotalavera/Documents/workspace/LLM-Engineers-Handbook/steps/training/train.py", line 15, in train
run_finetuning_on_sagemaker(
File "/Users/mariotalavera/Documents/workspace/LLM-Engineers-Handbook/llm_engineering/model/finetuning/sagemaker.py", line 69, in run_finetuning_on_sagemaker
huggingface_estimator.fit()
File "/Users/mariotalavera/Library/Caches/pypoetry/virtualenvs/llm-engineering-CRnbCVQ9-py3.11/lib/python3.11/site-packages/sagemaker/workflow/pipeline_context.py", line 346, in wrapper
return run_func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/mariotalavera/Library/Caches/pypoetry/virtualenvs/llm-engineering-CRnbCVQ9-py3.11/lib/python3.11/site-packages/sagemaker/estimator.py", line 1376, in fit
self.latest_training_job.wait(logs=logs)
File "/Users/mariotalavera/Library/Caches/pypoetry/virtualenvs/llm-engineering-CRnbCVQ9-py3.11/lib/python3.11/site-packages/sagemaker/estimator.py", line 2750, in wait
self.sagemaker_session.logs_for_job(self.job_name, wait=True, log_type=logs)
File "/Users/mariotalavera/Library/Caches/pypoetry/virtualenvs/llm-engineering-CRnbCVQ9-py3.11/lib/python3.11/site-packages/sagemaker/session.py", line 5945, in logs_for_job
_logs_for_job(self, job_name, wait, poll, log_type, timeout)
File "/Users/mariotalavera/Library/Caches/pypoetry/virtualenvs/llm-engineering-CRnbCVQ9-py3.11/lib/python3.11/site-packages/sagemaker/session.py", line 8547, in _logs_for_job
_check_job_status(job_name, description, "TrainingJobStatus")
File "/Users/mariotalavera/Library/Caches/pypoetry/virtualenvs/llm-engineering-CRnbCVQ9-py3.11/lib/python3.11/site-packages/sagemaker/session.py", line 8611, in _check_job_status
raise exceptions.UnexpectedStatusException(
sagemaker.exceptions.UnexpectedStatusException: Error for Training job huggingface-pytorch-training-2025-07-04-16-29-08-379: Failed. Reason: AlgorithmError: ExecuteUserScriptError:
ExitCode 1
ErrorMessage "raise NotImplementedError(
NotImplementedError: Unsloth: meta-llama/Llama-3.1-8B is not supported in your current Unsloth version! Please update Unsloth via
pip uninstall unsloth -y
pip install --upgrade --no-cache-dir "unsloth[colab-new] @ git+�[0m�[34m�[0m�[34mhttps://github.com/unslothai/unsloth.git�[31m�[31m""
Command "/opt/conda/bin/python3.10 finetune.py --dataset_huggingface_workspace mariotalavera --finetuning_type sft --is_dummy True --learning_rate 0.0003 --model_output_huggingface_workspace mariotalavera --num_train_epochs 3 --per_device_train_batch_size 2", exit code: 1. Check troubleshooting guide for common errors: �[0m�[34m�[0m�[34mhttps://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-python-sdk-troubleshooting.html�[31m�[31m
�[31mPipeline run �[0m�[1;36mtraining_run_2025_07_04_12_29_07�[31m failed.�[0m
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels