Xgboost Dask #11782
-
|
Is XGBoost with Dask truly scalable in practice — especially regarding dataset size relative to RAM capacity? I’m testing distributed training with xgboost.dask and noticed that during DaskDMatrix initialization, it appears to load the entire dataset into memory across the workers. This seems to limit scalability, as even with a Dask cluster, the dataset size cannot exceed total cluster RAM by much. My understanding was that Dask+XGBoost would allow “out-of-core” or streaming-like training behavior, but it looks like the data still needs to be materialized in memory first. So, a few sub-questions:
|
Beta Was this translation helpful? Give feedback.
Replies: 1 comment
-
|
Yes, it needs to load the dataset into the main memory. It's horizontal scaling only (using more workers). External memory supports distributed training, just not through dask, please find the demo in the document. |
Beta Was this translation helpful? Give feedback.
Yes, it needs to load the dataset into the main memory. It's horizontal scaling only (using more workers).
External memory supports distributed training, just not through dask, please find the demo in the document.