Huggingface parallel training for solving the CUDA out of memory issue
Document a workable solution for the annoying CUDA Out Of Memory (OOM)
Background
I am running conditional text generation experiments these days on various seq2seq models. The script went well on other models except for the T5-large model. I continuously got the CUDA OOM error.
The T5-large model is not so big and I’m running code on a server with 4 NVIDIA Corporation GP102s. Each one has 12G of video memory. It seems that I could run my code. However, even I set the batch size to 1 for each device, I got the CUDA OOM error.
Environment
My huggingface transformer version is 4.20.1
and my code looks like this (preprocess_function
, dataset.map
, trainer
).
Some tries
I followed the advice and added --fp16
and --sharded_ddp
but neither of them work.
Solution
( device_map = None )
Parameters
- device_map (
Dict[int, list]
, optional, defaults to None) — A dictionary that maps attention modules to devices. Note that the embedding module and LMHead are always automatically mapped to the first device (for esoteric reasons). That means that the first device should have fewer attention modules mapped to it than other devices. For reference, the t5 models have the following number of attention modules:- t5-small: 6
- t5-base: 12
- t5-large: 24
- t5-3b: 24
- t5-11b: 24
Uses a device map to distribute attention modules of the model across several devices. If no device map is given, it will evenly distribute blocks across all devices.
Example:
1 |
|
I just copy the code and replace “t5-3b” with “t5-large” and it works!