今天利用python和pytorch编写图像分类训练程序,好不容易噼里啪啦敲完键盘,运行之。。。。。,结果突然报错(RuntimeError cuda out of memory),使笔者大失所望,具体信息如下:
/usr/bin/python3.5 /home/xxx/train.py Step 1: prepare train/test dataset There are 121 classes Step 1 has been completed ---------7.801877 Step 2: Begin to train the model num_ftrs=2048 num_classes=121 Epoch [0/29] ---------- Traceback (most recent call last): File "/home/xxx/train.py", line 121, in <module> outputs=model(images) File "/usr/local/lib/python3.5/dist-packages/torch/nn/modules/module.py", line 541, in __call__ result = self.forward(*input, **kwargs) File "/usr/local/lib/python3.5/dist-packages/torchvision/models/resnet.py", line 204, in forward x = self.layer4(x) File "/usr/local/lib/python3.5/dist-packages/torch/nn/modules/module.py", line 541, in __call__ result = self.forward(*input, **kwargs) File "/usr/local/lib/python3.5/dist-packages/torch/nn/modules/container.py", line 92, in forward input = module(input) File "/usr/local/lib/python3.5/dist-packages/torch/nn/modules/module.py", line 541, in __call__ result = self.forward(*input, **kwargs) File "/usr/local/lib/python3.5/dist-packages/torchvision/models/resnet.py", line 99, in forward out = self.bn1(out) File "/usr/local/lib/python3.5/dist-packages/torch/nn/modules/module.py", line 541, in __call__ result = self.forward(*input, **kwargs) File "/usr/local/lib/python3.5/dist-packages/torch/nn/modules/batchnorm.py", line 81, in forward exponential_average_factor, self.eps) File "/usr/local/lib/python3.5/dist-packages/torch/nn/functional.py", line 1670, in batch_norm training, momentum, eps, torch.backends.cudnn.enabled RuntimeError: CUDA out of memory. Tried to allocate 154.00 MiB (GPU 0; 23.65 GiB total capacity; 22.54 GiB already allocated; 18.00 MiB free; 257.96 MiB cached) Process finished with exit code 1其中采用的网络模型是torchvision自带的resnext101_32x8d模型,batch_size=100。其他代码不变,直接修改batch_size=50。并在命令行中启用 watch -n 0.1 nvidia-smi开启监控窗口,可以看到如下界面:
从图中可以看出,虽然有四块GPU卡,但是只用了其中一块,显存使用率已经过半。应该是batch_size=100的时候显存溢出了。
简单的通过减少batch_size的数值可以解决这个显存溢出的问题。但是这不是最完美的解决之道,而且四块卡没有得到很好的利用。后续将介绍 多卡训练模型的相关问题,敬请期待。
-------------------- 正文到此结束------------------------
推荐一个公众号:健哥聊量化,会持续推出股票相关基础知识,以及python实现的一些基本的分析代码。欢迎大家关注,二维码如下:
相关文章列表如下:
股票基础知识----- K线形态
股票K线形态 ----早晨之星
“早晨之星”实际操作篇---通达信软件为例
牛刀小试----python+tushare进行股票分析
股票K线形态----黄昏之星
股票K线形态-----墓碑线
股票K线形态-----多方炮
股票K线形态-----红三兵
股票K线形态----三只乌鸦
股票K线形态-----锤头线、吊颈线、倒锤头线