Docker部署yolact中编译DCNv2的问题

    技术2022-07-11  80

    yolact部署到Docker中,需要单独编译DCNv2

    cd external/DCNv2 python setup.py build develop

    但是这个DCNv2的编译需要依赖GPU,总是编不过。

     

    失败1:使用python:3.6镜像

    FROM python:3.6 ... WORKDIR ***/external/DCNv2 RUN python setup.py build develop ...

    执行后编译报错,通过docker run进入到docker里面依然编译报错:

    No CUDA runtime is found, using CUDA_HOME='/usr/local/cuda' Traceback (most recent call last): File "setup.py", line 64, in <module> ext_modules=get_extensions(), File "setup.py", line 41, in get_extensions raise NotImplementedError('Cuda is not availabel') NotImplementedError: Cuda is not availabel

    原因分析:python:3.6镜像未安装cuda驱动

     

    失败2:改用pytorch/pytorch:1.2-cuda10.0-cudnn7-runtime镜像

    FROM pytorch/pytorch:1.2-cuda10.0-cudnn7-runtime ... WORKDIR ***/external/DCNv2 RUN python setup.py build develop ...

    无论是Dockerfile编译,还是docker run进入到docker里面编译,依然报错:

    No CUDA runtime is found, using CUDA_HOME='/usr/local/cuda' Traceback (most recent call last): File "setup.py", line 64, in <module> ext_modules=get_extensions(), File "setup.py", line 41, in get_extensions raise NotImplementedError('Cuda is not availabel') NotImplementedError: Cuda is not availabel

    原因分析:torch.cuda.is_available() 显示为True,但是from torch.utils.cpp_extension import CUDA_HOME,CUDA_HOME为NULL,看了一下/usr/local目录下确实没有cuda相关的目录。

     

    失败3:改用pytorch/pytorch:1.2-cuda10.0-cudnn7-devel镜像

    FROM pytorch/pytorch:1.2-cuda10.0-cudnn7-devel ... WORKDIR ***/external/DCNv2 RUN python setup.py build develop ...

    中间出现过一个apt-get update失败的问题:Failed to fetch https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1604/x86_64/Packages.gz  Hash Sum mismatch

    解决方法:

    ... # Update source RUN sed -i s:/archive.ubuntu.com:/mirrors.tuna.tsinghua.edu.cn/ubuntu:g /etc/apt/sources.list RUN cat /etc/apt/sources.list RUN apt-get clean RUN apt-get -y update --fix-missing --allow-unauthenticated ...

    docker build跑起来,结果编译依然报错(吐血): 

    No CUDA runtime is found, using CUDA_HOME='/usr/local/cuda' Traceback (most recent call last): File "setup.py", line 64, in <module> ext_modules=get_extensions(), File "setup.py", line 41, in get_extensions raise NotImplementedError('Cuda is not availabel') NotImplementedError: Cuda is not availabel

    但是通过docker run --gpus all -it ... /bin/bash进入到docker里面,居然编译成功了。

    running build running build_ext running develop running egg_info writing DCNv2.egg-info/PKG-INFO writing dependency_links to DCNv2.egg-info/dependency_links.txt writing top-level names to DCNv2.egg-info/top_level.txt reading manifest file 'DCNv2.egg-info/SOURCES.txt' writing manifest file 'DCNv2.egg-info/SOURCES.txt' running build_ext copying build/lib.linux-x86_64-3.6/_ext.cpython-36m-x86_64-linux-gnu.so -> Creating /opt/conda/lib/python3.6/site-packages/DCNv2.egg-link (link to .) Adding DCNv2 0.1 to easy-install.pth file Installed ***/external/DCNv2 Processing dependencies for DCNv2==0.1 Finished processing dependencies for DCNv2==0.1

     原因分析:通过docker run进入到docker里面编译时,已通过--gpus选项为docker指定了GPU,所以可以使用GPU并编译成功。但在docker build执行Dockerfile时并未为docker指定GPU,所以依然无法使用GPU。

     

    终极方案:不在docker build时通过Dockerfile编译,而是在ENDPOINT中执行编译:

    FROM pytorch/pytorch:1.2-cuda10.0-cudnn7-devel ... ENTRYPOINT ["sh", "run.sh"]

    在run.sh中编译DCNv2:

    cd external/DCNv2 python setup.py build develop cd ../.. python ***.py

     

    Processed: 0.011, SQL: 9