ElasticDL弹性分布式深度学习系统使用实践

    技术2025-07-11  8

    前言

    ElasticDL是基于 TensorFlow2.0 的支持弹性调度的深度学习系统。可以认为是Kubeflow的升级版。更重要的是ElasticDL是国人使用Python开发的软件。由于ElasticDL是调用 Kubernetes API 来起止进程,所以必须安装Kubernetes。又因为众所周知的原因,在本地机器安装Kubernetes会出现拉取镜像失败的情况,建议大家使用阿里云的香港或国外地区的云主机。创建一个按量付费的4核16G的云主机,使用完之后停机就不再扣费,是体验和学习AI的最省钱方案。快捷通道

    安装Python3

    ElasticDL 要求Python >= 3.6 Ubuntu18.04 自带Python3.6,满足条件。 Ubuntu16.04 自带Python3.5,需要升级成python3.6。详情查看

    安装Docker

    $ sudo apt-get update # 安装依赖包 $ sudo apt-get install apt-transport-https ca-certificates curl gnupg-agent software-properties-common # 添加 Docker 的官方 GPG 密钥 $ curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo apt-key add - # 验证您现在是否拥有带有指纹的密钥 $ sudo apt-key fingerprint 0EBFCD88 # 设置稳定版仓库 $ sudo add-apt-repository "deb [arch=amd64] https://download.docker.com/linux/ubuntu $(lsb_release -cs) stable"

    安装 Docker Engine-Community

    # 更新 $ sudo apt-get update # 安装最新的Docker-ce $ sudo apt-get install docker-ce # 启动 $ sudo systemctl enable docker sudo systemctl start docker

    安装kubectl

    $ curl -LO https://storage.googleapis.com/kubernetes-release/release/$(curl -s https://storage.googleapis.com/kubernetes-release/release/stable.txt)/bin/linux/amd64/kubectl && chmod +x kubectl && sudo mv kubectl /usr/local/bin

    安装minikube

    $ curl -Lo minikube https://storage.googleapis.com/minikube/releases/latest/minikube-linux-amd64 && chmod +x minikube && sudo mv minikube /usr/local/bin/

    装完后,验证一下版本:

    $ minikube version minikube version: v1.11.0 commit: 57e2f55f47effe9ce396cea42a1e0eb4f611ebbd

    安装ElasticDL客户端和下载源码

    $ pip install elasticdl_client $ git clone https://github.com/sql-machine-learning/elasticdl.git

    创建Kubernetes集群

    $ sudo mkdir /data $ minikube start --vm-driver=none --cpus 2 --memory 6144 --disk-size=50gb --mount=true --mount-string="/data:/data" $ cd elasticdl $ kubectl apply -f elasticdl/manifests/elasticdl-rbac.yaml

    创建docker分布式训练镜像

    $ cd model_zoo $ elasticdl zoo init $ elasticdl zoo build --image=elasticdl:mnist .

    准备mnist数据

    $ docker pull elasticdl/elasticdl:dev $ cd .. $ docker run --rm -it \ -v $HOME/.keras/datasets:/root/.keras/datasets \ -v $PWD:/work \ -w /work elasticdl/elasticdl:dev \ bash -c "scripts/gen_dataset.sh data" $ sudo cp -r data/* /data

    开始训练

    $ elasticdl train \ --image_name=elasticdl:mnist \ --model_zoo=model_zoo \ --model_def=mnist_functional_api.mnist_functional_api.custom_model \ --training_data=/data/mnist/train \ --validation_data=/data/mnist/test \ --num_epochs=2 \ --master_resource_request="cpu=0.2,memory=1024Mi" \ --master_resource_limit="cpu=1,memory=2048Mi" \ --worker_resource_request="cpu=0.4,memory=1024Mi" \ --worker_resource_limit="cpu=1,memory=2048Mi" \ --ps_resource_request="cpu=0.2,memory=1024Mi" \ --ps_resource_limit="cpu=1,memory=2048Mi" \ --minibatch_size=64 \ --num_minibatches_per_task=2 \ --num_ps_pods=1 \ --num_workers=1 \ --evaluation_steps=50 \ --grads_to_wait=1 \ --job_name=test-mnist \ --log_level=INFO \ --image_pull_policy=Never \ --volume="host_path=/data,mount_path=/data" \ --distribution_strategy=ParameterServerStrategy # 检查job状态和日志 $ kubectl get pods NAME READY STATUS RESTARTS AGE elasticdl-test-mnist-master 1/1 Running 0 33s elasticdl-test-mnist-ps-0 1/1 Running 0 30s elasticdl-test-mnist-worker-0 1/1 Running 0 30s kubectl logs elasticdl-test-mnist-worker-0 | grep "Loss" [2020-04-14 02:46:28,535] [INFO] [worker.py:879:_process_minibatch] Loss is 3.07190203666687 [2020-04-14 02:46:28,920] [INFO] [worker.py:879:_process_minibatch] Loss is 9.413976669311523 [2020-04-14 02:46:29,120] [INFO] [worker.py:879:_process_minibatch] Loss is 3.9641590118408203 [2020-04-14 02:46:29,344] [INFO] [worker.py:879:_process_minibatch] Loss is 15.329755783081055 [2020-04-14 02:46:29,551] [INFO] [worker.py:879:_process_minibatch] Loss is 3.8414430618286133 [2020-04-14 02:46:29,817] [INFO] [worker.py:879:_process_minibatch] Loss is 2.7703640460968018 [2020-04-14 02:46:30,041] [INFO] [worker.py:879:_process_minibatch] Loss is 6.920175075531006 [2020-04-14 02:46:30,242] [INFO] [worker.py:879:_process_minibatch] Loss is 4.37514925003051 $ kubectl logs elasticdl-test-mnist-master | grep "Evaluation" [2020-04-14 02:46:21,836] [INFO] [master.py:192:prepare] Evaluation service started [2020-04-14 02:46:40,750] [INFO] [evaluation_service.py:214:complete_task] Evaluation metrics[v=50]: {'accuracy': 0.21933334} [2020-04-14 02:46:53,827] [INFO] [evaluation_service.py:214:complete_task] Evaluation metrics[v=100]: {'accuracy': 0.5173333} [2020-04-14 02:47:07,529] [INFO] [evaluation_service.py:214:complete_task] Evaluation metrics[v=150]: {'accuracy': 0.6253333} [2020-04-14 02:47:23,251] [INFO] [evaluation_service.py:214:complete_task] Evaluation metrics[v=200]: {'accuracy': 0.752}
    Processed: 0.012, SQL: 9