3.7. 其他(调节资源的配比、增大bs等)

说明:本章内容仅适用于飞桨静态图分布式。

3.7.1. 原理

飞桨使用“线程池”模型调度并执行Op算子。在启动GPU计算之前,通常需要CPU的协助,如调度算子执行,然而如果Op算子本身计算时间很小,“线程池”模型下会带来额外的调度开销。根据实践经验,对于CPU任务设置使用的线程数num_threads=2*dev_count时性能较好,对于GPU任务,设置线程数num_threads=4*dev_count时性能较好。注意:线程池不是越大越好。

3.7.2. 操作实践

用户只需要指定相应的DistributedStrategy()的开关,就可以设置线程数量。

strategy = fleet.DistributedStrategy()

exe_strategy = paddle.static.ExecutionStrategy()
exe_strategy.num_threads = 3
strategy.execution_strategy = exe_strategy

上述例子存放在:example/resnet/train_fleet_static_others.py

假设要运行8卡的任务,那么只需在命令行中执行:

python -m paddle.distributed.launch --gpus=0,1,2,3,4,5,6,7 train_fleet_static_others.py

您将看到显示如下日志信息:

-----------  Configuration Arguments -----------
gpus: None
heter_worker_num: None
heter_workers:
http_port: None
ips: 127.0.0.1
log_dir: log
...
------------------------------------------------
...
INFO 2021-01-19 14:50:52,903 launch_utils.py:472] Local start 8 processes. First process distributed environment info (Only For Debug):
    +=======================================================================================+
    |                        Distributed Envs                      Value                    |
    +---------------------------------------------------------------------------------------+
    |                 PADDLE_CURRENT_ENDPOINT                 127.0.0.1:20485               |
    |                     PADDLE_TRAINERS_NUM                        8                      |
    |                PADDLE_TRAINER_ENDPOINTS  ... 0.1:23281,127.0.0.1:41983,127.0.0.1:17503|
    |                     FLAGS_selected_gpus                        0                      |
    |                       PADDLE_TRAINER_ID                        0                      |
    +=======================================================================================+
...
W0119 14:51:04.500844 77798 device_context.cc:362] Please NOTE: device: 0, GPU Compute Capability: 7.0, Driver API Version: 10.2, Runtime API Version: 9.2
W0119 14:51:04.506238 77798 device_context.cc:372] device: 0, cuDNN Version: 7.4.
W0119 14:51:12.378418 77798 fuse_all_reduce_op_pass.cc:79] Find all_reduce operators: 161. To make the speed faster, some all_reduce ops are fused during training, after fusion, the number of all_reduce ops is 5.
[Epoch 0, batch 0] loss: 0.11252, acc1: 0.03125, acc5: 0.06250
[Epoch 0, batch 5] loss: 0.11252, acc1: 0.03125, acc5: 0.06250
[Epoch 0, batch 10] loss: 0.11252, acc1: 0.03125, acc5: 0.06250
[Epoch 0, batch 15] loss: 0.11252, acc1: 0.03125, acc5: 0.06250

需要注意的是,不同飞桨版本,上述信息可能会有所差异。