Multi-Node Applications en cours d'exécution

Utilisation de l'EFA sur le DLAMI

La section suivante décrit comment utiliser EFA pour exécuter des applications à nœuds multiples sur le AWS Apprentissage profond (deep learning) AMIs.

Exécution Multi-Node d'applications avec EFA

Pour exécuter une application sur un cluster de nœuds, la configuration suivante est requise

Activer SSH sans mot de passe

Sélectionnez un nœud de votre cluster comme nœud principal. Les autres nœuds sont appelés nœuds de membre.

Sur le nœud principal, générez la paire de clés RSA.
```
ssh-keygen -t rsa -N "" -f ~/.ssh/id_rsa
```
Modifiez les autorisations de la clé privée sur le nœud principal.
```
chmod 600 ~/.ssh/id_rsa
```
Copiez la clé ~/.ssh/id_rsa.pub publique et ajoutez-la à l'~/.ssh/authorized_keysun des nœuds membres du cluster.
Vous devriez maintenant pouvoir vous connecter directement aux nœuds de membre du nœud principal en utilisant l'adresse IP privée.
```
ssh <member private ip>
```
Désactivez le HostKeyChecking mode strict et activez le transfert d'agent sur le nœud principal en ajoutant ce qui suit au fichier ~/. ssh/config fichier sur le nœud leader :
```
Host *
    ForwardAgent yes
Host *
    StrictHostKeyChecking no
```
Sur les instances Amazon Linux 2, exécutez la commande suivante sur le nœud principal pour fournir les autorisations correctes au fichier de configuration :
```
chmod 600 ~/.ssh/config
```

Créer un fichier hosts.

Sur le nœud principal, créez un fichier hosts pour identifier les nœuds du cluster. Le fichier hosts doit avoir une entrée pour chaque nœud du cluster. Créez un fichier ~/hosts et ajoutez chaque nœud en utilisant l'adresse IP privée comme suit :


localhost slots=8
<private ip of node 1> slots=8
<private ip of node 2> slots=8

Tests NCCL

Note

Ces tests ont été exécutés à l'aide de la version 1.38.0 d'EFA et du plugin OFI NCCL 1.13.2.

Vous trouverez ci-dessous un sous-ensemble de tests NCCL fournis par Nvidia pour tester à la fois les fonctionnalités et les performances sur plusieurs nœuds de calcul

Instances prises en charge : P3dn, P4, P5, P5e, P5en

Multi-node Test de performance NCCL sur P4d.24xlarge

Pour vérifier les performances NCCL avec EFA, exécutez le test de performance NCCL standard disponible sur le Repo officiel. NCCL-Tests Le DLAMI est livré avec ce test déjà conçu pour CUDA. XX.X Vous pouvez également exécuter votre propre script avec EFA.

Lors de la construction de votre propre script, suivez les instructions suivantes :

Utilisez le chemin complet vers mpirun comme indiqué dans l'exemple lors de l'exécution d'applications NCCL avec EFA.
Modifiez les paramètres np et N en fonction du nombre d'instances et de GPU de votre cluster.
Ajoutez l'indicateur NCCL_DEBUG=INFO et assurez-vous que les journaux indiquent l'utilisation de l'EFA sous la forme « Le fournisseur sélectionné est EFA ».
Définissez l'emplacement du journal d'entraînement à analyser pour validation
```
TRAINING_LOG="testEFA_$(date +"%N").log"
```

Utilisez la commande watch nvidia-smi sur n'importe quel nœud de membre pour surveiller l'utilisation des GPU. Les watch nvidia-smi commandes suivantes concernent une version générique de CUDA xx.x et dépendent du système d'exploitation de votre instance. Vous pouvez exécuter les commandes pour n'importe quelle version de CUDA disponible dans votre instance Amazon EC2 en remplaçant la version CUDA dans le script.

Amazon Linux 2, Amazon Linux 2023 :


 $ /opt/amazon/openmpi/bin/mpirun -n 16 -N 8 \
-x NCCL_DEBUG=INFO --mca pml ^cm \
-x LD_LIBRARY_PATH=/usr/local/cuda-xx.x/efa/lib:/usr/local/cuda-xx.x/lib:/usr/local/cuda-xx.x/lib64:/usr/local/cuda-xx.x:/opt/amazon/efa/lib64:/opt/amazon/openmpi/lib64:$LD_LIBRARY_PATH \
--hostfile hosts --mca btl tcp,self --mca btl_tcp_if_exclude lo,docker0 --bind-to none \
/usr/local/cuda-xx.x/efa/test-cuda-xx.x/all_reduce_perf -b 8 -e 1G -f 2 -g 1 -c 1 -n 100 | tee ${TRAINING_LOG}

Ubuntu 20.04, Ubuntu 20.04 :


$ /opt/amazon/openmpi/bin/mpirun -n 16 -N 8 \
-x NCCL_DEBUG=INFO --mca pml ^cm \
-x LD_LIBRARY_PATH=/usr/local/cuda-xx.x/efa/lib:/usr/local/cuda-xx.x/lib:/usr/local/cuda-xx.x/lib64:/usr/local/cuda-xx.x:/opt/amazon/efa/lib:/opt/amazon/openmpi/lib:$LD_LIBRARY_PATH \
--hostfile hosts --mca btl tcp,self --mca btl_tcp_if_exclude lo,docker0 --bind-to none \
/usr/local/cuda-xx.x/efa/test-cuda-xx.x/all_reduce_perf -b 8 -e 1G -f 2 -g 1 -c 1 -n 100 | tee ${TRAINING_LOG}

Le résultat doit être similaire à ce qui suit :


# nThread 1 nGpus 1 minBytes 8 maxBytes 1073741824 step: 2(factor) warmup iters: 5 iters: 100 agg iters: 1 validation: 1 graph: 0
#
# Using devices
#  Rank  0 Group  0 Pid  33378 on ip-172-31-42-25 device  0 [0x10] NVIDIA A100-SXM4-40GB
#  Rank  1 Group  0 Pid  33379 on ip-172-31-42-25 device  1 [0x10] NVIDIA A100-SXM4-40GB
#  Rank  2 Group  0 Pid  33380 on ip-172-31-42-25 device  2 [0x20] NVIDIA A100-SXM4-40GB
#  Rank  3 Group  0 Pid  33381 on ip-172-31-42-25 device  3 [0x20] NVIDIA A100-SXM4-40GB
#  Rank  4 Group  0 Pid  33382 on ip-172-31-42-25 device  4 [0x90] NVIDIA A100-SXM4-40GB
#  Rank  5 Group  0 Pid  33383 on ip-172-31-42-25 device  5 [0x90] NVIDIA A100-SXM4-40GB
#  Rank  6 Group  0 Pid  33384 on ip-172-31-42-25 device  6 [0xa0] NVIDIA A100-SXM4-40GB
#  Rank  7 Group  0 Pid  33385 on ip-172-31-42-25 device  7 [0xa0] NVIDIA A100-SXM4-40GB
#  Rank  8 Group  0 Pid  30378 on ip-172-31-43-8 device  0 [0x10] NVIDIA A100-SXM4-40GB
#  Rank  9 Group  0 Pid  30379 on ip-172-31-43-8 device  1 [0x10] NVIDIA A100-SXM4-40GB
#  Rank 10 Group  0 Pid  30380 on ip-172-31-43-8 device  2 [0x20] NVIDIA A100-SXM4-40GB
#  Rank 11 Group  0 Pid  30381 on ip-172-31-43-8 device  3 [0x20] NVIDIA A100-SXM4-40GB
#  Rank 12 Group  0 Pid  30382 on ip-172-31-43-8 device  4 [0x90] NVIDIA A100-SXM4-40GB
#  Rank 13 Group  0 Pid  30383 on ip-172-31-43-8 device  5 [0x90] NVIDIA A100-SXM4-40GB
#  Rank 14 Group  0 Pid  30384 on ip-172-31-43-8 device  6 [0xa0] NVIDIA A100-SXM4-40GB
#  Rank 15 Group  0 Pid  30385 on ip-172-31-43-8 device  7 [0xa0] NVIDIA A100-SXM4-40GB
ip-172-31-42-25:33385:33385 [7] NCCL INFO cudaDriverVersion 12060
ip-172-31-43-8:30383:30383 [5] NCCL INFO Bootstrap : Using ens32:172.31.43.8
ip-172-31-43-8:30383:30383 [5] NCCL INFO NCCL version 2.23.4+cuda12.5
...
ip-172-31-42-25:33384:33451 [6] NCCL INFO NET/OFI Initializing aws-ofi-nccl 1.13.2-aws
ip-172-31-42-25:33384:33451 [6] NCCL INFO NET/OFI Using Libfabric version 1.22
ip-172-31-42-25:33384:33451 [6] NCCL INFO NET/OFI Using CUDA driver version 12060 with runtime 12050
ip-172-31-42-25:33384:33451 [6] NCCL INFO NET/OFI Configuring AWS-specific options
ip-172-31-42-25:33384:33451 [6] NCCL INFO NET/OFI Setting provider_filter to efa
ip-172-31-42-25:33384:33451 [6] NCCL INFO NET/OFI Setting FI_EFA_FORK_SAFE environment variable to 1
ip-172-31-42-25:33384:33451 [6] NCCL INFO NET/OFI Setting NCCL_NVLSTREE_MAX_CHUNKSIZE to 512KiB
ip-172-31-42-25:33384:33451 [6] NCCL INFO NET/OFI Setting NCCL_NVLS_CHUNKSIZE to 512KiB
ip-172-31-42-25:33384:33451 [6] NCCL INFO NET/OFI Running on p4d.24xlarge platform, Setting NCCL_TOPO_FILE environment variable to /opt/amazon/ofi-nccl/share/aws-ofi-nccl/xml/p4d-24xl-topo.xml
...
-----------------------------some output truncated-----------------------------------
#                                                              out-of-place                       in-place          
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
           8             2     float     sum      -1    180.3    0.00    0.00      0    179.3    0.00    0.00      0
          16             4     float     sum      -1    178.1    0.00    0.00      0    177.6    0.00    0.00      0
          32             8     float     sum      -1    178.5    0.00    0.00      0    177.9    0.00    0.00      0
          64            16     float     sum      -1    178.8    0.00    0.00      0    178.7    0.00    0.00      0
         128            32     float     sum      -1    178.2    0.00    0.00      0    177.8    0.00    0.00      0
         256            64     float     sum      -1    178.6    0.00    0.00      0    178.8    0.00    0.00      0
         512           128     float     sum      -1    177.2    0.00    0.01      0    177.1    0.00    0.01      0
        1024           256     float     sum      -1    179.2    0.01    0.01      0    179.3    0.01    0.01      0
        2048           512     float     sum      -1    181.3    0.01    0.02      0    181.2    0.01    0.02      0
        4096          1024     float     sum      -1    184.2    0.02    0.04      0    183.9    0.02    0.04      0
        8192          2048     float     sum      -1    191.2    0.04    0.08      0    190.6    0.04    0.08      0
       16384          4096     float     sum      -1    202.5    0.08    0.15      0    202.3    0.08    0.15      0
       32768          8192     float     sum      -1    233.0    0.14    0.26      0    232.1    0.14    0.26      0
       65536         16384     float     sum      -1    238.6    0.27    0.51      0    235.1    0.28    0.52      0
      131072         32768     float     sum      -1    237.2    0.55    1.04      0    236.8    0.55    1.04      0
      262144         65536     float     sum      -1    248.3    1.06    1.98      0    247.0    1.06    1.99      0
      524288        131072     float     sum      -1    309.2    1.70    3.18      0    307.7    1.70    3.20      0
     1048576        262144     float     sum      -1    408.7    2.57    4.81      0    404.3    2.59    4.86      0
     2097152        524288     float     sum      -1    613.5    3.42    6.41      0    607.9    3.45    6.47      0
     4194304       1048576     float     sum      -1    924.5    4.54    8.51      0    914.8    4.58    8.60      0
     8388608       2097152     float     sum      -1   1059.5    7.92   14.85      0   1054.3    7.96   14.92      0
    16777216       4194304     float     sum      -1   1269.9   13.21   24.77      0   1272.0   13.19   24.73      0
    33554432       8388608     float     sum      -1   1642.7   20.43   38.30      0   1636.7   20.50   38.44      0
    67108864      16777216     float     sum      -1   2446.7   27.43   51.43      0   2445.8   27.44   51.45      0
   134217728      33554432     float     sum      -1   4143.6   32.39   60.73      0   4142.4   32.40   60.75      0
   268435456      67108864     float     sum      -1   7351.9   36.51   68.46      0   7346.7   36.54   68.51      0
   536870912     134217728     float     sum      -1    13717   39.14   73.39      0    13703   39.18   73.46      0
  1073741824     268435456     float     sum      -1    26416   40.65   76.21      0    26420   40.64   76.20      0
...
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 15.5514

Pour vérifier que les tests EFA ont renvoyé un résultat valide, veuillez utiliser les tests suivants pour confirmer :

Obtenez le type d'instance à l'aide des métadonnées d'instance EC2 :


TOKEN=$(curl -X PUT "http://169.254.169.254/latest/api/token" -H "X-aws-ec2-metadata-token-ttl-seconds: 21600")
INSTANCE_TYPE=$(curl -H "X-aws-ec2-metadata-token: $TOKEN" -v http://169.254.169.254/latest/meta-data/instance-type)

Exécutez le Tests de performance

Définissez les paramètres suivants


CUDA_VERSION
CUDA_RUNTIME_VERSION
NCCL_VERSION

Validez les résultats comme indiqué :


RETURN_VAL=`echo $?`
if [ ${RETURN_VAL} -eq 0 ]; then

    # [0] NCCL INFO NET/OFI Initializing aws-ofi-nccl 1.13.2-aws
    # [0] NCCL INFO NET/OFI Using CUDA driver version 12060 with runtime 12010

    # cudaDriverVersion 12060  --> This is max supported cuda version by nvidia driver
    # NCCL version 2.23.4+cuda12.5 --> This is NCCL version compiled with cuda version

    # Validation of logs
    grep "NET/OFI Configuring AWS-specific options" ${TRAINING_LOG} || { echo "AWS-specific options text not found"; exit 1; } 
    grep "busbw" ${TRAINING_LOG} || { echo "busbw text not found"; exit 1; } 
    grep "Avg bus bandwidth " ${TRAINING_LOG} || { echo "Avg bus bandwidth text not found"; exit 1; } 
    grep "NCCL version $NCCL_VERSION" ${TRAINING_LOG} || { echo "Text not found: NCCL version $NCCL_VERSION"; exit 1; }
    if [[ ${INSTANCE_TYPE} == "p4d.24xlarge" ]]; then
        grep "NET/Libfabric/0/GDRDMA" ${TRAINING_LOG} || { echo "Text not found: NET/Libfabric/0/GDRDMA"; exit 1; }  
        grep "NET/OFI Selected Provider is efa (found 4 nics)" ${TRAINING_LOG} || { echo "Selected Provider is efa text not found"; exit 1; }   
    elif [[ ${INSTANCE_TYPE} == "p4de.24xlarge" ]]; then
        grep "NET/Libfabric/0/GDRDMA" ${TRAINING_LOG} || { echo "Avg bus bandwidth text not found"; exit 1; }
        grep "NET/OFI Selected Provider is efa (found 4 nics)" ${TRAINING_LOG} || { echo "Avg bus bandwidth text not found"; exit 1; }
    elif [[ ${INSTANCE_TYPE} == "p5.48xlarge" ]]; then
        grep "NET/Libfabric/0/GDRDMA" ${TRAINING_LOG} || { echo "Avg bus bandwidth text not found"; exit 1; }
        grep "NET/OFI Selected Provider is efa (found 32 nics)" ${TRAINING_LOG} || { echo "Avg bus bandwidth text not found"; exit 1; } 
    elif [[ ${INSTANCE_TYPE} == "p5e.48xlarge" ]]; then
        grep "NET/Libfabric/0/GDRDMA" ${TRAINING_LOG} || { echo "Avg bus bandwidth text not found"; exit 1; }
        grep "NET/OFI Selected Provider is efa (found 32 nics)" ${TRAINING_LOG} || { echo "Avg bus bandwidth text not found"; exit 1; }
    elif [[ ${INSTANCE_TYPE} == "p5en.48xlarge" ]]; then
        grep "NET/Libfabric/0/GDRDMA" ${TRAINING_LOG} || { echo "Avg bus bandwidth text not found"; exit 1; }
        grep "NET/OFI Selected Provider is efa (found 16 nics)" ${TRAINING_LOG} || { echo "Avg bus bandwidth text not found"; exit 1; }
    elif [[ ${INSTANCE_TYPE} == "p3dn.24xlarge" ]]; then
        grep "NET/OFI Selected Provider is efa (found 4 nics)" ${TRAINING_LOG} || { echo "Selected Provider is efa text not found"; exit 1; }  
    fi
    echo "***************************** check_efa_nccl_all_reduce passed for cuda version ${CUDA_VERSION} *****************************"
else
    echo "***************************** check_efa_nccl_all_reduce failed for cuda version ${CUDA_VERSION} *****************************"
fi

Pour accéder aux données de référence, nous pouvons analyser la dernière ligne du résultat du tableau issu du test all_reduce à nœuds multiples :


benchmark=$(sudo cat ${TRAINING_LOG} | grep '1073741824' | tail -n1 | awk -F " " '{{print $12}}' | sed 's/ //' | sed  's/  5e-07//')
if [[ -z "${benchmark}" ]]; then
  echo "benchmark variable is empty"
  exit 1
fi

echo "Benchmark throughput: ${benchmark}"

Avertissement JavaScript est désactivé ou n'est pas disponible dans votre navigateur.

Pour que vous puissiez utiliser la documentation AWS, Javascript doit être activé. Vous trouverez des instructions sur les pages d'aide de votre navigateur.

Conventions de rédaction

Lancement d'une AWS Apprentissage profond (deep learning) AMIs instance

Optimisation et surveillance des GPU