Multiple cpu producers with few gpus not utilize 100% of the gpus (pytorch) The 2019 Stack Overflow Developer Survey Results Are InEffect of Data Parallelism on Training ResultHigh GPU Memory-Usage but low volatile gpu-utilParallel GPU computations - utilization fluctuationCaffe's GPU Utilization Is Not Full Enough When Doing Forward Inference, Any Idea?tensorflow on multiple GPUs (very odd behaviour)Tensorflow-GPU Eager Execution: Batch Normalization extremely slow and decreases Volatile GPU-UtilPytorch CPU and GPU run in parallelPytorch on GPU: variable still on CPUPyTorch: Move Weights Between GPU and CPU on the fly

Is bread bad for ducks?

Pristine Bit Checking

Falsification in Math vs Science

Why could you hear an Amstrad CPC working?

Is domain driven design an anti-SQL pattern?

Are there any other methods to apply to solving simultaneous equations?

Manuscript was "unsubmitted" because the manuscript was deposited in Arxiv Preprints

The difference between dialogue marks

On the insanity of kings as an argument against Monarchy

I see my dog run

How can I create a character who can assume the widest possible range of creature sizes?

Why do UK politicians seemingly ignore opinion polls on Brexit?

Unbreakable Formation vs. Cry of the Carnarium

In microwave frequencies, do you use a circulator when you need a (near) perfect diode?

Access elements in std::string where positon of string is greater than its size

Does it makes sense to buy a new cycle to learn riding?

"What time...?" or "At what time...?" - what is more grammatically correct?

A poker game description that does not feel gimmicky

How to manage monthly salary

What does "rabbited" mean/imply in this sentence?

Output the Arecibo Message

What do hard-Brexiteers want with respect to the Irish border?

Extreme, unacceptable situation and I can't attend work tomorrow morning

Landlord wants to switch my lease to a "Land contract" to "get back at the city"



Multiple cpu producers with few gpus not utilize 100% of the gpus (pytorch)



The 2019 Stack Overflow Developer Survey Results Are InEffect of Data Parallelism on Training ResultHigh GPU Memory-Usage but low volatile gpu-utilParallel GPU computations - utilization fluctuationCaffe's GPU Utilization Is Not Full Enough When Doing Forward Inference, Any Idea?tensorflow on multiple GPUs (very odd behaviour)Tensorflow-GPU Eager Execution: Batch Normalization extremely slow and decreases Volatile GPU-UtilPytorch CPU and GPU run in parallelPytorch on GPU: variable still on CPUPyTorch: Move Weights Between GPU and CPU on the fly



.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty height:90px;width:728px;box-sizing:border-box;








1















I tried to implement board game self-play data generation in parallel using multiple cpus to do self-paly concurrently. For parent process, i created 4 NN model for 30cpus (1 model for 10 cpus and 1 model to train) each model is in different gpus.(the model is implemented as 20 blocks resnet-like architecture with batchnorm) Pseudo code as follows



nnet = NN(gpu_num=0)
nnet1 = NN(gpu_num=1)
nnet2 = NN(gpu_num=2)
nnet3 = NN(gpu_num=3)

for i in range(num_iteration):
nnet1.load_state_dict(nnet.state_dict())
nnet2.load_state_dict(nnet.state_dict())
nnet3.load_state_dict(nnet.state_dict())
samples = parallel_self_play()
nnet.train(samples)


parallel_self_play() is implemented as follows



pool = mp.Pool(processes=num_cpu) #30
for i in range(self.args.numEps):
results = []
if i % 3 == 0:
net = self.nnet1
elif i % 3 == 1:
net = self.nnet2
else:
net = self.nnet3

results.append(pool.apply_async(AsyncSelfPlay, args=(net))
# get results from results array then return it
return results


My code work perfectly fine with almost 100% gpu utilization throughout the first self-play (less than 10 minutes for an iteration) but after the first iteration (training) when i loaded new weights into nnet1-3 gpu utilization never reach 80% again (~30min - 1hour per iteration). I notice a few things while mess around with me code



  1. This model includes batchnorm layers, when switch model to train() mode -> train -> switch back to eval() causes the self-play (use forward pass from model) to not use gpu at all.


  2. If it doesn't switch from eval() -> train() (train using eval mode) this causes gpu utilization to be lower (30-50%) but not entirely gone.


  3. If the models that are not the main one doesn't load the weights from the main one, self-play still utilize 100% gpu so my guess is that something happened during training process and change some states in the model.


This also happen when use only 8 cpus - 1gpu architecture and train model on the fly (no intermediate one).



Can someone guide me where to fix my code or how i should train my model?










share|improve this question






























    1















    I tried to implement board game self-play data generation in parallel using multiple cpus to do self-paly concurrently. For parent process, i created 4 NN model for 30cpus (1 model for 10 cpus and 1 model to train) each model is in different gpus.(the model is implemented as 20 blocks resnet-like architecture with batchnorm) Pseudo code as follows



    nnet = NN(gpu_num=0)
    nnet1 = NN(gpu_num=1)
    nnet2 = NN(gpu_num=2)
    nnet3 = NN(gpu_num=3)

    for i in range(num_iteration):
    nnet1.load_state_dict(nnet.state_dict())
    nnet2.load_state_dict(nnet.state_dict())
    nnet3.load_state_dict(nnet.state_dict())
    samples = parallel_self_play()
    nnet.train(samples)


    parallel_self_play() is implemented as follows



    pool = mp.Pool(processes=num_cpu) #30
    for i in range(self.args.numEps):
    results = []
    if i % 3 == 0:
    net = self.nnet1
    elif i % 3 == 1:
    net = self.nnet2
    else:
    net = self.nnet3

    results.append(pool.apply_async(AsyncSelfPlay, args=(net))
    # get results from results array then return it
    return results


    My code work perfectly fine with almost 100% gpu utilization throughout the first self-play (less than 10 minutes for an iteration) but after the first iteration (training) when i loaded new weights into nnet1-3 gpu utilization never reach 80% again (~30min - 1hour per iteration). I notice a few things while mess around with me code



    1. This model includes batchnorm layers, when switch model to train() mode -> train -> switch back to eval() causes the self-play (use forward pass from model) to not use gpu at all.


    2. If it doesn't switch from eval() -> train() (train using eval mode) this causes gpu utilization to be lower (30-50%) but not entirely gone.


    3. If the models that are not the main one doesn't load the weights from the main one, self-play still utilize 100% gpu so my guess is that something happened during training process and change some states in the model.


    This also happen when use only 8 cpus - 1gpu architecture and train model on the fly (no intermediate one).



    Can someone guide me where to fix my code or how i should train my model?










    share|improve this question


























      1












      1








      1


      1






      I tried to implement board game self-play data generation in parallel using multiple cpus to do self-paly concurrently. For parent process, i created 4 NN model for 30cpus (1 model for 10 cpus and 1 model to train) each model is in different gpus.(the model is implemented as 20 blocks resnet-like architecture with batchnorm) Pseudo code as follows



      nnet = NN(gpu_num=0)
      nnet1 = NN(gpu_num=1)
      nnet2 = NN(gpu_num=2)
      nnet3 = NN(gpu_num=3)

      for i in range(num_iteration):
      nnet1.load_state_dict(nnet.state_dict())
      nnet2.load_state_dict(nnet.state_dict())
      nnet3.load_state_dict(nnet.state_dict())
      samples = parallel_self_play()
      nnet.train(samples)


      parallel_self_play() is implemented as follows



      pool = mp.Pool(processes=num_cpu) #30
      for i in range(self.args.numEps):
      results = []
      if i % 3 == 0:
      net = self.nnet1
      elif i % 3 == 1:
      net = self.nnet2
      else:
      net = self.nnet3

      results.append(pool.apply_async(AsyncSelfPlay, args=(net))
      # get results from results array then return it
      return results


      My code work perfectly fine with almost 100% gpu utilization throughout the first self-play (less than 10 minutes for an iteration) but after the first iteration (training) when i loaded new weights into nnet1-3 gpu utilization never reach 80% again (~30min - 1hour per iteration). I notice a few things while mess around with me code



      1. This model includes batchnorm layers, when switch model to train() mode -> train -> switch back to eval() causes the self-play (use forward pass from model) to not use gpu at all.


      2. If it doesn't switch from eval() -> train() (train using eval mode) this causes gpu utilization to be lower (30-50%) but not entirely gone.


      3. If the models that are not the main one doesn't load the weights from the main one, self-play still utilize 100% gpu so my guess is that something happened during training process and change some states in the model.


      This also happen when use only 8 cpus - 1gpu architecture and train model on the fly (no intermediate one).



      Can someone guide me where to fix my code or how i should train my model?










      share|improve this question
















      I tried to implement board game self-play data generation in parallel using multiple cpus to do self-paly concurrently. For parent process, i created 4 NN model for 30cpus (1 model for 10 cpus and 1 model to train) each model is in different gpus.(the model is implemented as 20 blocks resnet-like architecture with batchnorm) Pseudo code as follows



      nnet = NN(gpu_num=0)
      nnet1 = NN(gpu_num=1)
      nnet2 = NN(gpu_num=2)
      nnet3 = NN(gpu_num=3)

      for i in range(num_iteration):
      nnet1.load_state_dict(nnet.state_dict())
      nnet2.load_state_dict(nnet.state_dict())
      nnet3.load_state_dict(nnet.state_dict())
      samples = parallel_self_play()
      nnet.train(samples)


      parallel_self_play() is implemented as follows



      pool = mp.Pool(processes=num_cpu) #30
      for i in range(self.args.numEps):
      results = []
      if i % 3 == 0:
      net = self.nnet1
      elif i % 3 == 1:
      net = self.nnet2
      else:
      net = self.nnet3

      results.append(pool.apply_async(AsyncSelfPlay, args=(net))
      # get results from results array then return it
      return results


      My code work perfectly fine with almost 100% gpu utilization throughout the first self-play (less than 10 minutes for an iteration) but after the first iteration (training) when i loaded new weights into nnet1-3 gpu utilization never reach 80% again (~30min - 1hour per iteration). I notice a few things while mess around with me code



      1. This model includes batchnorm layers, when switch model to train() mode -> train -> switch back to eval() causes the self-play (use forward pass from model) to not use gpu at all.


      2. If it doesn't switch from eval() -> train() (train using eval mode) this causes gpu utilization to be lower (30-50%) but not entirely gone.


      3. If the models that are not the main one doesn't load the weights from the main one, self-play still utilize 100% gpu so my guess is that something happened during training process and change some states in the model.


      This also happen when use only 8 cpus - 1gpu architecture and train model on the fly (no intermediate one).



      Can someone guide me where to fix my code or how i should train my model?







      python parallel-processing deep-learning pytorch reinforcement-learning






      share|improve this question















      share|improve this question













      share|improve this question




      share|improve this question








      edited Mar 8 at 8:42







      51616

















      asked Mar 8 at 8:31









      5161651616

      62




      62






















          0






          active

          oldest

          votes












          Your Answer






          StackExchange.ifUsing("editor", function ()
          StackExchange.using("externalEditor", function ()
          StackExchange.using("snippets", function ()
          StackExchange.snippets.init();
          );
          );
          , "code-snippets");

          StackExchange.ready(function()
          var channelOptions =
          tags: "".split(" "),
          id: "1"
          ;
          initTagRenderer("".split(" "), "".split(" "), channelOptions);

          StackExchange.using("externalEditor", function()
          // Have to fire editor after snippets, if snippets enabled
          if (StackExchange.settings.snippets.snippetsEnabled)
          StackExchange.using("snippets", function()
          createEditor();
          );

          else
          createEditor();

          );

          function createEditor()
          StackExchange.prepareEditor(
          heartbeatType: 'answer',
          autoActivateHeartbeat: false,
          convertImagesToLinks: true,
          noModals: true,
          showLowRepImageUploadWarning: true,
          reputationToPostImages: 10,
          bindNavPrevention: true,
          postfix: "",
          imageUploader:
          brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
          contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
          allowUrls: true
          ,
          onDemand: true,
          discardSelector: ".discard-answer"
          ,immediatelyShowMarkdownHelp:true
          );



          );













          draft saved

          draft discarded


















          StackExchange.ready(
          function ()
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55059358%2fmultiple-cpu-producers-with-few-gpus-not-utilize-100-of-the-gpus-pytorch%23new-answer', 'question_page');

          );

          Post as a guest















          Required, but never shown

























          0






          active

          oldest

          votes








          0






          active

          oldest

          votes









          active

          oldest

          votes






          active

          oldest

          votes















          draft saved

          draft discarded
















































          Thanks for contributing an answer to Stack Overflow!


          • Please be sure to answer the question. Provide details and share your research!

          But avoid


          • Asking for help, clarification, or responding to other answers.

          • Making statements based on opinion; back them up with references or personal experience.

          To learn more, see our tips on writing great answers.




          draft saved


          draft discarded














          StackExchange.ready(
          function ()
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55059358%2fmultiple-cpu-producers-with-few-gpus-not-utilize-100-of-the-gpus-pytorch%23new-answer', 'question_page');

          );

          Post as a guest















          Required, but never shown





















































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown

































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown







          Popular posts from this blog

          AWS Lex not identifying response if by a variable The 2019 Stack Overflow Developer Survey Results Are In Announcing the arrival of Valued Associate #679: Cesar Manara Planned maintenance scheduled April 17/18, 2019 at 00:00UTC (8:00pm US/Eastern) The Ask Question Wizard is Live! Data science time! April 2019 and salary with experienceEnforcing custom enumeration in AWS LEX for slot valuesHow to give response based on user response in Amazon Lex?Intercepting AWS Lambda Response to a AWS Lex QueryLex chat bot error: Reached second execution of fulfillment lambda on the same utteranceamazon lex showing invalid responseLambda response send back to Lex slot?Response card in Amazon lexAmazon Lex - Lambda response return HTML to botHow can I solve 424 (Failed Dependency) (python) obtained from Amazon lex?

          Алба-Юлія

          Захаров Федір Захарович