Multiple cpu producers with few gpus not utilize 100% of the gpus (pytorch) The 2019 Stack Overflow Developer Survey Results Are InEffect of Data Parallelism on Training ResultHigh GPU Memory-Usage but low volatile gpu-utilParallel GPU computations - utilization fluctuationCaffe's GPU Utilization Is Not Full Enough When Doing Forward Inference, Any Idea?tensorflow on multiple GPUs (very odd behaviour)Tensorflow-GPU Eager Execution: Batch Normalization extremely slow and decreases Volatile GPU-UtilPytorch CPU and GPU run in parallelPytorch on GPU: variable still on CPUPyTorch: Move Weights Between GPU and CPU on the fly

Is bread bad for ducks?

Pristine Bit Checking

Falsification in Math vs Science

Why could you hear an Amstrad CPC working?

Is domain driven design an anti-SQL pattern?

Are there any other methods to apply to solving simultaneous equations?

Manuscript was "unsubmitted" because the manuscript was deposited in Arxiv Preprints

The difference between dialogue marks

On the insanity of kings as an argument against Monarchy

I see my dog run

How can I create a character who can assume the widest possible range of creature sizes?

Why do UK politicians seemingly ignore opinion polls on Brexit?

Unbreakable Formation vs. Cry of the Carnarium

In microwave frequencies, do you use a circulator when you need a (near) perfect diode?

Access elements in std::string where positon of string is greater than its size

Does it makes sense to buy a new cycle to learn riding?

"What time...?" or "At what time...?" - what is more grammatically correct?

A poker game description that does not feel gimmicky

How to manage monthly salary

What does "rabbited" mean/imply in this sentence?

Output the Arecibo Message

What do hard-Brexiteers want with respect to the Irish border?

Extreme, unacceptable situation and I can't attend work tomorrow morning

Landlord wants to switch my lease to a "Land contract" to "get back at the city"

Multiple cpu producers with few gpus not utilize 100% of the gpus (pytorch)

The 2019 Stack Overflow Developer Survey Results Are InEffect of Data Parallelism on Training ResultHigh GPU Memory-Usage but low volatile gpu-utilParallel GPU computations - utilization fluctuationCaffe's GPU Utilization Is Not Full Enough When Doing Forward Inference, Any Idea?tensorflow on multiple GPUs (very odd behaviour)Tensorflow-GPU Eager Execution: Batch Normalization extremely slow and decreases Volatile GPU-UtilPytorch CPU and GPU run in parallelPytorch on GPU: variable still on CPUPyTorch: Move Weights Between GPU and CPU on the fly

.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty height:90px;width:728px;box-sizing:border-box;

I tried to implement board game self-play data generation in parallel using multiple cpus to do self-paly concurrently. For parent process, i created 4 NN model for 30cpus (1 model for 10 cpus and 1 model to train) each model is in different gpus.(the model is implemented as 20 blocks resnet-like architecture with batchnorm) Pseudo code as follows

nnet = NN(gpu_num=0)
nnet1 = NN(gpu_num=1)
nnet2 = NN(gpu_num=2)
nnet3 = NN(gpu_num=3)

for i in range(num_iteration):
 nnet1.load_state_dict(nnet.state_dict())
 nnet2.load_state_dict(nnet.state_dict())
 nnet3.load_state_dict(nnet.state_dict())
 samples = parallel_self_play()
 nnet.train(samples)

parallel_self_play() is implemented as follows

pool = mp.Pool(processes=num_cpu) #30
for i in range(self.args.numEps):
 results = []
 if i % 3 == 0:
 net = self.nnet1
 elif i % 3 == 1:
 net = self.nnet2
 else:
 net = self.nnet3

 results.append(pool.apply_async(AsyncSelfPlay, args=(net))
 # get results from results array then return it
 return results

My code work perfectly fine with almost 100% gpu utilization throughout the first self-play (less than 10 minutes for an iteration) but after the first iteration (training) when i loaded new weights into nnet1-3 gpu utilization never reach 80% again (~30min - 1hour per iteration). I notice a few things while mess around with me code

This model includes batchnorm layers, when switch model to train() mode -> train -> switch back to eval() causes the self-play (use forward pass from model) to not use gpu at all.

If it doesn't switch from eval() -> train() (train using eval mode) this causes gpu utilization to be lower (30-50%) but not entirely gone.

If the models that are not the main one doesn't load the weights from the main one, self-play still utilize 100% gpu so my guess is that something happened during training process and change some states in the model.

This also happen when use only 8 cpus - 1gpu architecture and train model on the fly (no intermediate one).

Can someone guide me where to fix my code or how i should train my model?

edited Mar 8 at 8:42

asked Mar 8 at 8:31

51616

add a comment |

nnet = NN(gpu_num=0)
nnet1 = NN(gpu_num=1)
nnet2 = NN(gpu_num=2)
nnet3 = NN(gpu_num=3)

for i in range(num_iteration):
 nnet1.load_state_dict(nnet.state_dict())
 nnet2.load_state_dict(nnet.state_dict())
 nnet3.load_state_dict(nnet.state_dict())
 samples = parallel_self_play()
 nnet.train(samples)

parallel_self_play() is implemented as follows

pool = mp.Pool(processes=num_cpu) #30
for i in range(self.args.numEps):
 results = []
 if i % 3 == 0:
 net = self.nnet1
 elif i % 3 == 1:
 net = self.nnet2
 else:
 net = self.nnet3

 results.append(pool.apply_async(AsyncSelfPlay, args=(net))
 # get results from results array then return it
 return results

This model includes batchnorm layers, when switch model to train() mode -> train -> switch back to eval() causes the self-play (use forward pass from model) to not use gpu at all.

If it doesn't switch from eval() -> train() (train using eval mode) this causes gpu utilization to be lower (30-50%) but not entirely gone.

If the models that are not the main one doesn't load the weights from the main one, self-play still utilize 100% gpu so my guess is that something happened during training process and change some states in the model.

This also happen when use only 8 cpus - 1gpu architecture and train model on the fly (no intermediate one).

Can someone guide me where to fix my code or how i should train my model?

edited Mar 8 at 8:42

asked Mar 8 at 8:31

51616

add a comment |

nnet = NN(gpu_num=0)
nnet1 = NN(gpu_num=1)
nnet2 = NN(gpu_num=2)
nnet3 = NN(gpu_num=3)

for i in range(num_iteration):
 nnet1.load_state_dict(nnet.state_dict())
 nnet2.load_state_dict(nnet.state_dict())
 nnet3.load_state_dict(nnet.state_dict())
 samples = parallel_self_play()
 nnet.train(samples)

parallel_self_play() is implemented as follows

pool = mp.Pool(processes=num_cpu) #30
for i in range(self.args.numEps):
 results = []
 if i % 3 == 0:
 net = self.nnet1
 elif i % 3 == 1:
 net = self.nnet2
 else:
 net = self.nnet3

 results.append(pool.apply_async(AsyncSelfPlay, args=(net))
 # get results from results array then return it
 return results

This model includes batchnorm layers, when switch model to train() mode -> train -> switch back to eval() causes the self-play (use forward pass from model) to not use gpu at all.

If it doesn't switch from eval() -> train() (train using eval mode) this causes gpu utilization to be lower (30-50%) but not entirely gone.

If the models that are not the main one doesn't load the weights from the main one, self-play still utilize 100% gpu so my guess is that something happened during training process and change some states in the model.

This also happen when use only 8 cpus - 1gpu architecture and train model on the fly (no intermediate one).

Can someone guide me where to fix my code or how i should train my model?

edited Mar 8 at 8:42

asked Mar 8 at 8:31

51616

nnet = NN(gpu_num=0)
nnet1 = NN(gpu_num=1)
nnet2 = NN(gpu_num=2)
nnet3 = NN(gpu_num=3)

for i in range(num_iteration):
 nnet1.load_state_dict(nnet.state_dict())
 nnet2.load_state_dict(nnet.state_dict())
 nnet3.load_state_dict(nnet.state_dict())
 samples = parallel_self_play()
 nnet.train(samples)

parallel_self_play() is implemented as follows

pool = mp.Pool(processes=num_cpu) #30
for i in range(self.args.numEps):
 results = []
 if i % 3 == 0:
 net = self.nnet1
 elif i % 3 == 1:
 net = self.nnet2
 else:
 net = self.nnet3

 results.append(pool.apply_async(AsyncSelfPlay, args=(net))
 # get results from results array then return it
 return results

This model includes batchnorm layers, when switch model to train() mode -> train -> switch back to eval() causes the self-play (use forward pass from model) to not use gpu at all.

If it doesn't switch from eval() -> train() (train using eval mode) this causes gpu utilization to be lower (30-50%) but not entirely gone.

If the models that are not the main one doesn't load the weights from the main one, self-play still utilize 100% gpu so my guess is that something happened during training process and change some states in the model.

This also happen when use only 8 cpus - 1gpu architecture and train model on the fly (no intermediate one).

Can someone guide me where to fix my code or how i should train my model?

python parallel-processing deep-learning pytorch reinforcement-learning

edited Mar 8 at 8:42

asked Mar 8 at 8:31

51616

edited Mar 8 at 8:42

asked Mar 8 at 8:31

51616

edited Mar 8 at 8:42

asked Mar 8 at 8:31

51616

asked Mar 8 at 8:31

51616

asked Mar 8 at 8:31

51616

add a comment |

0

active

oldest

votes

Your Answer

StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55059358%2fmultiple-cpu-producers-with-few-gpus-not-utilize-100-of-the-gpus-pytorch%23new-answer', 'question_page');

);

Post as a guest

Name

Required, but never shown

0

active

oldest

votes

0

active

oldest

votes

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Ufdjrw

0

Your Answer

Post as a guest

0

0

Post as a guest

Popular posts from this blog

Алба-Юлія

Захаров Федір Захарович

0

Your Answer

Sign up or log in

Post as a guest

Post as a guest

0

0

Sign up or log in

Post as a guest

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Popular posts from this blog

Алба-Юлія

Захаров Федір Захарович