Is multi-node Sagemaker training batched per-node or shared? Announcing the arrival of Valued Associate #679: Cesar Manara Planned maintenance scheduled April 23, 2019 at 00:00UTC (8:00pm US/Eastern) Data science time! April 2019 and salary with experience The Ask Question Wizard is Live!how to read batches in one hdf5 data file for training?Feed Slim Training Loop with Data Batches from Keras ImageDataGeneratorMulti GPU Training in Tensorflow (Data Parallelism) when Using feed_dictBatch Training Accuracy is always multiple of 10%Tensorflow Train on incomplete batchPredict batch of images with a SageMaker modelIs there a way to partially load weights into memory in Tensorflow or Pytorch, in cases where only a fraction of weights are trained per step?SageMaker fails when using Multi-GPU with keras.utils.multi_gpu_modelAWS Sagemaker | Why multiple instances training taking time multiplied to instance numbertf: Save a batch-trained model to a servable with no input batching?

Why do we need to use the builder design pattern when we can do the same thing with setters?

Can the Great Weapon Master feat's damage bonus and accuracy penalty apply to attacks from the Spiritual Weapon spell?

What are the out-of-universe reasons for the references to Toby Maguire-era Spider-Man in Into the Spider-Verse?

What do you call the main part of a joke?

Project Euler #1 in C++

Should I use a zero-interest credit card for a large one-time purchase?

How does the math work when buying airline miles?

As a beginner, should I get a Squier Strat with a SSS config or a HSS?

AppleTVs create a chatty alternate WiFi network

How come Sam didn't become Lord of Horn Hill?

Has negative voting ever been officially implemented in elections, or seriously proposed, or even studied?

Is it fair for a professor to grade us on the possession of past papers?

What is "gratricide"?

What is the difference between globalisation and imperialism?

Is CEO the "profession" with the most psychopaths?

Why wasn't DOSKEY integrated with COMMAND.COM?

Can a new player join a group only when a new campaign starts?

Do I really need to have a message in a novel to appeal to readers?

Morning, Afternoon, Night Kanji

Is it a good idea to use CNN to classify 1D signal?

How do I find out the mythology and history of my Fortress?

Is a ledger board required if the side of my house is wood?

Dating a Former Employee

Converted a Scalar function to a TVF function for parallel execution-Still running in Serial mode



Is multi-node Sagemaker training batched per-node or shared?



Announcing the arrival of Valued Associate #679: Cesar Manara
Planned maintenance scheduled April 23, 2019 at 00:00UTC (8:00pm US/Eastern)
Data science time! April 2019 and salary with experience
The Ask Question Wizard is Live!how to read batches in one hdf5 data file for training?Feed Slim Training Loop with Data Batches from Keras ImageDataGeneratorMulti GPU Training in Tensorflow (Data Parallelism) when Using feed_dictBatch Training Accuracy is always multiple of 10%Tensorflow Train on incomplete batchPredict batch of images with a SageMaker modelIs there a way to partially load weights into memory in Tensorflow or Pytorch, in cases where only a fraction of weights are trained per step?SageMaker fails when using Multi-GPU with keras.utils.multi_gpu_modelAWS Sagemaker | Why multiple instances training taking time multiplied to instance numbertf: Save a batch-trained model to a servable with no input batching?



.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty height:90px;width:728px;box-sizing:border-box;








-2















I am using Tensorflow, and am noticing that individual steps are slower with multiple nodes than with one, so I am a bit confused as to what constitutes a step on multiple training nodes on Sagemaker.



If my batch size is 10 and I have 5 training nodes, is a "step" 2 from each node or 10 from each node?



What if I have a batch size of 1 and 5 nodes?



Note - a 'node' here is an individual training instance, count created from train_instance_count=5










share|improve this question




























    -2















    I am using Tensorflow, and am noticing that individual steps are slower with multiple nodes than with one, so I am a bit confused as to what constitutes a step on multiple training nodes on Sagemaker.



    If my batch size is 10 and I have 5 training nodes, is a "step" 2 from each node or 10 from each node?



    What if I have a batch size of 1 and 5 nodes?



    Note - a 'node' here is an individual training instance, count created from train_instance_count=5










    share|improve this question
























      -2












      -2








      -2








      I am using Tensorflow, and am noticing that individual steps are slower with multiple nodes than with one, so I am a bit confused as to what constitutes a step on multiple training nodes on Sagemaker.



      If my batch size is 10 and I have 5 training nodes, is a "step" 2 from each node or 10 from each node?



      What if I have a batch size of 1 and 5 nodes?



      Note - a 'node' here is an individual training instance, count created from train_instance_count=5










      share|improve this question














      I am using Tensorflow, and am noticing that individual steps are slower with multiple nodes than with one, so I am a bit confused as to what constitutes a step on multiple training nodes on Sagemaker.



      If my batch size is 10 and I have 5 training nodes, is a "step" 2 from each node or 10 from each node?



      What if I have a batch size of 1 and 5 nodes?



      Note - a 'node' here is an individual training instance, count created from train_instance_count=5







      amazon-web-services tensorflow machine-learning amazon-sagemaker






      share|improve this question













      share|improve this question











      share|improve this question




      share|improve this question










      asked Mar 8 at 20:14









      King DededeKing Dedede

      347618




      347618






















          1 Answer
          1






          active

          oldest

          votes


















          1














          Please look at this notebook for an example of distributed training with TF: https://github.com/awslabs/amazon-sagemaker-examples/blob/master/sagemaker-python-sdk/tensorflow_distributed_mnist/tensorflow_distributed_mnist.ipynb



          "Each instance will predict a batch of the dataset, calculate loss and minimize the optimizer. One entire loop of this process is called training step.



          A global step is a global variable shared between the instances. It's necessary for distributed training, so the optimizer will keep track of the number of training steps between runs:



          train_op = optimizer.minimize(loss, tf.train.get_or_create_global_step())
          That is the only required change for distributed training!"






          share|improve this answer























            Your Answer






            StackExchange.ifUsing("editor", function ()
            StackExchange.using("externalEditor", function ()
            StackExchange.using("snippets", function ()
            StackExchange.snippets.init();
            );
            );
            , "code-snippets");

            StackExchange.ready(function()
            var channelOptions =
            tags: "".split(" "),
            id: "1"
            ;
            initTagRenderer("".split(" "), "".split(" "), channelOptions);

            StackExchange.using("externalEditor", function()
            // Have to fire editor after snippets, if snippets enabled
            if (StackExchange.settings.snippets.snippetsEnabled)
            StackExchange.using("snippets", function()
            createEditor();
            );

            else
            createEditor();

            );

            function createEditor()
            StackExchange.prepareEditor(
            heartbeatType: 'answer',
            autoActivateHeartbeat: false,
            convertImagesToLinks: true,
            noModals: true,
            showLowRepImageUploadWarning: true,
            reputationToPostImages: 10,
            bindNavPrevention: true,
            postfix: "",
            imageUploader:
            brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
            contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
            allowUrls: true
            ,
            onDemand: true,
            discardSelector: ".discard-answer"
            ,immediatelyShowMarkdownHelp:true
            );



            );













            draft saved

            draft discarded


















            StackExchange.ready(
            function ()
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55070396%2fis-multi-node-sagemaker-training-batched-per-node-or-shared%23new-answer', 'question_page');

            );

            Post as a guest















            Required, but never shown

























            1 Answer
            1






            active

            oldest

            votes








            1 Answer
            1






            active

            oldest

            votes









            active

            oldest

            votes






            active

            oldest

            votes









            1














            Please look at this notebook for an example of distributed training with TF: https://github.com/awslabs/amazon-sagemaker-examples/blob/master/sagemaker-python-sdk/tensorflow_distributed_mnist/tensorflow_distributed_mnist.ipynb



            "Each instance will predict a batch of the dataset, calculate loss and minimize the optimizer. One entire loop of this process is called training step.



            A global step is a global variable shared between the instances. It's necessary for distributed training, so the optimizer will keep track of the number of training steps between runs:



            train_op = optimizer.minimize(loss, tf.train.get_or_create_global_step())
            That is the only required change for distributed training!"






            share|improve this answer



























              1














              Please look at this notebook for an example of distributed training with TF: https://github.com/awslabs/amazon-sagemaker-examples/blob/master/sagemaker-python-sdk/tensorflow_distributed_mnist/tensorflow_distributed_mnist.ipynb



              "Each instance will predict a batch of the dataset, calculate loss and minimize the optimizer. One entire loop of this process is called training step.



              A global step is a global variable shared between the instances. It's necessary for distributed training, so the optimizer will keep track of the number of training steps between runs:



              train_op = optimizer.minimize(loss, tf.train.get_or_create_global_step())
              That is the only required change for distributed training!"






              share|improve this answer

























                1












                1








                1







                Please look at this notebook for an example of distributed training with TF: https://github.com/awslabs/amazon-sagemaker-examples/blob/master/sagemaker-python-sdk/tensorflow_distributed_mnist/tensorflow_distributed_mnist.ipynb



                "Each instance will predict a batch of the dataset, calculate loss and minimize the optimizer. One entire loop of this process is called training step.



                A global step is a global variable shared between the instances. It's necessary for distributed training, so the optimizer will keep track of the number of training steps between runs:



                train_op = optimizer.minimize(loss, tf.train.get_or_create_global_step())
                That is the only required change for distributed training!"






                share|improve this answer













                Please look at this notebook for an example of distributed training with TF: https://github.com/awslabs/amazon-sagemaker-examples/blob/master/sagemaker-python-sdk/tensorflow_distributed_mnist/tensorflow_distributed_mnist.ipynb



                "Each instance will predict a batch of the dataset, calculate loss and minimize the optimizer. One entire loop of this process is called training step.



                A global step is a global variable shared between the instances. It's necessary for distributed training, so the optimizer will keep track of the number of training steps between runs:



                train_op = optimizer.minimize(loss, tf.train.get_or_create_global_step())
                That is the only required change for distributed training!"







                share|improve this answer












                share|improve this answer



                share|improve this answer










                answered Mar 11 at 15:54









                Julien SimonJulien Simon

                53928




                53928





























                    draft saved

                    draft discarded
















































                    Thanks for contributing an answer to Stack Overflow!


                    • Please be sure to answer the question. Provide details and share your research!

                    But avoid


                    • Asking for help, clarification, or responding to other answers.

                    • Making statements based on opinion; back them up with references or personal experience.

                    To learn more, see our tips on writing great answers.




                    draft saved


                    draft discarded














                    StackExchange.ready(
                    function ()
                    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55070396%2fis-multi-node-sagemaker-training-batched-per-node-or-shared%23new-answer', 'question_page');

                    );

                    Post as a guest















                    Required, but never shown





















































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown

































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown







                    Popular posts from this blog

                    Save data to MySQL database using ExtJS and PHP [closed]2019 Community Moderator ElectionHow can I prevent SQL injection in PHP?Which MySQL data type to use for storing boolean valuesPHP: Delete an element from an arrayHow do I connect to a MySQL Database in Python?Should I use the datetime or timestamp data type in MySQL?How to get a list of MySQL user accountsHow Do You Parse and Process HTML/XML in PHP?Reference — What does this symbol mean in PHP?How does PHP 'foreach' actually work?Why shouldn't I use mysql_* functions in PHP?

                    Compiling GNU Global with universal-ctags support Announcing the arrival of Valued Associate #679: Cesar Manara Planned maintenance scheduled April 23, 2019 at 23:30 UTC (7:30pm US/Eastern) Data science time! April 2019 and salary with experience The Ask Question Wizard is Live!Tags for Emacs: Relationship between etags, ebrowse, cscope, GNU Global and exuberant ctagsVim and Ctags tips and trickscscope or ctags why choose one over the other?scons and ctagsctags cannot open option file “.ctags”Adding tag scopes in universal-ctagsShould I use Universal-ctags?Universal ctags on WindowsHow do I install GNU Global with universal ctags support using Homebrew?Universal ctags with emacsHow to highlight ctags generated by Universal Ctags in Vim?

                    Add ONERROR event to image from jsp tldHow to add an image to a JPanel?Saving image from PHP URLHTML img scalingCheck if an image is loaded (no errors) with jQueryHow to force an <img> to take up width, even if the image is not loadedHow do I populate hidden form field with a value set in Spring ControllerStyling Raw elements Generated from JSP tagds with Jquery MobileLimit resizing of images with explicitly set width and height attributeserror TLD use in a jsp fileJsp tld files cannot be resolved