Masked language model processing, deeper explanation Announcing the arrival of Valued Associate #679: Cesar Manara Planned maintenance scheduled April 17/18, 2019 at 00:00UTC (8:00pm US/Eastern) The Ask Question Wizard is Live! Data science time! April 2019 and salary with experienceNatural Language Processing ModelHow to tune a Machine Translation model with huge language model?stanfordnlp - Training space separated words as a single token to Stanford NER model generationhow to fine-tune word2vec when training our CNN for text classification?How to add new embeddings for unknown words in Tensorflow (training & pre-set for testing)Python: clustering similar words based on word2vecSentence order prediction from user given input using RNN- LSTM language modelingWhy does MITIE get stuck on segment classifier?Using pre-trained word embeddings - how to create vector for unknown / OOV Token?How to use BERT in image caption tasks,such as im2txt,densecap

Why does this iterative way of solving of equation work?

What to do with post with dry rot?

Using "nakedly" instead of "with nothing on"

The following signatures were invalid: EXPKEYSIG 1397BC53640DB551

Slither Like a Snake

Why use gamma over alpha radiation?

Simulating Exploding Dice

Why don't the Weasley twins use magic outside of school if the Trace can only find the location of spells cast?

What loss function to use when labels are probabilities?

Is 1 ppb equal to 1 μg/kg?

Active filter with series inductor and resistor - do these exist?

Need a suitable toxic chemical for a murder plot in my novel

3 doors, three guards, one stone

Is drag coefficient lowest at zero angle of attack?

New Order #5: where Fibonacci and Beatty meet at Wythoff

Classification of bundles, Postnikov towers, obstruction theory, local coefficients

Is there a service that would inform me whenever a new direct route is scheduled from a given airport?

Estimate capacitor parameters

How to say that you spent the night with someone, you were only sleeping and nothing else?

90's book, teen horror

Single author papers against my advisor's will?

What computer would be fastest for Mathematica Home Edition?

Do working physicists consider Newtonian mechanics to be "falsified"?

Cold is to Refrigerator as warm is to?



Masked language model processing, deeper explanation



Announcing the arrival of Valued Associate #679: Cesar Manara
Planned maintenance scheduled April 17/18, 2019 at 00:00UTC (8:00pm US/Eastern)
The Ask Question Wizard is Live!
Data science time! April 2019 and salary with experienceNatural Language Processing ModelHow to tune a Machine Translation model with huge language model?stanfordnlp - Training space separated words as a single token to Stanford NER model generationhow to fine-tune word2vec when training our CNN for text classification?How to add new embeddings for unknown words in Tensorflow (training & pre-set for testing)Python: clustering similar words based on word2vecSentence order prediction from user given input using RNN- LSTM language modelingWhy does MITIE get stuck on segment classifier?Using pre-trained word embeddings - how to create vector for unknown / OOV Token?How to use BERT in image caption tasks,such as im2txt,densecap



.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty height:90px;width:728px;box-sizing:border-box;








0















I'm looking to BERT model (you can found the description here) in detail and I'm getting problem to understand clearly the need to keep or replace random word 20% of the time instead or just use [MASK] token always for the masked language model.



We try to train the bidirectional technique and the article explains "[MASK] token is never seen during fine-tuning" but it is two different steps for me, we train first bidirectional and after we downstream task.



If someone can explain to me where I'm wrong in my comprehension.










share|improve this question




























    0















    I'm looking to BERT model (you can found the description here) in detail and I'm getting problem to understand clearly the need to keep or replace random word 20% of the time instead or just use [MASK] token always for the masked language model.



    We try to train the bidirectional technique and the article explains "[MASK] token is never seen during fine-tuning" but it is two different steps for me, we train first bidirectional and after we downstream task.



    If someone can explain to me where I'm wrong in my comprehension.










    share|improve this question
























      0












      0








      0








      I'm looking to BERT model (you can found the description here) in detail and I'm getting problem to understand clearly the need to keep or replace random word 20% of the time instead or just use [MASK] token always for the masked language model.



      We try to train the bidirectional technique and the article explains "[MASK] token is never seen during fine-tuning" but it is two different steps for me, we train first bidirectional and after we downstream task.



      If someone can explain to me where I'm wrong in my comprehension.










      share|improve this question














      I'm looking to BERT model (you can found the description here) in detail and I'm getting problem to understand clearly the need to keep or replace random word 20% of the time instead or just use [MASK] token always for the masked language model.



      We try to train the bidirectional technique and the article explains "[MASK] token is never seen during fine-tuning" but it is two different steps for me, we train first bidirectional and after we downstream task.



      If someone can explain to me where I'm wrong in my comprehension.







      nlp stanford-nlp






      share|improve this question













      share|improve this question











      share|improve this question




      share|improve this question










      asked Mar 8 at 15:09









      JonorJonor

      9313




      9313






















          1 Answer
          1






          active

          oldest

          votes


















          1














          If you don't use random replacement during training your network won't learn to extract useful features from non-masked tokens.



          in other words, if you only use masking and try to predict them, it will be a waste of resources for your network to extract good features for the non-masked tokens(remember that your network is as good as your task and it will try to find the easiest way to solve your task)






          share|improve this answer























          • Thanks, but I don't see the goal of unchanged the sentence in that case rather than just use random words sometimes

            – Jonor
            Mar 11 at 10:17






          • 1





            If you always replaced the words, then your network would always try to guess something other than the word provided (remember that they calculate the loss only for some words) and that would be bad and destructive

            – Separius
            Mar 11 at 10:41











          Your Answer






          StackExchange.ifUsing("editor", function ()
          StackExchange.using("externalEditor", function ()
          StackExchange.using("snippets", function ()
          StackExchange.snippets.init();
          );
          );
          , "code-snippets");

          StackExchange.ready(function()
          var channelOptions =
          tags: "".split(" "),
          id: "1"
          ;
          initTagRenderer("".split(" "), "".split(" "), channelOptions);

          StackExchange.using("externalEditor", function()
          // Have to fire editor after snippets, if snippets enabled
          if (StackExchange.settings.snippets.snippetsEnabled)
          StackExchange.using("snippets", function()
          createEditor();
          );

          else
          createEditor();

          );

          function createEditor()
          StackExchange.prepareEditor(
          heartbeatType: 'answer',
          autoActivateHeartbeat: false,
          convertImagesToLinks: true,
          noModals: true,
          showLowRepImageUploadWarning: true,
          reputationToPostImages: 10,
          bindNavPrevention: true,
          postfix: "",
          imageUploader:
          brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
          contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
          allowUrls: true
          ,
          onDemand: true,
          discardSelector: ".discard-answer"
          ,immediatelyShowMarkdownHelp:true
          );



          );













          draft saved

          draft discarded


















          StackExchange.ready(
          function ()
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55066010%2fmasked-language-model-processing-deeper-explanation%23new-answer', 'question_page');

          );

          Post as a guest















          Required, but never shown

























          1 Answer
          1






          active

          oldest

          votes








          1 Answer
          1






          active

          oldest

          votes









          active

          oldest

          votes






          active

          oldest

          votes









          1














          If you don't use random replacement during training your network won't learn to extract useful features from non-masked tokens.



          in other words, if you only use masking and try to predict them, it will be a waste of resources for your network to extract good features for the non-masked tokens(remember that your network is as good as your task and it will try to find the easiest way to solve your task)






          share|improve this answer























          • Thanks, but I don't see the goal of unchanged the sentence in that case rather than just use random words sometimes

            – Jonor
            Mar 11 at 10:17






          • 1





            If you always replaced the words, then your network would always try to guess something other than the word provided (remember that they calculate the loss only for some words) and that would be bad and destructive

            – Separius
            Mar 11 at 10:41















          1














          If you don't use random replacement during training your network won't learn to extract useful features from non-masked tokens.



          in other words, if you only use masking and try to predict them, it will be a waste of resources for your network to extract good features for the non-masked tokens(remember that your network is as good as your task and it will try to find the easiest way to solve your task)






          share|improve this answer























          • Thanks, but I don't see the goal of unchanged the sentence in that case rather than just use random words sometimes

            – Jonor
            Mar 11 at 10:17






          • 1





            If you always replaced the words, then your network would always try to guess something other than the word provided (remember that they calculate the loss only for some words) and that would be bad and destructive

            – Separius
            Mar 11 at 10:41













          1












          1








          1







          If you don't use random replacement during training your network won't learn to extract useful features from non-masked tokens.



          in other words, if you only use masking and try to predict them, it will be a waste of resources for your network to extract good features for the non-masked tokens(remember that your network is as good as your task and it will try to find the easiest way to solve your task)






          share|improve this answer













          If you don't use random replacement during training your network won't learn to extract useful features from non-masked tokens.



          in other words, if you only use masking and try to predict them, it will be a waste of resources for your network to extract good features for the non-masked tokens(remember that your network is as good as your task and it will try to find the easiest way to solve your task)







          share|improve this answer












          share|improve this answer



          share|improve this answer










          answered Mar 10 at 20:56









          SepariusSeparius

          379313




          379313












          • Thanks, but I don't see the goal of unchanged the sentence in that case rather than just use random words sometimes

            – Jonor
            Mar 11 at 10:17






          • 1





            If you always replaced the words, then your network would always try to guess something other than the word provided (remember that they calculate the loss only for some words) and that would be bad and destructive

            – Separius
            Mar 11 at 10:41

















          • Thanks, but I don't see the goal of unchanged the sentence in that case rather than just use random words sometimes

            – Jonor
            Mar 11 at 10:17






          • 1





            If you always replaced the words, then your network would always try to guess something other than the word provided (remember that they calculate the loss only for some words) and that would be bad and destructive

            – Separius
            Mar 11 at 10:41
















          Thanks, but I don't see the goal of unchanged the sentence in that case rather than just use random words sometimes

          – Jonor
          Mar 11 at 10:17





          Thanks, but I don't see the goal of unchanged the sentence in that case rather than just use random words sometimes

          – Jonor
          Mar 11 at 10:17




          1




          1





          If you always replaced the words, then your network would always try to guess something other than the word provided (remember that they calculate the loss only for some words) and that would be bad and destructive

          – Separius
          Mar 11 at 10:41





          If you always replaced the words, then your network would always try to guess something other than the word provided (remember that they calculate the loss only for some words) and that would be bad and destructive

          – Separius
          Mar 11 at 10:41



















          draft saved

          draft discarded
















































          Thanks for contributing an answer to Stack Overflow!


          • Please be sure to answer the question. Provide details and share your research!

          But avoid


          • Asking for help, clarification, or responding to other answers.

          • Making statements based on opinion; back them up with references or personal experience.

          To learn more, see our tips on writing great answers.




          draft saved


          draft discarded














          StackExchange.ready(
          function ()
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55066010%2fmasked-language-model-processing-deeper-explanation%23new-answer', 'question_page');

          );

          Post as a guest















          Required, but never shown





















































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown

































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown







          Popular posts from this blog

          1928 у кіно

          Захаров Федір Захарович

          Ель Греко