Masked language model processing, deeper explanation Announcing the arrival of Valued Associate #679: Cesar Manara Planned maintenance scheduled April 17/18, 2019 at 00:00UTC (8:00pm US/Eastern) The Ask Question Wizard is Live! Data science time! April 2019 and salary with experienceNatural Language Processing ModelHow to tune a Machine Translation model with huge language model?stanfordnlp - Training space separated words as a single token to Stanford NER model generationhow to fine-tune word2vec when training our CNN for text classification?How to add new embeddings for unknown words in Tensorflow (training & pre-set for testing)Python: clustering similar words based on word2vecSentence order prediction from user given input using RNN- LSTM language modelingWhy does MITIE get stuck on segment classifier?Using pre-trained word embeddings - how to create vector for unknown / OOV Token?How to use BERT in image caption tasks,such as im2txt,densecap
Why does this iterative way of solving of equation work?
What to do with post with dry rot?
Using "nakedly" instead of "with nothing on"
The following signatures were invalid: EXPKEYSIG 1397BC53640DB551
Slither Like a Snake
Why use gamma over alpha radiation?
Simulating Exploding Dice
Why don't the Weasley twins use magic outside of school if the Trace can only find the location of spells cast?
What loss function to use when labels are probabilities?
Is 1 ppb equal to 1 μg/kg?
Active filter with series inductor and resistor - do these exist?
Need a suitable toxic chemical for a murder plot in my novel
3 doors, three guards, one stone
Is drag coefficient lowest at zero angle of attack?
New Order #5: where Fibonacci and Beatty meet at Wythoff
Classification of bundles, Postnikov towers, obstruction theory, local coefficients
Is there a service that would inform me whenever a new direct route is scheduled from a given airport?
Estimate capacitor parameters
How to say that you spent the night with someone, you were only sleeping and nothing else?
90's book, teen horror
Single author papers against my advisor's will?
What computer would be fastest for Mathematica Home Edition?
Do working physicists consider Newtonian mechanics to be "falsified"?
Cold is to Refrigerator as warm is to?
Masked language model processing, deeper explanation
Announcing the arrival of Valued Associate #679: Cesar Manara
Planned maintenance scheduled April 17/18, 2019 at 00:00UTC (8:00pm US/Eastern)
The Ask Question Wizard is Live!
Data science time! April 2019 and salary with experienceNatural Language Processing ModelHow to tune a Machine Translation model with huge language model?stanfordnlp - Training space separated words as a single token to Stanford NER model generationhow to fine-tune word2vec when training our CNN for text classification?How to add new embeddings for unknown words in Tensorflow (training & pre-set for testing)Python: clustering similar words based on word2vecSentence order prediction from user given input using RNN- LSTM language modelingWhy does MITIE get stuck on segment classifier?Using pre-trained word embeddings - how to create vector for unknown / OOV Token?How to use BERT in image caption tasks,such as im2txt,densecap
.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty height:90px;width:728px;box-sizing:border-box;
I'm looking to BERT model (you can found the description here) in detail and I'm getting problem to understand clearly the need to keep or replace random word 20% of the time instead or just use [MASK] token always for the masked language model.
We try to train the bidirectional technique and the article explains "[MASK] token is never seen during fine-tuning" but it is two different steps for me, we train first bidirectional and after we downstream task.
If someone can explain to me where I'm wrong in my comprehension.
nlp stanford-nlp
add a comment |
I'm looking to BERT model (you can found the description here) in detail and I'm getting problem to understand clearly the need to keep or replace random word 20% of the time instead or just use [MASK] token always for the masked language model.
We try to train the bidirectional technique and the article explains "[MASK] token is never seen during fine-tuning" but it is two different steps for me, we train first bidirectional and after we downstream task.
If someone can explain to me where I'm wrong in my comprehension.
nlp stanford-nlp
add a comment |
I'm looking to BERT model (you can found the description here) in detail and I'm getting problem to understand clearly the need to keep or replace random word 20% of the time instead or just use [MASK] token always for the masked language model.
We try to train the bidirectional technique and the article explains "[MASK] token is never seen during fine-tuning" but it is two different steps for me, we train first bidirectional and after we downstream task.
If someone can explain to me where I'm wrong in my comprehension.
nlp stanford-nlp
I'm looking to BERT model (you can found the description here) in detail and I'm getting problem to understand clearly the need to keep or replace random word 20% of the time instead or just use [MASK] token always for the masked language model.
We try to train the bidirectional technique and the article explains "[MASK] token is never seen during fine-tuning" but it is two different steps for me, we train first bidirectional and after we downstream task.
If someone can explain to me where I'm wrong in my comprehension.
nlp stanford-nlp
nlp stanford-nlp
asked Mar 8 at 15:09
JonorJonor
9313
9313
add a comment |
add a comment |
1 Answer
1
active
oldest
votes
If you don't use random replacement during training your network won't learn to extract useful features from non-masked tokens.
in other words, if you only use masking and try to predict them, it will be a waste of resources for your network to extract good features for the non-masked tokens(remember that your network is as good as your task and it will try to find the easiest way to solve your task)
Thanks, but I don't see the goal of unchanged the sentence in that case rather than just use random words sometimes
– Jonor
Mar 11 at 10:17
1
If you always replaced the words, then your network would always try to guess something other than the word provided (remember that they calculate the loss only for some words) and that would be bad and destructive
– Separius
Mar 11 at 10:41
add a comment |
Your Answer
StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");
StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);
else
createEditor();
);
function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);
);
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55066010%2fmasked-language-model-processing-deeper-explanation%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
If you don't use random replacement during training your network won't learn to extract useful features from non-masked tokens.
in other words, if you only use masking and try to predict them, it will be a waste of resources for your network to extract good features for the non-masked tokens(remember that your network is as good as your task and it will try to find the easiest way to solve your task)
Thanks, but I don't see the goal of unchanged the sentence in that case rather than just use random words sometimes
– Jonor
Mar 11 at 10:17
1
If you always replaced the words, then your network would always try to guess something other than the word provided (remember that they calculate the loss only for some words) and that would be bad and destructive
– Separius
Mar 11 at 10:41
add a comment |
If you don't use random replacement during training your network won't learn to extract useful features from non-masked tokens.
in other words, if you only use masking and try to predict them, it will be a waste of resources for your network to extract good features for the non-masked tokens(remember that your network is as good as your task and it will try to find the easiest way to solve your task)
Thanks, but I don't see the goal of unchanged the sentence in that case rather than just use random words sometimes
– Jonor
Mar 11 at 10:17
1
If you always replaced the words, then your network would always try to guess something other than the word provided (remember that they calculate the loss only for some words) and that would be bad and destructive
– Separius
Mar 11 at 10:41
add a comment |
If you don't use random replacement during training your network won't learn to extract useful features from non-masked tokens.
in other words, if you only use masking and try to predict them, it will be a waste of resources for your network to extract good features for the non-masked tokens(remember that your network is as good as your task and it will try to find the easiest way to solve your task)
If you don't use random replacement during training your network won't learn to extract useful features from non-masked tokens.
in other words, if you only use masking and try to predict them, it will be a waste of resources for your network to extract good features for the non-masked tokens(remember that your network is as good as your task and it will try to find the easiest way to solve your task)
answered Mar 10 at 20:56
SepariusSeparius
379313
379313
Thanks, but I don't see the goal of unchanged the sentence in that case rather than just use random words sometimes
– Jonor
Mar 11 at 10:17
1
If you always replaced the words, then your network would always try to guess something other than the word provided (remember that they calculate the loss only for some words) and that would be bad and destructive
– Separius
Mar 11 at 10:41
add a comment |
Thanks, but I don't see the goal of unchanged the sentence in that case rather than just use random words sometimes
– Jonor
Mar 11 at 10:17
1
If you always replaced the words, then your network would always try to guess something other than the word provided (remember that they calculate the loss only for some words) and that would be bad and destructive
– Separius
Mar 11 at 10:41
Thanks, but I don't see the goal of unchanged the sentence in that case rather than just use random words sometimes
– Jonor
Mar 11 at 10:17
Thanks, but I don't see the goal of unchanged the sentence in that case rather than just use random words sometimes
– Jonor
Mar 11 at 10:17
1
1
If you always replaced the words, then your network would always try to guess something other than the word provided (remember that they calculate the loss only for some words) and that would be bad and destructive
– Separius
Mar 11 at 10:41
If you always replaced the words, then your network would always try to guess something other than the word provided (remember that they calculate the loss only for some words) and that would be bad and destructive
– Separius
Mar 11 at 10:41
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55066010%2fmasked-language-model-processing-deeper-explanation%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown