How to sed replace UTF-8 characters with HTML entities? Announcing the arrival of Valued Associate #679: Cesar Manara Planned maintenance scheduled April 17/18, 2019 at 00:00UTC (8:00pm US/Eastern) Data science time! April 2019 and salary with experience Should we burninate the [wrap] tag? The Ask Question Wizard is Live!Escape a string for a sed replace patternUnexpected substitution for & with sedHow can I remove the first line of a text file using bash/sed script?Escape a string for a sed replace patternHow can I replace a newline (n) using sed?How to do a recursive find/replace of a string with awk or sed?How to output only captured groups with sed?Find and replace in file and overwrite file doesn't work, it empties the fileFind and replace with sed in directory and sub directoriesReplace comma with newline in sed on MacOS?Replace whole line containing a string using SedHow to replace strings containing slashes with sed?

List *all* the tuples!

How can I make names more distinctive without making them longer?

Echoing a tail command produces unexpected output?

Output the ŋarâþ crîþ alphabet song without using (m)any letters

What's the meaning of 間時肆拾貳 at a car parking sign

Why was the term "discrete" used in discrete logarithm?

What would be the ideal power source for a cybernetic eye?

Withdrew £2800, but only £2000 shows as withdrawn on online banking; what are my obligations?

What is a non-alternating simple group with big order, but relatively few conjugacy classes?

Denied boarding although I have proper visa and documentation. To whom should I make a complaint?

English words in a non-english sci-fi novel

Why do people hide their license plates in the EU?

How come Sam didn't become Lord of Horn Hill?

prime numbers and expressing non-prime numbers

String `!23` is replaced with `docker` in command line

Generate an RGB colour grid

Why are Kinder Surprise Eggs illegal in the USA?

Overriding an object in memory with placement new

List of Python versions

3 doors, three guards, one stone

Is it true that "carbohydrates are of no use for the basal metabolic need"?

What does an IRS interview request entail when called in to verify expenses for a sole proprietor small business?

Check which numbers satisfy the condition [A*B*C = A! + B! + C!]

How do I stop a creek from eroding my steep embankment?



How to sed replace UTF-8 characters with HTML entities?



Announcing the arrival of Valued Associate #679: Cesar Manara
Planned maintenance scheduled April 17/18, 2019 at 00:00UTC (8:00pm US/Eastern)
Data science time! April 2019 and salary with experience
Should we burninate the [wrap] tag?
The Ask Question Wizard is Live!Escape a string for a sed replace patternUnexpected substitution for & with sedHow can I remove the first line of a text file using bash/sed script?Escape a string for a sed replace patternHow can I replace a newline (n) using sed?How to do a recursive find/replace of a string with awk or sed?How to output only captured groups with sed?Find and replace in file and overwrite file doesn't work, it empties the fileFind and replace with sed in directory and sub directoriesReplace comma with newline in sed on MacOS?Replace whole line containing a string using SedHow to replace strings containing slashes with sed?



.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty height:90px;width:728px;box-sizing:border-box;








1















I'm running cygwin under windows 10



Have a dictionary file (1-dictionary.txt) that looks like this:



labelling labeling
flavour flavor
colour color
organisations organizations
végétales végétales
contr?lée contrôlée
" "


The separators between are TABs (ts).



The dictionary file is encoded as UTF-8.



Want to replace words and symbols in the first column with words and HTML entities in the second column.



My source file (2-source.txt) has the target UTF-8 and ASCII symbols. The source file also is encoded as UTF-8.



Sample text looks like this:



Cultivar was coined by Bailey and it is generally regarded as a portmanteau of "cultivated" and "variety" ... The International Union for the Protection of New Varieties of Plants (UPOV - French: Union internationale pour la protection des obtentions végétales) offers legal protection of plant cultivars ...Terroir is the basis of the French wine appellation d'origine contrôlée (AOC) system


I run the following sed one-liner in a shell script (./3-script.sh):



sed -f <(sed -E 's_(.+)t(.+)_s/1/2/g_' 1-dictionary.txt) 2-source.txt > 3-translation.txt



The substitution of English (en-GB) words with American (en-US) words in 3-translation.txt is successful.



However the substitution of ASCII symbols, such as the quote symbol, and UTF-8 words produces this result:



vvégétales#x00E9;gvégétales#x00E9;tales)
contrcontrôlée#x00F4;lcontrôlée#x00E9;e (AOC)


If i use only the specific symbol (not the full word) I get results like this:



vé#x00E9;gé#x00E9;tales
"#x0022cultivated"#x0022
contrô#x00F4;lé#x00E9;e


The ASCII quote symbol is appended with " - it is not replaced.



Similarly, the UTF-8 symbol is appended with its HTML entity - not replaced with the HTML entity.



The expected output would look like this:



v#x00E9;g#x00E9;tales
#x0022cultivated#x0022
contr#x00F4;l#x00E9;e


How to modify the sed script so that target ASCII and UTF-8 symbols are replaced with their HTML entity equivalent as defined in the dictionary file?










share|improve this question



















  • 1





    Possible duplicate of Unexpected substitution for & with sed

    – tripleee
    Mar 8 at 17:47






  • 1





    Possible duplicate of stackoverflow.com/questions/407523/…

    – tripleee
    Mar 8 at 17:48






  • 1





    I tried it, just replace all & with & in your 1-dictionary.txt will solve your problem. Try it, see if it's working.

    – Tiw
    Mar 8 at 18:09

















1















I'm running cygwin under windows 10



Have a dictionary file (1-dictionary.txt) that looks like this:



labelling labeling
flavour flavor
colour color
organisations organizations
végétales végétales
contr?lée contrôlée
" "


The separators between are TABs (ts).



The dictionary file is encoded as UTF-8.



Want to replace words and symbols in the first column with words and HTML entities in the second column.



My source file (2-source.txt) has the target UTF-8 and ASCII symbols. The source file also is encoded as UTF-8.



Sample text looks like this:



Cultivar was coined by Bailey and it is generally regarded as a portmanteau of "cultivated" and "variety" ... The International Union for the Protection of New Varieties of Plants (UPOV - French: Union internationale pour la protection des obtentions végétales) offers legal protection of plant cultivars ...Terroir is the basis of the French wine appellation d'origine contrôlée (AOC) system


I run the following sed one-liner in a shell script (./3-script.sh):



sed -f <(sed -E 's_(.+)t(.+)_s/1/2/g_' 1-dictionary.txt) 2-source.txt > 3-translation.txt



The substitution of English (en-GB) words with American (en-US) words in 3-translation.txt is successful.



However the substitution of ASCII symbols, such as the quote symbol, and UTF-8 words produces this result:



vvégétales#x00E9;gvégétales#x00E9;tales)
contrcontrôlée#x00F4;lcontrôlée#x00E9;e (AOC)


If i use only the specific symbol (not the full word) I get results like this:



vé#x00E9;gé#x00E9;tales
"#x0022cultivated"#x0022
contrô#x00F4;lé#x00E9;e


The ASCII quote symbol is appended with " - it is not replaced.



Similarly, the UTF-8 symbol is appended with its HTML entity - not replaced with the HTML entity.



The expected output would look like this:



v#x00E9;g#x00E9;tales
#x0022cultivated#x0022
contr#x00F4;l#x00E9;e


How to modify the sed script so that target ASCII and UTF-8 symbols are replaced with their HTML entity equivalent as defined in the dictionary file?










share|improve this question



















  • 1





    Possible duplicate of Unexpected substitution for & with sed

    – tripleee
    Mar 8 at 17:47






  • 1





    Possible duplicate of stackoverflow.com/questions/407523/…

    – tripleee
    Mar 8 at 17:48






  • 1





    I tried it, just replace all & with & in your 1-dictionary.txt will solve your problem. Try it, see if it's working.

    – Tiw
    Mar 8 at 18:09













1












1








1








I'm running cygwin under windows 10



Have a dictionary file (1-dictionary.txt) that looks like this:



labelling labeling
flavour flavor
colour color
organisations organizations
végétales végétales
contr?lée contrôlée
" "


The separators between are TABs (ts).



The dictionary file is encoded as UTF-8.



Want to replace words and symbols in the first column with words and HTML entities in the second column.



My source file (2-source.txt) has the target UTF-8 and ASCII symbols. The source file also is encoded as UTF-8.



Sample text looks like this:



Cultivar was coined by Bailey and it is generally regarded as a portmanteau of "cultivated" and "variety" ... The International Union for the Protection of New Varieties of Plants (UPOV - French: Union internationale pour la protection des obtentions végétales) offers legal protection of plant cultivars ...Terroir is the basis of the French wine appellation d'origine contrôlée (AOC) system


I run the following sed one-liner in a shell script (./3-script.sh):



sed -f <(sed -E 's_(.+)t(.+)_s/1/2/g_' 1-dictionary.txt) 2-source.txt > 3-translation.txt



The substitution of English (en-GB) words with American (en-US) words in 3-translation.txt is successful.



However the substitution of ASCII symbols, such as the quote symbol, and UTF-8 words produces this result:



vvégétales#x00E9;gvégétales#x00E9;tales)
contrcontrôlée#x00F4;lcontrôlée#x00E9;e (AOC)


If i use only the specific symbol (not the full word) I get results like this:



vé#x00E9;gé#x00E9;tales
"#x0022cultivated"#x0022
contrô#x00F4;lé#x00E9;e


The ASCII quote symbol is appended with " - it is not replaced.



Similarly, the UTF-8 symbol is appended with its HTML entity - not replaced with the HTML entity.



The expected output would look like this:



v#x00E9;g#x00E9;tales
#x0022cultivated#x0022
contr#x00F4;l#x00E9;e


How to modify the sed script so that target ASCII and UTF-8 symbols are replaced with their HTML entity equivalent as defined in the dictionary file?










share|improve this question
















I'm running cygwin under windows 10



Have a dictionary file (1-dictionary.txt) that looks like this:



labelling labeling
flavour flavor
colour color
organisations organizations
végétales végétales
contr?lée contrôlée
" "


The separators between are TABs (ts).



The dictionary file is encoded as UTF-8.



Want to replace words and symbols in the first column with words and HTML entities in the second column.



My source file (2-source.txt) has the target UTF-8 and ASCII symbols. The source file also is encoded as UTF-8.



Sample text looks like this:



Cultivar was coined by Bailey and it is generally regarded as a portmanteau of "cultivated" and "variety" ... The International Union for the Protection of New Varieties of Plants (UPOV - French: Union internationale pour la protection des obtentions végétales) offers legal protection of plant cultivars ...Terroir is the basis of the French wine appellation d'origine contrôlée (AOC) system


I run the following sed one-liner in a shell script (./3-script.sh):



sed -f <(sed -E 's_(.+)t(.+)_s/1/2/g_' 1-dictionary.txt) 2-source.txt > 3-translation.txt



The substitution of English (en-GB) words with American (en-US) words in 3-translation.txt is successful.



However the substitution of ASCII symbols, such as the quote symbol, and UTF-8 words produces this result:



vvégétales#x00E9;gvégétales#x00E9;tales)
contrcontrôlée#x00F4;lcontrôlée#x00E9;e (AOC)


If i use only the specific symbol (not the full word) I get results like this:



vé#x00E9;gé#x00E9;tales
"#x0022cultivated"#x0022
contrô#x00F4;lé#x00E9;e


The ASCII quote symbol is appended with " - it is not replaced.



Similarly, the UTF-8 symbol is appended with its HTML entity - not replaced with the HTML entity.



The expected output would look like this:



v#x00E9;g#x00E9;tales
#x0022cultivated#x0022
contr#x00F4;l#x00E9;e


How to modify the sed script so that target ASCII and UTF-8 symbols are replaced with their HTML entity equivalent as defined in the dictionary file?







sed






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Mar 8 at 17:53







Jay Gray

















asked Mar 8 at 17:32









Jay GrayJay Gray

1,02021222




1,02021222







  • 1





    Possible duplicate of Unexpected substitution for & with sed

    – tripleee
    Mar 8 at 17:47






  • 1





    Possible duplicate of stackoverflow.com/questions/407523/…

    – tripleee
    Mar 8 at 17:48






  • 1





    I tried it, just replace all & with & in your 1-dictionary.txt will solve your problem. Try it, see if it's working.

    – Tiw
    Mar 8 at 18:09












  • 1





    Possible duplicate of Unexpected substitution for & with sed

    – tripleee
    Mar 8 at 17:47






  • 1





    Possible duplicate of stackoverflow.com/questions/407523/…

    – tripleee
    Mar 8 at 17:48






  • 1





    I tried it, just replace all & with & in your 1-dictionary.txt will solve your problem. Try it, see if it's working.

    – Tiw
    Mar 8 at 18:09







1




1





Possible duplicate of Unexpected substitution for & with sed

– tripleee
Mar 8 at 17:47





Possible duplicate of Unexpected substitution for & with sed

– tripleee
Mar 8 at 17:47




1




1





Possible duplicate of stackoverflow.com/questions/407523/…

– tripleee
Mar 8 at 17:48





Possible duplicate of stackoverflow.com/questions/407523/…

– tripleee
Mar 8 at 17:48




1




1





I tried it, just replace all & with & in your 1-dictionary.txt will solve your problem. Try it, see if it's working.

– Tiw
Mar 8 at 18:09





I tried it, just replace all & with & in your 1-dictionary.txt will solve your problem. Try it, see if it's working.

– Tiw
Mar 8 at 18:09












1 Answer
1






active

oldest

votes


















1














I tried it, just replace all & with & in your 1-dictionary.txt will solve your problem.



Sed's substitute uses a regex as the from part, so when you use it like that, notice those regex characters and add to prepare them to be escaped.



And the to part will have special characters too, mainly and &, add extra to prepare them to be escaped too.



Above linked to GNU sed's document, for other sed version, you can also check man sed.






share|improve this answer

























    Your Answer






    StackExchange.ifUsing("editor", function ()
    StackExchange.using("externalEditor", function ()
    StackExchange.using("snippets", function ()
    StackExchange.snippets.init();
    );
    );
    , "code-snippets");

    StackExchange.ready(function()
    var channelOptions =
    tags: "".split(" "),
    id: "1"
    ;
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function()
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled)
    StackExchange.using("snippets", function()
    createEditor();
    );

    else
    createEditor();

    );

    function createEditor()
    StackExchange.prepareEditor(
    heartbeatType: 'answer',
    autoActivateHeartbeat: false,
    convertImagesToLinks: true,
    noModals: true,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: 10,
    bindNavPrevention: true,
    postfix: "",
    imageUploader:
    brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
    contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
    allowUrls: true
    ,
    onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    );



    );













    draft saved

    draft discarded


















    StackExchange.ready(
    function ()
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55068226%2fhow-to-sed-replace-utf-8-characters-with-html-entities%23new-answer', 'question_page');

    );

    Post as a guest















    Required, but never shown

























    1 Answer
    1






    active

    oldest

    votes








    1 Answer
    1






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes









    1














    I tried it, just replace all & with & in your 1-dictionary.txt will solve your problem.



    Sed's substitute uses a regex as the from part, so when you use it like that, notice those regex characters and add to prepare them to be escaped.



    And the to part will have special characters too, mainly and &, add extra to prepare them to be escaped too.



    Above linked to GNU sed's document, for other sed version, you can also check man sed.






    share|improve this answer





























      1














      I tried it, just replace all & with & in your 1-dictionary.txt will solve your problem.



      Sed's substitute uses a regex as the from part, so when you use it like that, notice those regex characters and add to prepare them to be escaped.



      And the to part will have special characters too, mainly and &, add extra to prepare them to be escaped too.



      Above linked to GNU sed's document, for other sed version, you can also check man sed.






      share|improve this answer



























        1












        1








        1







        I tried it, just replace all & with & in your 1-dictionary.txt will solve your problem.



        Sed's substitute uses a regex as the from part, so when you use it like that, notice those regex characters and add to prepare them to be escaped.



        And the to part will have special characters too, mainly and &, add extra to prepare them to be escaped too.



        Above linked to GNU sed's document, for other sed version, you can also check man sed.






        share|improve this answer















        I tried it, just replace all & with & in your 1-dictionary.txt will solve your problem.



        Sed's substitute uses a regex as the from part, so when you use it like that, notice those regex characters and add to prepare them to be escaped.



        And the to part will have special characters too, mainly and &, add extra to prepare them to be escaped too.



        Above linked to GNU sed's document, for other sed version, you can also check man sed.







        share|improve this answer














        share|improve this answer



        share|improve this answer








        edited Mar 8 at 19:07

























        answered Mar 8 at 18:59









        TiwTiw

        4,40761730




        4,40761730





























            draft saved

            draft discarded
















































            Thanks for contributing an answer to Stack Overflow!


            • Please be sure to answer the question. Provide details and share your research!

            But avoid


            • Asking for help, clarification, or responding to other answers.

            • Making statements based on opinion; back them up with references or personal experience.

            To learn more, see our tips on writing great answers.




            draft saved


            draft discarded














            StackExchange.ready(
            function ()
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55068226%2fhow-to-sed-replace-utf-8-characters-with-html-entities%23new-answer', 'question_page');

            );

            Post as a guest















            Required, but never shown





















































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown

































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown







            Popular posts from this blog

            1928 у кіно

            Захаров Федір Захарович

            Ель Греко