Unicode - Extract chars in a String contains Tamil letters in JavaHow do I read / convert an InputStream into a String in Java?How do I create a Java string from the contents of a file?How to get an enum value from a string value in Java?How do I determine whether an array contains a particular value in Java?How to split a string in JavaHow do I convert a String to an int in Java?How to convert a char to a String?Why is char[] preferred over String for passwords?How to install Java 8 on MacWhy is executing Java code in comments with certain Unicode characters allowed?

Writing in a Christian voice

Offset in split text content

When is the exact date for EOL of Ubuntu 14.04 LTS?

How do you say "Trust your struggle." in French?

New Order #2: Turn My Way

Why didn't Voldemort know what Grindelwald looked like?

What is it called when someone votes for an option that's not their first choice?

A seasonal riddle

Are hand made posters acceptable in Academia?

Do people actually use the word "kaputt" in conversation?

Should I warn a new PhD Student?

Connection Between Knot Theory and Number Theory

What is the tangent at a sharp point on a curve?

How to get directions in deep space?

Does capillary rise violate hydrostatic paradox?

Turning a hard to access nut?

Mortal danger in mid-grade literature

Make a Bowl of Alphabet Soup

Is divisi notation needed for brass or woodwind in an orchestra?

Sort with assumptions

Reason why a kingside attack is not justified

Is there any common country to visit for persons holding UK and Schengen visas?

Walter Rudin's mathematical analysis: theorem 2.43. Why proof can't work under the perfect set is uncountable.

Should I be concerned about student access to a test bank?



Unicode - Extract chars in a String contains Tamil letters in Java


How do I read / convert an InputStream into a String in Java?How do I create a Java string from the contents of a file?How to get an enum value from a string value in Java?How do I determine whether an array contains a particular value in Java?How to split a string in JavaHow do I convert a String to an int in Java?How to convert a char to a String?Why is char[] preferred over String for passwords?How to install Java 8 on MacWhy is executing Java code in comments with certain Unicode characters allowed?













0















I am working to support unicode chars in the system, so I want to split chars in string that contains Tamil letters. I don't know to handle string in other than English in Java.



String word = new String("தமிழ்")
String[] chars = word.split("")


What was output



[த, ம, ி, ழ, ்]



What is expected



[த, மி, ழ்]










share|improve this question




























    0















    I am working to support unicode chars in the system, so I want to split chars in string that contains Tamil letters. I don't know to handle string in other than English in Java.



    String word = new String("தமிழ்")
    String[] chars = word.split("")


    What was output



    [த, ம, ி, ழ, ்]



    What is expected



    [த, மி, ழ்]










    share|improve this question


























      0












      0








      0








      I am working to support unicode chars in the system, so I want to split chars in string that contains Tamil letters. I don't know to handle string in other than English in Java.



      String word = new String("தமிழ்")
      String[] chars = word.split("")


      What was output



      [த, ம, ி, ழ, ்]



      What is expected



      [த, மி, ழ்]










      share|improve this question
















      I am working to support unicode chars in the system, so I want to split chars in string that contains Tamil letters. I don't know to handle string in other than English in Java.



      String word = new String("தமிழ்")
      String[] chars = word.split("")


      What was output



      [த, ம, ி, ழ, ்]



      What is expected



      [த, மி, ழ்]







      java unicode-string






      share|improve this question















      share|improve this question













      share|improve this question




      share|improve this question








      edited Mar 7 at 1:29







      Arunachalam Sibisakkaravarthi

















      asked Mar 7 at 1:21









      Arunachalam SibisakkaravarthiArunachalam Sibisakkaravarthi

      113




      113






















          1 Answer
          1






          active

          oldest

          votes


















          0














          The String assigned to "word" is in fact 5 Unicode characters. The 3rd character, for example (U+0BFF) combines with the preceding one (U+0BAE?) to make one displayed symbol.



          Since you're splitting the word into characters, 5 characters is what you'll get. There is no such single character as (for example) the middle symbol displayed in your original string.



          Due to combining characters, the apparent number of symbols displayed on-screen is not necessarily the same as the number of Unicode characters. In general, programs that render Unicode strings have to be cognizant of combining characters.



          See this document for issues of Tamil in Unicode. Figure 12-21 discusses the i vowel sound, which is the middle character of the 5.



          It's not clear what your purpose is in splitting the string. If you really want "apparent symbols" (I'm making up this term) then you'll presumably need to scan the resulting characters to look for combining characters.



          This document describes one approach that seems like it would work for you, though the page says there are better facilities in releases after JDK 8, which I did not take the time to look for. Still, this may illuminate what's going on a little more thoroughly.






          share|improve this answer

























          • Yes, I understand the problem that it shows 5 chars. I used REGEX to parse/split/process the string which contains Tamil letters

            – Arunachalam Sibisakkaravarthi
            Mar 8 at 2:38











          • Wrong tool for the job, then? split deals in Unicode characters, not what I clumsily called "apparent symbols". It's possible you could write a more complicated regex that handles units of character followed by some number of combining characters but that is not then a use of split, and the BreakIterator solution looks more attractive to me.

            – another-dave
            Mar 8 at 13:08












          • You could try splitting on the regex (?U)(?!pMc) but I'm far from confident about that, particularly the Mc part. You might also have to set the locale first.

            – another-dave
            Mar 9 at 17:24










          Your Answer






          StackExchange.ifUsing("editor", function ()
          StackExchange.using("externalEditor", function ()
          StackExchange.using("snippets", function ()
          StackExchange.snippets.init();
          );
          );
          , "code-snippets");

          StackExchange.ready(function()
          var channelOptions =
          tags: "".split(" "),
          id: "1"
          ;
          initTagRenderer("".split(" "), "".split(" "), channelOptions);

          StackExchange.using("externalEditor", function()
          // Have to fire editor after snippets, if snippets enabled
          if (StackExchange.settings.snippets.snippetsEnabled)
          StackExchange.using("snippets", function()
          createEditor();
          );

          else
          createEditor();

          );

          function createEditor()
          StackExchange.prepareEditor(
          heartbeatType: 'answer',
          autoActivateHeartbeat: false,
          convertImagesToLinks: true,
          noModals: true,
          showLowRepImageUploadWarning: true,
          reputationToPostImages: 10,
          bindNavPrevention: true,
          postfix: "",
          imageUploader:
          brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
          contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
          allowUrls: true
          ,
          onDemand: true,
          discardSelector: ".discard-answer"
          ,immediatelyShowMarkdownHelp:true
          );



          );













          draft saved

          draft discarded


















          StackExchange.ready(
          function ()
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55034640%2funicode-extract-chars-in-a-string-contains-tamil-letters-in-java%23new-answer', 'question_page');

          );

          Post as a guest















          Required, but never shown

























          1 Answer
          1






          active

          oldest

          votes








          1 Answer
          1






          active

          oldest

          votes









          active

          oldest

          votes






          active

          oldest

          votes









          0














          The String assigned to "word" is in fact 5 Unicode characters. The 3rd character, for example (U+0BFF) combines with the preceding one (U+0BAE?) to make one displayed symbol.



          Since you're splitting the word into characters, 5 characters is what you'll get. There is no such single character as (for example) the middle symbol displayed in your original string.



          Due to combining characters, the apparent number of symbols displayed on-screen is not necessarily the same as the number of Unicode characters. In general, programs that render Unicode strings have to be cognizant of combining characters.



          See this document for issues of Tamil in Unicode. Figure 12-21 discusses the i vowel sound, which is the middle character of the 5.



          It's not clear what your purpose is in splitting the string. If you really want "apparent symbols" (I'm making up this term) then you'll presumably need to scan the resulting characters to look for combining characters.



          This document describes one approach that seems like it would work for you, though the page says there are better facilities in releases after JDK 8, which I did not take the time to look for. Still, this may illuminate what's going on a little more thoroughly.






          share|improve this answer

























          • Yes, I understand the problem that it shows 5 chars. I used REGEX to parse/split/process the string which contains Tamil letters

            – Arunachalam Sibisakkaravarthi
            Mar 8 at 2:38











          • Wrong tool for the job, then? split deals in Unicode characters, not what I clumsily called "apparent symbols". It's possible you could write a more complicated regex that handles units of character followed by some number of combining characters but that is not then a use of split, and the BreakIterator solution looks more attractive to me.

            – another-dave
            Mar 8 at 13:08












          • You could try splitting on the regex (?U)(?!pMc) but I'm far from confident about that, particularly the Mc part. You might also have to set the locale first.

            – another-dave
            Mar 9 at 17:24















          0














          The String assigned to "word" is in fact 5 Unicode characters. The 3rd character, for example (U+0BFF) combines with the preceding one (U+0BAE?) to make one displayed symbol.



          Since you're splitting the word into characters, 5 characters is what you'll get. There is no such single character as (for example) the middle symbol displayed in your original string.



          Due to combining characters, the apparent number of symbols displayed on-screen is not necessarily the same as the number of Unicode characters. In general, programs that render Unicode strings have to be cognizant of combining characters.



          See this document for issues of Tamil in Unicode. Figure 12-21 discusses the i vowel sound, which is the middle character of the 5.



          It's not clear what your purpose is in splitting the string. If you really want "apparent symbols" (I'm making up this term) then you'll presumably need to scan the resulting characters to look for combining characters.



          This document describes one approach that seems like it would work for you, though the page says there are better facilities in releases after JDK 8, which I did not take the time to look for. Still, this may illuminate what's going on a little more thoroughly.






          share|improve this answer

























          • Yes, I understand the problem that it shows 5 chars. I used REGEX to parse/split/process the string which contains Tamil letters

            – Arunachalam Sibisakkaravarthi
            Mar 8 at 2:38











          • Wrong tool for the job, then? split deals in Unicode characters, not what I clumsily called "apparent symbols". It's possible you could write a more complicated regex that handles units of character followed by some number of combining characters but that is not then a use of split, and the BreakIterator solution looks more attractive to me.

            – another-dave
            Mar 8 at 13:08












          • You could try splitting on the regex (?U)(?!pMc) but I'm far from confident about that, particularly the Mc part. You might also have to set the locale first.

            – another-dave
            Mar 9 at 17:24













          0












          0








          0







          The String assigned to "word" is in fact 5 Unicode characters. The 3rd character, for example (U+0BFF) combines with the preceding one (U+0BAE?) to make one displayed symbol.



          Since you're splitting the word into characters, 5 characters is what you'll get. There is no such single character as (for example) the middle symbol displayed in your original string.



          Due to combining characters, the apparent number of symbols displayed on-screen is not necessarily the same as the number of Unicode characters. In general, programs that render Unicode strings have to be cognizant of combining characters.



          See this document for issues of Tamil in Unicode. Figure 12-21 discusses the i vowel sound, which is the middle character of the 5.



          It's not clear what your purpose is in splitting the string. If you really want "apparent symbols" (I'm making up this term) then you'll presumably need to scan the resulting characters to look for combining characters.



          This document describes one approach that seems like it would work for you, though the page says there are better facilities in releases after JDK 8, which I did not take the time to look for. Still, this may illuminate what's going on a little more thoroughly.






          share|improve this answer















          The String assigned to "word" is in fact 5 Unicode characters. The 3rd character, for example (U+0BFF) combines with the preceding one (U+0BAE?) to make one displayed symbol.



          Since you're splitting the word into characters, 5 characters is what you'll get. There is no such single character as (for example) the middle symbol displayed in your original string.



          Due to combining characters, the apparent number of symbols displayed on-screen is not necessarily the same as the number of Unicode characters. In general, programs that render Unicode strings have to be cognizant of combining characters.



          See this document for issues of Tamil in Unicode. Figure 12-21 discusses the i vowel sound, which is the middle character of the 5.



          It's not clear what your purpose is in splitting the string. If you really want "apparent symbols" (I'm making up this term) then you'll presumably need to scan the resulting characters to look for combining characters.



          This document describes one approach that seems like it would work for you, though the page says there are better facilities in releases after JDK 8, which I did not take the time to look for. Still, this may illuminate what's going on a little more thoroughly.







          share|improve this answer














          share|improve this answer



          share|improve this answer








          edited Mar 7 at 22:27

























          answered Mar 7 at 12:39









          another-daveanother-dave

          95818




          95818












          • Yes, I understand the problem that it shows 5 chars. I used REGEX to parse/split/process the string which contains Tamil letters

            – Arunachalam Sibisakkaravarthi
            Mar 8 at 2:38











          • Wrong tool for the job, then? split deals in Unicode characters, not what I clumsily called "apparent symbols". It's possible you could write a more complicated regex that handles units of character followed by some number of combining characters but that is not then a use of split, and the BreakIterator solution looks more attractive to me.

            – another-dave
            Mar 8 at 13:08












          • You could try splitting on the regex (?U)(?!pMc) but I'm far from confident about that, particularly the Mc part. You might also have to set the locale first.

            – another-dave
            Mar 9 at 17:24

















          • Yes, I understand the problem that it shows 5 chars. I used REGEX to parse/split/process the string which contains Tamil letters

            – Arunachalam Sibisakkaravarthi
            Mar 8 at 2:38











          • Wrong tool for the job, then? split deals in Unicode characters, not what I clumsily called "apparent symbols". It's possible you could write a more complicated regex that handles units of character followed by some number of combining characters but that is not then a use of split, and the BreakIterator solution looks more attractive to me.

            – another-dave
            Mar 8 at 13:08












          • You could try splitting on the regex (?U)(?!pMc) but I'm far from confident about that, particularly the Mc part. You might also have to set the locale first.

            – another-dave
            Mar 9 at 17:24
















          Yes, I understand the problem that it shows 5 chars. I used REGEX to parse/split/process the string which contains Tamil letters

          – Arunachalam Sibisakkaravarthi
          Mar 8 at 2:38





          Yes, I understand the problem that it shows 5 chars. I used REGEX to parse/split/process the string which contains Tamil letters

          – Arunachalam Sibisakkaravarthi
          Mar 8 at 2:38













          Wrong tool for the job, then? split deals in Unicode characters, not what I clumsily called "apparent symbols". It's possible you could write a more complicated regex that handles units of character followed by some number of combining characters but that is not then a use of split, and the BreakIterator solution looks more attractive to me.

          – another-dave
          Mar 8 at 13:08






          Wrong tool for the job, then? split deals in Unicode characters, not what I clumsily called "apparent symbols". It's possible you could write a more complicated regex that handles units of character followed by some number of combining characters but that is not then a use of split, and the BreakIterator solution looks more attractive to me.

          – another-dave
          Mar 8 at 13:08














          You could try splitting on the regex (?U)(?!pMc) but I'm far from confident about that, particularly the Mc part. You might also have to set the locale first.

          – another-dave
          Mar 9 at 17:24





          You could try splitting on the regex (?U)(?!pMc) but I'm far from confident about that, particularly the Mc part. You might also have to set the locale first.

          – another-dave
          Mar 9 at 17:24



















          draft saved

          draft discarded
















































          Thanks for contributing an answer to Stack Overflow!


          • Please be sure to answer the question. Provide details and share your research!

          But avoid


          • Asking for help, clarification, or responding to other answers.

          • Making statements based on opinion; back them up with references or personal experience.

          To learn more, see our tips on writing great answers.




          draft saved


          draft discarded














          StackExchange.ready(
          function ()
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55034640%2funicode-extract-chars-in-a-string-contains-tamil-letters-in-java%23new-answer', 'question_page');

          );

          Post as a guest















          Required, but never shown





















































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown

































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown







          Popular posts from this blog

          1928 у кіно

          Захаров Федір Захарович

          Ель Греко