Unicode - Extract chars in a String contains Tamil letters in JavaHow do I read / convert an InputStream into a String in Java?How do I create a Java string from the contents of a file?How to get an enum value from a string value in Java?How do I determine whether an array contains a particular value in Java?How to split a string in JavaHow do I convert a String to an int in Java?How to convert a char to a String?Why is char[] preferred over String for passwords?How to install Java 8 on MacWhy is executing Java code in comments with certain Unicode characters allowed?
Writing in a Christian voice
Offset in split text content
When is the exact date for EOL of Ubuntu 14.04 LTS?
How do you say "Trust your struggle." in French?
New Order #2: Turn My Way
Why didn't Voldemort know what Grindelwald looked like?
What is it called when someone votes for an option that's not their first choice?
A seasonal riddle
Are hand made posters acceptable in Academia?
Do people actually use the word "kaputt" in conversation?
Should I warn a new PhD Student?
Connection Between Knot Theory and Number Theory
What is the tangent at a sharp point on a curve?
How to get directions in deep space?
Does capillary rise violate hydrostatic paradox?
Turning a hard to access nut?
Mortal danger in mid-grade literature
Make a Bowl of Alphabet Soup
Is divisi notation needed for brass or woodwind in an orchestra?
Sort with assumptions
Reason why a kingside attack is not justified
Is there any common country to visit for persons holding UK and Schengen visas?
Walter Rudin's mathematical analysis: theorem 2.43. Why proof can't work under the perfect set is uncountable.
Should I be concerned about student access to a test bank?
Unicode - Extract chars in a String contains Tamil letters in Java
How do I read / convert an InputStream into a String in Java?How do I create a Java string from the contents of a file?How to get an enum value from a string value in Java?How do I determine whether an array contains a particular value in Java?How to split a string in JavaHow do I convert a String to an int in Java?How to convert a char to a String?Why is char[] preferred over String for passwords?How to install Java 8 on MacWhy is executing Java code in comments with certain Unicode characters allowed?
I am working to support unicode chars in the system, so I want to split chars in string that contains Tamil letters. I don't know to handle string in other than English in Java.
String word = new String("தமிழ்")
String[] chars = word.split("")
What was output
[த, ம, ி, ழ, ்]
What is expected
[த, மி, ழ்]
java unicode-string
add a comment |
I am working to support unicode chars in the system, so I want to split chars in string that contains Tamil letters. I don't know to handle string in other than English in Java.
String word = new String("தமிழ்")
String[] chars = word.split("")
What was output
[த, ம, ி, ழ, ்]
What is expected
[த, மி, ழ்]
java unicode-string
add a comment |
I am working to support unicode chars in the system, so I want to split chars in string that contains Tamil letters. I don't know to handle string in other than English in Java.
String word = new String("தமிழ்")
String[] chars = word.split("")
What was output
[த, ம, ி, ழ, ்]
What is expected
[த, மி, ழ்]
java unicode-string
I am working to support unicode chars in the system, so I want to split chars in string that contains Tamil letters. I don't know to handle string in other than English in Java.
String word = new String("தமிழ்")
String[] chars = word.split("")
What was output
[த, ம, ி, ழ, ்]
What is expected
[த, மி, ழ்]
java unicode-string
java unicode-string
edited Mar 7 at 1:29
Arunachalam Sibisakkaravarthi
asked Mar 7 at 1:21
Arunachalam SibisakkaravarthiArunachalam Sibisakkaravarthi
113
113
add a comment |
add a comment |
1 Answer
1
active
oldest
votes
The String assigned to "word" is in fact 5 Unicode characters. The 3rd character, for example (U+0BFF) combines with the preceding one (U+0BAE?) to make one displayed symbol.
Since you're splitting the word into characters, 5 characters is what you'll get. There is no such single character as (for example) the middle symbol displayed in your original string.
Due to combining characters, the apparent number of symbols displayed on-screen is not necessarily the same as the number of Unicode characters. In general, programs that render Unicode strings have to be cognizant of combining characters.
See this document for issues of Tamil in Unicode. Figure 12-21 discusses the i vowel sound, which is the middle character of the 5.
It's not clear what your purpose is in splitting the string. If you really want "apparent symbols" (I'm making up this term) then you'll presumably need to scan the resulting characters to look for combining characters.
This document describes one approach that seems like it would work for you, though the page says there are better facilities in releases after JDK 8, which I did not take the time to look for. Still, this may illuminate what's going on a little more thoroughly.
Yes, I understand the problem that it shows 5 chars. I used REGEX to parse/split/process the string which contains Tamil letters
– Arunachalam Sibisakkaravarthi
Mar 8 at 2:38
Wrong tool for the job, then?split
deals in Unicode characters, not what I clumsily called "apparent symbols". It's possible you could write a more complicated regex that handles units of character followed by some number of combining characters but that is not then a use ofsplit
, and theBreakIterator
solution looks more attractive to me.
– another-dave
Mar 8 at 13:08
You could try splitting on the regex(?U)(?!pMc)
but I'm far from confident about that, particularly theMc
part. You might also have to set the locale first.
– another-dave
Mar 9 at 17:24
add a comment |
Your Answer
StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");
StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);
else
createEditor();
);
function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);
);
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55034640%2funicode-extract-chars-in-a-string-contains-tamil-letters-in-java%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
The String assigned to "word" is in fact 5 Unicode characters. The 3rd character, for example (U+0BFF) combines with the preceding one (U+0BAE?) to make one displayed symbol.
Since you're splitting the word into characters, 5 characters is what you'll get. There is no such single character as (for example) the middle symbol displayed in your original string.
Due to combining characters, the apparent number of symbols displayed on-screen is not necessarily the same as the number of Unicode characters. In general, programs that render Unicode strings have to be cognizant of combining characters.
See this document for issues of Tamil in Unicode. Figure 12-21 discusses the i vowel sound, which is the middle character of the 5.
It's not clear what your purpose is in splitting the string. If you really want "apparent symbols" (I'm making up this term) then you'll presumably need to scan the resulting characters to look for combining characters.
This document describes one approach that seems like it would work for you, though the page says there are better facilities in releases after JDK 8, which I did not take the time to look for. Still, this may illuminate what's going on a little more thoroughly.
Yes, I understand the problem that it shows 5 chars. I used REGEX to parse/split/process the string which contains Tamil letters
– Arunachalam Sibisakkaravarthi
Mar 8 at 2:38
Wrong tool for the job, then?split
deals in Unicode characters, not what I clumsily called "apparent symbols". It's possible you could write a more complicated regex that handles units of character followed by some number of combining characters but that is not then a use ofsplit
, and theBreakIterator
solution looks more attractive to me.
– another-dave
Mar 8 at 13:08
You could try splitting on the regex(?U)(?!pMc)
but I'm far from confident about that, particularly theMc
part. You might also have to set the locale first.
– another-dave
Mar 9 at 17:24
add a comment |
The String assigned to "word" is in fact 5 Unicode characters. The 3rd character, for example (U+0BFF) combines with the preceding one (U+0BAE?) to make one displayed symbol.
Since you're splitting the word into characters, 5 characters is what you'll get. There is no such single character as (for example) the middle symbol displayed in your original string.
Due to combining characters, the apparent number of symbols displayed on-screen is not necessarily the same as the number of Unicode characters. In general, programs that render Unicode strings have to be cognizant of combining characters.
See this document for issues of Tamil in Unicode. Figure 12-21 discusses the i vowel sound, which is the middle character of the 5.
It's not clear what your purpose is in splitting the string. If you really want "apparent symbols" (I'm making up this term) then you'll presumably need to scan the resulting characters to look for combining characters.
This document describes one approach that seems like it would work for you, though the page says there are better facilities in releases after JDK 8, which I did not take the time to look for. Still, this may illuminate what's going on a little more thoroughly.
Yes, I understand the problem that it shows 5 chars. I used REGEX to parse/split/process the string which contains Tamil letters
– Arunachalam Sibisakkaravarthi
Mar 8 at 2:38
Wrong tool for the job, then?split
deals in Unicode characters, not what I clumsily called "apparent symbols". It's possible you could write a more complicated regex that handles units of character followed by some number of combining characters but that is not then a use ofsplit
, and theBreakIterator
solution looks more attractive to me.
– another-dave
Mar 8 at 13:08
You could try splitting on the regex(?U)(?!pMc)
but I'm far from confident about that, particularly theMc
part. You might also have to set the locale first.
– another-dave
Mar 9 at 17:24
add a comment |
The String assigned to "word" is in fact 5 Unicode characters. The 3rd character, for example (U+0BFF) combines with the preceding one (U+0BAE?) to make one displayed symbol.
Since you're splitting the word into characters, 5 characters is what you'll get. There is no such single character as (for example) the middle symbol displayed in your original string.
Due to combining characters, the apparent number of symbols displayed on-screen is not necessarily the same as the number of Unicode characters. In general, programs that render Unicode strings have to be cognizant of combining characters.
See this document for issues of Tamil in Unicode. Figure 12-21 discusses the i vowel sound, which is the middle character of the 5.
It's not clear what your purpose is in splitting the string. If you really want "apparent symbols" (I'm making up this term) then you'll presumably need to scan the resulting characters to look for combining characters.
This document describes one approach that seems like it would work for you, though the page says there are better facilities in releases after JDK 8, which I did not take the time to look for. Still, this may illuminate what's going on a little more thoroughly.
The String assigned to "word" is in fact 5 Unicode characters. The 3rd character, for example (U+0BFF) combines with the preceding one (U+0BAE?) to make one displayed symbol.
Since you're splitting the word into characters, 5 characters is what you'll get. There is no such single character as (for example) the middle symbol displayed in your original string.
Due to combining characters, the apparent number of symbols displayed on-screen is not necessarily the same as the number of Unicode characters. In general, programs that render Unicode strings have to be cognizant of combining characters.
See this document for issues of Tamil in Unicode. Figure 12-21 discusses the i vowel sound, which is the middle character of the 5.
It's not clear what your purpose is in splitting the string. If you really want "apparent symbols" (I'm making up this term) then you'll presumably need to scan the resulting characters to look for combining characters.
This document describes one approach that seems like it would work for you, though the page says there are better facilities in releases after JDK 8, which I did not take the time to look for. Still, this may illuminate what's going on a little more thoroughly.
edited Mar 7 at 22:27
answered Mar 7 at 12:39
another-daveanother-dave
95818
95818
Yes, I understand the problem that it shows 5 chars. I used REGEX to parse/split/process the string which contains Tamil letters
– Arunachalam Sibisakkaravarthi
Mar 8 at 2:38
Wrong tool for the job, then?split
deals in Unicode characters, not what I clumsily called "apparent symbols". It's possible you could write a more complicated regex that handles units of character followed by some number of combining characters but that is not then a use ofsplit
, and theBreakIterator
solution looks more attractive to me.
– another-dave
Mar 8 at 13:08
You could try splitting on the regex(?U)(?!pMc)
but I'm far from confident about that, particularly theMc
part. You might also have to set the locale first.
– another-dave
Mar 9 at 17:24
add a comment |
Yes, I understand the problem that it shows 5 chars. I used REGEX to parse/split/process the string which contains Tamil letters
– Arunachalam Sibisakkaravarthi
Mar 8 at 2:38
Wrong tool for the job, then?split
deals in Unicode characters, not what I clumsily called "apparent symbols". It's possible you could write a more complicated regex that handles units of character followed by some number of combining characters but that is not then a use ofsplit
, and theBreakIterator
solution looks more attractive to me.
– another-dave
Mar 8 at 13:08
You could try splitting on the regex(?U)(?!pMc)
but I'm far from confident about that, particularly theMc
part. You might also have to set the locale first.
– another-dave
Mar 9 at 17:24
Yes, I understand the problem that it shows 5 chars. I used REGEX to parse/split/process the string which contains Tamil letters
– Arunachalam Sibisakkaravarthi
Mar 8 at 2:38
Yes, I understand the problem that it shows 5 chars. I used REGEX to parse/split/process the string which contains Tamil letters
– Arunachalam Sibisakkaravarthi
Mar 8 at 2:38
Wrong tool for the job, then?
split
deals in Unicode characters, not what I clumsily called "apparent symbols". It's possible you could write a more complicated regex that handles units of character followed by some number of combining characters but that is not then a use of split
, and the BreakIterator
solution looks more attractive to me.– another-dave
Mar 8 at 13:08
Wrong tool for the job, then?
split
deals in Unicode characters, not what I clumsily called "apparent symbols". It's possible you could write a more complicated regex that handles units of character followed by some number of combining characters but that is not then a use of split
, and the BreakIterator
solution looks more attractive to me.– another-dave
Mar 8 at 13:08
You could try splitting on the regex
(?U)(?!pMc)
but I'm far from confident about that, particularly the Mc
part. You might also have to set the locale first.– another-dave
Mar 9 at 17:24
You could try splitting on the regex
(?U)(?!pMc)
but I'm far from confident about that, particularly the Mc
part. You might also have to set the locale first.– another-dave
Mar 9 at 17:24
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55034640%2funicode-extract-chars-in-a-string-contains-tamil-letters-in-java%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown