Counting Characters
Let’s start with a simple problem: “How many characters are in a string?”. Our first implementation:
|
|
Yay, all done, right?
Anyone who has encountered unicode strings, knows that this returns the number of bytes not the number of characters.
|
|
Let’s try again:
|
|
Huh, is it now correct?
Not quite.
|
|
So, still not right. It looks like 5 characters, however to delete that string, you would need to press backspace 7 times. The catch is that characters can be composed of multiple symbols. Do you see the little u shape on top of e and o. That is a diacritic and represented by a separate rune and then combined with the previous letter.
There are examples of the reverse as well. For example ligature “ffi” requires single backspace to delete, but looks like three characters smushed together. There are many other examples including ㈎ and ẛ̣.
The “number of characters” depends on the context. But, what do you actually want to know? The “correct” answer is out of the scope of this article, but it rarely is the number of runes. Here’s a comparison of different ways of counting characters:
|
|
I’m aware that some of these give the same answer, but usually you want to do something else with the string, not just count the characters.
PS: this is the wrong way to write a word reversing function:
|
|
As an exercise, try implementing a string reverse that does the right thing with “hĕllŏ”, “ffi”, “㈎” and “你好”.
Read more at:
- https://blog.golang.org/strings
- https://blog.golang.org/normalization
- http://www.unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries
- https://unicode.org/reports/tr15/
- https://developer.apple.com/library/content/documentation/Cocoa/Conceptual/Strings/Articles/stringsClusters.html
- https://mathias.gaunard.com/unicode/doc/html/unicode/introduction_to_unicode.html