Egon Elbre

Counting Characters

Let’s start with a simple problem: “How many characters are in a string?”. Our first implementation:

1
2
3
4
5
6
func CharacterCount1(s string) int {  
    return len(s)  
}

fmt.Println(CharacterCount1`("hello"))  
// Output: 5`

Yay, all done, right?

Anyone who has encountered unicode strings, knows that this returns the number of bytes not the number of characters.

1
2
fmt.Println(CharacterCount1("你好"))  
// Output: 6

Let’s try again:

1
2
3
4
5
func CharacterCount2(s string) int {  
    return len(([]rune)(s))  
}  
fmt.Println(CharacterCount2("你好"))  
// Output: 2

Huh, is it now correct?

Not quite.

1
2
fmt.Println(CharacterCount2("hĕllŏ"))  
// Output: 7

So, still not right. It looks like 5 characters, however to delete that string, you would need to press backspace 7 times. The catch is that characters can be composed of multiple symbols. Do you see the little u shape on top of e and o. That is a diacritic and represented by a separate rune and then combined with the previous letter.

There are examples of the reverse as well. For example ligature “ffi” requires single backspace to delete, but looks like three characters smushed together. There are many other examples including ㈎ and ẛ̣.

The “number of characters” depends on the context. But, what do you actually want to know? The “correct” answer is out of the scope of this article, but it rarely is the number of runes. Here’s a comparison of different ways of counting characters:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
Output:
   bytes   runes   NFC     NFD     NFKC    NFKD    Regex   Graph.. Text
   5       5       5       5       5       5       5       5       "hello"
   6       2       2       2       2       2       2       2       "你好"
   9       7       5       5       5       5       5       5       "hĕllŏ"
   12      8       4       4       4       4       4       4       "l̲i̲n̲e̲"
   3       1       1       1       2       2       1       1       "fi"
   3       1       1       1       3       3       1       1       "ffi"
   3       1       1       1       3       3       1       1       "㈎"
   5       2       1       1       1       1       1       1       "ẛ̣"

I’m aware that some of these give the same answer, but usually you want to do something else with the string, not just count the characters.

PS: this is the wrong way to write a word reversing function:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
func Reverse(s string) string {
	chars := []rune(s)
	for i, j := 0, len(chars)-1; i < j; i, j = i+1, j-1 {
		chars[i], chars[j] = chars[j], chars[i]
	}
	return string(chars)
}

func main() {
	fmt.Println(Reverse("hĕllŏx"))
	// Output: x̆oll̆eh
}

As an exercise, try implementing a string reverse that does the right thing with “hĕllŏ”, “ffi”, “㈎” and “你好”.

Read more at: