Introduction

String encoding is an important matter in Ruby, however most of the blog posts that I came acoross (some of which are linked at the end of this post) tend to look at a ‘user-level’ point of view of the subject and do not explore Ruby internals with respect to string encoding. In this blog post I will try to shed some light on the topic and talk about the important APIs and terminologies that one should be aware of when interfacing with Ruby strings internally.

Code points and character sets

Character sets and code points are abstractions that sit between bytes and encodings. A character set defines a group of characters, their order, and it assigns each an identifier. The identifier is known as a “code point”. It allows for character interaction without having to understand the underlying byte structure of a character.

So basically code point is group of bytes that make a character. It can be thought of as the ‘visual’ size of the string. The size method on a string actually returns the number of code points in the string.

Unicode characters in regular Ruby strings

Using the \u escape sequence, we can specify the value of an 8-bit hexadecimal string in Ruby.

Usual string encodings

The default string encoding in Ruby is UTF-8.

Byte strings

Byte strings can be said to be just a sequence of bytes. They are not necessarily human-readable (in a way that makes sense). The closest brother of byte strings in Ruby can be Python byte strings.

These strings do not implicitly carry an encoding any must be ‘coded’ into a particular encoding before being used. They’re primary use case is for storing data to disk in machine readable form. The size of a byte string is exactly the same as the number of characters in the string.

Since UTF-8 is the default string encoding, you need to force Ruby to convert a string into a byte string (a.k.a US-ASCII) string using the force_encoding method. For example:

2.4.1 :026 > a = "ありが"
# => "ありが" 
2.4.1 :027 > a.bytes
# => [227, 129, 130, 227, 130, 138, 227, 129, 140] 
2.4.1 :028 > a.force_encoding "US-ASCII"
# => "\xE3\x81\x82\xE3\x82\x8A\xE3\x81\x8C" 
2.4.1 :029 > a.bytes
# => [227, 129, 130, 227, 130, 138, 227, 129, 140] 

Unfortunately there is no direct way of specifying byte strings in Ruby like the b'' short-hand syntax in Python.

Useful APIs

The RSTRING_LEN() macro returns the string data in bytes as variable of size_t type.

The encoding of strings is stored in the rb_encoding data type.

The rb_str_new() function that is used for creating strings from char* arrays returns Ruby strings are encoded as US-ASCII.

rb_enc_get_index(VALUE obj) gives an integer value for the particular encoding. The file encindex.h defines several constants that are associate a single int with the encoding of a string. These macros can be combined with rb_enc_get_index to easily compare the encoding of a Ruby string. However, this file is not accesssible for C extension writers since it is not present under the include/ruby directory.

Since I cannot yet find a fast and simple way of checking the encoding via C API calls, I’m resorting to rather ugly and slow Ruby method calls. Here’s the functions:

Other posts and links

  • Andre Arko’s blogpost : https://andre.arko.net/2013/12/01/strings-in-ruby-are-utf-8-now/
  • The string type is broken: https://mortoray.com/2013/11/27/the-string-type-is-broken/
  • String encodings book : https://aaronlasseigne.com/books/mastering-ruby/strings-and-encodings/
  • Ruby encoding wikibook: https://en.wikibooks.org/wiki/Ruby_Programming/Encoding
  • Post with some internals of bytes: https://www.justinweiss.com/articles/3-steps-to-fix-encoding-problems-in-ruby/
  • Helpful blog on some internals: https://blog.codeship.com/how-ruby-string-encoding-benefits-developers/
  • Post from Yehuda Katz: https://yehudakatz.com/2010/05/17/encodings-unabridged/
  • https://blog.daftcode.pl/fixing-unicode-for-ruby-developers-60d7f6377388