String encoding is an important matter in Ruby, however most of the blog posts that I came acoross (some of which are linked at the end of this post) tend to look at a ‘user-level’ point of view of the subject and do not explore Ruby internals with respect to string encoding. In this blog post I will try to shed some light on the topic and talk about the important APIs and terminologies that one should be aware of when interfacing with Ruby strings internally.
Code points and character sets
Character sets and code points are abstractions that sit between bytes and encodings. A character set defines a group of characters, their order, and it assigns each an identifier. The identifier is known as a “code point”. It allows for character interaction without having to understand the underlying byte structure of a character.
So basically code point is group of bytes that make a character. It can be thought of
as the ‘visual’ size of the string. The
size method on a string actually returns
the number of code points in the string.
Unicode characters in regular Ruby strings
\u escape sequence, we can specify the value of an 8-bit hexadecimal string
Usual string encodings
The default string encoding in Ruby is UTF-8.
Byte strings can be said to be just a sequence of bytes. They are not necessarily human-readable (in a way that makes sense). The closest brother of byte strings in Ruby can be Python byte strings.
These strings do not implicitly carry an encoding any must be ‘coded’ into a particular encoding before being used. They’re primary use case is for storing data to disk in machine readable form. The size of a byte string is exactly the same as the number of characters in the string.
Since UTF-8 is the default string encoding, you need to force Ruby to convert a string
into a byte string (a.k.a US-ASCII) string using the
force_encoding method. For example:
2.4.1 :026 > a = "ありが" # => "ありが" 2.4.1 :027 > a.bytes # => [227, 129, 130, 227, 130, 138, 227, 129, 140] 2.4.1 :028 > a.force_encoding "US-ASCII" # => "\xE3\x81\x82\xE3\x82\x8A\xE3\x81\x8C" 2.4.1 :029 > a.bytes # => [227, 129, 130, 227, 130, 138, 227, 129, 140]
Unfortunately there is no direct way of specifying byte strings in Ruby like the
short-hand syntax in Python.
RSTRING_LEN() macro returns the string data in bytes as variable of
The encoding of strings is stored in the
rb_encoding data type.
rb_str_new() function that is used for creating strings from
char* arrays returns
Ruby strings are encoded as
rb_enc_get_index(VALUE obj) gives an integer value for the particular encoding. The file
encindex.h defines several constants that
are associate a single
int with the encoding of a string. These macros can be combined with
rb_enc_get_index to easily compare the encoding of a Ruby string. However, this file is not
accesssible for C extension writers since it is not present under the
Since I cannot yet find a fast and simple way of checking the encoding via C API calls, I’m resorting to rather ugly and slow Ruby method calls. Here’s the functions:
Other posts and links
- Andre Arko’s blogpost : https://andre.arko.net/2013/12/01/strings-in-ruby-are-utf-8-now/
- The string type is broken: https://mortoray.com/2013/11/27/the-string-type-is-broken/
- String encodings book : https://aaronlasseigne.com/books/mastering-ruby/strings-and-encodings/
- Ruby encoding wikibook: https://en.wikibooks.org/wiki/Ruby_Programming/Encoding
- Post with some internals of bytes: https://www.justinweiss.com/articles/3-steps-to-fix-encoding-problems-in-ruby/
- Helpful blog on some internals: https://blog.codeship.com/how-ruby-string-encoding-benefits-developers/
- Post from Yehuda Katz: https://yehudakatz.com/2010/05/17/encodings-unabridged/