mtext - Unicode string implementation - Dprogramming.com

[archived content]

Directory

Home
News
Wiki
Entice Designer
DCode editor
Linkdef
bintod
Tutorial
D FAQ
Code
bintod
DFL GUI
D irclib IRC
fileprompt
ini files
Linkdef
list linked list
mtext
Splat sockets
trayicon
wildcard
Contact
Paste
Links

mtext 2.0

The mtext module contains the mstring Unicode string structure, which stores strings in a format that is easiest to index, slice, and access UTF-32 (dchar) code points, as well as attempts to be space efficient.

The documentation is available online.

Download Software

Download mtext 2.0.

import mtext;
void main()
{
   mstring hi = "hello"c;
   hi ~= ", world"c;
}

FAQ

Q. What's new in mtext 2.0?

A. The mtext module, previously known as dstring, adds the following in version 2.0:

support for both Phobos and Tango;
allocators, allowing more control over memory usage, especially when the garbage collector is not desired;
updated the code to work better with modern D compilers, such as using opAssign;
and made name changes to work better with D's standard string and dstring types.

Q. Why another string implementation?

A. This string implementation uses a struct that stores Unicode strings in a format that is easiest to index, slice, and access UTF-32 (dchar) code points, as well as attempts to be space efficient. Slicing and indexing always directly access an entire UTF-32 code point, even if UTF-32 is not the internal representation of the string, and it does not do this by scanning from the beginning of the string or using lookup tables.

Q. How does it work?

A. The internal char type (char, wchar or dchar) used to store your string is generally the smallest that is large enough to hold each character in the string with a single codepoint.

Q. How does it know which char type to use?

A. Upon appending to a string, it determines if the new characters added will fit in the char type chosen for the existing string; if not, the string is reallocated using the char type needed for the new characters. This generally does not happen frequently. For applications that primarily use ASCII (regular English with no accents), this reallocation will rarely happen, but the application is still able to support all characters.

Q. Why change the char type like this?

A. There are three main reasons for this.

It makes slicing and indexing very easy and efficient. It is no longer possible to index or slice resulting in an invalid UTF sequence.
Some languages fit very well in UTF-8 and some fit even worse than the extra space per codepoint of UTF-16; with this string implementation, this space problem is minimized.
This string implementation makes it easy to fully support every character supported by the Unicode standard. In many cases, people who are not aware of how UTF works will not support it properly and will break support with languages they did not anticipate their program would be used with.

Q. How much space does the struct take up and what is stored there?

A. The string struct is guaranteed always to be the same size as char[]. The code includes static assert(string.sizeof == (char[]).sizeof); This way it is just as fast to pass around as char[] and is also a value type so it does not need to be allocated on the heap. It holds the length and pointer, much like char[], except that 2 of the bits of the length are reserved for keeping track of which char type is being used. The length property does not include these reserved bits, but this does slightly reduce the maximum amount of characters allowed in one string. string.MAX_LENGTH is provided with the maximum number. This maximum generally should not be a concern; it is still 1073741824 characters, and it is generally only one bit less as you typically do not want the highest bit set or signed/unsigned issues can arise.

Q. Why does the compiler complain when I use string literals?

A. This is because if a function is overloaded to take char[], wchar[] and dchar[] and a string literal is used, the compiler is not sure which function you want called. The solution is to use the c postfix, such as "foo"c.

Q. What about diacritics (accents) and half marks?

A. These can still be an issue as a single character that is displayed on the screen may take up more than one codepoint even when using UTF-32 (dchars) if it has a diacritical mark or half mark. The mtext module has included some functions to detect these marks, including isDiacritic(), isHalfMark(), and isCombiningCharacter() that checks for them all.

Q. How well does it perform?

A. It is not certain how well it performs. This code was just written to see how well it would work and perform. Benchmarks would be appreciated.

There is a chance that it performs slightly slower than how strings are currently handled in D, but ease-of-use is a factor.

There is also a chance that it overall performs faster than how strings are currently handled in D because it is no longer necessary to convert to and from different UTF types, which is generally the way char-by-char inspection, slicing and string building is done currently in D.

Q. Can I use this string in my closed-source or commercial application? What is the license?

A. Yes, you may freely use mtext.d. The license is zlib/libpng. See mtext.d for the copyright and license.