String Container Class and Reference Counted Objects


Introduction

This web page publishes the C++ source code for a reference counted String container. This String container is derived from a reference counted array template. This array template implements copy-on-write semantics. The array template is supported by a few classes and templates which are useful for creating reference counted objects (see, SubString.C for an example of a reference counted class which is not an array).

The closely related SubString class is also published here. The SubString class supports String section operations. Like the String class, the SubString class is a reference counted object.

This source base also includes a class called DoubleVec which demonstrates how a specific array class can be derived from the array template.

Regression tests are included for the String, SubString and DoubleVec classes.

Strings and the String Container

If computer applications are divided into broad categories, one of the largest categories involves text processing. This includes document creation and processing, document retrieval and software translation (e.g., compiling C++ into machine code). Character streams are frequently divided into strings. This makes string containers some of the most frequently used classes in object oriented languages like C++ and Java.

The String container published here allows character strings to be treated as "first class" objects that support many of the basic operators (for example, +, +=).

A few examples:


  // String construction
  String a("the quick brown fox");
  String b = " jumps over the lazy";
  
  // String concatenation
  String c = a + b; // c = "the quick brown fox jumps over the lazy"

  // c = "the quick brown fox jumps over the lazy golden retriever"
  c += " golden retriever";

  c(16, 3) = "coy";     // replace "fox" with "coy"
  c(19, 0) = "ote";     // insert "ote" at character index 19

This string container uses reference counted strings and "copy-on-write" for efficiency. In the code below, for example, the assignment of "a" to "b" increments the reference count of "a" and points "b" at the shared data. The assignment of 'z' to b[3] results in a unique copy being made for "b", since b has been modified.


   String a, b;

   a = "1234";
   b = a;
   b[3] = 'z';

Use and Copyright


Click on the icon for a definition of "open source"

This source code may be used without restriction, in any way you choose, without a fee, as long as you maintain the following quote in the source code:

This software was written and is copyrighted by Ian Kaplan, Bear Products International, www.bearcave.com, 2003

The definition of open source used here means that you can use this software in proprietary software. Using my String container does not mean that your software must be open source.

The source code is provided "as is" without any warranty as to its usefulness or functionality. The source code comes with a fairly extensive regression test suite (e.g., StringTest.C, SubStrTest.C and DblVecTest.C). I recommend running compiling and running these tests. I also recommend running these tests with a memory verification tool like like Purify or Valgrind to help assure that there are no memory errors.

The presence of these regression tests in no way changes the "use at your own risk" caveat above.

Why No Restrictions?

After witnessing the Red Hat initial public stock offering (IPO), which made Red Hat CEO Bob Young and the other Red Hat executives vastly wealthy I rethought some of my ideas about publishing open source software. Those who got rich from the Red Hat IPO did nothing to create Linux and the GNU software that made Linux possible. They simply marketed GNU/Linux, taking advantage of the work of others, while giving them crumbs in return.

After the Red Hat IPO I started publishing my software with more restrictive use conditions. However, I have not applied these conidtions to the String container and related software published here. There are several reasons for this:

  1. There are lots of string container classes around, including the string class in the Standard Template Library. So this software is not that unique.

  2. Open source software works when it is software exchanged between engineers, not between engineers and end users. By modifying and improving open source software engineers "give back" to the open source community. Marketeers are not going to become rich off the efforts of software that is exchanged between engineers.

    The String container class is only useful to software engineers (it is not end user software, like GNU/Linux). I hope that any use the String container finds will result in improvements and bug fixes. Since I use this the String container in my own software, this will pay back my effort in developing the String container.

  3. The String container makes some software easier to create, but it will always be a small part of the software that uses it. So the software author must expend their effort, rather than simply reselling my work.

Software Download

Version 3.0, September 2003

This is the third version of the String container class published here. The String and SubString classes were largely rewritten for this version. They now use a common set of classes and templates to support reference counted objects. The String class is derived from a char instance of a reference counted array (RCArray).

The tar program is part of UNIX and Linux. It should also be available on Apple's UNIX derived operating systems (e.g., OS X). If you do not have a version of the "tar" program for Windows NT/200/XP/WhatEverMicrosoftCallsIt you can download a copy here.

To unpack the "tar" archive, copy the tar file to the directory where you the string sub-directory to be created. The file name should end in ".tar" (e.g., string.tar). Change the name if necessary (e.g., from string_tar to string.tar). Use the command

  tar xvf string.tar

to unpack the tar file. You should have the following files:

string/DblVecTest.C
string/DoubleVec.h
string/doxygenTemplate
string/GrowableArray.h
string/MainPage
string/RCArray.h
string/RCBase.h
string/RCObject.h
string/RCPtr.h
string/String.C
string/String.h
string/StringTest.C
string/SubString.C
string/SubString.h
string/SubStrTest.C

For notes on compiling this source using the Microsoft C++ compiler, see the comment in String.C

Doxygen Generated Documentation for the String container

The String container class is extensively commented using Doxygen formatted comments. The Doxygen generated documentation for the String container can be found here. I have also documented the class "by hand" below.

Why Did I Develop the String Container?

I have been using the Rogue Wave Tools.h++ classes and templates for a number of years. I wanted to have a class like Rogue Wave's RWCString for the software that I develop for Bear Products International. But I did not want to pay Rogue Wave's license fees (which have become expensive). Nor do I want to force Bear Products International licensees to pay Rogue Wave fees.

Many implementations of the C++ Standard Template Library (STL) do not provide reference counted string objects like RWCString, so STL was not an alternative (however, I'm told the the Dinkumware STL library does provide a reference counted string class).

So I decided to write my one String container, influenced in part by the publicly available Rogue Wave documentation. I expected that it would take me no more than a few evenings or a couple of weekends. As happens so often with software engineers, I underestimated the effort. In fact, to say that I underestimated the effort is a gross underestimation. The String container is now on its third version, over a span of over two years. My original estimate of a few days work has been vastly exceeded.

This software has been time consuming to develop because the details of the String container have been more complicated that I expected. Library code like the String container is also more time consuming to develop because it serves as the foundation for other software. Library code must be as reliable as possible and I have spent a fair amount of time developing regression tests.

Changes for Version 3.0

Version 3.0 is close to a complete rewrite of the String class and of the supporting SubString class. The String class is now derived from an instance of a generic reference counted array (RCArray). This reference counted array in turn uses a common set of classes and templates that support reference counted classes.

A reference counted object should only make copies when a shared object is modified. Much to my surprise, version 2.0 of the String container made copies on read. At the time I did not fully understand how C++ handled the operator [] function. This issue is discussed in a note I posted to comp.lang.c++.

To avoid copy-on-read the reference counted array template (RCArray) returns a proxy class as a result of the operator [] function. This proxy is based on Stroustrup's Cref class (see 11.12 of The C++ Programming Language, Third Edition).

Testing and Verification

One of the best choices I made in developing this software was to include a fairly extensive set of regression tests along with this code. The C++ code that implements the reference counted String container unavoidably uses some C++ features that, at least historically, have been inconsistently implemented by various C++ compilers. The regression tests allow some degree of verification relative to the local compiler. While I have done my best to assure the correctness of the source code, you should run the regression test and verify the software using a memory usage tool like Valgrind or Purify.

The String container code (and related code, like DoubleVec) has been compiled with the following compilers on the following platforms:

I have also attempted to check for memory leaks using Valgrind on Linux (I don't currently have access to Purify on Solaris as I have had in the past).

In terms of lines of code, the regression tests are larger than the source that implements the String, SubString, RCArray and other supporting classes and templates. In developing these regression tests I have tried to be fairly methodical. Developing these tests forced me to think about how difficult it is to assure absolute correctness, even in a relatively small software base. I have no doublt that these regression tests are incomplete.

Is anyone using this software?

After publishing Version 3.0 of the String container in September 2003, I have not heard that anyone is using it. Perhaps the reason is that this is yet another String container. If you are using the String container, I'd appreciate a brief e-mail (you don't have to say any more than "I'm using it"). Just click on the icon below:

Documentation for the String Container

Extensive Doxygen formatted documentation is provided in the source code. The documentation below is intended to highlight a few features of the String container.

Null Termination

The character data in the String container is stored along with a null character terminator ('\0'). In most cases String operations will preserve the null terminator. However, it can be overwritten using the [] operator (something that is not recommended).

Copy on Write

When one String object is assigned to another, only a reference is copied and the reference count for the shared data is incremented. A unique copy of a shared string is only made when the string is changed. Copy-on-write takes place as a result of the following:

String and SubString Operators

operator =
operator +=
operator +
operator []
operator () SubString operator
Using a String container with stream output (<<)
operator const char *
Relational Operators
String &resize( size_t new_size );
size_t getRefCnt();
size_t strlen();

Operator =

The String assignment operator can be used to assign a C-string, a String object, or a SubString to a String object. For example:

  String a;
  String b;
  String c;

  a = "this is string data";  // assign a C-string to a String
  b = a;         // assign String a to String b, incrementing the reference count.
  c = b(8, 11);  // assign a SubString, "string data", to c

The SubString assignment operator can be used to assign a C-string, String object or SubString object to a SubString. For example:

  String a("abcdefgh");
  String b("wxyz");

  a(4, 4) = "1234";  // change "abcdefgh" to "abcd1234"
  a(0,4) = b;        // change "abcd1234" to "wxyz1234"
  a(0,4) = a(4,4);   // change "wxyz1234" to "12341234"

Operator +=

The += concatenates a C-string, String object or character to the String object on the right hand side. For example:


  String weather("'Twas brillig, ");
  String nextLine("Did gyre and gimble in the wabe");

  weather += "and the slithy toves"; // concatenate a C-string
  weather += '\n';                   // concatenate a character
  weather += nextLine;               // concatenate a String object

Operator +

The + operator concatenates a String object, a C-string, a String or a character to create a new String. The String operand in a + expression may be either the right or left operand. For example:

  String a("All mimsy were the borogoves");
  String b;
  String c("\nBeware the Jabberwock, my son");
  String d;
  String poem;

  b = a + "\nAnd the mome raths outgrabe";  // concatenate String and a C-string
  d = b + c;      // concatenate a String and a String
  poem = d + '!'; // concatenate a String and a character

  // note that the String may also be the right operand
  poem = "Lewis Carroll" + d;  // concatenate a C-string and a String
  poem = '\n' + a;             // concatenate a character and a String

Operator []

The index operator allows a single character in a String to be indexed. The first index is 0. The index operation may be used on either the right or left hand size of an assignment. For example, after the for loop below is executed, both String objects will contain a unique copy of "xyz".

   String a("123"), b("xyz");

   for (size_t i = 0; i < a.strlen(); i++)
      a[i] = b[i];

Note that there is always a null character beyond the end of the character data. This null can be overwritten, which may cause problems if the String is assigned to a const char * pointer. For example:


    a[ a.strlen() ] = 'X';  // overwrite the null terminator

In general overwriting the null terminator is not a good idea.

If an index beyond the end of the string (e.g., characters plus the null terminator) will throw an exception.

Operator ()(size_t index, size_t len) SubString operator

The substring operator is given two arguments: the starting index, and the length of the SubString. It may be used on either the right or left hand side of the assignment. For example:

  String a("Gore is the school Principle");
  String b("Bush is a vacuous frat boy");
  String c("Gore wins!");

  c(0, 4) = b(0, 4);

  b(10, 7) = "fatuous";

  if (b(10,7) == "idiotic")
    cout << "must be a Bush";

OK, you've probably figured out that I'm a bit unhappy about the way Bush "won" the election through the graces of five conservative Supreme Court Justices, while still losing to Gore by over half a million votes.

Note that SubStrings supports general insertion. The inserted Strings do not have to be the same size as the SubString. For example:

   String a = "inserthere";
   a(6, 0) = " key ";  // a now contains "insert key here"

   String b = "123 456 678";
   b(4, 3) = "abcdefgh";  // b now contains "123 abcdefgh 678"

   String c = "this is a really long winded explaination";
   c(10, 18) = "terse";  // c now contains "this is a terse explaination"

operator const char *

The const char * operator allows a String to be assigned to a const char * pointer. For example:

  String a("This is it!");
  const char *pCStr;
  
  pCStr = a;
  printf("%s\n", pCStr ); // this will print "This is it!"

The const char * operator will allow String expressions to be assigned to a const char * pointer. However, this is not a good idea since the pointer will point to memory that has been deallocated (see C++: the good and the bad, above). For example, the code below will result in a reference to deallocated memory:

  String a("better red");
  String b(" than ned");
  const char *pCStr;

  pCStr = a + b;  // pCStr points to deallocated memory

Code like this could produce strange errors, especially in a multi-threaded environment.

Using a String container with stream output (<<)

To use the an instance of the String container with in C++ stream output you will need to cast the String to const char *. For example:

  String a("PDQ ");
  String b("Bach");

  cout << (const char *)a << endl;
  cout << (const char *)(a + b) << endl;

Relational Operators

The String container supports all of the relational operators ( ==, !=, <=, >=, <, >). Either the right or left hand operands may be a String object or a C-string. For example:

   String a, b;

   [...]
   if (a <= b)
      if (b > "1234")
         if ("5678" != a)
   [...]

String &resize( size_t new_size );

Resize will set the strlen() of the string to new_size. If new_size is larger than the current String, it will be padded with spaces. If new_size is smaller, it will truncate the String data. The resize() function preserves the null terminator, unless the string is resized to zero, in which case the String object will become empty.

size_t getRefCnt();

The getRefCnt() function returns the reference count for an instance of the String container. If the String is empty, the reference count will be one (however, the String will not contain any characters). The reference count is incremented each time the String object is assigned to another String object. The reference count is decremented each time a String reference copy is modified. For example:


  String original("The one, the only");
  String a, b, c;

  // Before assignment, getRefCnt() == 1 
  a = original;    // reference count is now 2
  b = a;           // reference count is now 3
  c = b;           // reference count is now 4
  a += " (NOT!)";  // a is now unique, reference count is now 3

In general the user of the String container will not find much use for the getRefCnt() function. It is useful mainly for verifying the correct function of the String container (e.g., it is used extensively in the StringTest.C, SubStrTest.C and DblVecTest.C regression tests).

size_t strlen();

The strlen() function returns the number of characters contained in the String object. This is one less than the internal length, which includes the null terminator.

C++: the good and the bad

Writing string classes has great educational value...
Bjarne Stroustrup (in The C++ Programming Language, third Edition, Chapter 20).

Writing the String container was indeed an educational experience. The String container is a good example of why I both love and hate C++. The String container allows character strings to be treated like "first class" objects in C++. Several of the basic operators are defined for the String type. The support provided by C++ for powerful (and complex) features like templates and operator overloading allowed the String container to be developed. In contrast, the String container that is part of the Java programming language requires support from the Java compiler (e.g., the Java String container is part of the language, not simply a library object).

The price exacted for the power and flexibility of the C++ language is complexity. I have tried to comment the String container and related code heavily, but it remains complex and difficult to understand. Reference counted objects like String, SubString or classes instantiated from the RCArray template share a common data object. The management of the shared data object is provided by a "smart pointer" (based on a smart pointer described in Scott Meyers book More Effective C++). To understand this code requires an indepth understanding of how C++ uses temporaries, constructors and destructors. Rather than concentrating on the algorithm being implemented, the developer must think hard about the C++ language and its implementation. In this case, the language simply gets in the way. The need for such a complicated container would be eliminated by a language/compiler that supported garbage collected memory (as Java does).

Some examples are probably worth while, both to illuminate the String container and to illustrate some of the complexities in its implementation.

Reference counted strings allow data to be shared between two String containers. When one String object is assigned to another, the reference count will be incremented and both objects will share the same data. The container supports "copy-on-write", so a change to one String object will not effect other objects that reference the shared data. Each time a class destructor is called the reference count will be decremented. This allows String objects to recover their own memory and reduces the effort that the programmer must put into memory management.

In general the String container just works and the programmer does not have to think about what the compiler does. For example, in the code below, the compiler generates a hidden temporary for the result of the String concatenation.

  String a("1234"), b("5678");
  String c = a + b;

The hidden temporary is shown below:

  String a("1234"), b("5678");
  tmp = a + b;
  String c = tmp;

The lifetime of the temporary spans the assignment statement. That is, the temporary is constructed before the assignment statement and its destructor is called after the assignment statement, but before the following statement.

The lifetime of the compiler generated temporaries cannot be entirely ignored. The String container implements a const char * operator, which allows String objects to be assigned to a const char * variable. The const char * variable points into the internal memory of the String. In some cases this memory may have been deallocated. For example, in the code below the overloaded + operator is used to concatenate two strings ("1234" and "5678"). The resulting String temporary (containing "12345678") is assigned to a const char * pointer.

  String a("1234"), b("5678");
  const char *pChar;

  pChar = a + b;

The problem here is that the destructor for the String temporary will be called after the assignment. When the reference count is decremented to zero the object will be deallocated. To paraphrase Stroustrup:

A temporary object of class String is created to hold a + b. Next, a pointer to a C-style string is extracted from the object (via the overloaded operator const char *()). Then - at the end of the expression - the temporary object is deleted. Now, where was the C-style string allocated? Probably as part of the temporary object holding a + b, and that storage is not guaranteed to exist after the temporary is destroyed. Consequently, pChar points to deallocated storage. If pChar is referenced it might work as expected, but that would be sheer luck.

Dreaming of Something Better

The complexity of overloaded operators, their associated compiler created temporaries, constructors and destructors are only a fraction of the C++ language. The complexity of programming in C++ is compounded by the need to explicitly manage memory. Assuring that there are no memory errors is difficult simply by inspecting the code. In most cases, a tool like Valgrind or Purify is needed to gain some confidence that there are no memory errors.

The C++ programming language holds out the promise of allowing object oriented software to be created. But along with this promise is the considerable complexity of the language. One of the reasons that Java has become popular is that it preserves many of the core features of C++ in a smaller, easier to understand language. Java also supports garbage collection avoiding the burden of explicitly managed memory.

Sadly Java has its own problems. The language has not been released to an impartial standards organization and Sun has insisted that Java must be an interpreted language. As a result, Java is slow. Perhaps Microsoft's C# will provide a better answer. Or, perhaps, as a friend of mine says: life is hard (but compared to what) - Shimon Peres. We'll just have to live with problems that could have been avoided.

Web References

Many of these references discuss, in one way or another, the complexity of C++.

Ian Kaplan, January, 2001
Revised: February 2004


back to Software Source