mirror of
https://github.com/autc04/Retro68.git
synced 2024-12-12 11:29:30 +00:00
484 lines
20 KiB
XML
484 lines
20 KiB
XML
<chapter xmlns="http://docbook.org/ns/docbook" version="5.0"
|
|
xml:id="std.strings" xreflabel="Strings">
|
|
<?dbhtml filename="strings.html"?>
|
|
|
|
<info><title>
|
|
Strings
|
|
<indexterm><primary>Strings</primary></indexterm>
|
|
</title>
|
|
<keywordset>
|
|
<keyword>ISO C++</keyword>
|
|
<keyword>library</keyword>
|
|
</keywordset>
|
|
</info>
|
|
|
|
<!-- Sect1 01 : Character Traits -->
|
|
|
|
<!-- Sect1 02 : String Classes -->
|
|
<section xml:id="std.strings.string" xreflabel="string"><info><title>String Classes</title></info>
|
|
|
|
|
|
<section xml:id="strings.string.simple" xreflabel="Simple Transformations"><info><title>Simple Transformations</title></info>
|
|
|
|
<para>
|
|
Here are Standard, simple, and portable ways to perform common
|
|
transformations on a <code>string</code> instance, such as
|
|
"convert to all upper case." The word transformations
|
|
is especially apt, because the standard template function
|
|
<code>transform<></code> is used.
|
|
</para>
|
|
<para>
|
|
This code will go through some iterations. Here's a simple
|
|
version:
|
|
</para>
|
|
<programlisting>
|
|
#include <string>
|
|
#include <algorithm>
|
|
#include <cctype> // old <ctype.h>
|
|
|
|
struct ToLower
|
|
{
|
|
char operator() (char c) const { return std::tolower(c); }
|
|
};
|
|
|
|
struct ToUpper
|
|
{
|
|
char operator() (char c) const { return std::toupper(c); }
|
|
};
|
|
|
|
int main()
|
|
{
|
|
std::string s ("Some Kind Of Initial Input Goes Here");
|
|
|
|
// Change everything into upper case
|
|
std::transform (s.begin(), s.end(), s.begin(), ToUpper());
|
|
|
|
// Change everything into lower case
|
|
std::transform (s.begin(), s.end(), s.begin(), ToLower());
|
|
|
|
// Change everything back into upper case, but store the
|
|
// result in a different string
|
|
std::string capital_s;
|
|
capital_s.resize(s.size());
|
|
std::transform (s.begin(), s.end(), capital_s.begin(), ToUpper());
|
|
}
|
|
</programlisting>
|
|
<para>
|
|
<emphasis>Note</emphasis> that these calls all
|
|
involve the global C locale through the use of the C functions
|
|
<code>toupper/tolower</code>. This is absolutely guaranteed to work --
|
|
but <emphasis>only</emphasis> if the string contains <emphasis>only</emphasis> characters
|
|
from the basic source character set, and there are <emphasis>only</emphasis>
|
|
96 of those. Which means that not even all English text can be
|
|
represented (certain British spellings, proper names, and so forth).
|
|
So, if all your input forevermore consists of only those 96
|
|
characters (hahahahahaha), then you're done.
|
|
</para>
|
|
<para><emphasis>Note</emphasis> that the
|
|
<code>ToUpper</code> and <code>ToLower</code> function objects
|
|
are needed because <code>toupper</code> and <code>tolower</code>
|
|
are overloaded names (declared in <code><cctype></code> and
|
|
<code><locale></code>) so the template-arguments for
|
|
<code>transform<></code> cannot be deduced, as explained in
|
|
<link xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="http://gcc.gnu.org/ml/libstdc++/2002-11/msg00180.html">this
|
|
message</link>.
|
|
<!-- section 14.8.2.4 clause 16 in ISO 14882:1998 -->
|
|
At minimum, you can write short wrappers like
|
|
</para>
|
|
<programlisting>
|
|
char toLower (char c)
|
|
{
|
|
// std::tolower(c) is undefined if c < 0 so cast to unsigned char.
|
|
return std::tolower((unsigned char)c);
|
|
} </programlisting>
|
|
<para>(Thanks to James Kanze for assistance and suggestions on all of this.)
|
|
</para>
|
|
<para>Another common operation is trimming off excess whitespace. Much
|
|
like transformations, this task is trivial with the use of string's
|
|
<code>find</code> family. These examples are broken into multiple
|
|
statements for readability:
|
|
</para>
|
|
<programlisting>
|
|
std::string str (" \t blah blah blah \n ");
|
|
|
|
// trim leading whitespace
|
|
string::size_type notwhite = str.find_first_not_of(" \t\n");
|
|
str.erase(0,notwhite);
|
|
|
|
// trim trailing whitespace
|
|
notwhite = str.find_last_not_of(" \t\n");
|
|
str.erase(notwhite+1); </programlisting>
|
|
<para>Obviously, the calls to <code>find</code> could be inserted directly
|
|
into the calls to <code>erase</code>, in case your compiler does not
|
|
optimize named temporaries out of existence.
|
|
</para>
|
|
|
|
</section>
|
|
<section xml:id="strings.string.case" xreflabel="Case Sensitivity"><info><title>Case Sensitivity</title></info>
|
|
|
|
<para>
|
|
</para>
|
|
|
|
<para>The well-known-and-if-it-isn't-well-known-it-ought-to-be
|
|
<link xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="http://www.gotw.ca/gotw/">Guru of the Week</link>
|
|
discussions held on Usenet covered this topic in January of 1998.
|
|
Briefly, the challenge was, <quote>write a 'ci_string' class which
|
|
is identical to the standard 'string' class, but is
|
|
case-insensitive in the same way as the (common but nonstandard)
|
|
C function stricmp()</quote>.
|
|
</para>
|
|
<programlisting>
|
|
ci_string s( "AbCdE" );
|
|
|
|
// case insensitive
|
|
assert( s == "abcde" );
|
|
assert( s == "ABCDE" );
|
|
|
|
// still case-preserving, of course
|
|
assert( strcmp( s.c_str(), "AbCdE" ) == 0 );
|
|
assert( strcmp( s.c_str(), "abcde" ) != 0 ); </programlisting>
|
|
|
|
<para>The solution is surprisingly easy. The original answer was
|
|
posted on Usenet, and a revised version appears in Herb Sutter's
|
|
book <emphasis>Exceptional C++</emphasis> and on his website as <link xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="http://www.gotw.ca/gotw/029.htm">GotW 29</link>.
|
|
</para>
|
|
<para>See? Told you it was easy!</para>
|
|
<para>
|
|
<emphasis>Added June 2000:</emphasis> The May 2000 issue of C++
|
|
Report contains a fascinating <link xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="http://lafstern.org/matt/col2_new.pdf"> article</link> by
|
|
Matt Austern (yes, <emphasis>the</emphasis> Matt Austern) on why
|
|
case-insensitive comparisons are not as easy as they seem, and
|
|
why creating a class is the <emphasis>wrong</emphasis> way to go
|
|
about it in production code. (The GotW answer mentions one of
|
|
the principle difficulties; his article mentions more.)
|
|
</para>
|
|
<para>Basically, this is "easy" only if you ignore some things,
|
|
things which may be too important to your program to ignore. (I chose
|
|
to ignore them when originally writing this entry, and am surprised
|
|
that nobody ever called me on it...) The GotW question and answer
|
|
remain useful instructional tools, however.
|
|
</para>
|
|
<para><emphasis>Added September 2000:</emphasis> James Kanze provided a link to a
|
|
<link xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="http://www.unicode.org/reports/tr21/tr21-5.html">Unicode
|
|
Technical Report discussing case handling</link>, which provides some
|
|
very good information.
|
|
</para>
|
|
|
|
</section>
|
|
<section xml:id="strings.string.character_types" xreflabel="Arbitrary Characters"><info><title>Arbitrary Character Types</title></info>
|
|
|
|
<para>
|
|
</para>
|
|
|
|
<para>The <code>std::basic_string</code> is tantalizingly general, in that
|
|
it is parameterized on the type of the characters which it holds.
|
|
In theory, you could whip up a Unicode character class and instantiate
|
|
<code>std::basic_string<my_unicode_char></code>, or assuming
|
|
that integers are wider than characters on your platform, maybe just
|
|
declare variables of type <code>std::basic_string<int></code>.
|
|
</para>
|
|
<para>That's the theory. Remember however that basic_string has additional
|
|
type parameters, which take default arguments based on the character
|
|
type (called <code>CharT</code> here):
|
|
</para>
|
|
<programlisting>
|
|
template <typename CharT,
|
|
typename Traits = char_traits<CharT>,
|
|
typename Alloc = allocator<CharT> >
|
|
class basic_string { .... };</programlisting>
|
|
<para>Now, <code>allocator<CharT></code> will probably Do The Right
|
|
Thing by default, unless you need to implement your own allocator
|
|
for your characters.
|
|
</para>
|
|
<para>But <code>char_traits</code> takes more work. The char_traits
|
|
template is <emphasis>declared</emphasis> but not <emphasis>defined</emphasis>.
|
|
That means there is only
|
|
</para>
|
|
<programlisting>
|
|
template <typename CharT>
|
|
struct char_traits
|
|
{
|
|
static void foo (type1 x, type2 y);
|
|
...
|
|
};</programlisting>
|
|
<para>and functions such as char_traits<CharT>::foo() are not
|
|
actually defined anywhere for the general case. The C++ standard
|
|
permits this, because writing such a definition to fit all possible
|
|
CharT's cannot be done.
|
|
</para>
|
|
<para>The C++ standard also requires that char_traits be specialized for
|
|
instantiations of <code>char</code> and <code>wchar_t</code>, and it
|
|
is these template specializations that permit entities like
|
|
<code>basic_string<char,char_traits<char>></code> to work.
|
|
</para>
|
|
<para>If you want to use character types other than char and wchar_t,
|
|
such as <code>unsigned char</code> and <code>int</code>, you will
|
|
need suitable specializations for them. For a time, in earlier
|
|
versions of GCC, there was a mostly-correct implementation that
|
|
let programmers be lazy but it broke under many situations, so it
|
|
was removed. GCC 3.4 introduced a new implementation that mostly
|
|
works and can be specialized even for <code>int</code> and other
|
|
built-in types.
|
|
</para>
|
|
<para>If you want to use your own special character class, then you have
|
|
<link xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="http://gcc.gnu.org/ml/libstdc++/2002-08/msg00163.html">a lot
|
|
of work to do</link>, especially if you with to use i18n features
|
|
(facets require traits information but don't have a traits argument).
|
|
</para>
|
|
<para>Another example of how to specialize char_traits was given <link xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="http://gcc.gnu.org/ml/libstdc++/2002-08/msg00260.html">on the
|
|
mailing list</link> and at a later date was put into the file <code>
|
|
include/ext/pod_char_traits.h</code>. We agree
|
|
that the way it's used with basic_string (scroll down to main())
|
|
doesn't look nice, but that's because <link xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="http://gcc.gnu.org/ml/libstdc++/2002-08/msg00236.html">the
|
|
nice-looking first attempt</link> turned out to <link xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="http://gcc.gnu.org/ml/libstdc++/2002-08/msg00242.html">not
|
|
be conforming C++</link>, due to the rule that CharT must be a POD.
|
|
(See how tricky this is?)
|
|
</para>
|
|
|
|
</section>
|
|
|
|
<section xml:id="strings.string.token" xreflabel="Tokenizing"><info><title>Tokenizing</title></info>
|
|
|
|
<para>
|
|
</para>
|
|
<para>The Standard C (and C++) function <code>strtok()</code> leaves a lot to
|
|
be desired in terms of user-friendliness. It's unintuitive, it
|
|
destroys the character string on which it operates, and it requires
|
|
you to handle all the memory problems. But it does let the client
|
|
code decide what to use to break the string into pieces; it allows
|
|
you to choose the "whitespace," so to speak.
|
|
</para>
|
|
<para>A C++ implementation lets us keep the good things and fix those
|
|
annoyances. The implementation here is more intuitive (you only
|
|
call it once, not in a loop with varying argument), it does not
|
|
affect the original string at all, and all the memory allocation
|
|
is handled for you.
|
|
</para>
|
|
<para>It's called stringtok, and it's a template function. Sources are
|
|
as below, in a less-portable form than it could be, to keep this
|
|
example simple (for example, see the comments on what kind of
|
|
string it will accept).
|
|
</para>
|
|
|
|
<programlisting>
|
|
#include <string>
|
|
template <typename Container>
|
|
void
|
|
stringtok(Container &container, string const &in,
|
|
const char * const delimiters = " \t\n")
|
|
{
|
|
const string::size_type len = in.length();
|
|
string::size_type i = 0;
|
|
|
|
while (i < len)
|
|
{
|
|
// Eat leading whitespace
|
|
i = in.find_first_not_of(delimiters, i);
|
|
if (i == string::npos)
|
|
return; // Nothing left but white space
|
|
|
|
// Find the end of the token
|
|
string::size_type j = in.find_first_of(delimiters, i);
|
|
|
|
// Push token
|
|
if (j == string::npos)
|
|
{
|
|
container.push_back(in.substr(i));
|
|
return;
|
|
}
|
|
else
|
|
container.push_back(in.substr(i, j-i));
|
|
|
|
// Set up for next loop
|
|
i = j + 1;
|
|
}
|
|
}
|
|
</programlisting>
|
|
|
|
|
|
<para>
|
|
The author uses a more general (but less readable) form of it for
|
|
parsing command strings and the like. If you compiled and ran this
|
|
code using it:
|
|
</para>
|
|
|
|
|
|
<programlisting>
|
|
std::list<string> ls;
|
|
stringtok (ls, " this \t is\t\n a test ");
|
|
for (std::list<string>const_iterator i = ls.begin();
|
|
i != ls.end(); ++i)
|
|
{
|
|
std::cerr << ':' << (*i) << ":\n";
|
|
} </programlisting>
|
|
<para>You would see this as output:
|
|
</para>
|
|
<programlisting>
|
|
:this:
|
|
:is:
|
|
:a:
|
|
:test: </programlisting>
|
|
<para>with all the whitespace removed. The original <code>s</code> is still
|
|
available for use, <code>ls</code> will clean up after itself, and
|
|
<code>ls.size()</code> will return how many tokens there were.
|
|
</para>
|
|
<para>As always, there is a price paid here, in that stringtok is not
|
|
as fast as strtok. The other benefits usually outweigh that, however.
|
|
</para>
|
|
|
|
<para><emphasis>Added February 2001:</emphasis> Mark Wilden pointed out that the
|
|
standard <code>std::getline()</code> function can be used with standard
|
|
<code>istringstreams</code> to perform
|
|
tokenizing as well. Build an istringstream from the input text,
|
|
and then use std::getline with varying delimiters (the three-argument
|
|
signature) to extract tokens into a string.
|
|
</para>
|
|
|
|
|
|
</section>
|
|
<section xml:id="strings.string.shrink" xreflabel="Shrink to Fit"><info><title>Shrink to Fit</title></info>
|
|
|
|
<para>
|
|
</para>
|
|
<para>From GCC 3.4 calling <code>s.reserve(res)</code> on a
|
|
<code>string s</code> with <code>res < s.capacity()</code> will
|
|
reduce the string's capacity to <code>std::max(s.size(), res)</code>.
|
|
</para>
|
|
<para>This behaviour is suggested, but not required by the standard. Prior
|
|
to GCC 3.4 the following alternative can be used instead
|
|
</para>
|
|
<programlisting>
|
|
std::string(str.data(), str.size()).swap(str);
|
|
</programlisting>
|
|
<para>This is similar to the idiom for reducing
|
|
a <code>vector</code>'s memory usage
|
|
(see <link linkend="faq.size_equals_capacity">this FAQ
|
|
entry</link>) but the regular copy constructor cannot be used
|
|
because libstdc++'s <code>string</code> is Copy-On-Write in GCC 3.
|
|
</para>
|
|
<para>In <link linkend="status.iso.2011">C++11</link> mode you can call
|
|
<code>s.shrink_to_fit()</code> to achieve the same effect as
|
|
<code>s.reserve(s.size())</code>.
|
|
</para>
|
|
|
|
|
|
</section>
|
|
|
|
<section xml:id="strings.string.Cstring" xreflabel="CString (MFC)"><info><title>CString (MFC)</title></info>
|
|
|
|
<para>
|
|
</para>
|
|
|
|
<para>A common lament seen in various newsgroups deals with the Standard
|
|
string class as opposed to the Microsoft Foundation Class called
|
|
CString. Often programmers realize that a standard portable
|
|
answer is better than a proprietary nonportable one, but in porting
|
|
their application from a Win32 platform, they discover that they
|
|
are relying on special functions offered by the CString class.
|
|
</para>
|
|
<para>Things are not as bad as they seem. In
|
|
<link xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="http://gcc.gnu.org/ml/gcc/1999-04n/msg00236.html">this
|
|
message</link>, Joe Buck points out a few very important things:
|
|
</para>
|
|
<itemizedlist>
|
|
<listitem><para>The Standard <code>string</code> supports all the operations
|
|
that CString does, with three exceptions.
|
|
</para></listitem>
|
|
<listitem><para>Two of those exceptions (whitespace trimming and case
|
|
conversion) are trivial to implement. In fact, we do so
|
|
on this page.
|
|
</para></listitem>
|
|
<listitem><para>The third is <code>CString::Format</code>, which allows formatting
|
|
in the style of <code>sprintf</code>. This deserves some mention:
|
|
</para></listitem>
|
|
</itemizedlist>
|
|
<para>
|
|
The old libg++ library had a function called form(), which did much
|
|
the same thing. But for a Standard solution, you should use the
|
|
stringstream classes. These are the bridge between the iostream
|
|
hierarchy and the string class, and they operate with regular
|
|
streams seamlessly because they inherit from the iostream
|
|
hierarchy. An quick example:
|
|
</para>
|
|
<programlisting>
|
|
#include <iostream>
|
|
#include <string>
|
|
#include <sstream>
|
|
|
|
string f (string& incoming) // incoming is "foo N"
|
|
{
|
|
istringstream incoming_stream(incoming);
|
|
string the_word;
|
|
int the_number;
|
|
|
|
incoming_stream >> the_word // extract "foo"
|
|
>> the_number; // extract N
|
|
|
|
ostringstream output_stream;
|
|
output_stream << "The word was " << the_word
|
|
<< " and 3*N was " << (3*the_number);
|
|
|
|
return output_stream.str();
|
|
} </programlisting>
|
|
<para>A serious problem with CString is a design bug in its memory
|
|
allocation. Specifically, quoting from that same message:
|
|
</para>
|
|
<programlisting>
|
|
CString suffers from a common programming error that results in
|
|
poor performance. Consider the following code:
|
|
|
|
CString n_copies_of (const CString& foo, unsigned n)
|
|
{
|
|
CString tmp;
|
|
for (unsigned i = 0; i < n; i++)
|
|
tmp += foo;
|
|
return tmp;
|
|
}
|
|
|
|
This function is O(n^2), not O(n). The reason is that each +=
|
|
causes a reallocation and copy of the existing string. Microsoft
|
|
applications are full of this kind of thing (quadratic performance
|
|
on tasks that can be done in linear time) -- on the other hand,
|
|
we should be thankful, as it's created such a big market for high-end
|
|
ix86 hardware. :-)
|
|
|
|
If you replace CString with string in the above function, the
|
|
performance is O(n).
|
|
</programlisting>
|
|
<para>Joe Buck also pointed out some other things to keep in mind when
|
|
comparing CString and the Standard string class:
|
|
</para>
|
|
<itemizedlist>
|
|
<listitem><para>CString permits access to its internal representation; coders
|
|
who exploited that may have problems moving to <code>string</code>.
|
|
</para></listitem>
|
|
<listitem><para>Microsoft ships the source to CString (in the files
|
|
MFC\SRC\Str{core,ex}.cpp), so you could fix the allocation
|
|
bug and rebuild your MFC libraries.
|
|
<emphasis><emphasis>Note:</emphasis> It looks like the CString shipped
|
|
with VC++6.0 has fixed this, although it may in fact have been
|
|
one of the VC++ SPs that did it.</emphasis>
|
|
</para></listitem>
|
|
<listitem><para><code>string</code> operations like this have O(n) complexity
|
|
<emphasis>if the implementors do it correctly</emphasis>. The libstdc++
|
|
implementors did it correctly. Other vendors might not.
|
|
</para></listitem>
|
|
<listitem><para>While parts of the SGI STL are used in libstdc++, their
|
|
string class is not. The SGI <code>string</code> is essentially
|
|
<code>vector<char></code> and does not do any reference
|
|
counting like libstdc++'s does. (It is O(n), though.)
|
|
So if you're thinking about SGI's string or rope classes,
|
|
you're now looking at four possibilities: CString, the
|
|
libstdc++ string, the SGI string, and the SGI rope, and this
|
|
is all before any allocator or traits customizations! (More
|
|
choices than you can shake a stick at -- want fries with that?)
|
|
</para></listitem>
|
|
</itemizedlist>
|
|
|
|
</section>
|
|
</section>
|
|
|
|
<!-- Sect1 03 : Interacting with C -->
|
|
|
|
</chapter>
|