Outpost Universe Forums

Off Topic => Computers & Programming General => Topic started by: Vagabond on June 13, 2018, 02:29:46 AM

Title: C++ std::string subtle difference from char*
Post by: Vagabond on June 13, 2018, 02:29:46 AM: I've been using std::string and char* as we convert the legacy archive code in OP2Utility to C++11 standards. I've had some problems dealing with strings as we read or write the volume and CLM files.

I stumbled across this article which sort of turned on a lightbulb about some differences between std::string and char*.
https://akrzemi1.wordpress.com/2014/03/20/strings-length/

In particular, std::string does not track its size by using a null terminator at the end. It actually allows for inserting null terminators in the middle of the string. You have to account for this when inserting data into the string or transferring the data out of the string. Previously, I was treating std::string as a char* with built in memory management and stored size information. Not considering the lack of a null terminator sort of messed up my thought process.

For example, when storing a fixed length string in std::string, trailing /0s will be counted in the length. However, if that string is converted to a char*, then none of the /0s will be counted in the length anymore. This can cause weird bugs when testing for equivalency since the size is different.

So I get the feeling we should probably clean up and standardize the std::string code for reading/writing the archives and map files in OP2Utility at some point. Unfortunately, there seems to be lots of places in C++ that I don't understand the subtleties of the language. Taking time to fully learn them all would probably mean the project would stall out and never be finished. :| I never felt that way about C#. I think coding properly in C++ definitely takes longer and takes more discipline than in C#. I can see why someone might want to avoid C++ and use a newer language like Java or C# to reduce development time.
Title: Re: C++ std::string subtle difference from char*
Post by: leeor_net on June 13, 2018, 09:03:15 PM: Subtle? They're entirely different beasts.

char* isn't a string, it's a pointer to a char. This allows you to store a sequence of char values that can be interpreted by humans as a string but the concept of a 'string' is entirely a human concept and C has no understanding of it. This is why there are functions to work with 'strings' that are flaky and extremely error prone. The 'null terminator' is a sentinel value that has become standard practice but ultimately is an arbitrary sentinel value. You're still dealing with raw memory and pointers and any 'string manipulations' really just translate to pointer arithmetic. It sucks. Hardcore.

std::string, on the other hand, is a class designed specifically to model what we humans call a 'string'. This is why they can contain null terminators within them. Remember, the 'null terminator' is just an arbitrary sentinel value. std::string manages memory and offers operators that make it easier to work with 'string' type data (though it's hardly perfect... this is why Boost has such things as the lexical_cast and an entire string manipulation library).

But yeah, thinking of them as the same thing with one having some memory management is certainly going to trip you up. One is a pointer to memory, the other is a stream of bytes. Most definitely not the same thing though they achieve similar results.

All that stated, this is why I suggested you choose one or the other with a strong suggestion of switching to std::string. It's far less prone to error than using raw char* pointers and you can get a char* pointer easily out of it if/when needed.
Title: Re: C++ std::string subtle difference from char*
Post by: Hooman on June 19, 2018, 10:36:09 PM: Interesting article, particularly the part at the end about the new user defined literals.

The semantics between the two are indeed different. I think std::string uses a much more sane approach. Most other languages will store string size along with a buffer pointer. It's far safer as it lets you bounds check various operations, and also allows for a number of speed improvements too. The additional cost in memory is hardly a concern these days. As far as I know, only C/C++ stored just the buffer pointer and required scanning for a sentinel value to determine string length. And it's suffered so many bugs and security flaws because of it. Not to mention frequent slow scanning to find string length, and excessive copying since you can't properly slice when a sentinel value is needed to determine an end point.

It is subtly deceptive how std::string has a constructor to initialize it from a null terminated string. I can see that easily causing people to conflate the two. Additionally, since C++11, the std::string is guaranteed to be null terminated. For compatibility with old code, std::string will allocate at least 1 extra byte to ensure there is always a null just past the end of the string, though it doesn't otherwise consider null special, and can have embedded nulls which do not terminate the string.

C++ has a lot of power that can make it truly wonderful at times, but it's also damn awful in terms of legacy junk and the number of subtleties that can bite you in the butt. The newer standards go a long way towards alleviating some of the pain points, though it's not surprising when people say that's to help legacy projects, and new projects should just be written in another language.

On the plus side, you stand to learn a lot about programming and low level details of how a computer works by getting good at C++. ;)