Author Topic: A `const LPWSTR` is just a `const wchar_t*`, right? (Read 9960 times)

Hooman · « **on:** September 24, 2019, 06:26:10 AM »

I was going over some code that used the Windows style defines for pointers to wide strings.

So a `const LPWSTR` is just a `const wchar_t*`, right? Right?

Wrong!

Yet another dark corner of the C++ language.

The following test written with Google Test passes:

Code: [Select]

#include <gtest/gtest.h>
#include <type_traits>


TEST(WideStrings, WideStringTypes)
{
	using LPWSTR = wchar_t*;

	EXPECT_FALSE((std::is_same<const LPWSTR, const wchar_t*>::value));
	EXPECT_FALSE((std::is_same<const LPWSTR, wchar_t const*>::value));
	EXPECT_TRUE((std::is_same<const LPWSTR, wchar_t* const>::value));

	EXPECT_FALSE((std::is_same<LPWSTR const, const wchar_t*>::value));
	EXPECT_FALSE((std::is_same<LPWSTR const, wchar_t const*>::value));
	EXPECT_TRUE((std::is_same<LPWSTR const, wchar_t* const>::value));

	EXPECT_TRUE((std::is_same<LPWSTR const, const LPWSTR>::value));
}

Code: [Select]

[==========] Running 1 test from 1 test case.
[----------] Global test environment set-up.
[----------] 1 test from WideStrings
[ RUN      ] WideStrings.WideStringTypes
[       OK ] WideStrings.WideStringTypes (0 ms)
[----------] 1 test from WideStrings (39 ms total)

[----------] Global test environment tear-down
[==========] 1 test from 1 test case ran. (101 ms total)
[  PASSED  ] 1 test.

To my shock and horror, a const LPWSTR is actually the same thing as a LPWSTR const. That is, the pointer is a const, not the wchar_t.

If you really want a const wchar_t* you need to use the LPCWSTR alias.

lordpalandus · « **Reply #1 on:** September 24, 2019, 04:07:11 PM »

I do not understand the issue here. Why would you want to create a pointer to a string in the first place?

And if it was a constant, you wouldn't be able to modify it at runtime, so I'm confused as to why someone would use this.

Isn't there easier ways to handle strings, in C++, like in Scripting languages? In Python, a string is just one long "list" filled with characters, symbols and empty spaces. And then you can use operators to add stuff to it, such as concatenation of two strings.

leeor_net · « **Reply #2 on:** September 24, 2019, 06:12:51 PM »

It's beyond your current ability to understand. That's not to call you stupid or ignorant, just that it would be difficult to explain it in terms you could understand until you have a grasp of strings in C which don't work like they do in other languages. This is a relatively good article that goes over it if you're interested in learning about it: https://www.dyclassroom.com/c/c-pointers-and-strings

There are better ways to handle strings in C++ like using the std::string class, but strings in C and C++ are an exercise in self flagellation. Basically, it's extremely painful and it's by design (unfortunately).

Anyway, that's not at all surprising, it's why I always tried to avoid the Windows API -- too many defines that don't translate well and are hold overs from way back in the very early 90's. Yuck.

lordpalandus · « **Reply #3 on:** September 24, 2019, 07:33:40 PM »

Yeah I kinda figured. Thanks for the link.

Uhh, why would someone purposefully design strings to be painful for any language? Any GUI is going to need strings, so... yeah.

TechCor · « **Reply #4 on:** September 25, 2019, 03:11:43 AM »

"Back in the day", when C++ was a wee language (called C), people had to work at a lower level with computers. Strings back then were abstract concepts. You created them by allocating a series of bytes and assigning it to a pointer. You'd pass this memory address of your string to wherever you needed it.

Unfortunately, that's still just a block of memory like any other. It did not give you any special capabilities for modifying it. To determine length, it was popular to "zero-terminate" the string. Then a function like strlen() would iterate over it and count the bytes until it hit the zero. "Zero termination" is just a convention, and isn't anything special.

Of course, people started making string libraries to handle common operations, but computers for the longest while were slow. I mean REALLY slow. A few extra operations could make all the difference. You often didn't want the overhead of a library. Not to mention that 24 kilobyte library could make you go over your memory limit!

You also had the issue of standardization.

Modern languages have the luxury of being designed in a time where we can waste some performance and memory in exchange for faster development times and less required technical knowledge. Don't forget that C++ was designed at a time when Object-Oriented Programming was a new and revolutionary philosophy. Modern languages are standing on the shoulders of giants.

Hooman · « **Reply #5 on:** September 25, 2019, 04:13:27 AM »

Way back in the 1970s, C was invented, on what is now primitive computer hardware with many limitations, both in speed and in memory. Sometimes people did strange things in an attempt to optimize for them. C-strings are a horrible abomination that came from that, and are probably the worst programming decision that we are still paying for today.

If you're using C++, you have the option to use std::string, which provides many (though not typically all) of the conveniences afforded by other modern programming languages. If you're using just plain C though, then you're pretty much stuck with C-strings.

The Windows API is built for C. Just plain C, not C++. One of the primary reasons is that C has a well defined ABI (Application Binary Interface), while C++ did not. That meant that C compilers from different vendors could all build to the same ABI standard, and their code would interoperate with each other after they were compiled. This isn't the case for C++, particularly not on Windows. C++ code compiled with one compiler is unlikely to be linkable or usable by C++ code compiled by a different compiler. As the Windows API needed to provide the base level interface for all programs running on the operating system, it was written for C. This is of course fine for C++ compilers which contain most aspects of C as a sublanguage. You just can't use any of the additional C++ features at the core OS interface level.

A similar decision was made with the op2ext module loader interface. The module loader is written with C++, but the interface to other modules uses a limited C only subset of the language. This was intentional to allow for modules written with other compilers, and probably even other languages.

Anyway, ranting about the finer points of ABIs aside, the design of C-strings is part of the reason why there is a funny pointer interface to strings.

In C, a C-string is an array of char, with a bad decision at the end. Err, I mean, 0 null terminator byte at the end.

In C, an array decays to a pointer. If you try to pass an array to a function, it decays to a pointer to the first element of the array. This means you only need to copy a small pointer onto the stack during the call sequence, rather than an entire array, so it's much more efficient for anything but the smallest of strings.

As a consequence of passing a pointer, rather than copying the array, it means the function gets access to the original data, rather than a copy. It might be that you don't want the function to modify the original array. Maybe the function should only be allowed to read the data. This is where const comes in. Data that is declared const is checked by the compiler so writes to it are disallowed. A function can take a pointer to const data. This is a contract in that the function is saying it won't modify any of the data passed to it through the pointer.

The calling method might store the data as a mutable array, but only provide a const pointer when passing the data to other functions. This means the data can change, but it is limited in what functions are allowed to change it.

If you try to pass const data to a function that accepts a pointer to non-const data, it is a compile error. The type checking system disallows this. If this wasn't disallowed, the called function might happily write all over the data that wasn't supposed to change. This means for a function to accept const data, it must declare the parameter to accept const data.

Conversion in the other direction is automatic. If you have non-const data, you can pass it to a function that accepts const data. The caller doesn't care. Write to it or not, it's allowed. Though once the function is declared to accept const data, the compiler does ensure the function lives up to that promise of not writing to the data.

A consequence of all this, is if you write a function that takes data in through a pointer, and if it only ever uses that data in a read-only manner, it should declare the parameter as a pointer to const data. That way it can accept data regardless of the constness of it.

Examples:

Code: [Select]

struct Data {
  int field;
};

// Function accepting non-const data
void f1(Data* data) {
  int local = data->field; // Read allowed
  data->field = 0; // Write allowed
}

// Function acception const data
void f2(const Data* data) {
  int local = data->field; // Read allowed
  //data.field = 0; // Error, data is const
}

void f3() {
  Data data = { 1 }; // Create some data
  f1(&data); // Allowed (data is allowed to be changed)
  f2(&data); // Allowed (data can not be changed)
}

void f4() {
  const Data data = { 1 }; // Create some const data
  //f1(&data); // Error, this would give a function write access to const data
  f2(&data); // Allowed (data can not be changed)
}

void f5() {
  Data data = { 1 }; // Create some data

  Data* dataPtr = &data; // This pointer allows a writable view of the data
  const Data* dataReadPtr = &data; // This pointer allows a read-only view of the data

  f1(dataPtr); // Allowed (data is allowed to be changed)
  f2(dataPtr); // Allowed (data can not be changed)

  //f1(dataReadPtr); // Error, this would give a function write access to const data
  f2(dataReadPtr); // Allowed (data can not be changed)
}

TechCor · « **Reply #6 on:** September 25, 2019, 05:20:43 AM »

Quote from: Hooman on September 25, 2019, 04:13:27 AM

In C, a C-string is an array of char, with a bad decision at the end. Err, I mean, 0 null terminator byte at the end.

Hey man, you couldn't lead with the length because you didn't know what size to use. I mean, if your string is only 32 characters, using an int would waste a whole 3 bytes!

I think saying "we are still paying for [it] today" is a bit dramatic. Times were different, needs were different. We've moved on. Newer languages have replaced C++. The "Windows API" you are describing was called Win32 when I was in college. Win32!! We're on 64-bit machines these days. Win32 was forever ago. We passed through MFC, Visual C++ (remember that auto-generated nightmare?), WPF and WinForms. The only people who care about this stuff now are working with legacy code.... oh wait.

I was sitting here with XCode open and immediately noticed this line:

Code: [Select]

PushNotificationWasClicked(const char* body)

Look at that C-String goodness.

Honestly, I've used C-string so many times that I don't even think about it. The basic manipulation of a C-string is:

strlen
strcmp
strcpy
strcat

It's pretty simple until you throw in Windows god awful typedefs for everything.

Yeah, I'm killing time waiting for a build. So what?

lordpalandus · « **Reply #7 on:** September 25, 2019, 11:23:57 AM »

Thank you both for the indepth reply. Now that I actually understood.

So, I'm assuming that with C-Strings you cannot move the null terminator or insert around it or delete the null terminator, add to it and then add a new null terminator to it?

Vagabond · « **Reply #8 on:** September 25, 2019, 08:52:48 PM »

@lordpalandus,

Actually, you can move the null terminator earlier in the string, as long as you do not try to insert it past the end of the buffer length (A buffer being an array of chars). You can try to make the buffer large enough to accommodate the longest string you think will be stored and reject strings longer than the buffer or resize the buffer as needed. Part of the reason C# style strings or C++ std::string are so nice is because you don't have to try to guess how large to make the buffer, place null terminators, or resize the buffer manually when it is too small.

@Hooman and TechCor,

I've only done a little Win32 programming (all Outpost 2 associated). The most difficult part for me was dealing with all the type definitions for strings, figuring out how they relate and how to transfer them back and forth. Very difficult for someone jumping in without having closely studied Win32 programming I think.

Hooman · « **Reply #9 on:** September 28, 2019, 02:23:49 PM »

Quote from: TechCor on September 25, 2019, 05:20:43 AM

Yeah, I'm killing time waiting for a build. So what?

Lol!

Windows typedefs do tend to be pretty awful. Though I suppose that do somewhat insulate you from changes to their API. They might not preserve binary compatibility, but they can sometimes preserve source code compatibility.

Quote

So, I'm assuming that with C-strings you cannot move the null terminator or insert around it or delete the null terminator, add to it and then add a new null terminator to it?

Brett is quite right with his response.

The problem with C-strings is they do not store their length. There is a conceptual length, which you can find by scanning for the null byte. This is a linear operation, so the longer the string gets, the slower it becomes to determine the conceptual length. Perhaps more importantly, there is also a maximum length based on the buffer size, which depends on how memory for the C-string was reserved or allocated. That maximum length isn't stored in any known field, and there is no way to scan for it, so there is no built-in way for the C-string to know what it is. This can be a real problem for memory handling. Reading past the end of the buffer would result in undefined behaviour. It would be accessing the memory for other data structures, or possibility uninitialized memory. Writing past the end of the buffer is even worse, as it corrupts what it writes over, and there's no telling what that will be.

You can overwrite the null terminator byte, and write a new one somewhere else, but without knowing the buffer size, there's no guarantee it's still within the bounds of the buffer. This may be fine for string shortening, but isn't so great for string lengthening. If you want to concatenate strings, you often have to allocate a new buffer for the result, since it's often not clear how much space is available in the input buffers. What's worse, is to allocate enough space for the result, you need to know the lengths of the inputs, and since those values aren't stored, you don't have constant time access to the information. Instead you have to scan the input strings to find their length, which as stated before, is a linear operation, where the time required grows with the length of the string.

To circumvent some of those problems, many APIs pass pairs of parameters around. They'll require both a C-string pointer, and a length. This is somewhat of a pain though, as what is one conceptual thing, a string, now needs to be passed around as two independent parameter values. This opens up the possibility of bugs from badly matched pointer/length pairs, and makes APIs depend on the order of the two fields. Some might pass pointer first, follow by length, others might pass length first, followed by the pointer. The order may change depending on the function, or on the API, so it may not be consistent across the codebase of one program.

Passing pairs of values around still doesn't solve all the problems though. You can read a string, and you can write a string, maybe even change it's length within the limits of the buffer, but you can't change the buffer size. There is no associated allocator, and no concept of re-allocation built into C-strings. Like the buffer length, you have to provide all that externally. Effectively you end up allocating a new string, copying old data into it, freeing the old buffer, and replacing the original value with the new one. That's assuming a dynamically allocated buffer stored on the heap. Sometimes C-strings point to other memory sources, such as global or local variables. You can point a pointer at any location, but you can't free memory from any location. Allocations and free only work with heap memory. It's an error to try and free global or local memory, and if you try it results in undefined behaviour. So although you can point the pointer at any of those locations, if you don't know what type of location it's pointing to, there's no way to know if the memory should be freed, or simply forgotten about.

Using a null terminator byte as a convention is also really bad for string slicing. Suppose you want to scan a string and return a portion of it. For very large and slow to copy strings, it would be nice to extract a portion of the string without having to copy it. It's easy enough to set a pointer to the new start location. However, if the convention is to null terminate strings, and you need to pass the string on to another API that expects this convention, then there's a big downside. You can null terminate the shorter substring by writing a null byte into the original buffer at the new end point. That would work, but it modifies the original buffer. It might be that you need to re-use the buffer contents many times, and so you don't want to trash the data by writing null bytes into the middle of it. In that case, you're forced to copy the substring into a new buffer so that it can be null terminated without it affecting the original data. As text file parsing often requires such an operation, using a null terminator convention can greatly increase the memory requirements and processing time required to process text data.

So yeah, there are a few issues with C-strings. Thankfully C++ has started offering alternatives.

News:

Author Topic: A `const LPWSTR` is just a `const wchar_t*`, right? (Read 9960 times)

Hooman

A `const LPWSTR` is just a `const wchar_t*`, right?

lordpalandus

Re: A `const LPWSTR` is just a `const wchar_t*`, right?

leeor_net

Re: A `const LPWSTR` is just a `const wchar_t*`, right?

lordpalandus

Re: A `const LPWSTR` is just a `const wchar_t*`, right?

TechCor

Re: A `const LPWSTR` is just a `const wchar_t*`, right?

Hooman

Re: A `const LPWSTR` is just a `const wchar_t*`, right?

TechCor

Re: A `const LPWSTR` is just a `const wchar_t*`, right?

lordpalandus

Re: A `const LPWSTR` is just a `const wchar_t*`, right?

Vagabond

Re: A `const LPWSTR` is just a `const wchar_t*`, right?

Hooman

Re: A `const LPWSTR` is just a `const wchar_t*`, right?