1

I am trying to decode an UTF-32 encoded string, stored as uint[] into a string type.

For this I want to use System.Text.Encoding.UTF32.GetString method. However overloads of this function take either Byte[] or Byte*. I don't mind using unsafe in this particular case (however coming from C, I am not aware of the consequences of doing so)

I figured out a way to do this by copying an entire array to another of type byte[] either directly or indirectly. In this question I would only like to know whether it is possible to convert uint[] to byte[] (or to byte*, if can't convert to byte[]) without copying.

0

2 Answers 2

3

If possible, you should absolutely use the span approaches on this same question by canton7, here.

This answer is just to complete some information from the question, specifically:

I don't mind using unsafe in this particular case (however coming from C, I am not aware of the consequences of doing so)

The main time this information is relevant is for when:

  • spans aren't available in your target platform (or can't be used with the API you're consuming)
  • you're talking to unmanaged libraries via P/Invoke

Just to add some terminology: byte* is an unmanaged pointer ("unsafe"), and ref byte is a managed pointer ("safe", relatively speaking). Spans are basically "managed pointers, plus a length"; so: a Span<byte> (or ReadOnlySpan<byte>) is broadly comparable (in unmanaged land) to a pair of byte* and int (length). The huge difference is that the garbage collector understands managed pointers (including tracking what they touch, and fixing them if it decides to move memory around), and does not even look at unmanaged pointers (they are just opaque native integers) - so: keeping everything managed avoids a lot of problems.

As a consequence, if you obtain the unmanaged pointer to data, it is your job to make sure that the data can't possibly get moved while you're using it, which would make your pointer invalid. To do this, you usually use the fixed keyword, i.e. (in an unsafe block):

uint[] data = ...
fixed (uint* u32ptr = data)
{
    // we can coerce pointers:
    byte* bptr = (byte*)u32ptr;

    // TODO: and now we can use that byte* (along with data.Length * sizeof(uint))
    // to pass to e.g. Encoding methods
    return Encoding.UTF32.GetString(bptr, data.Length * sizeof(uint));
}

The important thing is not to store such a pointer (u32ptr or bptr here) in such a way that it could escape the fixed block. Storing it on a field, for example. As long as you don't do that, you're fine: the garbage collector knows how to interpret fixed, and knows that it can't move data while the thread is inside that region; fixed is a very low cost way of temporarily pinning data. There is a second way, used for longer running requirements - usually because you're handing managed memory to an unmanaged API that then holds the pointer as a field for longer than the P/Invoke call(s) that you might use inside a fixed region - GCHandle with a type of GCHandleType.Pinned - but that should be avoided usually.

Sign up to request clarification or add additional context in comments.

Comments

3

If you're using .NET Core 2.1+ / .NET Standard 2.1+, you can use MemoryMarshal.AsBytes to get a Span<byte> over your array, then there's an Encoding.GetString overload which takes a ReadOnlySpan<byte>.

Span<byte> bytes = MemoryMarshal.AsBytes(uintArray.AsSpan());
string str = Encoding.UTF32.GetString(bytes);

As I'm sure you're aware, make sure that the endianness of your machine matches the endianness of the UTF32Encoding instance you're using! E.g.

var encoding = new UTF32Encoding(bigEndian: !BitConverter.IsLittleEndian, byteOrderMark: false);

5 Comments

Span<T> is only available in .NET Framework >= 5.0, which I don't have and I don't want to push on users, which I'm expecting to have .NET Framework >= 4.0
Span<T> is also available in .NET Core 2.1+ / .NET Standard 2.1+ -- still beyond the reach of .NET Framework, but still 3 years old! (Note that it's ".NET 5" -- ".NET Framework" is the legacy runtime, and stopped at version 4.8)
@canton7 Span<T> might be available, but Encoding lacks the useful methods, meaning: you end up having to use fixed on the span. NetFX does not implement netstandard2.1 and never will - only netstandard2.0, and even that is a huge hack. I totally agree that this is the right approach whenever available, though.
@MarcGravell Encoding.GetString(ROS) was introduced in .NET Core 2.1 as well. I never claimed any of this worked in .NET Framework, only in .NET Core 2.1+ / .NET Standard 2.1+
@canton7 yep, totally understood - I'm just commending on the OP's premise of "I don't want to push on users, which I'm expecting to have .NET Framework >= 4.0" and your comments on .NET Core 2.1/ .NET Standard 2.1; I also totally believe that we ditch netfx when possible: blog.marcgravell.com/2020/01/…