Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data corruption in callback to SML from C with c and llvm codegen #600

Open
pclayton opened this issue Jan 4, 2025 · 5 comments
Open

Data corruption in callback to SML from C with c and llvm codegen #600

pclayton opened this issue Jan 4, 2025 · 5 comments

Comments

@pclayton
Copy link

pclayton commented Jan 4, 2025

I am testing on macOS arm64 for the first time. I have an example that works fine on Linux x64_64 plaforms and on (the very old) OS X 10.10 (Yosemite) x86_64 but crashes on macOS 12.5 (Monterey) arm64.

I am using MLton provided by MacPorts. The version is called '20240519' which corresponds to commit 475cf2b and I still see the issue with a local build of the latest version, 20241230.

The example is a GTK application (using Giraffe Library) so the SML application's main function calls g_application_run (which is imported as a reentrant function) and most SML code is run in callbacks from C. Using lldb, I see crashes in various places but I will describe a scenario that is repeatable.

The SML application has a top-level declaration

val surface : Cairo.Surface.t option ref = ref NONE

where Cairo.Surface.t is an abstract type that is internally a finalizable pointer, MLton.Pointer.t Finalizable.t. surface is set to some value SOME ... in an SML callback and read in subsequent SML callbacks. In some subsequent SML callbacks, I can see that surface has the value that was set but at some point an SML callback occurs and the value of surface is corrupted, causing the program to crash. The SML code of this callback contains

  case !surface of 
    NONE         => ()
  | SOME surface =>
      let
        val () = print "D1\n"
        
        (* Paint to the surface, where we store our state *)
        val cr = Cairo.Context.create surface
        ...

When running under lldb the crash appears to occur when looking up the value for the argument of cairo_create (imported as Cairo.Context.create) due to an invalid address:

D1
Process 5598 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1, address=0xb)
    frame #0: 0x0000000100021404 example-4-mlton`Chunk_9 + 1108
example-4-mlton`Chunk_9:
->  0x100021404 <+1108>: ldr    x0, [x8]
    0x100021408 <+1112>: mov    w8, #0x123c
    0x10002140c <+1116>: str    x8, [x26, #0x30]
    0x100021410 <+1120>: bl     0x1000675cc               ; symbol stub for: cairo_create
Target 0: (example-4-mlton) stopped.
(lldb) register read -format h x8
      x8 = 0x000000000000000b

The C function cairo_create is declared in cairo.h as

cairo_public cairo_t *
cairo_create (cairo_surface_t *target);

The pointer argument of cairo_create goes in x0 in arm64 and should be 0x101F4A1B0 but the pointer to where this value is stored has become corrupted and is 0xB. Here are the previous few instructions, in case that is useful:

(lldb) disassemble -s 0x1000213F0 -e 0x100021414
example-4-mlton`Chunk_9:
    0x1000213f0 <+1088>: b.hs   0x100021400               ; <+1104>
    0x1000213f4 <+1092>: mov    w8, #0x661
    0x1000213f8 <+1096>: b      0x100023690               ; <+9952>
    0x1000213fc <+1100>: sub    x26, x26, #0x38
    0x100021400 <+1104>: ldr    x8, [x26]
->  0x100021404 <+1108>: ldr    x0, [x8]
    0x100021408 <+1112>: mov    w8, #0x123c
    0x10002140c <+1116>: str    x8, [x26, #0x30]
    0x100021410 <+1120>: bl     0x1000675cc               ; symbol stub for: cairo_create
(lldb) register read -format h x26 x8
     x26 = 0x0000000101c72f20
      x8 = 0x000000000000000b

The corresponding C code generated by MLton is:

L_13437:
        T(Q, 0) = O(CPointer, S(Objptr, 0), 0);
        S(Word64, 48) = /* L_13436 */ 4668;
        StackTop = CPointer_add (StackTop, (Word64)(0x38ull));
        CReturnQ = cairo_create (T(Q, 0));
        goto L_13436;

(This must be as it is the only reference to cairo_create in the generated code.)

I am fairly sure that no finalization or garbage collection is occuring in the above scenario because the finalizers log to stdout and there is no such logging output.

Are there any MLton flags that I could use to get further information about what could be going wrong?

Also, in one version of this application, I had the error:

Thread.atomicSwitch didn't set r.
Process 5328 exited with status = 1 (0x00000001)

Is this something that an SML application could cause?

@MatthewFluet
Copy link
Member

Interesting.

My initial question would be whether or not the data corruption appears on amd64 platforms using the C codegen (-codegen c). That would make it a bit easier to investigate.

One place that you could get data corruption with callbacks is if there is an SML object pointer retained by some C code when an SML GC occurs (which would invalidate that object pointer). Currently, the runtime system does not have a mechanism for registering roots.

The Thread.atomicSwtich didn't set r. error could be a manifestation of the same issue, since it occurs when a shared reference is not properly updated when switching between SML threads: https://github.com/MLton/mlton/blob/master/basis-library/mlton/thread.sml#L118. But, in general, the invariants related to MLton.Thread operations should be maintained by the implementation and should not be able to be violated by SML user code.

@pclayton
Copy link
Author

pclayton commented Jan 6, 2025

My initial question would be whether or not the data corruption appears on amd64 platforms using the C codegen (-codegen c).

Ah, yes - it does! I hadn't considered the different backends until now. On macOS arm64, the issue occurs with -codegen c and -codegen llvm (there is no native codegen available). On Linux x86_64, the issue occurs with -codegen c but not with the native codegen.

That would make it a bit easier to investigate.

Certainly - I no longer need to use my wife's Mac!

One place that you could get data corruption with callbacks is if there is an SML object pointer retained by some C code when an SML GC occurs (which would invalidate that object pointer). Currently, the runtime system does not have a mechanism for registering roots.

I considered this and will keep looking out for this but I cannot see where this could be happening at the moment. C functions should be passed objects on the C heap only, created by C code. Also, if this was occurring, would it not affect native codegen also? Furthermore, the address that appears to be corrupted is 0x7ffff7faafc8 in the GDB log below but logging output from Giraffe Library (showing addresses of C objects) shows adresses like 0x77E000. I don't think the C code could corrupt addresses in the MLton stack but I will still look out for this.

Below is the equivalent debugging transcript to the LLDB one above but in GDB on Linux x86_64. In deriving the pointer argument of cairo_create, the program attempts to read address 0x1. At address 0x7ffff7faafc8, which appears to be StackTop, there should be a pointer which, when dereferenced, gives a pointer to a cairo_surface_t object. However, the pointer at StackTop is 0x1, and dereferencing that fails:

D1

Thread 1 "example-4-mlton" received signal SIGSEGV, Segmentation fault.
0x000000000045d589 in Chunk_12 (GCState=0x5463e0 <gcState.0>, StackTop=0x7ffff7faafc8, Frontier=0x7ffff7fb2878, nextBlock=2094) at example-4-mlton.0.c:3097
3097		T(Q, 0) = O(CPointer, S(Objptr, 0), 0);
(gdb) bt
#0  0x000000000045d589 in Chunk_12
    (GCState=0x5463e0 <gcState.0>, StackTop=0x7ffff7faafc8, Frontier=0x7ffff7fb2878, nextBlock=2094)
    at example-4-mlton.0.c:3097
#1  0x0000000000424424 in MLton_trampoline (mayReturnToC=true, nextBlock=<optimized out>, s=0x5463e0 <gcState.0>)
    at /opt/mlton/20241230/lib/mlton/include/c-main.h:35
#2  MLton_callFromC (localOpArgsResPtr=localOpArgsResPtr@entry=0x7fffffffd450) at example-4-mlton.2.c:18006
#3  0x00000000004246e3 in giraffe_closure_dispatch
    (x0=<optimized out>, x1=<optimized out>, x2=<optimized out>, x3=<optimized out>, x4=<optimized out>, x5=<optimized out>) at example-4-mlton.2.c:18025
#4  0x00007ffff6f9cdb0 in g_closure_invoke () at /lib64/libgobject-2.0.so.0
...
(gdb) disassemble /s 0x000000000045d586,0x000000000045d599
Dump of assembler code from 0x45d586 to 0x45d599:
example-4-mlton.0.c:
3097		T(Q, 0) = O(CPointer, S(Objptr, 0), 0);
   0x000000000045d586 <Chunk_12+1094>:	mov    (%rbx),%rax
=> 0x000000000045d589 <Chunk_12+1097>:	mov    (%rax),%rdi

3098		S(Word64, 48) = /* L_13391 */ 4707;
   0x000000000045d58c <Chunk_12+1100>:	movq   $0x1263,0x30(%rbx)

3099		StackTop = CPointer_add (StackTop, (Word64)(0x38ull));
3100		CReturnQ = cairo_create (T(Q, 0));
   0x000000000045d594 <Chunk_12+1108>:	call   0x4227f0 <cairo_create@plt>
End of assembler dump.
(gdb) info register rbx
rbx            0x7ffff7faafc8      140737353789384
(gdb) x/xg 0x7ffff7faafc8
0x7ffff7faafc8:	0x0000000000000001
(gdb) info register rax
rax            0x1                 1

I have attached the generated C files in example-4-mlton.tar.gz. I have also attached the minimal Giraffe library dependencies (Linux x64_64) in giraffe-lib-mlton.tar.gz. When these files are extracted in the same place, I can build the application with the command:

gcc example-4-mlton.*.c -o example-4-mlton -ggdb -I/opt/mlton/20241230/lib/mlton/targets/self/include -I/opt/mlton/20241230/lib/mlton/include -lm -lgmp -L/opt/mlton/20241230/lib/mlton/targets/self -lmlton -lgdtoa -Lgiraffe/lib/mlton -lgiraffe-glib-2.0 -lgiraffe-gobject-2.0 -lgiraffe-gio-2.0 -lgiraffe-cairo-1.0 -lgiraffe-gdk-3.0 -lgiraffe-gtk-3.0 `pkg-config --libs glib-2.0 gobject-2.0 gio-2.0 gio-unix-2.0 cairo-gobject gdk-3.0 gtk+-3.0`

replacing /opt/mlton/20241230 with the MLton home directory. (Some GTK dev package may be needed - see section Libraries here for a package manager command.) To debug, I run

GIRAFFE_DEBUG=log-all gdb ./example-4-mlton

To produce the same error in Chunk_12, try clicking in the bottom right of the white area. This may require several attempts. Clicking elsewhere, or just moving the cursor around will invariably generate some sort of failure.

The original SML program is attached in example-4.sml.txt. I will try to create a smaller example.

The Thread.atomicSwtich didn't set r. error could be a manifestation of the same issue, since it occurs when a shared reference is not properly updated when switching between SML threads: https://github.com/MLton/mlton/blob/master/basis-library/mlton/thread.sml#L118. But, in general, the invariants related to MLton.Thread operations should be maintained by the implementation and should not be able to be violated by SML user code.

Ok.

@pclayton pclayton changed the title Data corruption in callback to SML from C on arm64 Data corruption in callback to SML from C with c and llvm codegen Jan 6, 2025
@MatthewFluet
Copy link
Member

I was able to reproduce the segmentation fault on amd64-linux with example-4-mlton.tar.gz and giraffe-lib-mlton.tar.gz, although I could not build from source using giraffe-1.0.0-alpha.12.tar.gz.

I changed the compilation command from -lmlton to -lmlton-dbg to link with the debugging runtime system, which has many asserts. With that, rather than getting a segmentation fault, the program terminates with:

example-4-mlton: gc/stack.c:54: getStackTop: Assertion `isAligned ((size_t)res, s->alignment)' failed.

That corresponds to https://github.com/MLton/mlton/blob/on-20241230-release/runtime/gc/stack.c#L49-L56, which suggests that the used field of a stack doesn't have a valid value.

@pclayton
Copy link
Author

pclayton commented Jan 9, 2025

I was able to reproduce the segmentation fault on amd64-linux with example-4-mlton.tar.gz and giraffe-lib-mlton.tar.gz, although I could not build from source using giraffe-1.0.0-alpha.12.tar.gz.

Sorry - the previously attached example-4.sml.txt was for the latest version of Giraffe Library in Git, soon to be 1.0.0-alpha.13, which has some breaking changes (in the types of signal handler functions, which are needed to avoid reference cycles). For giraffe-1.0.0-alpha.12.tar.gz, this example is available in the Examples Guide. You should just need to change the MLton build command in Makefile to specify -codegen c. I checked with 1.0.0-alpha.12 and similar failures still occur.

Alternatively, the previously attached example-4.sml.txt could be used with the latest Git version. Once code is generated as described in the README file, the release directory can be used like a source release. To save you the trouble, I have attached the built source release as giraffe-g1422aac0.tar.gz. You can still use the Example Guide files to build with if you update example-4.sml.

I changed the compilation command from -lmlton to -lmlton-dbg to link with the debugging runtime system, which has many asserts. With that, rather than getting a segmentation fault, the program terminates with:

example-4-mlton: gc/stack.c:54: getStackTop: Assertion `isAligned ((size_t)res, s->alignment)' failed.

That corresponds to https://github.com/MLton/mlton/blob/on-20241230-release/runtime/gc/stack.c#L49-L56, which suggests that the used field of a stack doesn't have a valid value.

I also see the assertion failure:

example-4-mlton: gc/object.c:59: splitHeader: Assertion `1 == (header & GC_VALID_HEADER_MASK)' failed.

I presume these assertions are enabled by -debug true. I didn't know about that option - it's not mentioned in the compile-time options.

For the 1.0.0-alpha.12 version of this example, I sometimes see various other assertion failures with MLton 20241230 but these could be due to missing 'reentrant' keywords now added in Git. Sometimes there is no assertion failure, just a seg. fault.

With 20180207, I see the following when the application starts (no window appears):

example-4-mlton: gc/frame.c:25: getFrameLayoutFromFrameIndex: Assertion `findex < s->frameLayoutsLength' failed.

This occurs with 1.0.0-alpha.12 and the latest Git version.

With MLton 20130715, I have found that there is no issue using the C codegen. In 20130715, all imported functions are reentrant so this makes me wonder whether I have missed 'reentrant' off an _import somewhere. On the other hand, the native codegen works fine, so perhaps 'reentrant' is not the issue.

@MatthewFluet
Copy link
Member

I changed the compilation command from -lmlton to -lmlton-dbg to link with the debugging runtime system, which has many asserts. With that, rather than getting a segmentation fault, the program terminates with:

example-4-mlton: gc/stack.c:54: getStackTop: Assertion `isAligned ((size_t)res, s->alignment)' failed.

That corresponds to https://github.com/MLton/mlton/blob/on-20241230-release/runtime/gc/stack.c#L49-L56, which suggests that the used field of a stack doesn't have a valid value.

I also see the assertion failure:

example-4-mlton: gc/object.c:59: splitHeader: Assertion `1 == (header & GC_VALID_HEADER_MASK)' failed.

This is suggesting some kind of heap corruption, where a supposed object point isn't actually pointing at a valid object (because there is no valid object header in the expected place). This is consistent with the assertion error that I was seeing. Adding in some prints of the stack just before making the assertion, I see:

splitHeader (00000001)  objectTypeIndex = 0  tag = STACK  hasIdentity = FALSE  bytesNonObjptrs = 0  numObjptrs = 0
foreachObjptrInObject (0x00007f8ecfd8bcc0)  header = 00000001  tag = STACK  bytesNonObjptrs = 0  numObjptrs = 0
stack @ 0x00007f8ecfd8bcc0
		reserved = 140251349170712
		used = 107
example-4-mlton: gc/stack.c:57: getStackTop: Assertion `isAligned ((size_t)res, s->alignment)' failed.

Tha's clearly a "bogus" stack --- no way that we have a 140TB stack with only 107 bytes used!! So, we're actually trying to read an object, but getting a somewhat invalid header --- one that has a 1 in the low bit (so passes the gc/object.c:59: splitHeader) and has 0 for the type index bits (so gets interpreted as the object header for a STACK), but clearly this is not a stack.

Turning on more debugging output (requires recompiling the runtime system with DEBUG and some of the DEBUG_* enums set to TRUE), I see the following:

Checking nursery.
foreachObjptrInRange  front = 0x00007f8ecfd8bcb8  *back = 0x00007f8ecfd93e20
  front = 0x00007f8ecfd8bcb8  *back = 0x00007f8ecfd93e20
0x00007f8ecfd8bcc0 = advanceToObjectData (0x00007f8ecfd8bcb8)
splitHeader (00000001)  objectTypeIndex = 0  tag = STACK  hasIdentity = FALSE  bytesNonObjptrs = 0  numObjptrs = 0
foreachObjptrInObject (0x00007f8ecfd8bcc0)  header = 00000001  tag = STACK  bytesNonObjptrs = 0  numObjptrs = 0
stack @ 0x00007f8ecfd8bcc0
		reserved = 140251349170712
		used = 107
example-4-mlton: gc/stack.c:57: getStackTop: Assertion `isAligned ((size_t)res, s->alignment)' failed.

What's interesting here is that checking the old generation succeeded without issue, but this is the first object in the nursery that is tripping things.

I presume these assertions are enabled by -debug true. I didn't know about that option - it's not mentioned in the compile-time options.

Yes, there are -debug true (compile the C or assembly files with -g and link to the libmlton-dbg library) and -debug-runtime true (just link to the libmlton-dbg library) compile options; they aren't documented because they don't really provide a useful SML debugging experience; they are mainly for MLton developers.

For the 1.0.0-alpha.12 version of this example, I sometimes see various other assertion failures with MLton 20241230 but these could be due to missing 'reentrant' keywords now added in Git. Sometimes there is no assertion failure, just a seg. fault.

With 20180207, I see the following when the application starts (no window appears):

example-4-mlton: gc/frame.c:25: getFrameLayoutFromFrameIndex: Assertion `findex < s->frameLayoutsLength' failed.

This occurs with 1.0.0-alpha.12 and the latest Git version.

That could still arise from the same corruption --- a bogus object pointer that gets interpretted as a STACK, where the bogus data has an aligned used value, but can't map a (probably the first) return address to GC frame data.

With MLton 20130715, I have found that there is no issue using the C codegen. In 20130715, all imported functions are reentrant so this makes me wonder whether I have missed 'reentrant' off an _import somewhere. On the other hand, the native codegen works fine, so perhaps 'reentrant' is not the issue.

Any _import that might call an _export-ed function should be marked as reentrant. Missing one could certainly leave the ML thread/stack that call's the _import-ed function in a bad state, such that when the _export-ed function is called and triggers a GC, the saved thread can't be properly handled. (Though, I might expect that to crash during the GC proper; the failing assertion during Checking nursery suggests that some previous GCs succeeded, but left the heap in a bad state. Though, one possibility is that returning from the _import-ed (but not reentrant) function resumes execution with "old" values for the ML thread/stack, which, after a GC that moves the thread/stack, would be pointing to some arbitrary piece of the heap.)

As to why the native codegen works, one possibility is the fact that the native codegen "knows" that the StackTop and Frontier locals "belong" to the GC state. So, when it needs to free up a register, it will write those back to the GC state. But, the C and LLVM codegen's don't "know" that the GC state is a "safe" place to write those locals; instead, they will be spilled to the C stack. So, with the native codgen, those values might happen to be in the right place during a GC.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants