Arc Forum | arc2c: a proposed threading model

Arc Forum

5 points by almkglor 6419 days ago | 12 comments

I'm proposing a "green threads" model for threading on the arc2c output.

arc2c converts all functions into "case" blocks for a big switch statement. When a function needs to call another function, it loads a "pc" variable with the number for that function's case statement and jumps to the top. It's like a trampoline.

Since all "jumps" actually go to the top of the switch structure, it may be possible to change execution between threads during the trampoline. Basically, each thread contains its own stack and pc counters, as well as its state (running or dead).

We maintain a circular list of threads.

When the trampoline is entered we commit the sp, stack, pc, and state into the thread's structure. Then we get the 'cdr of the circular list of threads, check if that thread is dead, and while the cdr thread dead, get its cdr and replace the current cdr with that, until we reach the current thread (and if it's dead, exit), or we reach a live thread.

If the thread is live we copy its sp, stack, pc, and state, and proceed to the switch structure.

When building a new thread, we construct a stack with the provided function's closure and a %halt continuation (which are always the first parameters to all functions in the arc2c output), make a new sp that points just beyond the parameters, set the pc to the provided function, and set the state to alive, then insert this to the circular list of threads.

We could also make this part optional, so that non-threaded applications will not have the overhead of the thread-switching.

The advantage is that we can get threads even without OS support (say on an embedded system without a threaded RTOS...), and thus be very portable. Also, this makes atomic operations trivial: we can wrap atomic operations in a pair of (%start-atomic) and (%stop-atomic) primitives, which simply set a C-variable, atomic; when the trampoline detects that the C-variable atomic is set it simply skips over threading unless the state is dead (although there is a potential problem with this if an atomic operation, say, exits via a continuation). Finally, the Boehm GC currently being used doesn't seem to handle "real" threads very well, so our emulation of multiple threads would work.

The disadvantages are that I/O operations will block all threads, unless we complicate our threading model, wherein we have a "blocked" state, and use only asynchronous I/O (including the current async I/O with the blocked state, so that the trampoline can poll the I/O operation and unblock the thread if the I/O has completed). Also, we cannot expect the OS to distribute execution to other processors/cores (obviously because we're not asking for it).

3 points by stefano 6419 days ago | link

This strategy seems viable, but it seems slow, because there are a lot of checks at every thread switching. If we could manage to transform every Arc function into a C function (actually an array containing the C function address and the closed vars) we could then rely on the pthread library for multi-threading, or the equivalent threading library available on the host system.

-----

3 points by almkglor 6419 days ago | link

The problem with transforming each Arc function to a C function is the stack space consumed at each call. Sure, gcc does tail call optimization, but not all compilers are gcc.

Chicken fixes this by keeping track of stack and GC'ing the stack when it's full; I'm not sure about Bigloo but I hear it's got a pretty good Scheme function == C function equivalency, so it might be useful to see how they fixed the tail call optimization problem there.

That said, the current execute() function accepts a pc argument, which specifies which Arc function it begins with. It may be possible to pass the pc to go to together with a new stack, but I don't know pthreads.

-----

2 points by stefano 6419 days ago | link

I think you could trust every decent C compiler to do tail recursion optimization, but if you want full compatibility then the mapping Arc fun -> C fun doesn't work. pthread requires you to pass a pointer to a C function i.e. an adress where to jump. We could wrap every thread within a C function and leave all the other functions as they are currently implemented. The C function would just call execute() with the right parameters. If execute() doesn't use global vars, there will not be race condition problems.

-----

2 points by almkglor 6419 days ago | link

Well, I've started to import bits of execute from globals to locals. However I do have access to a few bits of global variables, specifically the quoted-constants array (those created by 'foo and '(a b c d), etc.); this table is initialized at startup (before the first call to execute). I would suppose this read-only array would be okay to access?

As for wrapping them in C functions: the problem is that the most basic Arc threading function isn't 'thread, it's 'new-thread, which accepts a function as input. 'thread is defined as:

  (mac thread body
    `(new-thread (fn () ,@body)))

In theory, new-thread could be called with any function:

  (new-thread
   (if
    (something)
      (fn () (ha-ha-ha))
    (something-else)
      (fn () (he-he-he))
      (fn () (je-je-je))))

Sure, it won't happen most of the time, since most people will sensibly use the simpler 'thread, but exploratory, exploratory...

It would be possible to implement if pthreads or whatnot can pass even just a single pointer to the newly-threaded function, but if it can't (why not?) then our alternative is to create a bunch of C functions for each Arc function, which just calls the execute() function with the correct number.

---------

Edit: okay, I did a little eensy-weensy bit of research on pthreads, and it seems that pthreads can pass a pointer to the called C function.

So I suppose we can modify execute's signature:

  obj execute(long pc, obj stack[], obj *sp);
  int main(void){
    /*init...*/
    obj *stack = GC_MALLOC(sizeof(obj)*MAX_STACK);
    obj *sp = stack;
    execute(0, stack, sp);
  }

Then the NEWTHREAD() primitive would be:

  typedef struct{
    long pc;
    obj *stack;
    obj *sp;
  } newthreadpasser;
  #define T_THREAD 9999 //some random number
  typedef struct{
    long type; /*T_THREAD*/
    pthread_t val;
  } thread;
  #define NEWTHREAD() \
   {obj *newstack = GC_MALLOC(sizeof(obj) * MAX_STACK);\
    obj *newsp = &newstack[2]; \
    newstack[0] = POP(); \
    newstack[1] = THREAD_EXIT_CONTINUATION;\
    newthreadpasser * np = GC_MALLOC(sizeof(newthreadpasser));\
    np->pc = ((obj*)newstack[0])[0];\
    np->stack = newstack;\
    np->sp = newsp;\
    thread * threadobj = GC_MALLOC(sizeof(thread));\
    threadobj->type = T_THREAD;\
    pthread_create(&(threadobj->val), NULL, threadcall, (void*) np)\
    PUSH((obj)threadobj);}

Finally, threadcall:

  void *threadcall(void * args){
    newthreadpasser * np = (newthreadpasser *) args;
    execute(np->pc, np->stack, np->sp);
  }

Hmm.

-----

2 points by stefano 6418 days ago | link

This could work. As for global variables, if they are read-only, then there won't be any problem. What happens if you load in parallel two different modules (with their constants)? Maybe loading a file should be made atomic, or at least a part of it, such as constant values initialization.

-----

1 point by almkglor 6418 days ago | link

Well, a "load" at top level simply inserts the code for the loaded file directly into the source for now. Any other load statement is not supported.

-----

1 point by kranerup 6416 days ago | link

Using pthreads might have some benefits but speed is not a reason for choosing pthreads. On Linux systems that maps pthreads onto kernel threads I've seen a two order of magnitude slower execution compared to an user space thread library. In general I'd say that if I/O is important then using pthreads would make sense but if speed is important then green threads is preferable.

-----

2 points by stefano 6416 days ago | link

I thought pthreads were fast, but anyway on computers with more than one core/processor, which are very common, green threads don't distribute the threads between the cores, because this is the OS's duty, and it doesn't know anything about green threads.

-----

3 points by almkglor 6416 days ago | link

Hmm. After a bit of research, it seems that some VM's handle this by using pthreads a few times (presumably one per core), then having each pthread spawn several green threads.

-----

2 points by stefano 6416 days ago | link

This seems a good solution if we can arrange to spawn a new pthread also for blocking I/O, maybe by catching I/O functions and putting into a pthread the I/O operation caller. This hybrid would make the compiler code not very clean, because it should handle both green threads and pthreads.

-----

1 point by sacado 6419 days ago | link

I think that's an excellent idea too. Boehm doesn't seem to be very good with threads. And green threads make things like the use of extended threading à la Erlang much easier. IO would probably be a problem, but that's something that can be defered.

-----

3 points by almkglor 6419 days ago | link

Yes, true. It would also be a pretty simple solution IMO, although potentially fraught with some dangers, especially when checking dead threads.

In any case I'm currently thinking of adding a library, where if a global variable is not read (only assigned to), it is simply removed.

Basically: extract the set of global variables that are read, and the set that are written to. If both sets are not equal, eliminate all set operations on members of the written set that are not members of the read set (replace them with their (car ast!subx)). Then eliminate any hanging constants: naked fn definitions and other constants that are in sequences not in the tail position. Repeat until set of global variables that are written is a full subset of the set of global variables that are read, and if they're not equal, raise an unbound variable error (i.e. a global is read but never assigned to, which as you mentioned would segfault. Not perfect of course, because reading the global before it is assigned to will still crash, but better than nothing).

-----