123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189 |
- 2007-Oct-07
- -----------
- So far we've been very antagonistic towards any "clever" requirements
- in a runtime: if the semantic model doesn't *spell out* the
- optimization strategy, chances are no runtime will ever manage to be
- smart enough to "discover" one.
- This position has failed to address one important issue, and I need to
- nail it down before going on: how do you efficiently schedule and
- dispatch message-receives. If we can't do that efficiently, we're
- sunk. Efficient here means: constant space per proc, constant
- dispatching time.
- So let's sketch the situation. Note this is not necessarily related to
- handling dynamic alts and binding-channels-to-sets; it might be nice
- if it handles it, but not essential. This is for the case of a fixed
- set of ports and a fixed set of dispatching alts visible in the
- proc. The only variable size is the number of queued callers and the
- number
- So, naively -- worst case -- we have 1 run queue including the
- sleepers, we pull people out in random order, and each sleeper points
- to the target it's trying to enter, and the first one to enter a proc
- wins. That works, but it's effectively busy-waiting the sleepers when
- anyone is making progress; you only idle when *everyone* is blocked.
- The crucial problem is devising the constant-size data structure that
- maps N callers into queues in a variable partition of between 0 .. K
- queues, such that you can do the following in O(1):
- - add a member to, or remove a member from, a queue
- - figure out if there's anyone in a queue
- - pick a representative from that queue at random
- First, note the following facts:
- - doubling/halving vectors waste no more than 75% of their space.
- - 2-space copying collectors waste no more than 50% of their space
- So we build a 2+10N-word structure given N processes:
- - 2 4N-word vectors representing 2 spaces for d/h queue sub-vectors,
- densely packed, copied back and forth between the spaces as they
- need reallocation.
- - 2 control words representing "end of current space" in each space.
- - a 2N-word vector of 2-word control records that refer into
- the 2 spaces
- So a million processes require <=10 million words, or ~40mb, of
- scheduling queues on a 32-bit machine. Not great, but not bad. Bounded
- at least. And really, no process is going to eat less than 10 words of
- stack and control state anyways; we just don't want to wind up
- charging more for the queues than we do for the process abstraction in
- a neutral-ish, low-stack form (say an auto proc with small-ish ports).
- The proc can get by with 1 word in its header for every queue, and
- combine all the auto ports into a single queue. So for abundant
- million-copy auto-proc sort of procs, 1 word period: pointing to the
- queue ID in the entry scheduler. Yay.
- This gives us O(1) randomized scheduling on every queue assuming we
- use an O(1) PRNG (I like ISAAC). Then when you execute an alt, it
- picks 2 words from the PRNG, uses the first to pick a queue and the
- second to pick a caller within the queue to wake (or should it pick 1
- word mod total waiters, then distribute to the queue that held that
- waiter? slightly slower to dispatch if you have a large number of
- ports, but gives a different model of fairness...). Either way:
- simple, fast, easy to guess what will happen, smoothes out load lumps
- at every step of the system.
- That last question is curious. Suppose you have a proc with a million
- processes waiting on port A, and 1 process waiting on port B.
- Should the B process get a ~1/1million chance to run, or a ~1/2 chance
- to run? Which is easier to imagine the designer wanting more often?
- Which is easier for the designer to work around if they mean the other
- thing? I think the designer is more likely to want to get a guarantee
- of the latter (1/2 chance) because they have more to count on that
- way; they can always combine event streams together dynamically if
- they want the other sort of behavior. It's easy enough to just have a
- proc hold an internal work buffer that it mixes its own trouble into
- and processes as it wishes. Also, it doesn't hurt that the 1/2 case
- scales better :)
- Also note that if you have dynamic ports, you may have a larger-size
- scheduling pool.
- Woo! I think this issue is (finally) solved. I chewed on it for 24+
- hours solid. Bleah.
- 2007-Oct-10
- -----------
- Multithreading.
- Hmm. I am generally not interested in multithreading: it's not worth
- the pain. There are reasons you might want it, but they're few.
- But let's just jot down "the simplest possible MT solution" and see if
- it works. First, note that modern systems guarantee word-aligned writes
- are atomic and all have an atomic compare-and-swap (CAS) instruction.
- So you are never going to have to race over refcount operations or
- twiddling owner IDs or whatever. So:
- - Every directly limited type has a single owner, which implies a
- single reference. That much is obvious.
- - If I have a single reference and limitation infects, then the path
- from the thread register that owns me to me is all
- single-referenced values.
- - So: directly or indirectly limited slots make up a true
- tree. Always. There is always a true tree of process ownership,
- for example. Moving process references around only happens between
- portions of a tree-shaped data structure. If a thread can see node
- N in that tree, it is the only thread that can see the subtree
- under N.
- - The only stuff that might be "visible" to multiple threads
- simultaneously is the daggy "value" stuff. That's all refcounted
- along pointer edges. And we CoW any "value" into our own thread's
- tree any time we're performing an update: we don't expect other
- threads can "see" changes we make. We always edit "our own copy".
- There is no semantic "shared visibility mutation" in this
- language, never has been.
- - So all we need to reason about in MT terms is the CoW system and
- the message queue system, and interactions with the outside world.
- Which, unfortunately, we *always* have to reason about quite
- carefully anyways.
- - CoW is CAS-based. The path from your thread registers/stack to a
- heap memory write is *always* an lval-lookup LHS of an assignment
- statement. So to perform such a write you go step by step ensuring
- refcount=1. If you see a >1 edge you must pause, make a shallow
- copy, rewrite "your" rc=1 base value that's holding the shared
- edge to instead hold your copy, and carry on. When you get to the
- leaf you write.
- When you are forming a new reference to a shared value, you must
- increment the shared value refcount safely. You do this with a
- CAS.
- When you are dropping a refcount due to a value you're freeing,
- you must get the refcount to 0, then free. If you CAS the value
- 1->0 successfully, you're the guy who gets to free the memory. If
- you're unsuccessful, someone else beat you to it (or re-referenced
- the value) and you're off the hook.
- aliases permit concurrent reads but reading through them should
- always use a memory barrier pointer-read sort of thing to ensure a
- coherent view. Or maybe it's that writes to heap pages always need
- to run a memory barrier op on the containing object? I can never
- remember...
- - The message queue system is a little hairier, but not hugely. You
- have a single Big Scheduler Of Processes and a bunch -- say 64 or
- 512 -- of threads. You can activate only as many procs as you have
- threads at any moment, and you may occasionally have to have one
- wait on another due to a message send or an attempted proc
- lifecycle event; in those cases the proc becomes blocked, the
- thread sticks it back in the run queue, and it's "dormant data"
- for other threads to diddle. You only possibly get stuck when you
- have an imperative like "I'm dying" and you are trying to clean up
- a proc you own. You can post the proc-death event but it might
- take a moment to "really" die. So you have a sort of grim reaper
- pseudo-proc that you have to message synchronously to confirm the
- death of another proc.
-
- ... OR ... you never have 1 thread running a proc with another
- thread running a parent proc. In order to activate a proc you have
- to block -- all the way up to the root of the proc tree -- its
- parents. So they're only likely to give you a little moment to run
- in, not a lot. That is probably a better rule. Explicit priority
- delegation: no inversion possible! The thread scheduler has to
- perform the topological allocation.
- - The other hairy part is that the message-send queue, the proc run
- queue, and the idle thread queue probably all have to be some
- fancy MT-safe lockless queues. Because the threads are all going
- to be independently and concurrently hammering on those
- things. But such structures exist.
- ---
- Native pointers are limited, non-transmittable. Native functions are statically imported and
- statically bound, single-entry, CDECL. Crate defn can specify transitive closure of acceptable
- native dependencies, or * if you don't care.
|