Parallelism abstraction ideas
Where parallel evaluation could happen
The obvious place to introduce parallelism into the decision tree algorithm is, where the algorithm branches off into subtrees, which potentially will branch off into subtrees themselves.
Profiling under Racket has shown, that the exhaustive search for the best split of a dataset of a subtree is, what takes the most time. This is another place, where parallelism can happen: One could split the search into multiple parts and start multiple, in parallel running units of evaluation, which each find the best split for one part of the data.
Parallelism constructs
- Different Scheme dialects might support different parallelism constructs or possibly no parallelism constructs. So the abstraction should consider the fact, that for some Scheme dialect, there might not be a simple way to evaluate expressions in parallel.
- It might not always be desired to run the maximum possible number of tasks or jobs in parallel. The decision, how many jobs should run in parallel should be up to the user of the algorithm. It should be configurable, probably via a keyword argument for the
fit
procedure.
About various ways of parallelizing the algorithm
- fibers support a
#:parallelism
keyword and thus the number of fibers running in parallel can be set for run-fibers
, which creates a scheduler, which might be used later, possibly avoiding to keep track of how many threads are being used.
- I need to check, whether it is possible to introduce more running fibers than a top level scheduler specifies, by spawning new fibers in fibers.
- with parallel forms, there are
n-par-map
and n-par-for-each
, where one can specify the maximum number of threads running in parallel.
- However, when processing one subtree, one does not have the knowledge of whether the processing of the other subtree has finished and thus cannot know how many other threads are being used currently. Calling one of these procedures again might run more threads than specified on a upper or previous level in the tree. The same issue might occur with fibers.
- See /reference-manual/guile/Parallel-Forms.html
- with futures, one cannot specify the number of futures to run at maximum.
- In order to control how many threads are evaluating code at the same time, one would have to build some kind of construct around them, which keeps track of how many futures are running or could be running.
- One could do something ugly and create a globally defined active thread counter, which requires locking, to keep track of the number of in parallel running threads or futures.
- As the count of threads running at the same time is limited to a maximum given by (current-processor-count), maybe the number can be reduced from outside the program, by using
taskset
(fibers manual mentions it: https://github.com/wingo/fibers/wiki/Manual#Parallelism). Perhaps also setaffinity
can be used to specify on what cores threads must run.
Idea about using fibers recursively
- If at tree depth 0, use
run-fibers
with #:parallelism
.
- If not at tree depth 1, use run-fibers with #:parallelism 1.
- Problem: might not spawn enough fibers then.
Procedures in the parallelized code path
Looking at all (recursively) participating procedures in the code, which runs in parallel should give a clear picture, of whether the fault for bugs happening in parallel execution can be in any of them.
select-min-cost-split
- can raise a condition in case of wrong arguments (empty list of splits)
get-best-split-for-column
make-split
- is a constructor
- only ever creates new struct instances
dataset-get-col
dataset-map
- uses only
map
internally, is just part of an abstraction layer over lists
data-point-get-col
- is just an alias for
vector-ref
dataset-column-empty?
- is an alias for
null?
, is part of an abstraction layer
dataset-column-first
- is an alias for
car
, is part of an abstraction layer
split-data
- makes use of
receive
, inside which the bug happened, that <
got an argument of the wrong type
dataset-partition
- is a wrapper for
partition
(or actually an alias)
- is part of the abstraction layer over lists
partition
- is a library function from
srfi-1
data-point-get-col
- alias for vector-ref
- part of abstraction layer over vectors for data points
<
- standard mathematical function
list
- is the list constructor
- standard function
split-quality-proc
- is in this case actually
gini-index
, given as an argument
gini-index
+
- standard math function
apply
- standard higher-order function
map
- standard higher-order function
calc-proportion
dataset-empty?
- abstraction layer over lists
- alias for
null?
dataset-length
- abstraction layer over lists
- alias for
length
count
- SRFI-1 higher-order function
data-point-get-col
- alias for
vector-ref
- part of abstraction layer over vectors for data points
/
- standard math function
*
- standard math function
-
- standard math function
iter-col-vals
- is a named let, defined in
get-best-split-for-column
itself
dataset-column-rest
- is an alias for
cdr
- is part of an abstraction layer over lists
Thoughts
- Perhaps some function from SRFI 1 is not suitable for running in parallel?
- Perhaps the implementation of the parallel forms is not correct?
- Perhaps the underlying threading code for parallel forms is not correct?
- Perhaps recursively splitting into parallel executions for branches of the tree is not a use case foreseen for parallel forms?
- Perhaps a language primitive is not usable in parallel code? (unlikely)
- Maybe the
receive
form is buggy in parallel execution.
- This is my main suspect.