notes.org 6.2 KB

Parallelism abstraction ideas

Where parallel evaluation could happen

The obvious place to introduce parallelism into the decision tree algorithm is, where the algorithm branches off into subtrees, which potentially will branch off into subtrees themselves.

Profiling under Racket has shown, that the exhaustive search for the best split of a dataset of a subtree is, what takes the most time. This is another place, where parallelism can happen: One could split the search into multiple parts and start multiple, in parallel running units of evaluation, which each find the best split for one part of the data.

Parallelism constructs

  • Different Scheme dialects might support different parallelism constructs or possibly no parallelism constructs. So the abstraction should consider the fact, that for some Scheme dialect, there might not be a simple way to evaluate expressions in parallel.
  • It might not always be desired to run the maximum possible number of tasks or jobs in parallel. The decision, how many jobs should run in parallel should be up to the user of the algorithm. It should be configurable, probably via a keyword argument for the fit procedure.

About various ways of parallelizing the algorithm

  • fibers support a #:parallelism keyword and thus the number of fibers running in parallel can be set for run-fibers, which creates a scheduler, which might be used later, possibly avoiding to keep track of how many threads are being used.
  • I need to check, whether it is possible to introduce more running fibers than a top level scheduler specifies, by spawning new fibers in fibers.
  • with parallel forms, there are n-par-map and n-par-for-each, where one can specify the maximum number of threads running in parallel.
  • However, when processing one subtree, one does not have the knowledge of whether the processing of the other subtree has finished and thus cannot know how many other threads are being used currently. Calling one of these procedures again might run more threads than specified on a upper or previous level in the tree. The same issue might occur with fibers.
  • See /reference-manual/guile/Parallel-Forms.html
  • with futures, one cannot specify the number of futures to run at maximum.
  • In order to control how many threads are evaluating code at the same time, one would have to build some kind of construct around them, which keeps track of how many futures are running or could be running.
  • One could do something ugly and create a globally defined active thread counter, which requires locking, to keep track of the number of in parallel running threads or futures.
  • As the count of threads running at the same time is limited to a maximum given by (current-processor-count), maybe the number can be reduced from outside the program, by using taskset (fibers manual mentions it: https://github.com/wingo/fibers/wiki/Manual#Parallelism). Perhaps also setaffinity can be used to specify on what cores threads must run.

Idea about using fibers recursively

  • If at tree depth 0, use run-fibers with #:parallelism .
  • If not at tree depth 1, use run-fibers with #:parallelism 1.
  • Problem: might not spawn enough fibers then.
  • Procedures in the parallelized code path

    Looking at all (recursively) participating procedures in the code, which runs in parallel should give a clear picture, of whether the fault for bugs happening in parallel execution can be in any of them.

    • select-min-cost-split
    • can raise a condition in case of wrong arguments (empty list of splits)
    • get-best-split-for-column
    • make-split
    • is a constructor
    • only ever creates new struct instances
    • dataset-get-col
    • dataset-map
    • uses only map internally, is just part of an abstraction layer over lists
    • data-point-get-col
    • is just an alias for vector-ref
    • dataset-column-empty?
    • is an alias for null?, is part of an abstraction layer
    • dataset-column-first
    • is an alias for car, is part of an abstraction layer
    • split-data
    • makes use of receive, inside which the bug happened, that < got an argument of the wrong type
    • dataset-partition
    • is a wrapper for partition (or actually an alias)
    • is part of the abstraction layer over lists
    • partition
    • is a library function from srfi-1
    • data-point-get-col
    • alias for vector-ref
    • part of abstraction layer over vectors for data points
    • <
    • standard mathematical function
    • list
    • is the list constructor
    • standard function
    • split-quality-proc
    • is in this case actually gini-index, given as an argument
    • gini-index
    • +
    • standard math function
    • apply
    • standard higher-order function
    • map
    • standard higher-order function
    • calc-proportion
    • dataset-empty?
    • abstraction layer over lists
    • alias for null?
    • dataset-length
    • abstraction layer over lists
    • alias for length
    • count
    • SRFI-1 higher-order function
    • data-point-get-col
    • alias for vector-ref
    • part of abstraction layer over vectors for data points
    • /
    • standard math function
    • *
    • standard math function
    • -
    • standard math function
    • iter-col-vals
    • is a named let, defined in get-best-split-for-column itself
    • dataset-column-rest
    • is an alias for cdr
    • is part of an abstraction layer over lists

    Thoughts

    • Perhaps some function from SRFI 1 is not suitable for running in parallel?
    • Perhaps the implementation of the parallel forms is not correct?
    • Perhaps the underlying threading code for parallel forms is not correct?
    • Perhaps recursively splitting into parallel executions for branches of the tree is not a use case foreseen for parallel forms?
    • Perhaps a language primitive is not usable in parallel code? (unlikely)
    • Maybe the receive form is buggy in parallel execution.
    • This is my main suspect.