todo.org 9.2 KB

To do

Bugs [0/1]

  • [ ] There is a nasty only sometimes occurring bug, when using parallel forms for running the algorithm in parallel:

#+BEGIN_SRC shell scheme@(guile-user)> (define tree (fit #:train-data (shuffle-dataset banking-dataset #:seed 12345) #:feature-column-indices (list 0 1 2 3) #:label-column-index 4 #:max-depth 5 #:min-data-points 12 #:min-data-points-ratio 0.02 #:min-impurity-split (expt 10 -7) #:stop-at-no-impurity-improvement #t)) recursive split on depth: 1 will check for stop conditions now checking for stop condition: max-depth-reached? checking for stop condition: insufficient-data-points-for-split? checking for stop condition: insufficient-data-points-ratio-for-split? checking for stop condition: all-same-label? INFO: CONTINUING SPLITT: still got 1372 data points ice-9/threads.scm:289:22: In procedure loop: In procedure <: Wrong type argument in position 1: 0.39142673091293256

Entering a new prompt. Type `,bt' for a backtrace or `,q' to continue. #+END_SRC

This only happens when running the algorithm in parallel and has never happened when running the algorithm without parallelism. It might be a bug in Guile's parallel forms or some code, which cannot safely be run in parallel being part of the algorithm. It is also not always reproducible and the argument to < changes every time the bug happens. This makes it more likely, that this bug is connected to concurrency issues.

I am not aware of any impure code causing this problem, as I specifically made a lot of effort to implement the algorithm in a purely functional way, except for debug outputs. There is still some chance, that I overlooked something.

Here is a complete transcript of the bug happening:

#+BEGIN_SRC shell [18:24:47]:[~/development/Guile/guile-ml]: guile --debug -L . GNU Guile 2.2.6 Copyright (C) 1995-2019 Free Software Foundation, Inc.

Guile comes with ABSOLUTELY NO WARRANTY; for details type `,show w'. This program is free software, and you are welcome to redistribute it under certain conditions; type `,show c' for details.

Enter `,help' for help. scheme@(guile-user)> (use-modules (utils csv) (decision-tree) (dataset) (tree) (utils string) (utils display) (prediction))

(define FILE-PATH "data_banknote_authentication.csv")

(define COLUMN-CONVERTERS (list (list string->number) (list string->number) (list string->number) (list string->number) (list #;(lambda (val) (display (simple-format #f "converting: ~a\n" val)) (display (simple-format #f "converted: ~a\n" (string->number val))) (string->number val)) (lambda (val) (string->number (string-trim-both val))))))

(define banking-dataset (all-rows "data_banknote_authentication.csv" #:converters COLUMN-CONVERTERS))

(define dev-dataset (list #(2.771244718 1.784783929 0) #(1.728571309 1.169761413 0) #(3.678319846 2.81281357 0) #(3.961043357 2.61995032 0) #(2.999208922 2.209014212 0) #(7.497545867 3.162953546 1) #(9.00220326 3.339047188 1) #(7.444542326 0.476683375 1) #(10.12493903 3.234550982 1) #(6.642287351 3.319983761 1)))

(define-public print-tree (lambda (tree label-column-index) (define tree->string (lambda (tree depth) (cond [(leaf-node? tree) (string-append (n-times-string " " depth) "[" (number->string (dataset-majority-prediction (node-data tree) label-column-index)) "]\n")] [else (string-append (string-append (n-times-string " " depth) "[feature:" (number->string (node-split-feature-index tree)) " < " (number->string (node-split-value tree)) "]\n") (tree->string (node-left tree) (+ depth 1)) (tree->string (node-right tree) (+ depth 1)))]))) (displayln (tree->string tree 0))))

;;; note: source file ./decision-tree.scm ;;; newer than compiled home/xiaolong.cache/guile/ccache/2.2-LE-8-3.A/home/xiaolong/development/Guile/guile-ml/decision-tree.scm.go ;;; note: auto-compilation is enabled, set GUILE_AUTO_COMPILE=0 ;;; or pass the --no-auto-compile argument to disable. ;;; compiling ./decision-tree.scm ;;; compiled home/xiaolong.cache/guile/ccache/2.2-LE-8-3.A/home/xiaolong/development/Guile/guile-ml/decision-tree.scm.go

scheme@(guile-user)> (define tree (fit #:train-data (shuffle-dataset banking-dataset #:seed 12345) #:feature-column-indices (list 0 1 2 3) #:label-column-index 4 #:max-depth 5 #:min-data-points 12 #:min-data-points-ratio 0.02 #:min-impurity-split (expt 10 -7) #:stop-at-no-impurity-improvement #t)) recursive split on depth: 1 will check for stop conditions now checking for stop condition: max-depth-reached? checking for stop condition: insufficient-data-points-for-split? checking for stop condition: insufficient-data-points-ratio-for-split? checking for stop condition: all-same-label? INFO: CONTINUING SPLITT: still got 1372 data points ice-9/threads.scm:289:22: In procedure loop: In procedure <: Wrong type argument in position 1: 0.39142673091293256

Entering a new prompt. Type `,bt' for a backtrace or `,q' to continue. scheme@(guile-user) [1]> ,bt In current input: 161:2 3 (_) In decision-tree.scm: 218:18 2 (recursive-split (#(-2.7028 1.6327 0.83598 -0.091393 1) #(-1.3931 1.5664 7.5382 0.78403 0) #(-5.4414 7.2363 0.10938 -7.5642 1) #(-2.7908 -5.7133 5.953 0.45946 1) #(-3.518 2.8763 0.1548 # 1) # …) …) 110:13 1 (get-best-split _ _ _ #:split-quality-proc _) In ice-9/threads.scm: 289:22 0 (loop _) scheme@(guile-user) [1]> #+END_SRC

Documentation [0/2]

  • [ ] I should write more doc strings for remaining procedures.
  • [ ] Update the readme file.

Testing [0/1]

  • [ ] I should check, whether there are more functions or procedures, which need unit tests.

Multiprocessing [0/4]

  • [ ] Make a decision which tool or library for parallelizing parts of the algorithm. Here are some choices:
  • [ ] futures (apparently GNU Guile futures are not the same as Racket futures)
  • [ ] Are there similar restrictions to futures as in Racket?
  • [ ] only for side-effect free function calls
  • [ ] parallel forms (building on futures)
  • [ ] fibers library
  • [ ] Add an abstraction layer for running something in parallel.
  • perhaps 2 interfaces are a good idea:
  1. do task in n processes (process pool basically)
  2. do task on in another process

This way no matter which Scheme is used, the multiprocessing can be ported by just changing the code behind the interface and the algorithm code does not need to care.

  • [ ] Implement parallel evaluation using a chosen tool or library.
  • [ ] Consider implementing with an additional tool or library.

Abstraction layers [0/1]

  • [ ] I should check, whether there is still any usage of things in the code, which I would like to abstract from.
  • For example list primitives car, cdr, null? instead of dataset-empty?, etc.

Logging [0/1]

  • [ ] Logging should not appear when undesired.
  • For example debug logs should not appear when running the tests.
  • Maybe there is a good logging module for Guile / Scheme? Maybe even an SRFI?

Old todo (possibly obsolete, need to check) [0/4]

  • [ ] remove data from not leaf nodes by using struct setters
  • However, that might be computationally expensive, if done in a purely functional way.
  • This might save some memory.
  • The memory cost of keeping copies is at maximum (for a perfectly balanced binary tree) the depth of the tree times the memory used by the whole dataset.
  • [ ] Find any remaining randomness (if there is any), which is not determined by random-seed keyword arguments yet.
  • [ ] Prediction
  • return not only the predicted label, but also how sure we are about the prediction (percentage of data points in the leaf node, which has the predicted label)
  • A decision tree predicts the label for a data point, by following the split features and values in the tree. How is this "How sure is the model about the prediction?" calculated? Does it even exist for decision trees? I should check scikit-learn and see what they offer.
  • One way to return how sure the model is about the prediction is to create a "Prediction" struct, which contains the prediction and the number expressing how sure the model is and return that struct.
  • Another way is to use multiple return values.
  • [ ] Pruning [0/1]
  • [ ] offer a prunning way of removing all splits with less improvement than x in cost?
  • argument against this: This can be already achieved with early stopping parameters. Why would anyone not use early stopping parameters, but then do this in a prunning step?
  • Perhaps the person doing the prunning is not the same person, who did the fitting of the model.

Testing [0/2]

  • [ ] Make a list of procedures, which are not tested yet.
  • [ ] Make decisions about which of those procedures really need tests.
  • [ ] anything that is not predefined in Scheme / GNU Guile