This repository is supposed to contain all my GNU Guile or Scheme machine learning algorithm implementations.

zelphir.kaltstahl 77f93cb446 add more notes about using fibers před 5 roky
old-racket-code 05f7208871 add racket code test file před 5 roky
scripts 7adfd9baf1 run metrics tests in run test script před 5 roky
test f272fac139 add dataset-get-columns procedure před 5 roky
utils 191af9c8c6 add range function před 5 roky
.gitignore 81913d80f7 update gitignore to include Emacs files před 5 roky
LICENSE 198d22ffbb Initial commit před 7 roky
README.org fde8c7b798 update readme před 5 roky
columns.csv ced580d8b0 initial commit před 7 roky
data-point.scm 3df084f217 add missing procedure před 5 roky
data_banknote_authentication.csv ced580d8b0 initial commit před 7 roky
dataset.scm f272fac139 add dataset-get-columns procedure před 5 roky
decision-tree.scm 43290ee473 add todo comments for parallelism před 5 roky
metrics.scm 51ef8a8b35 update comment před 5 roky
notes.org 77f93cb446 add more notes about using fibers před 5 roky
prediction.scm 90a79c8f89 separate prediction module před 5 roky
pruning.scm 526ce93aa3 separate pruning module před 5 roky
split-quality-measure.scm 0111a9d334 remove commented out Racket expression před 5 roky
todo.org 117ede1f47 update todo items před 5 roky
tree.scm 67d801c996 move tree printing procedure to tree module před 5 roky
utils.scm 74e8f4af8a move list procedures from utils into list utils před 5 roky

README.org

Tests

You can run the tests by running the script run-tests.bash in the scripts/ directory as follows:


# from the root directory of this project:
bash scripts/run-tests.bash

Usage (outdated example)

This example is outdated and still for the older Racket code.


(define shuffled-dataset (shuffle dataset))

(define small-dataset
  (data-range shuffled-dataset
              0
              ;; take only a fifth of the data to make this example run faster
              (exact-floor (/ (dataset-length shuffled-dataset)
                              5))))

;; be sure to collect all garbage, apparently this should be called thrice
(collect-garbage)
(collect-garbage)
(collect-garbage)

;; requires a ~time~ macro
(time
 ;; ~for/list~ -- a Racketism, needs to be rewritten
 (for/list ([i (in-range 1)])
   (mean
    (evaluate-algorithm #:dataset (shuffle dataset)
                        #:n-folds 10
                        #:feature-column-indices (list 0 1 2 3)
                        #:label-column-index 4
                        #:max-depth 5
                        #:min-data-points 24
                        #:min-data-points-ratio 0.02
                        #:min-impurity-split (expt 10 -7)
                        #:stop-at-no-impurity-improvement #t
                        #:random-seed 0))))

;; be sure to collect all garbage, apparently this should be called thrice
(collect-garbage)
(collect-garbage)
(collect-garbage)

(time
 ;; ~for/list~ -- a Racketism, needs to be rewritten
 (for/list ([i (in-range 1)])
   ;; run with the whole dataset as an example, no random seed
   (define tree (fit #:train-data dataset
                     #:feature-column-indices (list 0 1 2 3)
                     #:label-column-index 4
                     #:max-depth 5
                     #:min-data-points 12
                     #:min-data-points-ratio 0.02
                     #:min-impurity-split (expt 10 -7)
                     #:stop-at-no-impurity-improvement #t))
   'done))

Approach

Data representation

  • A dataset is currently represented by a list of vectors. Rows are represented by vectors.