README.org 2.1 KB

Tests

You can run the tests by running the script run-tests.bash in the scripts/ directory as follows:


# from the root directory of this project:
bash scripts/run-tests.bash

Usage (outdated example)

This example is outdated and still for the older Racket code.


(define shuffled-dataset (shuffle dataset))

(define small-dataset
  (data-range shuffled-dataset
              0
              ;; take only a fifth of the data to make this example run faster
              (exact-floor (/ (dataset-length shuffled-dataset)
                              5))))

;; be sure to collect all garbage, apparently this should be called thrice
(collect-garbage)
(collect-garbage)
(collect-garbage)

;; requires a ~time~ macro
(time
 ;; ~for/list~ -- a Racketism, needs to be rewritten
 (for/list ([i (in-range 1)])
   (mean
    (evaluate-algorithm #:dataset (shuffle dataset)
                        #:n-folds 10
                        #:feature-column-indices (list 0 1 2 3)
                        #:label-column-index 4
                        #:max-depth 5
                        #:min-data-points 24
                        #:min-data-points-ratio 0.02
                        #:min-impurity-split (expt 10 -7)
                        #:stop-at-no-impurity-improvement #t
                        #:random-seed 0))))

;; be sure to collect all garbage, apparently this should be called thrice
(collect-garbage)
(collect-garbage)
(collect-garbage)

(time
 ;; ~for/list~ -- a Racketism, needs to be rewritten
 (for/list ([i (in-range 1)])
   ;; run with the whole dataset as an example, no random seed
   (define tree (fit #:train-data dataset
                     #:feature-column-indices (list 0 1 2 3)
                     #:label-column-index 4
                     #:max-depth 5
                     #:min-data-points 12
                     #:min-data-points-ratio 0.02
                     #:min-impurity-split (expt 10 -7)
                     #:stop-at-no-impurity-improvement #t))
   'done))

Approach

Data representation

  • A dataset is currently represented by a list of vectors. Rows are represented by vectors.