This repository is supposed to contain all my GNU Guile or Scheme machine learning algorithm implementations.

zelphir.kaltstahl 77f93cb446 add more notes about using fibers		5 anos atrás
old-racket-code	05f7208871 add racket code test file	5 anos atrás
scripts	7adfd9baf1 run metrics tests in run test script	5 anos atrás
test	f272fac139 add dataset-get-columns procedure	5 anos atrás
utils	191af9c8c6 add range function	5 anos atrás
.gitignore	81913d80f7 update gitignore to include Emacs files	5 anos atrás
LICENSE	198d22ffbb Initial commit	7 anos atrás
README.org	fde8c7b798 update readme	5 anos atrás
columns.csv	ced580d8b0 initial commit	7 anos atrás
data-point.scm	3df084f217 add missing procedure	5 anos atrás
data_banknote_authentication.csv	ced580d8b0 initial commit	7 anos atrás
dataset.scm	f272fac139 add dataset-get-columns procedure	5 anos atrás
decision-tree.scm	43290ee473 add todo comments for parallelism	5 anos atrás
metrics.scm	51ef8a8b35 update comment	5 anos atrás
notes.org	77f93cb446 add more notes about using fibers	5 anos atrás
prediction.scm	90a79c8f89 separate prediction module	5 anos atrás
pruning.scm	526ce93aa3 separate pruning module	5 anos atrás
split-quality-measure.scm	0111a9d334 remove commented out Racket expression	5 anos atrás
todo.org	117ede1f47 update todo items	5 anos atrás
tree.scm	67d801c996 move tree printing procedure to tree module	5 anos atrás
utils.scm	74e8f4af8a move list procedures from utils into list utils	5 anos atrás

Tests

You can run the tests by running the script run-tests.bash in the scripts/ directory as follows:


# from the root directory of this project:
bash scripts/run-tests.bash

Usage (outdated example)

This example is outdated and still for the older Racket code.


(define shuffled-dataset (shuffle dataset))

(define small-dataset
  (data-range shuffled-dataset
              0
              ;; take only a fifth of the data to make this example run faster
              (exact-floor (/ (dataset-length shuffled-dataset)
                              5))))

;; be sure to collect all garbage, apparently this should be called thrice
(collect-garbage)
(collect-garbage)
(collect-garbage)

;; requires a ~time~ macro
(time
 ;; ~for/list~ -- a Racketism, needs to be rewritten
 (for/list ([i (in-range 1)])
   (mean
    (evaluate-algorithm #:dataset (shuffle dataset)
                        #:n-folds 10
                        #:feature-column-indices (list 0 1 2 3)
                        #:label-column-index 4
                        #:max-depth 5
                        #:min-data-points 24
                        #:min-data-points-ratio 0.02
                        #:min-impurity-split (expt 10 -7)
                        #:stop-at-no-impurity-improvement #t
                        #:random-seed 0))))

;; be sure to collect all garbage, apparently this should be called thrice
(collect-garbage)
(collect-garbage)
(collect-garbage)

(time
 ;; ~for/list~ -- a Racketism, needs to be rewritten
 (for/list ([i (in-range 1)])
   ;; run with the whole dataset as an example, no random seed
   (define tree (fit #:train-data dataset
                     #:feature-column-indices (list 0 1 2 3)
                     #:label-column-index 4
                     #:max-depth 5
                     #:min-data-points 12
                     #:min-data-points-ratio 0.02
                     #:min-impurity-split (expt 10 -7)
                     #:stop-at-no-impurity-improvement #t))
   'done))

Approach

Data representation

A dataset is currently represented by a list of vectors. Rows are represented by vectors.

README.org

Tests

Usage (outdated example)

Approach

Data representation