This repository is supposed to contain all my GNU Guile or Scheme machine learning algorithm implementations.

zelphir.kaltstahl ce9d3d306d add FP to approach 4 年之前
test aaffde7d14 add makefile for running tests and cleaning up test log files 4 年之前
utils b0d6075bc1 renaming of utils modules 5 年之前
.gitignore 81913d80f7 update gitignore to include Emacs files 5 年之前
LICENSE 198d22ffbb Initial commit 7 年之前
Makefile aaffde7d14 add makefile for running tests and cleaning up test log files 4 年之前
README.org ce9d3d306d add FP to approach 4 年之前
columns.csv ced580d8b0 initial commit 7 年之前
data-point.scm 3df084f217 add missing procedure 5 年之前
data_banknote_authentication.csv ced580d8b0 initial commit 7 年之前
dataset.scm 634f7236e8 ad procedure dataset-num-cols 5 年之前
decision-tree.scm e172c5e1a3 parallelize k-fold crossvalidation for folds 4 年之前
metrics.scm 51ef8a8b35 update comment 5 年之前
notes.org dff2016cd3 add notes: check every procedure that runs in parallel 5 年之前
parallelism.scm c9b32bd5b7 implement parallelization using futures 4 年之前
prediction.scm 90a79c8f89 separate prediction module 5 年之前
pruning.scm b0d6075bc1 renaming of utils modules 5 年之前
run.scm 4647c37608 adapt run.scm to see parallelization in action 4 年之前
split-quality-measure.scm 0111a9d334 remove commented out Racket expression 5 年之前
todo.org d61a953550 add more detail to the bug report 5 年之前
tree.scm 594ea7d19b add run example script and move print-tree there 5 年之前
utils.scm 74e8f4af8a move list procedures from utils into list utils 5 年之前

README.org

About

(This readme file might not render 100% precisely on many git host websites. If something seems strange, you can look at the raw version of this file.)

This is a collection of machine learning algorithms implemented in GNU Guile. Currently implemented algorithms are:

  • decision tree

Nope, sorry, currently that is it.

The decision tree algorithm is implemented in a purely functional way, which helps immensely with parallelization of the algorithm and unit testing the code. My vision for this repository is, that more purely functional machine learning algorithms are added later. My preference is strongly for purely functional algorithms.

Feel free to report bugs, report possible improvements, contribute machine learning algorithms in GNU Guile or standard Scheme runnable in GNU Guile (perhaps even non-purely functional, so that we can later on transform them into purely functional ones), fork or use this code in any license adhering way.

Contributions

There are many things, that could be done. A few will be listed here, in a non-comprehensive list:

  • Find and report bugs.
  • Add more tests.
  • Implement more machine learning algorithms.
  • Based on decision trees, random forest could be a next logical step.
  • Come up with an idea for a common interface for algorithms.
  • Improve the code.
  • Make the code more readable.
  • Fix performance issues (but not at the cost of readability).
  • write documentation
  • How does the algorithm work?
  • Code documentation: How does this function / procedure work?
  • Write a usage guide.
  • Make use of the code and share experiences with it.
  • Inform about problems, which prevent usage of the code.

Tests

You can run the tests by using the make file as follows:


# from the root directory of this project:
make test

Approach

Functional programming

In general, the idea is to implement machine learning algorithms in a purely functional way, which will help with parallelization, testability and avoiding bugs due to mutable state.

Data representation

  • A dataset is currently represented by a list of vectors. Rows are represented by vectors.

Known Caveats

  • There is currently no way to limit parallelism. If you need to limit the number of cores this program uses, you need to limit the number of cores available to the Guile process.
  • Currently there is no choice of parallelism costructs to use for running the algorithm in parallel.
  • Currently there is no way to configure, whether Guile tries to run the algorithm in parallel or purely sequentially.
  • There is no concept yet for a common interface to the decision tree algorithm and other possibly later added machine learning algoritms.
  • Currently the decision tree algorithm does not make use of any heuristic for selecting split feature values. It checks every occurring value of the split feature in the data set. This is computationally expensive. However, it should deliver the best, by this algorithm learnable, decision tree for specified parameters.
  • Currently there is only one split quality measure implemented. Other implementations often offer multiple quality measures.