ultimate_programming_language_iteration1.md 15 KB

UPDATE: obsolete, made my language now (comun)

ultimate programming language - WIP

This is a draft of an idea for the ultimate, ideal programming language.

vs C

Let's start with C, a language that's pretty suckless already.

What's seems bad about C:

  • preprocessor/macros form a language different from the base language, it's basically two languages in one (bad)
  • non-minimal syntax, each syntactic structure has arbitrary syntactic rules (e.g. ; needs to follow an array initialization { } block but not a function { } body)
  • English keywords (instead of language neutral single symbols)
  • expressions as a syntactic speciality, operators are syntactically different from function calls, have arbitrary precedence rules etc.
  • has non-intuitive and confusing things for non-programmers, e.g. = vs ==
  • non-uniform data type sizes (int size depends on platform)
  • lots of undefined behavior (but to some degree this is prolly needed for performance etc.), some if unintuitive (e.g. bit shift by more than data type width, one would expect it's a 0 but it's undefined)
  • unnecessary features: enums, floats, doubles
  • is just one non-configurable language (could be parametrized and form a family of languages)
  • no binary literals
  • ...

basic philosophy

The basic goals of the language are:

  • To be extremely intuitive.
  • To be extremely elegant.
  • To be extremely readable, with short code, in both inline and long source code formats.
  • To be general purpose.
  • To not be based on any natural human language (such as English), but rather on mathematical notation.
  • Support at least imperative and functional paradigms, both any one alone as well as their combination.
  • Support compilation and interpretation alike.
  • Support high level and low level programming alike.
  • Be naturally extensible without this support adding complexity to the language.
  • Respect suckless, KISS, minimal and Unix philosophies.
  • Have small and simple specification, syntax and as few rules and concepts as possible, with as few exceptions as possible.
  • Be simple to analyze and allow very simple implementation.
  • Be easy to learn.
  • The reference implementation should be public domain free software, as well as accompanying materials (specification, documentation etc.).

The implied rules are:

  • Keywords must be single ASCII special symbols (@, !, : etc.).
  • No package managers, libraries and programs are not packages but simply files.
  • No Unicode support, only 7 bit ASCII.
  • No boilerplate or bloat.
  • No declarations.

The priorities are not:

  • To be highly productive.
  • To be highly popular.
  • To be highly safe.
  • To achieve high performance.
  • To offer many features.
  • To offer a big standard library.
  • To enforce any programming style.
  • To conform to industry or capitalism needs.
  • To be similar to other languages, based on them or be compatible with them.
  • To support OOP.
  • To be backward compatible.

drafts

Perhaps use brackets for both if branching and while cycle, like in Brainfuck. E.g.:

a == a   // pushes 1 on stack
(        // pops stack, if 1, continues on, else jumps after amatching endbracket

c == c   // pushes on stack 
)        // pops stack, if 1, jumps back to matching start bracket, else continues 

notation comparison:

name example FAVSRIMGC (pros)
C 10 * sin(PI - f(x,y)) + 1 222120002 = 11 (3)
lisp (+ (* 10 (sin (- PI (f x y)))) 1) 022011122 = 11 (5)
postfix plain 10 PI x y f - sin * 1 + 000212222 = 11 (9)
prefix plain + * 10 sin - PI f x y 1 000211222 = 10 (8)
postfix upper. vars 10 PI X Y f - sin * 1 + 002212220 = 11 (9)
postfix prefixed vars 10 $PI $x $y f - sin * 1 + 102112122 = 12 (7)
prefix terminators + * 10 sin - PI; f x; y;;;;; 1; 111001122 = 9 (4)

pros (suckless attributes, the number in the bracket, are SRIMG):

  • F: looks familiar, intuitive and similar to commonly used math notation and other programming languages
  • A: arities of functions don't have to be known for evaluation, i.e. it is clear how many parameters each function/operator has
  • V: functions and variables are distinguishable just from the expression alone
  • S: the expression can usually be fairly short
  • R: is well readable even without highlighters
  • I: simple implementation
  • M: uses minimum of extra symbols (brackets, commas, ...), i.e. doesn't "waste" characters
  • G: grammar is consistent, i.e. has (almost) no exceptions and irregularities (such as some operators being infix, function calls differing from operator use etc.)
  • C: can be case insensitive
myfunc a b c:  # comment goes either until \n or another #
  +(a -(b c))

@ < (i 10)
  print(i)
  i: -(i 1)
;

(& (>(x 10) <(x 15))) ?
  (print x)
~
  (print "no")
;



(@
  (> i 10)
  (myfunc a b c)
  (myfunc d e f)
)

(?
  (= x 50)
  (print "sasasa")
  ~
  (print "eeeee")
)

THIS SEEMS GOOD:

if (cond right after "(") :
  (
    cond ?
    ...
  )

while (cond right before ")"):
  (
    ...
    cond ?
  )

TODO: optional program parameters at the start of the file? such as extensions

TODO: lazy evaluation? likely not, but if and a loop have to be exceptions, because we don't want both branches of if be executed.

TODO: short condition evaluation? (e.g. a != 0 & b / a)

TODO: types? arrays? structs?

Every command is a function (returns a value). This function can however have side effects and doesn't have to be a pure function (for same input always return same output).

Function returns the value returned by its last statement

another attempt


IDEAS:

  • macros should use the same syntax (and semantics?) as the language itself, perhaps macros would mean two passes or running the program: first would generate the source, the second would generate the program
  • possible feature: embed source code to the output executable, pro-free-sw

Let's call the project A. The project doesn't just define one language, but a whole family and hierarchy of languages.

A specifies multiple languages at different levels.

This has many advantages. we can e.g. fork and derive multiple versions of a language from a common ancestor language, drop-in another ancestor language for an existing language, we can reuse existing parsers, have data-description languages use the same syntax as the programming language itself, which can in turn allow the programming language to easily parse programs written in the programming language itself... etc.

level 0

Languages at this level are called A0, followed by 2 parameters:

  • left bracket symbol, referenced as A0-(
  • right bracket symbol, referenced as A0-)

E.g. A0() or A0[].

Simple language for capturing nested bracket expressions.

A0-alphabet is the alphabet of the language, and is identical to the 7bit ACII. (The language is case-sensitive.) A white character is any character with decimal value less than 33 or equal to 127.

A0-name is a sequence of characters that aren't white, A0-(, A0-), " and #.

A0-string is a sequence of characters that aren't ", starting with " and ending with ".

A0-comment is a sequence not inside A0-string, of characters that aren't # and \n, starting with # and ending with either # or \n.

A0-atom is either A0-name or A0-string.

A0-sequence is a sequence composed of 0 to many A0-atoms and/or A0-brackets, called elements, separated by one or more white characters, possibly left and/or right padded with white characters.

A0-bracket is a A0-sequence with A0-( in front of it and A0-) after it, possibly left and/or right padded with white characters. An element of a bracket means an element of its sequence.

Given A0 language is any A0-sequence.

Example of A0() language sentence:

hello      # this is a comment
(
  (() 323 - XYZ)
  123
  (
    (abc def ghi)
  )
)

level 1

Each of these languages is a subset of some A0 language. Its name starts with A1 and is followed by these parameters:

  • first parameter to the parent A0 language
  • second parameter to the parent A0 language
  • F or f (supporting or not supporting functions, respectively)
  • J or j (supporting or not supporting jumps, respectively)

E.g. A1()Fg. Languages at this level capture control flow of an algorithm (ignoring other details such as operators or data types).

A1-sequence is A0-sequence of which no element is ?, @, ', ~, ^ and ``` (ASCII value 96).

A1-branch is A0-bracket whose first element is ?, which is followed by an element called A1-condition, which is followed by A1-sequence called A1-then, which may or may not be followed by ~, which may or may not be followed by A1-sequence called A1-else. This represents the if-then and if-then-else structures.

A1-loop is A0-bracket whose first element is @, which is followed by an element called A1-condition, which is followed by A1-sequence.

A1-function is A0-bracket whose first element is ', which is followed by an element, which is followed by A1-sequence.

A1-label is A0-bracket with two elements, first one being ``` (ASCII value 96) and second one being A0-name.

A1-jump is A0-bracket with two elements, first one being ^ and second one being A0-name.

A1-call is A0-bracket whose sequence is A1-sequence and which has at least one element. The sequence starting with second element up until the end is called A1-arguments. This represents a function call (either user defined or otherwise available, e.g. as built-in or via library).

Given A1 language is a A1-sequence. If the language is parametrized with f, no A1-function structure is allowed to appear. If the language is parametrized with j, no A1-jump structure is allowed to appear.

Example of A1()FJ sentence:

(' func1   # my function
  (` mylabel)
  (? (= x 1)
    (somefunc a b c)
    (func2)
  )
)

(' func2
  (^ mylabel)
  (+ 3 2)
  ("asasa" abcd) (3) (c)
)

(func 2)

level 2

Each of these languages is a subset of some of A1 language, and their name starts with A2 (followed by possible parameters). These languages define additional syntactic and semantic elements (macros, data types, operators, ...) needed for a complete programming language.

data types

An SDT (simple data type) defines a set of integers. Any SDT is described by two pieces of as SDT(w,x):

  • w: bit width, a positive non-zero integer
  • x: signedness, either signed (s) or unsigned (u)

The set of numbers defined by SDT(w,x) contains 2^w numbers, and these are all integers between 0 and 2^w - 1 (including the bounds) if x = u, or -2^w / 2 to 2^w / 2 - 1 (including the bounds) if x = s.

Representation of a value of given SDT is not defined, i.e. it is NOT guaranteed that the value will be encoded as two's complement or have specific endianity.

Sometimes we may need to compare SDTs, so let's define a linear ordering: SDTa(wa,xa) is greater than or equal to SDTb(wb,xb) if and only if

  1. wa > wb, or
  2. wa = wb and wa = s

Values can be converted between SDTs, which is called casting. The conversion is as follows:

TODO

A CDT (complex data type) is a finite sequence of SDTs: SDT1, SDT2, ... SDTn. A complex value is a specific value a CDT can take, it is a sequence of values of SDT1, ... SDTn.

TODO: pointer type

TODO: size of data type representation?

operators

Operators are basic mathematical and logical operations (performed on values called operands) that often correspond to single instructions in machine code. Operators behave like built-in functions, i.e. they are used in the same way as function calls, but their internal implementation may be (and usually should be, for performance sake) special, i.e. an addition operator should, if possible, translate to a single addition instruction as opposed to a call instruction.

The general rule of all operators is that the operation is performed in the space of (and limited by) the greatest SDT of the operands. The general algorithm of operation is as follows:

  1. Let T be the greatest SDT of all operands.
  2. Cast all operands to T.
  3. Perform the effect of the operator, cast the result to T and return it.

The operator call therefore has SDT identical to the greatest SDT of its operands.

For example when multiplying a value x of SDT(16,u) with value y of SDT(8,s), y is casted to SDT(16,u), then 16-bit unsigned multiplication is performed yielding a 16-but unsigned result (with possible overflow), which is then converted to the greatest SDT of the language and returned.

The language operators are:

  • (+,a,b): Addition, returns a + b with possible wrap around (overflow) over the upper bound.
  • (-,a,b): Substraction, returns a - b with possible wrap around (underflow) under the lower bound.
  • (*,a,b): Multiplication, performs a * b. Given the SDT of a and b is SDT(w,x), the multiplication is performed in space SDT(2 * w,x) and cast back to SDT(w,x) to give the result.
  • (/,a,b): Integer division, performs a / b. If b is 0, program should halt with a zero division error.
  • (%,a,b): Integer division remainder (modulus), e.g. 6 % 5 = 1, -3 % 2 = -1 etc. If b is 0, the program should behave the same as in case of dividing by zero with /.
  • (=,a,b): Comparison, the result is 1 if the integer values represented by a and b are equal, otherwise 0.
  • (|,a,b): Logical OR, the result is 1 if at least one of the operands is non0, otherwise 0.
  • (&,a,b): Logical AND, the result is 1 if at least one of the operands is non0, otherwise 0.
  • (!,a): Logical NOT, the result is 1 if a is 0, otherwise 1.

TODO: bitwises

specific languages

These are final programming languages of which each is a subset of some A2 language.

macro language

This is a simple interpreted language meant as a helper language to be embedded in another language. It should be very easy to implement. Features:

  • no complex data types
  • no recursion

scripting language

This is an interpreted language that can be embedded in programs and allow their scripting. This language is also easier to implement than the compiled language. Features:

  • dynamic typing, variables can change type at run time
  • no variable declaration, variable is created by assigning a value to it
  • only one sufficiently large SDT is used for all simple values
  • associative arrays (arrays can be indexed by strings)

compiled language

This is the more complex language meant to be compiled to binary forms or to other compiled languages such as C. Features:

  • static typing