1
$\begingroup$

want to design a simple programming language for educational purposes.

My goal is to understand the fundamentals of how a language works — defining syntax, grammar, and how code gets executed (interpreted or compiled).

I’m not aiming to create a production-level language, just something minimal and easy to test basic concepts.

My questions are:

  1. What are the main steps to design a very simple language?
  2. Which programming languages are best to use for implementing it (Python, C, or Java)?
  3. Are there any good tutorials, books, or open-source projects to start learning from?

What I know so far:

  • I understand basic programming and syntax (Python and JavaScript).
  • I’ve heard about concepts like “lexer”, “parser”, and “compiler”, but I’ve never built one.

What I’ve tried:

  • I read about building a simple calculator language in Python, but I didn’t fully understand how the tokens and parsing worked.

What I’m looking for: A clear explanation or example of how to design a simple programming language from scratch — even something minimal, like a language that can print text or do simple math operations.

Example of what I imagine:

print "Hello, World!"
a = 5 + 3
print a
New contributor
Anna Cláudia Speck de Souza is a new contributor to this site. Take care in asking for clarification, commenting, and answering. Check out our Code of Conduct.
$\endgroup$
5
  • 5
    $\begingroup$ This question is quite broad and an answer could be a whole book. Fortunately, there already is one! Crafting Interpreters would be an excellent place to start. $\endgroup$ Commented 2 days ago
  • $\begingroup$ You might like to look at MARIE, a simple programming langauge for educational purposes. $\endgroup$ Commented yesterday
  • $\begingroup$ Compiler implementation is the subject of a semester-long class in many CS programs. $\endgroup$ Commented yesterday
  • $\begingroup$ (To paraphrase Carl Sagan: if you want to design a programming language from scratch, you must first invent the universe…  Otherwise you're building on decades of experience and expertise.) $\endgroup$ Commented yesterday
  • $\begingroup$ Are you asking about designing a language (defining what a valid program looks like and what it will do, e.g. by writing down its grammar and semantics), or implementing it (being able to run programs, e.g. by writing a compiler or interpreter)?  The two are separate, though obviously related; it looks like you're asking about both. $\endgroup$ Commented yesterday

2 Answers 2

2
$\begingroup$
  1. What are the main steps to design a very simple language?

First, have a purpose for the language. Let's break that down:

My goal is to understand the fundamentals of how a language works — defining syntax, grammar, and how code gets executed (interpreted or compiled).

OK, that's a purpose for you building a language, which is great. But it is not a purpose for the language itself.

You also said:

something minimal and easy to test basic concepts.

Those are design goals -- and they are good goals! -- but not purposes.

So my first bit of advice to you is to think about what you want your users -- which could be just you -- to be able to do in this language. You mentioned calculator apps, which is a pretty good start, so let's provisionally pick that as a purpose. The purpose of the language is to do arithmetic and print out the result.

First step done.

Second step: gather user requirements. Ask the users of your calculator language -- which again, could be just asking yourself -- what they like and dislike about existing calculation systems, what they find easy or hard to do in those systems, and so on. Use that to motivate a design.

Third step: Every programming language takes inspiration from the successes and shortcomings of other languages. Are there languages you admire? Are there languages you dislike? Make a list of what you find tasteful and distasteful in a language. This will inform your design goals, which you've already made a good start on.

Fourth step: Know the strengths of your development team, which again, is you. You've already done a great job on this step. You've made a list of the things you don't know how to do, and all those things you can learn how to do. Use your strengths and weaknesses to drive your development choices, such as:

Which programming languages are best to use for implementing it (Python, C, or Java)?

I've written compilers in C++, C#, Java, Python, Scala, OCAML... any general-purpose language is fine for writing a compiler. Pick one that you feel confident you can be productive in. Doesn't really matter. Remember that you're trying to learn about the process of creating a new language; if you improve your skills in the implementation language, great, but that's not a priority goal.

Fifth step: Design the MVP, the Minimal Viable Product. What is the smallest, easiest language that meets your goals? To design the language start with two things:

What is the lexical grammar? A "lexer" is nothing more than a function whose input is a string and whose output is a sequence of "tokens". How does the lexer know where one "token" ends and another begins? If you have xyz = 1.23+4; what precisely are the rules for dividing that up into the sequence id, space, eq, space, number, plus, number, semi? Are there comments in your language? What are the rules for whitespace? Work all that out.

Now you can start writing a lexer, which again, is just a function from texts to token sequences. You can test that independently. Write a lot of unit tests!

A parser is a function that takes a sequence of tokens and produces a parse tree -- a tree data structure that describes the structure of the program. Again, come up with a specification for precisely what the rules are for breaking down a program. For example:

  • A program is one or more statements
  • An assignment is a kind of statement; an assignment consists of an identifier token, an eq token, an expression, and a semi token.
  • A number token is a kind of expression
  • An addition is a kind of expression... and so on

Now you can write a parser -- a function that takes a sequence of tokens and returns a parse tree.

Remember that both your lexer and parser must produce a helpful error message when the user gets it wrong. This is the hard part. Processing correct programs is easy. Successfully leading a user who has done something wrong to make it right is hard, and that's where you will really learn a lot about developer tool design as it is done in the real world.

I would recommend against using an automatic lexer/parser generator at this stage. Your personal goal is to learn the nuts and bolts of building a language, so build the language. Don't outsource the tedious parts to a parser generator. Parser generators are great for professionals to rapidly prototype languages, but nothing beats a hand-written lexer and recursive descent parser for building your skills and understanding how the language analysis really works.

Sixth step: If you're writing a compiled language, transform the parse tree into a different language; a compiler back end is a function whose input is a parse tree and whose output is a string in a different language.

If you're writing an interpreted language, write an interpreter that takes a parse tree and evaluates the program it represents. An interpreter is a function that takes a parse tree and an environment as inputs and modifies the environment.

Both of those are big topics.

Are there any good tutorials, books, or open-source projects to start learning from?

There are a number of books on engineering modern compilers, and I'm sure you can find blog articles. I would mildly recommend against the Dragon Book at this stage in your career. It's a classic but it is also quite academic and I think you'd probably benefit from a more hands-on approach.

You might also look at implementations of existing languages that you admire. Python, C# and many other languages have their lexers and parsers open sourced, so take a look at them and see if you can figure out how they work. (The Python parser will be much easier to understand than the C# parser, FYI, but if you want a tour of the C# parser let me know and I can give you some pointers.)

Good luck! Designing and implementing new languages is incredibly fun and I hope you enjoy this challenge.

$\endgroup$
0
$\begingroup$

What is design? Making a lot of choices/decisions from the many options of the vastly larger design/problem space. Which leads to certain properties, advantages/benefits, limitations/constraints, etc., as most of these tend to be trade-offs (for example between convenience of implementation or use versus execution performance versus powerful features, etc.).

What is a programming language? Usually textual notation as an interface for a human programmer to instruct some machine/device to operate in a certain way. Essentially a textual machine interface for a human to use. There's other media like visual programming languages, regarding design of the notation, but for now let's stay with text?

There's various types of (programming) languages. Typically, for general computation of formal logic to be done by automatic machine, one model is "Turing-completeness", which is a set of properties/features a programming language needs to have in order to allow the standard meta-circular "self-modifyable programmation per software (opposed to non-programmable hard-wired devices or simple replay machinery, which may have programming/software in the form of fixed runs of instructions that can be swapped out, but does not have software that can create or modify software). OK that's the theoretical concept, but in practice, it means a "sufficient" programming language needs to have at least the foundational elements of

  1. logic (means conditional logic)
  2. control (jumps)
  3. memory
  4. input
  5. output

as Robert C. Martin nicely condensed in his "Programming 101" video.

Logic means, some mechanism to evaluate and check the result of an expression. With Shannon information encoding theory, and because it might mostly be about programming digital = binary computers, this comes down to the comparison of two bits = electrical signals flowing into a transistor, whether these are the same or not, to emit a comparison result (to then chain a whole bunch of these together into more complex logic/arithmetic gates). In the programming language, such evaluation of expressions that produce a binary result is what goes into the conditional of an if or to determine whether a while/for loop should continue with the next iteration or abort.

Control is the mechanism to jump to other places of code located in memory. Depending on language design and/or programming style of whether subscribe to Dijksta or not, in unstructured programming or machine/assembler anyway, it's goto, and under structured programming, once an if has evaluated/computed/compared the binary result of some logic/arithmetic expression, jumps might be made to the sequence of code that's associated with the then or else case/branch. Loops are just a jump back to their starting point of a sequence. Functions are just jumps to somewhere else (while storing in a variable on a stack memory where left off, so at the end of the execution of the function, can return to the spot to continue after where the call was located).

Memory for storing values at were the result of evaluating/computing an expression, so these values per named variable reference/lookup might go into other expressions to be evaluated. Also takes what's entered as input. Also likely contains other code/instructions, so control may jump to it, based on the automated, programmed logic/conditional rules which evaluate values as found or stored into memory. The mechanisms described above also allow for new code/software to be written & stored into memory, so could then jump to and execute such.

Input could come from peripheral devices, onboard sensors like the clock, components that are hooked up on the bus, etc. Includes keyboard/mouse input or reading data into memory from a hard disk device (with the help of an operating system that provides a standardized interface for "streams"/"files" for abstracting driver details away, or without).

Output is desired to somehow obtain and expect the results of the otherwise internal computation. Usually coloring pixels on a screen, print output on paper, send something out into a network, things of that nature.

Depending on your design decisions, if you want to design your language to be of the "functional" type, you may, for your language as an textual interface, not need to support "memory" per se. Turing-machine might refer to the computer CPU hardware, but in a language, in addition to not support random jumps with state set/fitting for one context to a completely different sequence of instructions that assume/expect a different state context to be set, your language could avoid side-effects of state/value sharing across function contexts (as found in random spaghetti code goto jumps or encapsulated inside object values in object-oriented programming), and have all values be passed as parameter arguments put on the stack into functions for respective localized value/state context each. Let's call it "Church-"/"McCarthy"-complete. Instead of memory, it could all be input (hypothetically, likely very crude and per specialized hardware?).

Input/output is similar, just a difference of direction.

So regarding design, the fundamental question is how you want to go about these essential elements. What to allow and enable, and what to prevent by artificially not supporting it via a language construct. Language constructs then, on top of these basic essentials, are features added to a language to save the programmer from doing the lower-level logistics oneself. What about variable names and assignment (value copy or reference)? What about allocating and freeing memory? Support first-order functions as expressions for interpretation? Is there a notion of arrays with dedicated access operator syntax? How to represent which data types (how many, or why have any?), do some type-checking or dynamically determine or switch types? How much do you offer built-in out-of-the-box you take care of in your language's implementation (as a base layer or "framework" for other software to build/rely upon), vs. how much you leave out and therefore offload it to app or library programmers? A lot of questions like these, because for every element you've seen in a language, you could design it differently, or leave it out, or add some other ones.

Programming languages have the levels of syntax, which is the notation for structuring the text, semantics, which is the grammar of allowed/reasonable expressions, and constructs which is your language features.

Syntax as the textual human interface of notation and structuring is what a programmer needs to write your language in, because your interpreter or compiler is going to read a source written in this syntax. In order to execute any of the instructions using the primitive elements of a target machine, the source syntax gets parsed/tokenized. For example, you can throw out all whitespaces and comments (if your language supports any), because the machine doesn't understand and has no need for such information that targets other human readers of the source. Likely you're going to assign internal "types" on the tokens/words found in the source input, for example if it's a name/"symbol", or a literal (say, the value 5 found somewhere in the code can clearly be identified as a number literal, and is not a constant/variable). To categorize consecutive characters that belong together, and to determine their general type (independent of implementation and internal representation, just in their external textual representation), so you can later analyze if the source code supplies the expected textual token categories in the required places & order. At this stage, you're "deserializing" the source code into your internal representation inside your compiler/interpreter front end, so you get an "abstract syntax tree" (like the DOM document object model inside the browser of a parsed HTML page), so your compiler/interpreter software can perform checks and translations on it (you're throwing away the syntax that was just the human interface).

Semantics as the grammar is what you check next. Not all statements in your language are valid. Some errors you might want to reject because it's too difficult or expensive to implement such a feature, others because the user applied the syntax/notation incorrectly (and there's a better/valid way to express the desired operation), and so on. You decide i.e. for 5 = a if this is a valid assignment of first numeric literal 5 then assignment operator = then variable name a is OK or if this is a conditional expression of whether the value of the variable name a is 5, of whether conditionals are allowed to be statements with no assignment on their own or if only expected/allowed in certain places (then you check if this is inside an if condition, or preceded by an assignment like b := 5 = a, and so on). If your language supports assignment at all. Likewise, if names like a or b are used, you do some checking of whether this has been declared/defined before, or if your language allows these to be found/defined later (to then go back and do type checking later, if not dynamic), and so on. The goal of this is in the end to have a checked, validated internal representation of the code in memory, where you also added your internal info/annotations of what these tokens are, because this is what's then fed into the interpreter for execution, into your compiler backend to generate CPU machine code instructions out of it, or to transpile it into another language (where you start adding the different syntax of the different target language).

Often, this involves some stages of optimization, which may replace the expressions written by the human programmer with a more performant and equivalent internal alternative, but this complicates the matter quite a bit, so better not start out with too much of this, as with Knuth, premature optimization is the root of all evil.

Once you have translated the text in the syntax of your language into the internal model that's validated to be syntactically correct to only use your features/constructs in the grammatically expected order & fashion, it's time to move from the design space to the implementation of the language. If it's only about language design, that can be done theoretically and on paper, just as there's pseudocode and hypothetical languages or abstract notations that serve as a notational standard without corresponding implementation (or multiple ones, potentially with different syntax). Of course, it makes sense to consider in the design the difficulty, performance and memory cost of implementing constructs, as well as notational convenience for the human interface (syntactic sugar), etc.

In the implementation, you take the cleaned-up and checked-to-be-valid internal representation of the former source code, and "translate"/map it to your target. If you're building an interpreter for your language, your former syntax for the addition operator might have been the character +, or the command keyword add, or the command keyword PLUS, or whatever. The interpreter implementation, on encountering this instruction fed into it, a lookup table might recognize the addition operation and dispatch the execution to the addition operation of the host language. As this expects two operands, in the internal model, there must be two operands to pass along with it (the earlier semantics check made sure that every addition operation is supplied with two operands), and the implementation will likely need to take the result and store it in some place, as later code might want to read/compare the result. If the operands are variable names the interpreter created in its own memory earlier, and had the source code store a value into it, now it's time to do a lookup and retrieve the value that's currently stored there, to pass it towards the host language. If internally you have variable name a, operator assignment, literal value 5, the interpreter code should allocate memory for an integer, store the value 5 into it, and add the name "a" as a reference name to this memory for later lookup. If there's a print instruction, you may forward the string that has been validated to be the next argument into your host language (so if that would be Java, you could load the string from the source code's string literal into a variable internal to the interpreter (say, printText) to then pass it into System.out.print(printText);.

If the target and corresponding "back-end" is assembler code, or directly machine code of a CPU, the instructions for these target languages need to be written into a file. To the extend these go towards lower-level machine-details, more of the memory logistics and jumping coordination have to be generated by your compiler back-end implementation.

Regarding your questions: Parsing is technically done by starting to read source code in your language one character at a time. Say, you read the first character. Is it whitespace? Might be irrelevant, ignore and continue with the next character. Do you find a special character, like # or a number like 5 or a character like a? For the design of your syntax/notation, you can dedicate characters to have a special meaning (then often too in need of an escape notation to cancel the special meaning and get the character as an ordinary literal/value). These could be operators, or be a special feature, or similar. If, for example, # starts a preprocessor directive, you may continue reading characters and collect them into a token as long as you find more characters, and stop if whitespace, a number or another special character is found. Do you have a constraint that says, after the #, this must immediately be followed by characters that name the preprocessor directive? # is not allowed to be followed by a number or whitespace or another special character? This checks the notational/syntactical demands, and too collects/prepares the source into internal representation, into for example a language item that's marked to be a preprocessor directive with the name "include" or "define". At this stage, you may not yet need to check whether you support/implement the directive name, as the parsing stage often first deals with collecting what the input tokens are (type), collects the values/words, and strips away whitespace etc. The parsed/tokenized result can then be analyzed, by going over it in a loop, and each time you find an item marked to be a preprocessor directive, you may then check if it is one of the names you support (if not: error?). If you encounter an operator that was extracted from an input of ! or NOT, you may dispatch further checking to whether the correct number of operands of a compatible type were supplied. Doing this on tokens with whitespace stripped away means you don't need to do the character/text handling at the syntactical stage anymore, as you have all the values/symbols available in a cleaned-up fashion. Parsing not much different from = "deserializing" a CSV or XML or HTML or JSON into an internal representation in memory, in order to apply operations onto the cleaned internal structures instead of on the source text. Also, your implementation code can maintain lookup indexes and restructure the items, which would be much harder to do in a linear text stream (which might be read-only and/or uni-directional). In Language code just has a few more internal name or address references to check and potentially resolve (at the time needed, could be at runtime to get a value referenced by a name).

OK some time ago I made a few videos Reading & Processing Data and Text Input, Structuring Data and Text Inline, Tokenization & Lexical Analysis and Parsers, Interpretation & Control Flow which are not that great, but hopefully illustrate that one can do this rather easily oneself with regular standard programming as done for many other applications, and especially so by not getting confused by the need to deploy lex/yacc, or the need to build a formal grammar in some notation (EBNF) that needs to be written in grammar notation that then again needs to be parsed, or the complications for more complex notations or expression trees. For example, with the standard mathematical notation of arithmetic with the operator in the middle of the two operands, and then some operands being communicative and others not, a little bit of more extra effort needs to be done to first parse the entire expression before evaluating it. Or similar with many language constructs, if it's not context-free grammars or other simplifying factors, it could well be that a long statement is ended with a keyword that changes the meaning entirely, at which point one has to backtrack and fix up or mark earlier statements (single-pass or more than once?). This is where a lot of optimization effort is spent, as these scenarios can be performance-expensive, be it for compile-time or during interpreter runtime. But for learning language design, you can simplify by "cheating", for example demanding polish notation.

Separate from language design, I found "Language Implementation Patterns -- Create Your Own Domain-Specific and General Programming Languages" by Terence Parr a great help for the parsing part, as it starts out describing the mechanisms using familiar programming language constructs. There's a few more beginner-friendly/hands-on (less academic theory) books out there, but didn't find the time yet to read & review :-(

For language design, besides just for learning, it's more about what style or features to support and encourage, and what to take away by preventing, in order to make an interface for humans to express instructions to machines in a certain way for a particular purpose/application. Can include considerations and tradeoffs in the areas of security, safety/reliability, powerfulness, correctness, convenience, performance, ease of implementation, interoperability, portability, etc. For many features & constraints, there's a lot of conceptual debates and examples/experiments, unfortunately often cluttered and obfuscated by additional optimization techniques and theoretical examination.

Syntax design, could be you want to play around, which awards some insight into why existing notations are the way they are (names not allowed to start with a number, just to make tokenization easier, of numerical literals not allowed to contain arbitrary text characters, and if the first character is a letter, can assume it's not a numerical literal, etc.). Of course, can go against that and invent & implement some other new rules. However, new benefits of convenience, concisiveness, ease of implementation, compatibility and so on should only come from syntax if there's a really good reason/justification for it. Otherwise, there's too many languages/notations already that introduced different notation for the same thing, for no other good reason than being different and incompatible to everything else. Just consider "enumerated"/iterator foreach. If there's a equivalent construct with known notation in some other language programmers are already familiar with, there's no benefit gained from arbitrarily demanding a different notation. Notwithstanding real benefits of syntactic sugar, as languages being textual interfaces or domain-specific languages can help expressing something clearer, conciser, etc. But in general, for language design, syntax is largely solved, per parsers and grammars for parsers, so notation too could simply become a language/notational feature/construct of the language, as already every feature creeps into every language eventually, and for cross-compilation/transpiling, Language Servers and what not, at implementation, the job is anyway to undo the notational interface and work with the constructs according to grammar/semantics.

Paradigm, semantics and/or constructs design, one can look at the different approaches of current, new and historical languages, what advantages these offer for which purposes, and where the respective limitations are, or worse, design flaws. This is a big and serious area, because imagine, for production languages, once there's investments made and code written in it, to maintain and support it can lead to quite some trouble. Or similarly, maybe something can be done design-wise, to mitigate various issues, etc.

$\endgroup$

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.