Parser

1. Introduction

This is an implementation of a literate programming system in D. The goal is to be able to create books that one can read on a website, with chapters, subchapters, and sections, and additionally to be able to compile the code from the book into a working program.

Literate proogramming aims to make the source code of a program understandable. The program can be structured in any way the programmer likes, and the code should be explained.

The source code for a literate program will somewhat resemble CWEB, but differ in many key ways which simplify the source code and make it easier to read. Literate will use @ signs for commands and markdown to style the prose.

2. Directory Structure

A Literate program may be just a single file, but it should also be possible to make a book out of it, with chapters and possibly multiple programs in a single book. If the literate command line tool is run on a single file, it should compile that file, if it is run on a directory, it should search for the Summary.lit file in the directory and create a book.

What should the directory structure of a Literate book look like? I try to mimic the Gitbook software here. There will be a Summary.lit file which links to each of the different chapters in the book. An example Summary.lit file might look like this:

@title Title of the book

[Chapter 1](chapter1/intro.lit)
    [Subchapter 1](chapter1/example1.lit)
    [Subchapter 2](chapter1/example2.lit)
[Chapter 2](section2/intro.lit)
    [Subchapter 1](chapter2/example1.lit)

Sub chapters are denoted by tabs, and each chapter is linked to the correct .lit file using Markdown link syntax.

3. The Parser

As a first step, I'll make a parser for single chapters only, and leave having multiple chapters and books for later.

The parser will have 2 main parts to it: the which represent the various structures in a literate program, and the parse function.

{parser.d 3}

{Imports, 3}
{Classes, 4}
{Parse functions, 12}

I'll quickly list the imports here.

{Imports 3}

import globals;
import std.stdio;
import util;
import std.string: split, endsWith, startsWith, chomp, replace, strip;
import std.algorithm: canFind;
import std.regex: matchAll, matchFirst, regex, ctRegex, splitter;
import std.conv;
import std.path: extension;
import std.file;

4.

Now we have to define the classes used to represent a literate program. There are 7 such classes:

{Classes 4}

{Line class, 10}
{Command class, 9}
{Block class, 8}
{Section class, 7}
{Chapter class, 6}
{Program class, 5}
{Change class, 11}

Used in section 3

5. The Program Class

What is a literate program at the highest level? A program has multiple chapters, it has a title, and it has various commands associated with it (although some of these commands may be overwritten by chapters or even sections). It also has the file it originally came from.

{Program class 5}

class Program {
    public string title;
    public Command[] commands;
    public Chapter[] chapters;
    public string file;
    public string text;

    this() {
        commands = [];
        chapters = [];
    }
}

Used in section 4

6. The Chapter class

A chapter is very similar to a program. It has a title, commands, sections, and also an original file. In the case of a single file program (which is what we are focusing on for the moment) the Program's file and the Chapter's file will be the same. A chapter also has a minor number and a major number;

{Chapter class 6}

class Chapter {
    public string title;
    public Command[] commands;
    public Section[] sections;
    public string file;

    public int majorNum;
    public int minorNum;

    this() {
        commands = [];
        sections = [];
    }

    string num() {
        if (minorNum != 0) {
            return to!string(majorNum) ~ "." ~ to!string(minorNum);
        } else {
            return to!string(majorNum);
        }
    }
}

Used in section 4

7. The Sections class

A section has a title, commands, a number, and a series of blocks, which can either be blocks of code, or blocks of prose.

{Section class 7}

class Section {
    public string title;
    public Command[] commands;
    public Block[] blocks;
    public int num;

    this() {
        commands = [];
        blocks = [];
    }
}

Used in section 4

8. The Block Class

A block is more interesting. It can either be a block of code, or a block of prose, so it has a boolean which represents what type it is. It also stores a start line. If it is a code block, it also has a name. Finally, it stores an array of lines, and has a function called text() which just returns the string of the text it contains. A block also contains a codeType and a commentString.

{Block class 8}

class Block {
    public int startLine;
    public string name;
    public bool isCodeblock;
    public bool isRootBlock;
    public Line[] lines;

    public string codeType;
    public string commentString;

    public Modifier[] modifiers;

    this() {
        lines = [];
        modifiers = [];
    }

    string text() {
        string text = "";
        foreach (line; lines) {
            text ~= line.text ~ "\n";
        }
        return text;
    }

    Block dup() {
        Block b = new Block();
        b.startLine = startLine;
        b.name = name;
        b.isCodeblock = isCodeblock;
        b.codeType = codeType;
        b.commentString = commentString;
        b.modifiers = modifiers;

        foreach (Line l; lines) {
            b.lines ~= l.dup();
        }

        return b;
    }
}

Used in section 4

9. The Command Class

A command is quite simple. It has a name, and any arguments that are passed.

{Command class 9}

class Command {
    public string name;
    public string args;
    public int lineNum;
    public string filename;
    this() {}
}

Used in section 4

10. The Line Class

A line is the lowest level. It stores the line number, the file the line is from, and the text for the line itself.

{Line class 10}

class Line {
    public string file;
    public int lineNum;
    public string text;

    this(string text, string file, int lineNum) {
        this.text = text;
        this.file = file;
        this.lineNum = lineNum;
    }

    Line dup() {
        return new Line(text, file, lineNum);
    }
}

Used in section 4

11. The Change Class

The change class helps when parsing a change statement. It stores the file that is being changed, what the text to search for is and what the text to replace it with is. These two things are arrays because you can make multiple changes (search and replaces) to one file. In order to keep track of the current change, an index is also stored.

{Change class 11}

class Change {
    public string filename;
    public string[] searchText;
    public string[] replaceText;
    public int index;

    this() {
        searchText = [];
        replaceText = [];
        index = 0;
    }
}

Used in section 4

That's it for the classes. These 7 classes can be used to represent an entire literate program. Now let's get to the actual parse function to turn a text file into a program.

12. The Parse Function

Here we have two functions: parseProgram and parseChapter.

{Parse functions 12}

{parseProgram function, 13}
{parseChapter function, 14}

Used in section 3

13. parseProgram function

This function takes a literate book source and parses each chapter and returns the final program.

Here is an example book:

@title Title of the book

[Chapter 1](chapter1/intro.lit)
    [Subchapter 1](chapter1/example1.lit)
    [Subchapter 2](chapter1/example2.lit)
[Chapter 2](section2/intro.lit)
    [Subchapter 1](chapter2/example1.lit)

{parseProgram function 13}

Program parseProgram(Program p, string src) {
    string filename = p.file;
    bool hasChapters;

    string[] lines = src.split("\n");
    int lineNum;
    int majorNum;
    int minorNum;
    foreach (line; lines) {
        lineNum++;

        if (line.startsWith("@title")) {
            p.title = strip(line[6..$]);
        } else if (line.startsWith("@book")) {
            continue;
        } else if (auto matches = matchFirst(line, regex(r"\[(?P<chapterName>.*)\]\((?P<filepath>.*)\)"))) {
            if (matches["filepath"] == "") {
                error(filename, lineNum, "No filepath for " ~ matches["chapterName"]);
                continue;
            }
            if (leadingWS(line).length > 0) {
                minorNum++;
            } else {
                majorNum++;
                minorNum = 0;
            }
            Chapter c = new Chapter();
            c.file = matches["filepath"];
            c.title = matches["chapterName"];
            c.majorNum = majorNum;
            c.minorNum = minorNum;

            p.chapters ~= parseChapter(c, readall(File(matches["filepath"])));
            hasChapters = true;
        } else {
            p.text ~= line ~ "\n";
        }
    }

    return p;
}

Used in section 12

14. parseChapter function

The parseChapter function is the more complex one. It parses the source of a chapter. Before doing any parsing, we resolve the @include statements by replacing them with the contents of the file that was included. Then we loop through each line in the source and parse it, provided that it is not a comment (starting with //);

{parseChapter function 14}

Chapter parseChapter(Chapter chapter, string src) {
    {Initialize some variables, 15}

    string[] blocks = [];

    string include(string file) {
        if (file == filename) {
            error(filename, 1, "Recursive include");
            return "";
        }
        if (!exists(file)){
            error(filename, 1, "File " ~ file ~ " does not exist");
            return "";
        }
        return readall(File(file));
    }

    // Handle the @include statements
    src = std.regex.replaceAll!(match => include(match[1]))(src, regex(`\n@include (.*)`));
    string[] lines = src.split("\n");

    int lineNum = 0;
    foreach (line; lines) {
        lineNum++;

        if (strip(line).startsWith("//") && !inCodeblock) {
            continue;
        }

        {Parse the line, 16}
    }
    {Close the last section, 22}

    return chapter;
}

Used in section 12

15. The Parse Function Setup

For the initial variables, it would be nice to move the value for chapter.file into a variable called filename. Additionally, I'm going to need an array of all the possible commands that are recognized.

{Initialize some variables 15}

string filename = chapter.file;
string[] commands = ["@code_type", "@comment_type", "@compiler", "@error_format", 
                     "@add_css", "@overwrite_css", "@colorscheme", "@include"];

Added to in section 15

Used in section 14

I also need to keep track of the current section that is being parsed, and the current block that is being parsed, because the parser is going through the file one line at a time. I'll also define the current change being parsed.

{Initialize some variables 15} +=

Section curSection;
int sectionNum = 0;
Block curBlock;
Change curChange;

Added to in section 15

Used in section 14

Finally, I need 3 flags to keep track of if it is currently parsing a codeblock, a search block, or a replace block.

{Initialize some variables 15} +=

bool inCodeblock = false;
bool inSearchBlock = false;
bool inReplaceBlock = false;

Added to in section 15

Used in section 14

16. Parsing the Line

When parsing a line, we are either inside a code block, or inside a prose block, or we are transitioning from one to the other. So we'll have an if statement to separate the two.

{Parse the line 16}

if (!inCodeblock) {
    // This might be a change block
    {Parse change block, 23}
    {Parse a command, 16}
    {Parse a title command, 16}
    {Parse a section definition, 17}
    {Parse the beginning of a code block, 18}
    else if (curBlock !is null) {
        if (line.split().length > 1) {
            if (commands.canFind(line.split()[0])) {
                continue;
            }
        }
        {Add the line to the list of lines, 21}
    }
} else if (startsWith(line, "---")) {
    {Begin a new prose block, 20}
} else if (curBlock !is null) {
    {Add the line to the list of lines, 21}
}

Used in section 14

Parsing a command and the title command are both fairly simple, so let's look at them first.

To parse a command we first make sure that there is the command name, and any arguments. Then we check if the command is part of the list of commands we have. If it is, we create a new command object, fill in the name and arguments, and add it to the chapter object.

We also do something special if it is a @include command. For these ones, we take the file read it, and parse it as a chapter (using the parseChapter function). Then we add the included chapter's sections to the current chapter's sections. In this case, we don't add the @include command to the list of chapter commands.

{Parse a command 16}

if (line.split().length > 1) {
    if (commands.canFind(line.split()[0])) {
        Command cmd = new Command();
        cmd.name = line.split()[0];
        auto index = cmd.name.length;
        cmd.args = strip(line[index..$]);
        cmd.lineNum = lineNum;
        cmd.filename = filename;
        if (cmd.args == "none") {
            cmd.args = "";
        }

        if (curSection is null) {
            chapter.commands ~= cmd;
        } else {
            curSection.commands ~= cmd;
        }
    }
}

Parsing an @title command is even simpler.

{Parse a title command 16}

if (startsWith(line, "@title")) {
    chapter.title = strip(line[6..$]);
}

17. Parsing a Section Definition

When a new section is created (using @s), we should add the current section to the list of sections for the chapter, and then we should create a new section, which becomes the current section.

{Parse a section definition 17}

else if (startsWith(line, "@s")) {
    if (curBlock !is null && !curBlock.isCodeblock) {
        if (strip(curBlock.text()) != "") {
            curSection.blocks ~= curBlock;
        }
    } else if (curBlock !is null && curBlock.isCodeblock) {
        error(chapter.file, curBlock.startLine, "Unclosed block {" ~ curBlock.name ~ "}");
    }
    // Make sure the section exists
    if (curSection !is null) {
        chapter.sections ~= curSection;
    }
    curSection = new Section();
    curSection.title = strip(line[2..$]);
    curSection.commands = chapter.commands ~ curSection.commands;
    curSection.num = ++sectionNum;

    curBlock = new Block();
    curBlock.isCodeblock = false;
}

Used in section 16

18. Parse the Start of a Codeblock

Codeblocks always begin with --- title, so we can use the regex ^---.+ to represent this. Once a new codeblock starts, the old one must be appended to the current section's list of blocks, and the current codeblock must be reset.

{Parse the beginning of a code block 18}

else if (matchAll(line, regex("^---.+"))) {
    if (curSection is null) {
        error(chapter.file, lineNum, "You must define a section with @s before writing a code block");
        continue;
    }

    if (curBlock !is null) {
        curSection.blocks ~= curBlock;
    }
    curBlock = new Block();
    curBlock.startLine = lineNum;
    curBlock.isCodeblock = true;
    curBlock.name = strip(line[3..$]);

    {Parse Modifiers, 19}

    if (blocks.canFind(curBlock.name)) {
        if (!curBlock.modifiers.canFind(Modifier.redef) && !curBlock.modifiers.canFind(Modifier.additive)) {
            error(filename, lineNum, "Redefinition of {" ~ curBlock.name ~ "}, use ':=' to redefine");
        }
    } else {
        blocks ~= curBlock.name;
    }

    foreach (cmd; curSection.commands) {
        if (cmd.name == "@code_type") {
            curBlock.codeType = cmd.args;
        } else if (cmd.name == "@comment_type") {
            curBlock.commentString = cmd.args;
        }
    }

    inCodeblock = true;
}

Used in section 16

19. Check for and extract modifiers.

Modifier format for a code block: --- Block Name --- noWeave +=. The checkForModifiers ugliness is due to lack of (?|...) and friends.

First half matches for expressions with modifiers:

(?P<namea>\S.*) : Keep taking from the first non-whitespace character ...
[ \t]-{3}[ \t] : Until it matches ---
(?P<modifiers>.+) : Matches everything after the separator.

Second half matches for no modifiers: Ether Block name and with a floating separator Block Name ---.

|(?P<nameb>\S.*?) : Same thing as #1 but stores it in nameb
[ \t]*? : Checks for any amount of whitespace (Including none.)
(-{1,}$ : Checks for any floating - and verifies that nothing else is there untill end of line.
|$)) : Or just checks that there is nothing but the end of the line after the whitespace.

Returns ether namea and modifiers or just nameb.

{Parse Modifiers 19}

auto checkForModifiers = ctRegex!(`(?P<namea>\S.*)[ \t]-{3}[ \t](?P<modifiers>.+)|(?P<nameb>\S.*?)[ \t]*?(-{1,}$|$)`);
auto splitOnSpace = ctRegex!(r"(\s+)");
auto modMatch = matchFirst(curBlock.name, checkForModifiers);

// matchFirst returns unmatched groups as empty strings

if (modMatch["namea"] != "") {
    curBlock.name = modMatch["namea"];
} else if (modMatch["nameb"] != ""){
    curBlock.name = modMatch["nameb"];
    // Check for old syntax.
    if (curBlock.name.endsWith("+=")) {
        curBlock.modifiers ~= Modifier.additive;
        curBlock.name = strip(curBlock.name[0..$-2]);
    } else if (curBlock.name.endsWith(":=")) {
        curBlock.modifiers ~= Modifier.redef;
        curBlock.name = strip(curBlock.name[0..$-2]);
    }
} else {
    error(filename, lineNum, "Something went wrong with: " ~ curBlock.name);
}

if (modMatch["modifiers"]) {
    foreach (m; splitter(modMatch["modifiers"], splitOnSpace)) {
        switch(m) {
            case "+=":
                curBlock.modifiers ~= Modifier.additive;
                break;
            case ":=":
                curBlock.modifiers ~= Modifier.redef;
                break;
            case "noWeave":
                curBlock.modifiers ~= Modifier.noWeave;
                break;
            case "noTangle":
                curBlock.modifiers ~= Modifier.noTangle;
                break;
            default:
                error(filename, lineNum, "Invalid modifier: " ~ m);
                break;
        }
    }

}

Used in section 18

20. Parse the End of a Codeblock

Codeblocks end with just a ---. When a codeblock ends, we do the same as when it begins, except the new block we create is a block of prose as opposed to code.

{Begin a new prose block 20}

if (curBlock !is null) {
    curSection.blocks ~= curBlock;
}
curBlock = new Block();
curBlock.startLine = lineNum;
curBlock.isCodeblock = false;
inCodeblock = false;

Used in section 16

21. Add the current line

Finally, if the current line is nothing interesting, we just add it to the current block's list of lines.

{Add the line to the list of lines 21}

curBlock.lines ~= new Line(line, filename, lineNum);

Used in sections 16 and 16

Now we're done parsing the line.

22. Closing the last section

When the end of the file is reached, the last section has not been closed and added to the chapter yet, so we should do that. Additionally, if the last block is a prose block, it should be closed and added to the section first. If the last block is a code block, it should have been closed with ---. If it wasn't we'll throw an error.

{Close the last section 22}

if (curBlock !is null) {
    if (!curBlock.isCodeblock) {
        curSection.blocks ~= curBlock;
    } else {
        writeln(filename, ":", lineNum - 1, ":error: {", curBlock.name, "} is never closed");
    }
}
if (curSection !is null) {
    chapter.sections ~= curSection;
}

Used in section 14

23. Parsing the change block

Parsing a change block is somewhat complex. Change blocks look like this:

@change file.lit

Some comments here...

@replace
replace this text
@with
with this text
@end

More comments ...

@replace
...
@with
...
@end

...

@change_end

You can make multiple changes on one file. We've got two nice flags for keeping track of which kind of block we are in: replaceText or searchText.

{Parse change block 23}

// Start a change block
if (startsWith(line, "@change") && !startsWith(line, "@change_end")) {
    curChange = new Change();
    curChange.filename = strip(line[7..$]);
    continue;
} else if (startsWith(line, "@replace")) {
    // Begin the search block
    curChange.searchText ~= "";
    curChange.replaceText ~= "";
    inReplaceBlock = false;
    inSearchBlock = true;
    continue;
} else if (startsWith(line, "@with")) {
    // Begin the replace block and end the search block
    inReplaceBlock = true;
    inSearchBlock = false;
    continue;
} else if (startsWith(line, "@end")) {
    // End the replace block
    inReplaceBlock = false;
    inSearchBlock = false;
    // Increment the number of changes
    curChange.index++;
    continue;
} else if (startsWith(line, "@change_end")) {
    // Apply all the changes
    string text = readall(File(curChange.filename));
    foreach (i; 0 .. curChange.index) {
        text = text.replace(curChange.searchText[i], curChange.replaceText[i]);
    }
    Chapter c = new Chapter();
    c.file = curChange.filename;
    // We can ignore these, but they need to be initialized
    c.title = "";
    c.majorNum = -1;
    c.minorNum = -1;
    Chapter includedChapter = parseChapter(c, text);
    // Overwrite the current file's title and add to the commands and sections
    chapter.sections ~= includedChapter.sections;
    chapter.commands ~= includedChapter.commands;
    chapter.title = includedChapter.title;
    continue;
}

// Just add the line to the search or replace text depending
else if (inSearchBlock) {
    curChange.searchText[curChange.index] ~= line ~ "\n";
    continue;
} else if (inReplaceBlock) {
    curChange.replaceText[curChange.index] ~= line ~ "\n";
    continue;
}

Used in section 16