Documenting ŊCv9

Documenting ŊCv9
1. Dictionary format
  1. Proposed format
  2. Data-driven inflections?
2. Tools for documenting languages from Markdown
  1. CreateGlossFilter.rb

Most of my earlier conlangs have been documented using LaTeX. In fact, I’d imagine many conlangers feel attracted to it because it excels at typesetting printed documents to a degree that few tools of reasonable cost do. However, it has some downsides:

It only outputs PDFs (to describe it to a reasonable approximation) and doesn’t output anything that can be readily browsed on the computer. In fact, I’ve found it annoying to look up something about a language from a PDF grammar.
As a result, it’s bound to having discrete pages, so if you want to insert a large table, you’re likely going to have it appear out-of-line, separate from the text where it’s referenced (Table 44). I consider this approach hard to read when it comes to inflection tables and such. If you don’t have discrete pages (as in a web page), then you can include such objects inline.

Copying text from a PDF generated by LaTeX results in garbage like

The experiential aspect is constructed with the verb ⟨ŋačat⟩ in the past tense (and the
perfective aspect).

(note the spurious newline between “the” and “perfective”), and searching is similarly broken.

LaTeX often spits out somewhat cryptic error messages, and the program spits out a lot of debug output.

Dictionary format

For a long time, my conlang documentations have used an ad-hoc text-based dictionary format, which looks like this:

# relten
: nc
@l riltes
mist

# cfiþar
: nc
leaf, page
\textsf{anten cfjoþes} (lit.~\emph{on the leaves of time}) sometimes

# crîþ
: nt
@s clîþic
forest

# flarþ
: nc
@s flalþic
metal

# trešil
: nc
park, garden, field
\textsf{sividin trešil} lit.~\emph{coward's garden} refuge, sanctuary (usually with a negative connotation)

These are converted using a Raku script into LaTeX code, which is included in the main grammar. Of course, it has some downsides that I’d like to fix in an improved format:

markup for definition is tightly coupled to LaTeX, not to mention whatever commands are defined in the main document
no built-in support for examples or idioms
metadata for the dictionary (such as styling or collation rules) not only shows up in a separate file (conventionally named options.json) but also in the grammar source (namely for the usage notes)
there is code to automatically decline nouns, but it’s in yet another separate file; ideally, it would be defined in the dictionary file itself (or at least the main metadata file), although this would probably be a complex endeavor

Proposed format

Have a metadata section, then a data section after it.

Metadata should include:

Collation rules
Usage notes for dictionary
- Explanation of part-of-speech tags: can be as simple as nc = “celestial-gender noun” or as complex as vt(p:i) = “transitive verb whose pre-thematic vowel changes to 〈i〉 in the past tense” (with support for general rules that can produce such explanations). In HTML output, also mark this in entry POS tags using the <abbr> tag.

Data format:

# headword
: part of speech
< etymology
@tag1 first tag
@tag2 second tag
@tag3 third tag; note: only one value allowed per tag

Definition.
This can include multiple lines and should support some *basic* markup.

For examples or idioms, as shown below, the phrase in the target language comes first, separated from the translation by an equals sign. An optional explanation can be provided afterwards, separated by a pipe.

% headword sucks = example | explanation

Data-driven inflections?

TL;DR: Probably not going to work well.

Terminology:

A paradigm encodes how a word is inflected into different forms. A paradigm contains one or more tables.
A criterion encodes the conditions for which a particular paradigm applies. It might depend on the form of the word, as well as its part of speech.
A table is a grid of inflected forms with row and column labels. (Some cells might be empty.) A table might or might not have a name.
A pattern encodes how parts of a word relevant for inflection are extracted. For instance, V-nouns in ŊCv7 have a pattern that extracts the N stem and the thematic vowel; this might look like (.*)(j?[aeiouâêîô]).
A lookup is a table that maps a key to a value. These could be used to store the derivatives of thematic vowels, for instance.
A component is a sequence of phonemes (possibly depending on the word being inflected) that is used to build one or more inflected forms.

Let’s walk through how we can decline V-nouns in ŊCv7.

First, we create a paradigm for V-nouns and set its criteria¹:

# <external definition>
paradigm "V-noun" {
  criterion word ~ "((?:\*?(?:#|+\*?|@))?)(.*)(j?[aeiouâêîô])";
  criterion pos ~ "n.*";
  # <rest of paradigm definition>
}

Then we define the necessary components:

# <rest of paradigm definition>:
component M = word$0;
component N = word$1;
component V0 = word$2;
component V1 = thematic_vowel_derivative_v_1[V0];
component V2 = thematic_vowel_derivative_v_2[V0];
component V3 = thematic_vowel_derivative_v_3[V0];
# <rest of paradigm definition>

Of course, we need to define the appropriate tables²:

# <external definition>:
table thematic_vowel_derivative_v_1 {
  a o
  e o
  i jo
  o o
  u u
  ja jo
  je jo
  jo jo
  â ô
  ê ô
  î jô
  ô ô
  jâ jô
  jê jô
  jô jô
}
# and so on...

Then how do we get components for L and S forms? We have to get these forms from the entry if it lists irregular forms, but derive it regularly otherwise.

# <rest of paradigm definition>:
if criterion l: tag(l) ~ "((?:\*?(?:#|+\*?|@))?)(.*)(j?[aeiouâêîô])s" {
  L = l$1;
} else if criterion nm: N ~ "(.*)(j?[aeiouâêîô])([^aeiouâêîô]*)" {
  L = nm$0 ~ thematic_vowel_derivative_v_1[nm$1] ~ nm$2;
} else error "N form is corrupted"
# L is defined in this scope because it was defined in both of the
# branches that did not error
if criterion s: tag(s) ~ "((?:\*?(?:#|+\*?|@))?)(.*)ic" {
  S = s$1;
} else if criterion nm: N ~ "(.*)(j?[aeiouâêîô])([^aeiouâêîô]*)" {
  if criterion nm$2 ~ "[rl]?þ" {
    SB = "ð";
  } else if criterion nm$2 ~ "t|st|s" {
    SB = "d";
  } else {
    SB1 = replace(nm$2, "r", "R");
    SB2 = replace(SB1, "([aoâô])R([^aeiouâêîô])", "\\1r\\2");
    SB = replace(SB2, "R", "l");
  }
  S = nm$1 ~ SB;
} else error "N form is corrupted"
# <rest of paradigm definition>

Whoops! We forgot about eclipsis:

# <rest of paradigm definition>:
NE = magic_eclipsis_function(N); # use your imagination
# <rest of paradigm definition>

Finally, we define a table:

# <rest of paradigm definition>:
table {
  rows {
    "nominative" "accusative" "dative" "genitive"
    "locative-temporal" "ablative" "allative" "prolative"
    "instrumental-comitative" "abessive" "semblative I" "semblative II"
  }
  columns {
    "singular" "dual" "plural"
  }
  entries {
    "{M}{N}{0}" "{M}{N}{0}c" "{M}{N}{1}"
    "{M}{N}{0}n" "{M}{N}{0}ŋ" "{M}{N}{1}n"
    "{M}{N}{0}s" "{M}{N}{0}ci" "{M}{N}{1}s"
    "{M}{N}{2}n" "{M}{NE}{2}c" "{M}{NE}{3}n"
    # and so on...
  }
}

Of course, something resembling subroutines would make our lives easier. By this time, we might be better off using a proper programming language, especially when it comes to monosyllabic noun declensions.

Tools for documenting languages from Markdown

Can be found in the source tree for the site’s repository.

CreateGlossFilter.rb

Works on HTML sources, looking for <ol> tags with class ilgloss. Like Leipzig.js, but transforms HTML at compile-time instead of making the browser run JS code.

Kramdown source:

{: .ilgloss}
! šin-on men-at ŋ\geð-i-þ.
@ all-%acc.%sg see-%inf %pfv\fail_to-3%pl-%past
"They failed to see anything." **Stay mad, `sed`-users!**

{: .ilgloss}
! šin-o nem-an racr-a.
@ all-%nom.%sg any-%acc.%sg know-3%sg
$\forall x \exists y: \text{$x$ knows $y$}$

Output:

šin-on: all-acc.sg

men-at: see-inf

ŋ\geð-i-þ.: pfv\fail_to-3pl-past

“They failed to see anything.” Stay mad, sed-users!

šin-o: all-nom.sg

nem-an: any-acc.sg

racr-a.: know-3sg

$\forall x \exists y: \text{$x$ knows $y$}$

(Sorry, but this might look weird if you’re using a text browser or a screen reader.)

The syntax here is tentative. ↩
Of course, we could consider the possibility of supporting the ability to define multiple related tables in tandem. ↩