diff --git a/iXMLstart.md b/iXMLstart.md new file mode 100644 index 0000000..8aa02c7 --- /dev/null +++ b/iXMLstart.md @@ -0,0 +1,239 @@ +# Writing Your First ixml + +## Understanding ixml Rules + +An ixml **grammar** consists of rules. Each rule has a "left hand side" and a "right hand side". The left hand side is a single symbol, which is the title of something that's being defined. The right hand side is a list of one or more symbols that define it. + +**What is a grammar?** A grammar is a collection of rules that defines how to match parts of the input text. Your entire ixml document is a grammar, and every rule contributes to describing the structure you expect in the input. + + +### Basic Syntax: How to Write a Grammar + +A rule has the form of a name (the left hand side) followed by a colon, followed by one or more symbols (the right hand side) followed by a period. If more than one symbol appears on the right hand side, they must be separated by commas: + +```shell +symbol-name: defining, symbols, here. +``` + +Invisible XML allows only a single rule for any given name. If you want to express that a symbol can have two or more definitions, separate the alternatives with semicolons. + +This rule says a "thing" is a "this" followed by a "that": + +```shell +thing: this, that. +``` + +This rule says a "thing" is a "this" OR a "that": + +```shell +thing: this; that. +``` + +Whitespace around punctuation is insignificant: `"this;that"` is the same as `"this; that"`. + +## Understanding ixml Symbols and Structure + +**Nonterminals** are the symbols you define with rules — they're the names on the left-hand side, before the colon. They represent patterns or structures in your text that get converted to XML elements. For example, if you write: + +```shell +month: "January"; "February"; "March". +``` + +Then `month` is a **nonterminal**. When the processor matches one of those month names in your input, it creates a `` element in the XML output. + +**Terminals** are the symbols that match characters explicitly in your input. In the example above, `"January"`, `"February"`, and `"March"` are **terminals** - they match those exact words in your input. + +### Mixing and Matching + +A nonterminal may be defined using another nonterminal. Let's add to the previous example: + +```shell +date: month. +month: "January"; "February", "March". +``` +When a rule refers to other nonterminals, it might help to think of the output to better understand what's happening: + +``` + + January + +``` + +Elements can contain other elements ("nesting"), so nonterminals can - and most likely will - contain other nonterminals. + +## Organizing the Right Hand Side + +### Building blocks +The right-hand side of a rule is organized into levels: + +- **Factors** — the basic building blocks: terminals (quoted text like `"January"`), nonterminals (names of other rules), or grouped alternatives in parentheses +- **Terms** — separated by commas (`,`), forming a sequence +- **Alternatives** — separated by semicolons (`;`), offering different ways to match the same nonterminal + + +You can use parentheses to group alternatives together. For example: + +```shell +memo: recipient, (date, sender ; sender, date), content. +``` + +Here, the middle part can match either date followed by sender, OR sender followed by date. + +## Character Classes/Categories +For both letters and numbers, you can use a square brackets to indicate ranges. For instance: +```shell +num: ["0"-"5"]. +``` + +The above code will match any digit from 0 to 5, inclusive. Keep in mind that the quotation marks are necessary. + +## Exclusions +Adding a tilde ('\~') before a symbol changes a rule from matching that symbol to matching anything *except* for that symbol. This is called an **exclusion**. For example: + +```shell +notNum: ~[N]. +``` +The above code will recognize a letter, whitespace, etc. as a `notNum`, but will not recognize any numerical digit. + +### Excluding a nonterminal +There will be instances where the original file you're working with will have structural markers you don't want in your XML output. Take this situation: + +```shell +1/2/3 +``` + +Each number is separated by a slash, but maybe for our purposes, it isn't important to preserve this formatting. Our grammar might look like this: + + +```shell +line: num, slash, num, slash, num. +num: [N]. +-slash: -#2F. +``` + +The hyphen `-` before the `slash` terminal means that the `` **element** will not appear in the final output. The hyphen `-` before `#2F` suppresses the slash itself. + +## Unicode Characters and Hex Codes + +**What is Unicode?** Unicode is a universal character encoding system that assigns every character—from all languages and symbol sets—a unique code point. This lets grammars reliably match a wide range of characters. + +In ixml, you cannot put actual line breaks, tabs, or other special characters directly into quoted strings. To represent these invisible characters or symbols, ixml uses encoded characters written as a number sign ("#") followed by hexadecimal digits. Each encoded character can represent any Unicode character. + +Most special character codes in ixml match the Unicode equivalent. A tab is represented in unicode as U+0009. Replacing "U+000" with "#" yields the ixml code, #9. + +### Useful Character Codes + +You may find the following values useful as you construct your ixml grammar: + +| Character | Character Code (Hex value) | +|-----------|----------------------------| +| tab | `#9` | +| quotation mark (") | `#22` | +| newline (CR?, LF) | `#d?, #a` | + +## The Data + +There's a plain-text tab-delimited document at https://raw.githubusercontent.com/djbpitt/ixml/refs/heads/main/movies/movieData-short.txt that contains information about several films. Each line of the file contains four pieces of information about one film, separated by tab characters: + +- title +- year +- country +- runtime + +The first line of the file contains column headers; the remaining lines are actual data. You may want to open the file in and look around in order to familiarize yourself with it before you start the assignment. + +The format of the data is mostly straightforward, but here are a few details: + +- Some titles are surrounded by quotation marks and others are not. The use of quotation marks around titles is arbitrary and not informational. +- The country value includes between one and seven countries, inclusive. If there are two or more, the individual countries are separated by a comma and space character and the whole value is surrounded by quotation marks, e.g., `"UK, USA"`. The quotation marks in the country field are used only when there are two or more countries. +- The runtime is either a number of minutes followed by a space character and the abbreviation `min` (e.g., `93 min`) or just the string `N/A` (which stands for not available). If a time is given, it is always given as a number of minutes (i.e., there is no reference to hours or any other unit of time except minutes). + +## The Task + +Your task is to create an ixml grammar that will convert the plain-text file to well-formed XML. + + +For example, for the plain text line: + +``` +Goodbye to All That 1930 UK 9 min +``` + +your ixml grammar should create: + +```xml + + Goodbye to All That + 1930 + UK + 9 min + +``` + +## How to Proceed + +### Reading the Plain-Text Input + +You can use CoffeePot, the jωixml workbench, or Markup Blitz to test your grammar. You can read your plain-text input file directly from the URL above or save it and then read your local copy. + +**Using Markup Blitz (command line):** + +Markup Blitz can read directly from a URL, so you can parse the remote file against your ixml grammar with: + +```bash +blitz movies.ixml https://raw.githubusercontent.com/djbpitt/ixml/refs/heads/main/movies/movieData-short.txt +``` + +### Examining or Saving the XML Output + +If you run the code above and your ixml tagging succeeds, the output will race across the screen. Here are two ways to examine the result more carefully: + +- If you're on MacOS or in GitBash in Windows, you can pipe the output into `less`, a pager that displays one screen of data at a time. The command to do that is: + +```bash +blitz movies.ixml https://raw.githubusercontent.com/djbpitt/ixml/refs/heads/main/movies/movieData-short.txt | less +``` + +Once the first screen is displayed in `less`, pressing the space bar moves forward to the next page. Press `q` to quit. + +- To save the output to disk, you can redirect it to a file by appending `> movies.xml` to the command line, i.e.: + +```bash +blitz movies.ixml https://raw.githubusercontent.com/djbpitt/ixml/refs/heads/main/movies/movieData-short.txt > movies.xml +``` + +This will create a file called `movies.xml`, which will contain the output of your ixml operation. If the file already exists, running the command will replace the old file with the new output. + +## Handling Line Endings (Postel's Law) + +Files from different systems use different line-ending conventions: Linux and macOS use `#a` (LF), while Windows uses `#d#a` (CRLF). Since you cannot predict which format an Internet file will use, you should write your grammar to accept both. The pattern `#d?#a` will match the end-of-line code in both Linux/macOS and Windows files. Using this pattern is a way of being liberal in what you accept. + +Files may also end with or without a final newline. To handle both cases, you can use the separator pattern with an optional trailing newline: + +``` +doc: line++newline, newline?. +``` + +- `line++newline` matches one or more lines with newlines as separators between them +- The final `newline?` makes a trailing newline optional + +### Repetition operators: + +| Operator | Meaning | +|----------|---------| +| `*` | zero or more | +| `+` | one or more | +| `?` | optional (zero or one) | +| `++` | one or more with separator (e.g., `item++comma`) | + +### XML Output + +For the moment you don't have to worry about treating multiple countries specially. That is, where the input country value is something like `"US, UK"`, it's okay if your XML says `"US, UK"`. We'll process that value further, later in this unit, using XSLT within XProc. + +Invisible XML doesn't create pretty-printed output by default, but we can take care of that later. For now, you can open the output in oXygen to pretty-print and check your work. + + +## What to Submit + +Submit only your ixml grammar. Do not submit the tagged XML that it creates; we'll run it ourselves to examine the output. +