From 90d0b7705b10427a34da93c4903bbfa48ea42edb Mon Sep 17 00:00:00 2001 From: lgmccurdy Date: Thu, 22 Jan 2026 14:06:33 -0500 Subject: [PATCH 1/4] pushing iXMLstart md --- iXMLstart.md | 1 + 1 file changed, 1 insertion(+) create mode 100644 iXMLstart.md diff --git a/iXMLstart.md b/iXMLstart.md new file mode 100644 index 0000000..689422c --- /dev/null +++ b/iXMLstart.md @@ -0,0 +1 @@ +# Writing Your First iXML From 215a8c8cafdc7c0d25d0c751f736aaa4d09402a1 Mon Sep 17 00:00:00 2001 From: lgmccurdy Date: Tue, 27 Jan 2026 00:19:56 -0500 Subject: [PATCH 2/4] Updates to iXMLstart.md --- iXMLstart.md | 187 ++++++++++++++++++++++++++++++++++++++++++++++++++- 1 file changed, 186 insertions(+), 1 deletion(-) diff --git a/iXMLstart.md b/iXMLstart.md index 689422c..9bf2e8d 100644 --- a/iXMLstart.md +++ b/iXMLstart.md @@ -1 +1,186 @@ -# Writing Your First iXML +# Writing Your First iXML + +## The Data + +There's a plain-text tab-delimited document at https://raw.githubusercontent.com/djbpitt/ixml/refs/heads/main/movies/movieData-short.txt that contains information about several films. Each line of the file contains four pieces of information about one film, separated by tab characters: + +- title +- year +- country +- runtime + +The first line of the file contains column headers; the remaining lines are actual data. You may want to open the file in and look around in order to familiarize yourself with it before you start the assignment. + +The format of the data is mostly straightforward, but here are a few details: + +- Some titles are surrounded by quotation marks and others are not. The use of quotation marks around titles is arbitrary and not informational. +- The country value includes between one and seven countries, inclusive. If there are two or more, the individual countries are separated by a comma and space character and the whole value is surrounded by quotation marks, e.g., `"UK, USA"`. The quotation marks in the country field are used only when there are two or more countries. +- The runtime is either a number of minutes followed by a space character and the abbreviation `min` (e.g., `93 min`) or just the string `N/A` (which stands for not available). If a time is given, it is always given as a number of minutes (i.e., there is no reference to hours or any other unit of time except minutes). + +## The Task + +Your task is to create an ixml grammar that will convert the plain-text file to well-formed XML. + +**What is a grammar?** A grammar is a collection of rules that defines how to match parts of the input text. Your entire iXML document is a grammar, and every rule contributes to describing the structure you expect in the input. + +For example, for the plain text line: + +``` +Goodbye to All That 1930 UK 9 min +``` + +your ixml grammar should create: + +```xml + + Goodbye to All That + 1930 + UK + 9 min + +``` + +### XML Output + +For the moment you don't have to worry about treating multiple countries specially. That is, where the input country value is something like `"US, UK"`, it's okay if your XML says `"US, UK"`. We'll process that value further, later in this unit, using XSLT within XProc. + +Invisible XML doesn't create pretty-printed (wrapped and indented) output by default, and we'll take care of that later, once we begin using XProc. If, in the meanwhile, you'd like to make the result more legible, you can save the output file, open it in , and pretty-print it there. + +## Understanding iXML Rules + +An iXML grammar consists of rules. Each rule has a "left hand side" and a "right hand side". The left hand side is a single symbol, the one being defined, and the right hand side is a list of one or more symbols that define it. + +A symbol is either: + +- The name of something (in which case there must be a further rule that defines it, one that has it as the rule's left hand side), or +- Something that literally matches characters in your input + +### Basic Syntax: How to Write a Grammar + +A rule has the form of a name (the left hand side) followed by a colon, followed by one or more symbols (the right hand side) followed by a period. If more than one symbol appears on the right hand side, they must be separated by commas: + +``` +symbol-name: defining, symbols, here . +``` + +Invisible XML allows only a single rule for any given name. If you want to express that a symbol can have two or more definitions, separate the alternatives with semicolons. + +This rule says a "thing" is a "this" followed by a "that": + +``` +thing: this, that . +``` + +This rule says a "thing" is a "this" OR a "that": + +``` +thing: this; that . +``` + +Whitespace around punctuation is insignificant: `"this;that"` is the same as `"this; that"`. + +## Understanding iXML Symbols and Structure + +**Nonterminals** are the symbols you define with rules - they're the names on the left-hand side of your rules. They represent patterns or structures in your text that get converted to XML elements. For example, if you write: + +``` +month: "January"; "February"; "March" . +``` + +Then `month` is a nonterminal. When the processor matches one of those month names in your input, it creates a `` element in the XML output. + +### Terminals + +**Terminals** are the symbols that match characters explicitly in your input. In the example above, `"January"`, `"February"`, and `"March"` are terminals - they match those exact words in your input. + +### Organizing the Right Hand Side + +The right-hand side of a rule is organized into levels: + +- **Alternatives** — separated by semicolons (`;`), offering different ways to match the same nonterminal +- **Terms** — separated by commas (`,`), forming a sequence +- **Factors** — the basic building blocks: terminals (quoted text like `"January"`), nonterminals (names of other rules), or grouped alternatives in parentheses + +You can use parentheses to group alternatives together. For example: + +``` +memo: recipient, (date, sender ; sender, date), content . +``` + +Here, the middle part can match either date followed by sender, OR sender followed by date. + +## Unicode Characters and Hex Codes + +**What is Unicode?** Unicode is a universal character encoding system that assigns every character—from all languages and symbol sets—a unique code point. This lets grammars reliably match a wide range of characters. + +In iXML, you cannot put actual line breaks, tabs, or other special characters directly into quoted strings. To represent these invisible characters or symbols, iXML uses encoded characters written as a number sign ("#") followed by hexadecimal digits. Each encoded character can represent any Unicode character. + +### Useful Character Codes + +You may find the following values useful as you construct your ixml grammar: + +| Character | Character Code (Hex value) | +|-----------|----------------------------| +| tab | `#9` | +| quotation mark (") | `#22` | +| newline (CR?, LF) | `#d?, #a` | + +## How to Proceed + +### Reading the Plain-Text Input + +You can use CoffeePot, the jωiXML workbench, or Markup Blitz to test your grammar. You can read your plain-text input file directly from the URL above or save it and then read your local copy. + +**Using Markup Blitz (command line):** + +Markup Blitz can read directly from a URL, so you can parse the remote file against your ixml grammar with: + +```bash +blitz movies.ixml https://raw.githubusercontent.com/djbpitt/ixml/refs/heads/main/movies/movieData-short.txt +``` + +### Examining or Saving the XML Output + +If you run the code above and your ixml tagging succeeds, the output will race across the screen. Here are two ways to examine the result more carefully: + +- If you're on MacOS or in GitBash in Windows, you can pipe the output into `less`, a pager that displays one screen of data at a time. The command to do that is: + +```bash +blitz movies.ixml https://raw.githubusercontent.com/djbpitt/ixml/refs/heads/main/movies/movieData-short.txt | less +``` + +Once the first screen is displayed in `less`, pressing the space bar moves forward to the next page. Press `q` to quit. + +- To save the output to disk, you can redirect it to a file by appending `> movies.xml` to the command line, i.e.: + +```bash +blitz movies.ixml https://raw.githubusercontent.com/djbpitt/ixml/refs/heads/main/movies/movieData-short.txt > movies.xml +``` + +This will create a file called `movies.xml`, which will contain the output of your ixml operation. If the file already exists, running the command will replace the old file with the new output. + +## Handling Line Endings (Postel's Law) + +Files from different systems use different line-ending conventions: Linux and macOS use `#a` (LF), while Windows uses `#d#a` (CRLF). Since you cannot predict which format an Internet file will use, you should write your grammar to accept both. The pattern `#d?#a` will match the end-of-line code in both Linux/macOS and Windows files. Using this pattern is a way of being liberal in what you accept. + +Files may also end with or without a final newline. To handle both cases, you can use the separator pattern with an optional trailing newline: + +``` +doc: line++newline, newline? . +``` + +- `line++newline` matches one or more lines with newlines as separators between them +- The final `newline?` makes a trailing newline optional + +### Repetition operators: + +| Operator | Meaning | +|----------|---------| +| `*` | zero or more | +| `+` | one or more | +| `?` | optional (zero or one) | +| `++` | one or more with separator (e.g., `item++comma`) | + +## What to Submit + +Submit only your ixml grammar. Do not submit the tagged XML that it creates; we'll run it ourselves to examine the output. From a4d2b67d0a18d5064eaf55cd3337766f0e787f3f Mon Sep 17 00:00:00 2001 From: Mo Wright Date: Wed, 28 Jan 2026 12:52:15 -0500 Subject: [PATCH 3/4] revision work on the starter file --- iXMLstart.md | 166 ++++++++++++++++++++++++++++++++++----------------- 1 file changed, 110 insertions(+), 56 deletions(-) diff --git a/iXMLstart.md b/iXMLstart.md index 9bf2e8d..7227383 100644 --- a/iXMLstart.md +++ b/iXMLstart.md @@ -1,59 +1,11 @@ # Writing Your First iXML -## The Data - -There's a plain-text tab-delimited document at https://raw.githubusercontent.com/djbpitt/ixml/refs/heads/main/movies/movieData-short.txt that contains information about several films. Each line of the file contains four pieces of information about one film, separated by tab characters: - -- title -- year -- country -- runtime - -The first line of the file contains column headers; the remaining lines are actual data. You may want to open the file in and look around in order to familiarize yourself with it before you start the assignment. - -The format of the data is mostly straightforward, but here are a few details: - -- Some titles are surrounded by quotation marks and others are not. The use of quotation marks around titles is arbitrary and not informational. -- The country value includes between one and seven countries, inclusive. If there are two or more, the individual countries are separated by a comma and space character and the whole value is surrounded by quotation marks, e.g., `"UK, USA"`. The quotation marks in the country field are used only when there are two or more countries. -- The runtime is either a number of minutes followed by a space character and the abbreviation `min` (e.g., `93 min`) or just the string `N/A` (which stands for not available). If a time is given, it is always given as a number of minutes (i.e., there is no reference to hours or any other unit of time except minutes). - -## The Task - -Your task is to create an ixml grammar that will convert the plain-text file to well-formed XML. - -**What is a grammar?** A grammar is a collection of rules that defines how to match parts of the input text. Your entire iXML document is a grammar, and every rule contributes to describing the structure you expect in the input. - -For example, for the plain text line: - -``` -Goodbye to All That 1930 UK 9 min -``` - -your ixml grammar should create: - -```xml - - Goodbye to All That - 1930 - UK - 9 min - -``` - -### XML Output - -For the moment you don't have to worry about treating multiple countries specially. That is, where the input country value is something like `"US, UK"`, it's okay if your XML says `"US, UK"`. We'll process that value further, later in this unit, using XSLT within XProc. - -Invisible XML doesn't create pretty-printed (wrapped and indented) output by default, and we'll take care of that later, once we begin using XProc. If, in the meanwhile, you'd like to make the result more legible, you can save the output file, open it in , and pretty-print it there. - ## Understanding iXML Rules -An iXML grammar consists of rules. Each rule has a "left hand side" and a "right hand side". The left hand side is a single symbol, the one being defined, and the right hand side is a list of one or more symbols that define it. +An iXML **grammar** consists of rules. Each rule has a "left hand side" and a "right hand side". The left hand side is a single symbol, which is the title of something that's being defined. The right hand side is a list of one or more symbols that define it. -A symbol is either: +**What is a grammar?** A grammar is a collection of rules that defines how to match parts of the input text. Your entire iXML document is a grammar, and every rule contributes to describing the structure you expect in the input. -- The name of something (in which case there must be a further rule that defines it, one that has it as the rule's left hand side), or -- Something that literally matches characters in your input ### Basic Syntax: How to Write a Grammar @@ -81,7 +33,7 @@ Whitespace around punctuation is insignificant: `"this;that"` is the same as `"t ## Understanding iXML Symbols and Structure -**Nonterminals** are the symbols you define with rules - they're the names on the left-hand side of your rules. They represent patterns or structures in your text that get converted to XML elements. For example, if you write: +**Nonterminals** are the symbols you define with rules - they're the names on the left-hand side, before the colon. They represent patterns or structures in your text that get converted to XML elements. For example, if you write: ``` month: "January"; "February"; "March" . @@ -89,17 +41,35 @@ month: "January"; "February"; "March" . Then `month` is a nonterminal. When the processor matches one of those month names in your input, it creates a `` element in the XML output. -### Terminals +**Terminals** are the symbols that match characters explicitly in your input. In the example above, `"January"`, `"February"`, and `"March"` are terminals - they match those exact words in your input. -**Terminals** are the symbols that match characters explicitly in your input. In the example above, `"January"`, `"February"`, and `"March"` are terminals - they match those exact words in your input. +### Mixing and Matching -### Organizing the Right Hand Side +A nonterminal may be defined using another nonterminal. Let's add to the previous example: +``` +date: month . +month: "January"; "February", "March" . +``` +When a rule refers to other nonterminals, it might help to think of the output to better understand what's happening: + +``` + + January + +``` + +Elements can contain other elements ("nesting"), so nonterminals can - and most likely will - contain other nonterminals. + +## Organizing the Right Hand Side + +### Building blocks The right-hand side of a rule is organized into levels: -- **Alternatives** — separated by semicolons (`;`), offering different ways to match the same nonterminal -- **Terms** — separated by commas (`,`), forming a sequence - **Factors** — the basic building blocks: terminals (quoted text like `"January"`), nonterminals (names of other rules), or grouped alternatives in parentheses +- **Terms** — separated by commas (`,`), forming a sequence +- **Alternatives** — separated by semicolons (`;`), offering different ways to match the same nonterminal + You can use parentheses to group alternatives together. For example: @@ -109,12 +79,49 @@ memo: recipient, (date, sender ; sender, date), content . Here, the middle part can match either date followed by sender, OR sender followed by date. +## Inclusions +For both letters and numbers, you can use a square brackets to indicate ranges. For instance: +``` +num: ["0"-"5"] . +``` + +The above code will match any digit from 0 to 5, inclusive. + + +## Exclusions +Adding a tilde ('\~') before a symbol changes a rule from matching that symbol to matching anything *except* for that symbol. This is called an **exclusion**. For example: + +``` +notNum: ~[N] +``` +The above code will recognize a letter, whitespace, etc. as a `notNum`, but will not recognize any numerical digit. + +### Excluding a nonterminal +There will be instances where the original file you're working with will have structural elements you don't want in your XML output. Take this situation: + +`` +1/2/3 +`` + +Each number is separated by a slash, but maybe for our purposes, it isn't important to preserve this formatting. Our grammar might look like this: + + +``` +line: num, slash, num, slash, num +num: [N] +-slash: -#2F +``` + +The `-` before the `slash` terminal means that the `` **element** will not appear in the final output. The `-` before `#2F` supresses the slash itself. + ## Unicode Characters and Hex Codes **What is Unicode?** Unicode is a universal character encoding system that assigns every character—from all languages and symbol sets—a unique code point. This lets grammars reliably match a wide range of characters. In iXML, you cannot put actual line breaks, tabs, or other special characters directly into quoted strings. To represent these invisible characters or symbols, iXML uses encoded characters written as a number sign ("#") followed by hexadecimal digits. Each encoded character can represent any Unicode character. +Most special character codes in iXML match the Unicode equivalent. A tab is represented in unicode as U+0009. Replacing "U+000" with "#" yields the iXML code, #9. + ### Useful Character Codes You may find the following values useful as you construct your ixml grammar: @@ -125,6 +132,45 @@ You may find the following values useful as you construct your ixml grammar: | quotation mark (") | `#22` | | newline (CR?, LF) | `#d?, #a` | +## The Data + +There's a plain-text tab-delimited document at https://raw.githubusercontent.com/djbpitt/ixml/refs/heads/main/movies/movieData-short.txt that contains information about several films. Each line of the file contains four pieces of information about one film, separated by tab characters: + +- title +- year +- country +- runtime + +The first line of the file contains column headers; the remaining lines are actual data. You may want to open the file in and look around in order to familiarize yourself with it before you start the assignment. + +The format of the data is mostly straightforward, but here are a few details: + +- Some titles are surrounded by quotation marks and others are not. The use of quotation marks around titles is arbitrary and not informational. +- The country value includes between one and seven countries, inclusive. If there are two or more, the individual countries are separated by a comma and space character and the whole value is surrounded by quotation marks, e.g., `"UK, USA"`. The quotation marks in the country field are used only when there are two or more countries. +- The runtime is either a number of minutes followed by a space character and the abbreviation `min` (e.g., `93 min`) or just the string `N/A` (which stands for not available). If a time is given, it is always given as a number of minutes (i.e., there is no reference to hours or any other unit of time except minutes). + +## The Task + +Your task is to create an ixml grammar that will convert the plain-text file to well-formed XML. + + +For example, for the plain text line: + +``` +Goodbye to All That 1930 UK 9 min +``` + +your ixml grammar should create: + +```xml + + Goodbye to All That + 1930 + UK + 9 min + +``` + ## How to Proceed ### Reading the Plain-Text Input @@ -181,6 +227,14 @@ doc: line++newline, newline? . | `?` | optional (zero or one) | | `++` | one or more with separator (e.g., `item++comma`) | +### XML Output + +For the moment you don't have to worry about treating multiple countries specially. That is, where the input country value is something like `"US, UK"`, it's okay if your XML says `"US, UK"`. We'll process that value further, later in this unit, using XSLT within XProc. + +Invisible XML doesn't create pretty-printed output by default, but we can take care of that later. For now, you can open the output in oXygen to pretty-print and check your work. + + ## What to Submit Submit only your ixml grammar. Do not submit the tagged XML that it creates; we'll run it ourselves to examine the output. + From 738b7c61be71f514f5727e54dd3455c87c257e27 Mon Sep 17 00:00:00 2001 From: Mo Wright Date: Thu, 29 Jan 2026 13:53:46 -0500 Subject: [PATCH 4/4] updates talked about in meeting - preparation to separate into multiple documents --- iXMLstart.md | 79 ++++++++++++++++++++++++++-------------------------- 1 file changed, 39 insertions(+), 40 deletions(-) diff --git a/iXMLstart.md b/iXMLstart.md index 7227383..8aa02c7 100644 --- a/iXMLstart.md +++ b/iXMLstart.md @@ -1,55 +1,55 @@ -# Writing Your First iXML +# Writing Your First ixml -## Understanding iXML Rules +## Understanding ixml Rules -An iXML **grammar** consists of rules. Each rule has a "left hand side" and a "right hand side". The left hand side is a single symbol, which is the title of something that's being defined. The right hand side is a list of one or more symbols that define it. +An ixml **grammar** consists of rules. Each rule has a "left hand side" and a "right hand side". The left hand side is a single symbol, which is the title of something that's being defined. The right hand side is a list of one or more symbols that define it. -**What is a grammar?** A grammar is a collection of rules that defines how to match parts of the input text. Your entire iXML document is a grammar, and every rule contributes to describing the structure you expect in the input. +**What is a grammar?** A grammar is a collection of rules that defines how to match parts of the input text. Your entire ixml document is a grammar, and every rule contributes to describing the structure you expect in the input. ### Basic Syntax: How to Write a Grammar A rule has the form of a name (the left hand side) followed by a colon, followed by one or more symbols (the right hand side) followed by a period. If more than one symbol appears on the right hand side, they must be separated by commas: -``` -symbol-name: defining, symbols, here . +```shell +symbol-name: defining, symbols, here. ``` Invisible XML allows only a single rule for any given name. If you want to express that a symbol can have two or more definitions, separate the alternatives with semicolons. This rule says a "thing" is a "this" followed by a "that": -``` -thing: this, that . +```shell +thing: this, that. ``` This rule says a "thing" is a "this" OR a "that": -``` -thing: this; that . +```shell +thing: this; that. ``` Whitespace around punctuation is insignificant: `"this;that"` is the same as `"this; that"`. -## Understanding iXML Symbols and Structure +## Understanding ixml Symbols and Structure -**Nonterminals** are the symbols you define with rules - they're the names on the left-hand side, before the colon. They represent patterns or structures in your text that get converted to XML elements. For example, if you write: +**Nonterminals** are the symbols you define with rules — they're the names on the left-hand side, before the colon. They represent patterns or structures in your text that get converted to XML elements. For example, if you write: -``` -month: "January"; "February"; "March" . +```shell +month: "January"; "February"; "March". ``` -Then `month` is a nonterminal. When the processor matches one of those month names in your input, it creates a `` element in the XML output. +Then `month` is a **nonterminal**. When the processor matches one of those month names in your input, it creates a `` element in the XML output. -**Terminals** are the symbols that match characters explicitly in your input. In the example above, `"January"`, `"February"`, and `"March"` are terminals - they match those exact words in your input. +**Terminals** are the symbols that match characters explicitly in your input. In the example above, `"January"`, `"February"`, and `"March"` are **terminals** - they match those exact words in your input. ### Mixing and Matching A nonterminal may be defined using another nonterminal. Let's add to the previous example: -``` -date: month . -month: "January"; "February", "March" . +```shell +date: month. +month: "January"; "February", "March". ``` When a rule refers to other nonterminals, it might help to think of the output to better understand what's happening: @@ -73,54 +73,53 @@ The right-hand side of a rule is organized into levels: You can use parentheses to group alternatives together. For example: -``` -memo: recipient, (date, sender ; sender, date), content . +```shell +memo: recipient, (date, sender ; sender, date), content. ``` Here, the middle part can match either date followed by sender, OR sender followed by date. -## Inclusions +## Character Classes/Categories For both letters and numbers, you can use a square brackets to indicate ranges. For instance: +```shell +num: ["0"-"5"]. ``` -num: ["0"-"5"] . -``` - -The above code will match any digit from 0 to 5, inclusive. +The above code will match any digit from 0 to 5, inclusive. Keep in mind that the quotation marks are necessary. ## Exclusions Adding a tilde ('\~') before a symbol changes a rule from matching that symbol to matching anything *except* for that symbol. This is called an **exclusion**. For example: -``` -notNum: ~[N] +```shell +notNum: ~[N]. ``` The above code will recognize a letter, whitespace, etc. as a `notNum`, but will not recognize any numerical digit. ### Excluding a nonterminal -There will be instances where the original file you're working with will have structural elements you don't want in your XML output. Take this situation: +There will be instances where the original file you're working with will have structural markers you don't want in your XML output. Take this situation: -`` +```shell 1/2/3 -`` +``` Each number is separated by a slash, but maybe for our purposes, it isn't important to preserve this formatting. Our grammar might look like this: -``` -line: num, slash, num, slash, num -num: [N] --slash: -#2F +```shell +line: num, slash, num, slash, num. +num: [N]. +-slash: -#2F. ``` -The `-` before the `slash` terminal means that the `` **element** will not appear in the final output. The `-` before `#2F` supresses the slash itself. +The hyphen `-` before the `slash` terminal means that the `` **element** will not appear in the final output. The hyphen `-` before `#2F` suppresses the slash itself. ## Unicode Characters and Hex Codes **What is Unicode?** Unicode is a universal character encoding system that assigns every character—from all languages and symbol sets—a unique code point. This lets grammars reliably match a wide range of characters. -In iXML, you cannot put actual line breaks, tabs, or other special characters directly into quoted strings. To represent these invisible characters or symbols, iXML uses encoded characters written as a number sign ("#") followed by hexadecimal digits. Each encoded character can represent any Unicode character. +In ixml, you cannot put actual line breaks, tabs, or other special characters directly into quoted strings. To represent these invisible characters or symbols, ixml uses encoded characters written as a number sign ("#") followed by hexadecimal digits. Each encoded character can represent any Unicode character. -Most special character codes in iXML match the Unicode equivalent. A tab is represented in unicode as U+0009. Replacing "U+000" with "#" yields the iXML code, #9. +Most special character codes in ixml match the Unicode equivalent. A tab is represented in unicode as U+0009. Replacing "U+000" with "#" yields the ixml code, #9. ### Useful Character Codes @@ -175,7 +174,7 @@ your ixml grammar should create: ### Reading the Plain-Text Input -You can use CoffeePot, the jωiXML workbench, or Markup Blitz to test your grammar. You can read your plain-text input file directly from the URL above or save it and then read your local copy. +You can use CoffeePot, the jωixml workbench, or Markup Blitz to test your grammar. You can read your plain-text input file directly from the URL above or save it and then read your local copy. **Using Markup Blitz (command line):** @@ -212,7 +211,7 @@ Files from different systems use different line-ending conventions: Linux and ma Files may also end with or without a final newline. To handle both cases, you can use the separator pattern with an optional trailing newline: ``` -doc: line++newline, newline? . +doc: line++newline, newline?. ``` - `line++newline` matches one or more lines with newlines as separators between them