Add logic for separator/initiator/prefixed/terminator alignment by olabusayoT · Pull Request #1604 · apache/daffodil

olabusayoT · 2025-12-18T21:01:18Z

Implement logic to calculate prefix, infix, and postfix separator alignments and lengths in sequence terms.
Added new alignment logic/test cases covering initiator alignment, prefixed length elements, terminator alignment and value MTA (for optimization) behavior.
Enhanced alignment calculation by excluding prior alignment from the contentStart, as alignmentApprox is a good place to start the contentStartAlignment.
Updated schemas and TDML files to include test data for new alignment scenarios.

DAFFODIL-2295, DAFFODIL-3056, DAFFODIL-3057

stevedlawrence

This alignment stuff is very complicated and difficult to make sense. I think this is the right approach to fixing the current issues, I think with some tweaks needed.

Though we might want to consider if there's a different approach, or maybe some refactoring that's easier to make sense of for a future update.

stevedlawrence · 2025-12-19T14:48:22Z

+      case Some(s: SequenceTermBase) if s.hasSeparator =>
+        import SeparatorPosition.*
+        s.separatorPosition match {
+          case Prefix | Infix => LengthMultipleOf(s.knownEncodingAlignmentInBits)


MTA is an alignment thing, so I think these MTA approx functions what to return AligmentMultipleOf instead of LengthApprox.

Not sure that's right.

An MTA "region" like any region is a part of the data stream that has a length. So it depends are we computing the length of something, computing the alignment before or after it, or asking what it's required alignment is.

The separatorPrefixMTAApprox val, and other MTAApprox vals are calculating where the MTA region needs to align to, not how big the region is.

This way we can look at approximate end alignment (we should probably call this approximate position) of whatever came before the MTA region and determine if the MTA region is known to already be aligned and so can be excluded.

And then based on the previous alignment and the MTA, we can then calculate where that MTA will have put us. For example, we might know it just put us on a byte boundary, but we might know more and know that it actually put us on a 2-byte boundary, allowing for better optimizations if a later elements needs to be 2-byte aligned.

Gotcha. I think you are correct.

I think some of the complexity in this code could be reduced by rigorously naming the various things to avoid confusion.

Being clear about lengthApprox vs positionApprox vs. alignmentRequirement would go a long ways. We're overusing the term "alignment" here to mean position and requirement.

Definitely. The naming could definitely be improved.

Adding more variables might also be helpful. This document has great diagrams of all the different parts that are considered in our alignment

https://daffodil.apache.org/dev/design-notes/term-sharing-in-schema-compiler/

Some of our current variables kindof mush multiple boxes together which can make it difficult to tease apart what's going on and how well we actually match the document.

It might makes sense to have a variable that calculates the approximate start of every one of the boxes with matching names (e.g. allignFilleApproximateStart, initiatorMTAApproximateStart),. It adds more variables, but might make things more clear.

And that might simplify other logic that tries to statically compile out different grams. For example, the gram guards can become something like:

val needsAlignFillGram = alignFillApproxStart % alignment != 0 val needsInitiatorMTAGram = initiatorMTAApproxStart % initiator.encoding.mandatoryTextAlignment != 0

stevedlawrence · 2025-12-19T14:49:19Z

+  private lazy val separatorLengthApprox = this.optLexicalParent match {
+    case Some(s: SequenceTermBase) if s.hasSeparator =>
+      getEncodingLengthApprox(s)
+    case _ => LengthMultipleOf(0)


Instead of LengthMultipleOf(0), it might be more clear to make this LengthExact(0). I imigine all the math ends up the same so it probably doesn't really matter, but it feels a bit odd to say something has a length that is a multiple of zero.

stevedlawrence · 2025-12-19T16:18:12Z

+  // To combine AlignmentMultipleOfs, use *
  def *(that: AlignmentMultipleOf) = AlignmentMultipleOf(Math.gcd(nBits, that.nBits))
  def %(that: AlignmentMultipleOf) = nBits % that.nBits
+


Some of these changes are making me wonder if we need a def +(that: AlignmentMultipleOf) that has slightly different logic than gcd.

I believe the intention of * was to combine alignments that could all happen at the same point in data (e.g. a choice of elements, optional siblings elements).

But in the new code, we are taking an existing approximate alignment, and potentially adding a new alignment to it, which is slightly different.

For example, say we are currently at 2-byte alignment (AlignmentMultipleof(16)) and we are adding an element that needs to be 1-byte (AlignmentMultipleOf(8)). In this case, we do not need to perform any alignment here because we are already byte align (we are actually 2-byte aligned), and our logic handles that correctly.

But the contentStartAlignment of that new element still really wants to be 2-byte aligned, so it should remain AlignmentMultipleOf(16). But right now, using * it's alignment will be AlignmentMultilpeOf(8) since gcd(16,8) => 8. So we actually have less information about our true alignment, and could miss out on future optimizations.

So it feels like a + operation might want to be something like

def *(that: AlignmentMultipleOf) = if (this.nBits % that.nBits == 0) this else that

The idea is that if our new alignment (that) evenly divides our existing alignment (this), then no alignment is actually needed, and we can keep our more accurate current alignment. Otherwise we use the new alignment.

stevedlawrence · 2025-12-19T16:25:06Z

+          case Prefix | Infix => LengthMultipleOf(s.knownEncodingAlignmentInBits)
+          case Postfix => LengthMultipleOf(0)
+        }
+      case _ => LengthMultipleOf(0)


I think this really wants to be AlignmentApprox(1) (MTA is alignment after all so using length isn't quite right) and then with the new + operator mentioned above it will work as expected. This is because 1 will always evenly divide the existing alignment so the existing alignment will be used. Same with the other MTA things.

In fact, I think AligmentApprox(0) never wants to be used except for the root element, since it's kindof a 1-based thing.

stevedlawrence · 2025-12-19T16:41:11Z

-    }
+        * initiatorMTAApprox
+        + initiatorLengthApprox
+    val csa = (leftFramingApprox + prefixLengthElementLength) * valueMTAApprox


I think this also wants to use the new + operator.

stevedlawrence · 2025-12-19T16:42:12Z

-          contentStartAlignment + (elementSpecifiedLengthApprox + trailingSkipApprox)
+          contentStartAlignment + elementSpecifiedLengthApprox
        }
+        val cea = res * terminatorMTAApprox


Wants to be the new + operator

stevedlawrence · 2025-12-19T16:43:36Z

+        // to terminator MTA/length and our trailing skip to get where it would leave off were
+        // each the actual last child.
+        val lastApproxesFinal = lastApproxes.map {
+          _ * terminatorMTAApprox


This wants to be a +

stevedlawrence · 2025-12-19T16:51:30Z

+                prefixElem.elementLengthInBitsEv.optConstant.get.get
+              ) + prefixLengthElementLength
+            } else {
+              getEncodingLengthApprox(prefixElem)


Prefix elements dont' necessarily have encoding. They could just be binary numbers. And I think it's possible for prefix lengths to have lengthKind="prefixed", in which case this would have to be LengthMultipleOf(lengthUnits) or something?

If they are binary numbers, wouldn't the isKnownEncoding condition be false, and result in the value being LengthMultipleOf(8)?

It doesn't look like isKnownEncoding takes into account whether or not an element is binary or not. It just accesses the encodingEv and asks if it's constant. And I think this is the correct behavior, since binary elements can have things like initiators that do require encoding. For example:

<element name="foo" type="xs:int" dfdl:representation="binary" dfdl:length="4" dfdl:lengthUnits="bytes" dfdl:initiator="FOO:" dfdl:encoding="US-ASCII" />

This will parse the 4-character ascii string "FOO:" followed by a 4-byte two's complement binary integer.

So in some cases, like the example above, it can be reasonable to ask if a binary element has a known encoding ad it could be true, so we can't rely on it being false for binary numbers.

This sort of "stuff" really does happen. TLOG format is a published schema on DFDLSchemas site. It is a binary format with character delimiters here and there. It does sort of help if looking at text dumps of the data. The binary numbers come through as mojibake, but you can pick out the delimiters often and figure the data out via a combination of text and hex dumps.

stevedlawrence · 2026-01-09T20:07:03Z

+    LengthExact(0)
+  }
+
+  private def getEncodingLengthApprox(t: Term) = {


Could we make this a private lazy val so we don't have to recalculate it? For example, uses would look like term.encodingLengthApprox instead of getEncodingLengthApprox(someTerm)?

stevedlawrence · 2026-01-09T20:11:25Z

      }
    }.value

+  private lazy val separatorPrefixMTAApprox =


Is it worth calling this separatorPrefixInfixMTAApprox, just to make it clear this is represents both of kinds of separators?

stevedlawrence · 2026-01-09T20:18:37Z

+      case eb: ElementBase => {
+        if (eb.lengthKind == Prefixed) {
+          val prefixTypeElem = eb.prefixedLengthElementDecl
+          getElementLength(prefixTypeElem)


Is it possible to use eb.prefixedLenghtElementDecl.elementSpecifiedLengthApprox to get the length of the prefix element instead of the new getElementLength?

stevedlawrence · 2026-01-09T20:24:41Z

        Assert.invariant(lastApproxes.nonEmpty)
-        val res = lastApproxes.reduce { _ * _ }
+        val res = lastApproxes.reduce {
+          _ * _


Can we put this all on one line? Looks confusing having a line with just symbol characters.

stevedlawrence · 2026-01-13T15:09:08Z

-          // it the actual last child.
-          //
+          // for each possible last child, gather its endingAlignmentApprox
          val lceaa = lc.endingAlignmentApprox


Does this want to be endingAlignmentWithRightFramingApprox? This case is trying to figure out the ending alignment of the last child(ren) of the model group, which is where the content of this model group ends. If the last child(ren) contains a postfix separator, then I think that wants to be included as part of the content of this model group. And so this wants to use the ...WithRightFraming variable?

Or, maybe we just move the separatorPostfixMTAApprox and separatorLengthApprox additions into endingAlignmentAppprox and remove the ...WithRightFraming variable? Since with the above change, nothing would actually access ...WithRightFraming anymore?

You're right, per the spec SequenceContent ought to include Separator, terminator and trailingskip for the enclosed content.

Sequence = LeftFraming SequenceContent RightFraming
SequenceContent = [ PrefixSeparator EnclosedContent [ Separator EnclosedContent ]* PostfixSeparator ]

I think there is value in keeping the WithRightFraming variable as it more closely aligns with the spec thus we should actually get rid of the endingAlignmentApprox variable since nothing uses it

stevedlawrence · 2026-01-13T18:33:44Z

+            val prefixElem = eb.prefixedLengthElementDecl
+            if (prefixElem.lengthKind == Explicit || prefixElem.lengthKind == Implicit) {
+              LengthExact(
+                prefixElem.elementLengthInBitsEv.optConstant.get.get


I think this might be incorrect? This is saying that if an element is lengthKind="prefix" then the elements length is length of the prefixed length element. But that's not quite right. The length of this element is the length of the value of the prefix element, which can't be known until runtime. So we can never know the exact length of a prefix length element.

I might suggest we just remove the new prefix length logic (maybe add comments where logic is missing) and leave it as NYI. I think prefix length elements are broken in a number of places so it's going to be hard to test if this is even working correctly. Suggest we fix this when we fix prefixed lengths.

stevedlawrence

+1 👍 can you please also update an existing ticket or create a new ticket so we don't forget that lengthKind="prefix" isn't implemented ?

jadams-tresys

+1 with just a style suggestion

If having separate vals for results makes things easier to debug feel free to leave them in.

jadams-tresys · 2026-01-14T21:07:55Z

+        val res = (pa % aa) == 0
+        res
+      }
+    res


Thoughts on having this else statement just end in

(pa % aa) == 0

Having the multiple res vals makes things a little confusing.

jadams-tresys · 2026-01-14T21:16:10Z

+            val res = Some(contentStartAlignment)
+            res


Creating a separate val res for this seems unnecessary.

jadams-tresys · 2026-01-14T21:16:20Z

+        val res = lastApproxes.reduce(_ * _)
        res


- Implement logic to calculate prefix, infix, and postfix separator alignments and lengths in sequence terms. - Added new alignment logic/test cases covering initiator alignment, prefixed length elements, terminator alignment and value MTA (for optimization) behavior. - Add new variables to more accurately represent alignment, such as contentEndAlignment, which is useful for delimiter text alignment before the terminator and trailing skip is considered. - Updated schemas and TDML files to include test data for new alignment scenarios. DAFFODIL-2295, DAFFODIL-3056, DAFFODIL-3057

olabusayoT requested review from jadams-tresys, mbeckerle and stevedlawrence December 18, 2025 21:01

This was referenced Dec 18, 2025

Update alignment to handle terminator alignment #1601

Closed

Add support for separator alignment in sequences #1594

Closed

stevedlawrence requested changes Dec 19, 2025

View reviewed changes

stevedlawrence reviewed Jan 13, 2026

View reviewed changes

stevedlawrence approved these changes Jan 14, 2026

View reviewed changes

jadams-tresys approved these changes Jan 14, 2026

View reviewed changes

olabusayoT force-pushed the daf-alignment-tickets branch from 0e18f71 to 556fed3 Compare January 14, 2026 21:43

olabusayoT force-pushed the daf-alignment-tickets branch from 556fed3 to 0614ec6 Compare January 14, 2026 21:52

olabusayoT merged commit 62d9bb0 into apache:main Jan 15, 2026
11 checks passed

olabusayoT deleted the daf-alignment-tickets branch January 15, 2026 16:46

Conversation

olabusayoT commented Dec 18, 2025

Uh oh!

stevedlawrence left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

stevedlawrence left a comment

Choose a reason for hiding this comment

Uh oh!

jadams-tresys left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants