Add logic for separator/initiator/prefixed/terminator alignment#1604
Add logic for separator/initiator/prefixed/terminator alignment#1604olabusayoT merged 1 commit intoapache:mainfrom
Conversation
stevedlawrence
left a comment
There was a problem hiding this comment.
This alignment stuff is very complicated and difficult to make sense. I think this is the right approach to fixing the current issues, I think with some tweaks needed.
Though we might want to consider if there's a different approach, or maybe some refactoring that's easier to make sense of for a future update.
| case Some(s: SequenceTermBase) if s.hasSeparator => | ||
| import SeparatorPosition.* | ||
| s.separatorPosition match { | ||
| case Prefix | Infix => LengthMultipleOf(s.knownEncodingAlignmentInBits) |
There was a problem hiding this comment.
MTA is an alignment thing, so I think these MTA approx functions what to return AligmentMultipleOf instead of LengthApprox.
There was a problem hiding this comment.
Not sure that's right.
An MTA "region" like any region is a part of the data stream that has a length. So it depends are we computing the length of something, computing the alignment before or after it, or asking what it's required alignment is.
There was a problem hiding this comment.
The separatorPrefixMTAApprox val, and other MTAApprox vals are calculating where the MTA region needs to align to, not how big the region is.
This way we can look at approximate end alignment (we should probably call this approximate position) of whatever came before the MTA region and determine if the MTA region is known to already be aligned and so can be excluded.
And then based on the previous alignment and the MTA, we can then calculate where that MTA will have put us. For example, we might know it just put us on a byte boundary, but we might know more and know that it actually put us on a 2-byte boundary, allowing for better optimizations if a later elements needs to be 2-byte aligned.
There was a problem hiding this comment.
Gotcha. I think you are correct.
I think some of the complexity in this code could be reduced by rigorously naming the various things to avoid confusion.
Being clear about lengthApprox vs positionApprox vs. alignmentRequirement would go a long ways. We're overusing the term "alignment" here to mean position and requirement.
There was a problem hiding this comment.
Definitely. The naming could definitely be improved.
Adding more variables might also be helpful. This document has great diagrams of all the different parts that are considered in our alignment
https://daffodil.apache.org/dev/design-notes/term-sharing-in-schema-compiler/
Some of our current variables kindof mush multiple boxes together which can make it difficult to tease apart what's going on and how well we actually match the document.
It might makes sense to have a variable that calculates the approximate start of every one of the boxes with matching names (e.g. allignFilleApproximateStart, initiatorMTAApproximateStart),. It adds more variables, but might make things more clear.
And that might simplify other logic that tries to statically compile out different grams. For example, the gram guards can become something like:
val needsAlignFillGram = alignFillApproxStart % alignment != 0
val needsInitiatorMTAGram = initiatorMTAApproxStart % initiator.encoding.mandatoryTextAlignment != 0| private lazy val separatorLengthApprox = this.optLexicalParent match { | ||
| case Some(s: SequenceTermBase) if s.hasSeparator => | ||
| getEncodingLengthApprox(s) | ||
| case _ => LengthMultipleOf(0) |
There was a problem hiding this comment.
Instead of LengthMultipleOf(0), it might be more clear to make this LengthExact(0). I imigine all the math ends up the same so it probably doesn't really matter, but it feels a bit odd to say something has a length that is a multiple of zero.
| // To combine AlignmentMultipleOfs, use * | ||
| def *(that: AlignmentMultipleOf) = AlignmentMultipleOf(Math.gcd(nBits, that.nBits)) | ||
| def %(that: AlignmentMultipleOf) = nBits % that.nBits | ||
|
|
There was a problem hiding this comment.
Some of these changes are making me wonder if we need a def +(that: AlignmentMultipleOf) that has slightly different logic than gcd.
I believe the intention of * was to combine alignments that could all happen at the same point in data (e.g. a choice of elements, optional siblings elements).
But in the new code, we are taking an existing approximate alignment, and potentially adding a new alignment to it, which is slightly different.
For example, say we are currently at 2-byte alignment (AlignmentMultipleof(16)) and we are adding an element that needs to be 1-byte (AlignmentMultipleOf(8)). In this case, we do not need to perform any alignment here because we are already byte align (we are actually 2-byte aligned), and our logic handles that correctly.
But the contentStartAlignment of that new element still really wants to be 2-byte aligned, so it should remain AlignmentMultipleOf(16). But right now, using * it's alignment will be AlignmentMultilpeOf(8) since gcd(16,8) => 8. So we actually have less information about our true alignment, and could miss out on future optimizations.
So it feels like a + operation might want to be something like
def *(that: AlignmentMultipleOf) = if (this.nBits % that.nBits == 0) this else thatThe idea is that if our new alignment (that) evenly divides our existing alignment (this), then no alignment is actually needed, and we can keep our more accurate current alignment. Otherwise we use the new alignment.
| case Prefix | Infix => LengthMultipleOf(s.knownEncodingAlignmentInBits) | ||
| case Postfix => LengthMultipleOf(0) | ||
| } | ||
| case _ => LengthMultipleOf(0) |
There was a problem hiding this comment.
I think this really wants to be AlignmentApprox(1) (MTA is alignment after all so using length isn't quite right) and then with the new + operator mentioned above it will work as expected. This is because 1 will always evenly divide the existing alignment so the existing alignment will be used. Same with the other MTA things.
In fact, I think AligmentApprox(0) never wants to be used except for the root element, since it's kindof a 1-based thing.
| } | ||
| * initiatorMTAApprox | ||
| + initiatorLengthApprox | ||
| val csa = (leftFramingApprox + prefixLengthElementLength) * valueMTAApprox |
There was a problem hiding this comment.
I think this also wants to use the new + operator.
| contentStartAlignment + (elementSpecifiedLengthApprox + trailingSkipApprox) | ||
| contentStartAlignment + elementSpecifiedLengthApprox | ||
| } | ||
| val cea = res * terminatorMTAApprox |
There was a problem hiding this comment.
Wants to be the new + operator
| // to terminator MTA/length and our trailing skip to get where it would leave off were | ||
| // each the actual last child. | ||
| val lastApproxesFinal = lastApproxes.map { | ||
| _ * terminatorMTAApprox |
| prefixElem.elementLengthInBitsEv.optConstant.get.get | ||
| ) + prefixLengthElementLength | ||
| } else { | ||
| getEncodingLengthApprox(prefixElem) |
There was a problem hiding this comment.
Prefix elements dont' necessarily have encoding. They could just be binary numbers. And I think it's possible for prefix lengths to have lengthKind="prefixed", in which case this would have to be LengthMultipleOf(lengthUnits) or something?
There was a problem hiding this comment.
If they are binary numbers, wouldn't the isKnownEncoding condition be false, and result in the value being LengthMultipleOf(8)?
There was a problem hiding this comment.
It doesn't look like isKnownEncoding takes into account whether or not an element is binary or not. It just accesses the encodingEv and asks if it's constant. And I think this is the correct behavior, since binary elements can have things like initiators that do require encoding. For example:
<element name="foo" type="xs:int"
dfdl:representation="binary" dfdl:length="4" dfdl:lengthUnits="bytes"
dfdl:initiator="FOO:" dfdl:encoding="US-ASCII" />This will parse the 4-character ascii string "FOO:" followed by a 4-byte two's complement binary integer.
So in some cases, like the example above, it can be reasonable to ask if a binary element has a known encoding ad it could be true, so we can't rely on it being false for binary numbers.
There was a problem hiding this comment.
This sort of "stuff" really does happen. TLOG format is a published schema on DFDLSchemas site. It is a binary format with character delimiters here and there. It does sort of help if looking at text dumps of the data. The binary numbers come through as mojibake, but you can pick out the delimiters often and figure the data out via a combination of text and hex dumps.
| LengthExact(0) | ||
| } | ||
|
|
||
| private def getEncodingLengthApprox(t: Term) = { |
There was a problem hiding this comment.
Could we make this a private lazy val so we don't have to recalculate it? For example, uses would look like term.encodingLengthApprox instead of getEncodingLengthApprox(someTerm)?
| } | ||
| }.value | ||
|
|
||
| private lazy val separatorPrefixMTAApprox = |
There was a problem hiding this comment.
Is it worth calling this separatorPrefixInfixMTAApprox, just to make it clear this is represents both of kinds of separators?
| case eb: ElementBase => { | ||
| if (eb.lengthKind == Prefixed) { | ||
| val prefixTypeElem = eb.prefixedLengthElementDecl | ||
| getElementLength(prefixTypeElem) |
There was a problem hiding this comment.
Is it possible to use eb.prefixedLenghtElementDecl.elementSpecifiedLengthApprox to get the length of the prefix element instead of the new getElementLength?
| Assert.invariant(lastApproxes.nonEmpty) | ||
| val res = lastApproxes.reduce { _ * _ } | ||
| val res = lastApproxes.reduce { | ||
| _ * _ |
There was a problem hiding this comment.
Can we put this all on one line? Looks confusing having a line with just symbol characters.
| // it the actual last child. | ||
| // | ||
| // for each possible last child, gather its endingAlignmentApprox | ||
| val lceaa = lc.endingAlignmentApprox |
There was a problem hiding this comment.
Does this want to be endingAlignmentWithRightFramingApprox? This case is trying to figure out the ending alignment of the last child(ren) of the model group, which is where the content of this model group ends. If the last child(ren) contains a postfix separator, then I think that wants to be included as part of the content of this model group. And so this wants to use the ...WithRightFraming variable?
Or, maybe we just move the separatorPostfixMTAApprox and separatorLengthApprox additions into endingAlignmentAppprox and remove the ...WithRightFraming variable? Since with the above change, nothing would actually access ...WithRightFraming anymore?
There was a problem hiding this comment.
You're right, per the spec SequenceContent ought to include Separator, terminator and trailingskip for the enclosed content.
Sequence = LeftFraming SequenceContent RightFraming
SequenceContent = [ PrefixSeparator EnclosedContent [ Separator EnclosedContent ]* PostfixSeparator ]
I think there is value in keeping the WithRightFraming variable as it more closely aligns with the spec thus we should actually get rid of the endingAlignmentApprox variable since nothing uses it
| val prefixElem = eb.prefixedLengthElementDecl | ||
| if (prefixElem.lengthKind == Explicit || prefixElem.lengthKind == Implicit) { | ||
| LengthExact( | ||
| prefixElem.elementLengthInBitsEv.optConstant.get.get |
There was a problem hiding this comment.
I think this might be incorrect? This is saying that if an element is lengthKind="prefix" then the elements length is length of the prefixed length element. But that's not quite right. The length of this element is the length of the value of the prefix element, which can't be known until runtime. So we can never know the exact length of a prefix length element.
I might suggest we just remove the new prefix length logic (maybe add comments where logic is missing) and leave it as NYI. I think prefix length elements are broken in a number of places so it's going to be hard to test if this is even working correctly. Suggest we fix this when we fix prefixed lengths.
stevedlawrence
left a comment
There was a problem hiding this comment.
+1 👍 can you please also update an existing ticket or create a new ticket so we don't forget that lengthKind="prefix" isn't implemented ?
jadams-tresys
left a comment
There was a problem hiding this comment.
+1 with just a style suggestion
If having separate vals for results makes things easier to debug feel free to leave them in.
| val res = (pa % aa) == 0 | ||
| res | ||
| } | ||
| res |
There was a problem hiding this comment.
Thoughts on having this else statement just end in
(pa % aa) == 0
Having the multiple res vals makes things a little confusing.
| val res = Some(contentStartAlignment) | ||
| res |
There was a problem hiding this comment.
Creating a separate val res for this seems unnecessary.
| val res = lastApproxes.reduce(_ * _) | ||
| res |
0e18f71 to
556fed3
Compare
- Implement logic to calculate prefix, infix, and postfix separator alignments and lengths in sequence terms. - Added new alignment logic/test cases covering initiator alignment, prefixed length elements, terminator alignment and value MTA (for optimization) behavior. - Add new variables to more accurately represent alignment, such as contentEndAlignment, which is useful for delimiter text alignment before the terminator and trailing skip is considered. - Updated schemas and TDML files to include test data for new alignment scenarios. DAFFODIL-2295, DAFFODIL-3056, DAFFODIL-3057
556fed3 to
0614ec6
Compare
DAFFODIL-2295, DAFFODIL-3056, DAFFODIL-3057