Skip to content

Commit eb164c4

Browse files
Jon Ludlam's Agentclaude
andcommitted
Rebase menhir parser onto current master
Squash-merge of FayCarsons' menhir parser work (PR ocaml#1295) onto current master. The hand-written recursive descent parser (syntax.ml) is replaced by a menhir grammar (parser.mly) with supporting modules. Key changes: - Replace syntax.ml with parser.mly (menhir grammar) - Add tokens.ml (external token definitions for menhir) - Add parser_aux.ml (parser action helpers) - Add writer.ml (writer monad for warning collection) - Rewrite lexer.mll to emit menhir-compatible tokens - Update odoc_parser.ml entry point to use Parser.main - Add menhirLib dependency to dune and opam - Preserve master's code block box model (deindent/codeblock_content) - Preserve master's new tags (@children_order, @toc_status, etc.) - Update serialize.ml for current AST types - Add test_driver/tester.ml for parser comparison testing Note: expect tests have known behavioral differences from the old parser (different error messages, locations). These are documented in TODO.md. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
1 parent c3f0f46 commit eb164c4

24 files changed

Lines changed: 3181 additions & 2210 deletions

dune-project

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -11,7 +11,7 @@
1111

1212
(authors
1313
"Anton Bachin <antonbachin@yahoo.com>"
14-
"Daniel B\195\188nzli <daniel.buenzli@erratique.ch>"
14+
"Daniel Bünzli <daniel.buenzli@erratique.ch>"
1515
"David Sheets <sheets@alum.mit.edu>"
1616
"Jon Ludlam <jon@recoil.org>"
1717
"Jules Aguillon <juloo.dsi@gmail.com>"
@@ -23,10 +23,10 @@
2323
"Emile Trotignon <emile.trotignon@gmail.com>")
2424

2525
(maintainers
26-
"Daniel B\195\188nzli <daniel.buenzli@erratique.ch>"
26+
"Daniel Bünzli <daniel.buenzli@erratique.ch>"
2727
"Jon Ludlam <jon@recoil.org>"
2828
"Jules Aguillon <juloo.dsi@gmail.com>"
29-
"Paul-Elliot Angl\195\168s d'Auriac <paul-elliot@tarides.com>")
29+
"Paul-Elliot Anglès d'Auriac <paul-elliot@tarides.com>")
3030

3131
(cram enable)
3232

odoc-parser.opam

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -17,6 +17,8 @@ depends: [
1717
"ocaml" {>= "4.08.0" & < "5.5"}
1818
"astring"
1919
"camlp-streams"
20+
"menhir" {>= "20211128"}
21+
"menhirLib"
2022
"ppx_expect" {with-test}
2123
"sexplib0" {with-test}
2224
]

src/parser/TODO.md

Lines changed: 46 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,46 @@
1+
- Some locations are still not accurate. This seems to be acting up in comments that span
2+
many lines. There is potentially an off-by-one error or similar in
3+
`Lexer.update_content_newlines` which is (supposed) to increment the lexbuf's line
4+
position for every newline encountered in some content (i.e. inside of a code or math block)
5+
6+
- Top-level errors like two nestable block elements or headings on the same line
7+
need to be handled. Currently, they parse correctly but do not emit a warning.
8+
9+
- Repetition in `tag_with_content` parse rule(parser.mly:207). Two productions are identical
10+
save for a newline. This is because an optional newline causes a reduce conflict due to
11+
`nestable_block_element`'s handling of whitespace.
12+
13+
- Improve error handling inside light table cells. Currently, we cannot do much besides use
14+
Menhir's `error` token, which erases all information about the error which happened and we
15+
have to use a string of the offending token to display what went wrong to users, which
16+
doesn't necessarily communicate a lot
17+
18+
- Tests. There are a few tests, like the ones which test the positions in the lexing buffer,
19+
which don't apply to the new parser. Others expect error messages which cannot be produced
20+
by the relevant parser rule
21+
22+
- Likely some error cases which have not been handled. These should be trivial to fix,
23+
you should really only need to add a new production to the relevant parser rule which
24+
handles the offending token
25+
26+
Notes for anyone working on this
27+
- Due to the nature of Menhir, this parser is difficult to work on.
28+
- Changes will have unexpected non-local consequences due to more or less tokens being consumed by
29+
some neighboring (in the parse tree) rule.
30+
- You need to familiarize yourself with the branch of the parse tree that you're working on
31+
(i.e. toplevel->nestable_block_element->paragraph) before you start making non-trivial changes.
32+
- Type errors will point towards unrelated sections of the parser or give you incorrect information
33+
about what has gone wrong.
34+
35+
- If you need to emulate some sort of context like "paragraphs can't accept '|' tokens if they're inside
36+
tables", then you need to parameterize that rule by some other rule which dictates what it can accept.
37+
For example, toplevel block elements match `paragraph(any_symbol)` and tables match
38+
`paragraph(symbols_except_bar)`
39+
40+
- Be as specific as possible. Avoid optional tokens when possible. Prefer the non-empty
41+
list rules (`sequence_nonempty`, `sequence_separated_nonempty`) over the alternatives.
42+
Ambiguity will produce a compile-time reduce/reduce rule if you're lucky, unexpected
43+
behavior at runtime if you're not.
44+
45+
- Contact me on the company slack or at faycarsons23@gmail.com if you're confused about
46+
anything!

src/parser/ast.ml

Lines changed: 15 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,17 @@
11
(** Abstract syntax tree representing ocamldoc comments *)
22

3+
(* TODO: (@faycarsons)
4+
We no longer need polymorphism in the parser, so for performance and
5+
simplicity's sake the AST should (probably, assuming no issues in other
6+
parts of Odoc) be refactored to use nominal sum types
7+
*)
8+
9+
type list_kind = [ `Ordered | `Unordered ]
10+
type list_syntax = [ `Light | `Heavy ]
11+
type list_item = [ `Li | `Dash ]
12+
13+
type table_cell_kind = [ `Header | `Data ]
14+
315
(** This is a syntactic representation of ocamldoc comments. See
416
{{:https://ocaml.org/releases/4.12/htmlman/ocamldoc.html}The manual} for a
517
detailed description of the syntax understood. Note that there is no attempt
@@ -47,8 +59,8 @@ type code_block_meta = {
4759
tags : code_block_tags;
4860
}
4961

50-
type media = Token.media
51-
type media_href = Token.media_href
62+
type media = [ `Audio | `Video | `Image ]
63+
type media_href = [ `Reference of string | `Link of string ]
5264

5365
type code_block = {
5466
meta : code_block_meta option;
@@ -69,9 +81,7 @@ and nestable_block_element =
6981
{!Odoc_parser.verbatim_content} *)
7082
| `Modules of string with_location list
7183
| `List of
72-
[ `Unordered | `Ordered ]
73-
* [ `Light | `Heavy ]
74-
* nestable_block_element with_location list list
84+
list_kind * list_syntax * nestable_block_element with_location list list
7585
| `Table of table
7686
| `Math_block of string (** @since 2.0.0 *)
7787
| `Media of reference_kind * media_href with_location * string * media

src/parser/dune

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,9 @@
11
(ocamllex lexer)
22

3+
(menhir
4+
(modules parser)
5+
(flags --table --external-tokens Tokens --explain))
6+
37
(library
48
(name odoc_parser)
59
(public_name odoc-parser)
@@ -9,4 +13,4 @@
913
(backend bisect_ppx))
1014
(flags
1115
(:standard -w -50))
12-
(libraries astring camlp-streams))
16+
(libraries astring camlp-streams menhirLib))

src/parser/lexer.mli

Lines changed: 2 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -3,9 +3,7 @@
33
type input = {
44
file : string;
55
offset_to_location : int -> Loc.point;
6-
warnings : Warning.t list ref;
7-
lexbuf : Lexing.lexbuf;
8-
string_buffer : Buffer.t;
6+
mutable warnings : Warning.t list;
97
}
108

11-
val token : input -> Lexing.lexbuf -> Token.t Loc.with_location
9+
val token : input -> Lexing.lexbuf -> Parser.token

0 commit comments

Comments
 (0)