Improve lark-cython compatibility by ornariece · Pull Request #1528 · lark-parser/lark

ornariece · 2025-04-22T15:03:43Z

This PR addresses two significant compatibility issues when using lark-cython:

1. Fix Token handling for lark-cython

In the standard lark implementation, Token inherits from str, allowing string methods to be called directly on Token objects. However, lark-cython implements Token differently - it doesn't inherit from str, which means string methods like rsplit() won't work on token objects.

Changes:

Replaced direct string method calls on Token instances with calls on token.value instead

2. Refactor postlexer handling architecture

The original implementation used a PostLexConnector class that acted as a wrapper around a lexer but only implemented part of the Lexer interface (the lex() method but not next_token()). This partial implementation worked in standard lark through duck typing but caused errors in lark-cython's more strict implementation.

Changes:

Removed the use of PostLexConnector class
Added a self.postlex attribute to store the postlexer separately
Preserves the stream-based design of PostLex while ensuring proper interface implementation

This approach is more consistent with the architectural intent of postlexers as stream processors, while avoiding type compatibility issues in lark-cython.

erezsh · 2025-04-22T15:20:13Z

lark/lexer.py

+        return self.postlex.process(tokens)
+
+    def __copy__(self):
+        return type(self)(self.lexer, copy(self.state), self.postlex)


Are we sure that the postlexer shouldn't be copied too?

i would say so? i see it as some kind of processor, it holds no data per se about the current states. i could be wrong.

But it does hold a state, e.g. https://github.com/lark-parser/lark/blob/master/lark/indenter.py#L33

Although, tbh, those shouldn't be instance attributes, but local variables inside of process. I think this postlexer design should already be broken on the current main if this copy method is called since lexer=PostLexConnector doesn't get copied either.

i see, i agree those don't really make sense as instance attributes. it makes sense that a postlexer instance should be agnostic to a stream of tokens. this matter is probably beyond the scope of this PR though.

…tting, and support for a dark theme

Docs: Updated link of DSL article to a new version

erezsh · 2025-04-23T05:40:18Z

The "Python type check" test should now work properly if you rebase over master.

…into lark-cython-compat

erezsh · 2025-04-29T09:12:17Z

Overall looks good, but I have a few questions -

Did you consider implementing PostLex fully instead? I saw you wrote "This approach is more consistent with the architectural intent of postlexers as stream processors", can you elaborate on that a bit more please?
In your implementation, PostLexThread can accept a None postlex. When would that happen?

Also, I still think copy() should make a copy of the postlex too, since it's not an object we have control of. (and if there's no data to copy, the cost is negligible anyway)

ornariece added 2 commits April 22, 2025 16:06

use token.value where required

f98bea2

handle postlex graciously

9b3f1a0

erezsh reviewed Apr 22, 2025

View reviewed changes

ornariece and others added 4 commits April 22, 2025 17:25

fix typing

c59bedd

Docs: Updated link of DSL article to a new version, with better forma…

4280441

…tting, and support for a dark theme

Upgrade pre-commit version

f9ba191

Merge pull request lark-parser#1529 from lark-parser/docs_apr23_2025

6d0f4b6

Docs: Updated link of DSL article to a new version

ornariece added 4 commits April 23, 2025 11:53

use token.value where required

1183ae5

handle postlex graciously

6635f12

fix typing

5374406

Merge branch 'lark-cython-compat' of https://github.com/ornariece/lark …

9f05ac9

…into lark-cython-compat

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Improve lark-cython compatibility#1528

Improve lark-cython compatibility#1528
ornariece wants to merge 10 commits intolark-parser:masterfrom
ornariece:lark-cython-compat

ornariece commented Apr 22, 2025 •

edited

Loading

Uh oh!

erezsh Apr 22, 2025

Uh oh!

ornariece Apr 22, 2025

Uh oh!

erezsh Apr 23, 2025

Uh oh!

MegaIng Apr 23, 2025

Uh oh!

ornariece Apr 23, 2025

Uh oh!

erezsh commented Apr 23, 2025

Uh oh!

erezsh commented Apr 29, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

ornariece commented Apr 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

1. Fix Token handling for lark-cython

2. Refactor postlexer handling architecture

Uh oh!

erezsh Apr 22, 2025

Choose a reason for hiding this comment

Uh oh!

ornariece Apr 22, 2025

Choose a reason for hiding this comment

Uh oh!

erezsh Apr 23, 2025

Choose a reason for hiding this comment

Uh oh!

MegaIng Apr 23, 2025

Choose a reason for hiding this comment

Uh oh!

ornariece Apr 23, 2025

Choose a reason for hiding this comment

Uh oh!

erezsh commented Apr 23, 2025

Uh oh!

erezsh commented Apr 29, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ornariece commented Apr 22, 2025 •

edited

Loading