Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parsing error on caret-escaped special characters #111

Open
2 tasks done
zaneduffield opened this issue Nov 27, 2023 · 3 comments
Open
2 tasks done

Parsing error on caret-escaped special characters #111

zaneduffield opened this issue Nov 27, 2023 · 3 comments
Labels
bug Something isn't working

Comments

@zaneduffield
Copy link
Collaborator

zaneduffield commented Nov 27, 2023

Prerequisites

  • This bug is in SonarDelphi, not SonarQube or my Delphi code.
  • This bug has not already been reported.

SonarDelphi version

1.0.0

SonarQube version

No response

Issue description

An undocumented Delphi feature (carried forward from TurboPascal) are escaped 'control characters'.

For example

const
  CtrlC = ^C;

While it may be intended for use with visible characters, it's valid to escape any single-byte ascii character with this technique.

const
  Space = ^ ;
  Comment = ^{;
  Plus = ^+;

The relevant section of the grammar in SonarDelphi currently has a few problems

escapedCharacter             : TkCharacterEscapeCode
                             | '^' (TkIdentifier | TkIntNumber | TkAnyChar) -> ^({changeTokenType(TkEscapedCharacter)})
  1. It's not valid to accept any identifier or integer after the ^; only one (single-byte ascii) character may be escaped
  2. Whitespace and the comment-starting { are handled by the 'hidden' channel and never make it to this section

For more info about the caret-escaped characters, see:

  1. https://stackoverflow.com/a/4916503/21058101
  2. http://www.delphibasics.co.uk/RTL.php?Name=Char

Steps to reproduce

Run SonarDelphi on the following program

program Test;

const A = ^ '';

begin
end.

observe the error

no viable alternative at input 'A'

Minimal Delphi code exhibiting the issue

No response

@zaneduffield zaneduffield added bug Something isn't working triage This needs to be triaged by a maintainer labels Nov 27, 2023
@Cirras
Copy link
Collaborator

Cirras commented Nov 27, 2023

This one is a bit cursed...

Comment = ^{;

Anyway, we'll definitely need to handle these caret-notation escapes as tokens in the lexer rather than high-level syntax structures in the parser.

Unlike most of the things in the lexer, these things are annoyingly context-dependent.

Here's how freepascal handles them:

'''','#','^' :
  begin
    len:=0;
    cstringpattern:='';
    iswidestring:=false;
    if c='^' then
    begin
      readchar;
      c:=upcase(c);
      if (block_type in [bt_type,bt_const_type,bt_var_type]) or
          (lasttoken=_ID) or (lasttoken=_NIL) or (lasttoken=_OPERATOR) or
          (lasttoken=_RKLAMMER) or (lasttoken=_RECKKLAMMER) or (lasttoken=_CARET) then
        begin
          token:=_CARET;
          goto exit_label;
        end
      else
        begin
          inc(len);
          setlength(cstringpattern,256);
          if c<#64 then
            cstringpattern[len]:=chr(ord(c)+64)
          else
            cstringpattern[len]:=chr(ord(c)-64);
          readchar;
        end;
    end;

Notes:

  • The block_type context appears to only be checked so they can surface a more appropriate error message to the user when a caret appears in an unexpected place - we don't need to check it.
  • The lasttoken=_OPERATOR check is because FPC's operator overloads allow you to refer to them by the symbol instead of a symbolic name (class operator ^( ... )). Again, we don't need a similar check.

Lexing rules:

  • match ^
  • if previous token is TkIdentifier, nil, ), ], or `'^', interpret it as the dereference operator.
  • otherwise, interpret it as a caret-notation character escape and take the following character (which must be an ascii character)

Expression rules:

  • take the ordinal value of the character following the caret
  • (value + 64) % 128
  • return the corresponding ascii character

@Cirras Cirras removed the triage This needs to be triaged by a maintainer label Nov 27, 2023
@Cirras
Copy link
Collaborator

Cirras commented Feb 29, 2024

Notes:

  • The block_type context appears to only be checked so they can surface a more appropriate error message to the user when a caret appears in an unexpected place - we don't need to check it.

This was a naive thought, we definitely do need to know what block we're in while lexing this token.
For example, ^T could be a pointer type or a caret-notation string expression depending on the block context.

This one is significantly complicated by ANTLR making the lexer and parser totally independent, which isn't generally how Pascal is parsed. It's simply not a context-free grammar.

@zaneduffield
Copy link
Collaborator Author

we definitely do need to know what block we're in while lexing this token

Yeah I worked that out the hard way when I tried implementing a lexer for this 😆 .

ANTLR making the lexer and parser totally independent

I don't know what level of freedom you have in the ANTLR lexer, but my idea for how to implement this in a pure 'lexer' was to maintain a stack of contexts (e.g. type/const/var) and check what the top context is when encountering a ^ character. Basically just doing the minimal amount of parsing required to disambiguate.

Another implementation option is to lex ^T as a special token type that is later disambiguated by the parser.

Yet another option is to just say we won't support this...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants