Apple-2-SW/prog8

Fork 0

mirror of https://github.com/irmen/prog8.git synced 2026-03-11 05:41:42 +00:00

Files

Irmen de Jong ed02b84f7d micro optimization for containment checks in compiler

2026-02-06 02:43:12 +01:00

21 KiB

Raw Blame History

ANTLR4 Grammar Improvements for Prog8

Overview

This document outlines recommended improvements to the Prog8ANTLR.g4 grammar file, focusing on clearer parse errors, rule optimization, and maintainability.

Current Issues Summary

The Prog8 ANTLR4 grammar has several areas that could be improved:

Poor Error Recovery: Uses BailErrorStrategy that fails immediately on first error
Generic Error Messages: Users receive cryptic ANTLR default messages
Complex Expression Rule: 25+ alternatives in a single left-recursive rule
Fragile Lexer Rules: NOT_IN token depends on whitespace
Manual Keyword Handling: Must explicitly list keywords that can be identifiers
"Cursed" Pointer Dereference: Complex grammar that doesn't handle chaining well

1. Error Recovery & Reporting Improvements

Current Issues

BailErrorStrategy: Fails immediately on first error with no recovery
Generic error messages: ANTLR default messages like "mismatched input" are unhelpful
No rule-specific error messages: Users get cryptic errors for common mistakes

Recommended Improvements

Add Error Alternatives with Custom Messages

// Assignment rule with error detection
assignment
    : assign_target '=' expression
    | assign_target '=' assignment
    | multi_assign_target '=' expression
    | assign_target '=' expression '=' expression+  // ERROR: chained assignment
      { notifyErrorListeners("Cannot chain assignments. Use multiple statements instead."); }
    ;

Add Error Recovery Rules

statement
    : // ... valid statements
    | 'if' expression statement  // Missing THEN or block
      { notifyErrorListeners("Expected 'then' or '{' after if condition"); }
    | 'for' identifier 'in'      // Missing expression
      { notifyErrorListeners("Expected expression after 'in'"); }
    | 'while' '}'                // Missing condition
      { notifyErrorListeners("Expected condition after 'while'"); }
    ;

Improve Error Messages for Common Mistakes

vardecl
    : datatype (arrayindex | EMPTYARRAYSIG)? TAG* identifierlist
    | datatype EMPTYARRAYSG identifierlist
      { if (!$datatype.text.equals("ubyte") && !$datatype.text.equals("byte")) 
          notifyErrorListeners("Empty array syntax [] only valid for byte/ubyte types"); }
    ;

2. Rule Optimization

A. Expression Rule Refactoring

Current Problem: Massive left-recursive rule with 25+ alternatives (lines 202-233)

Optimization: Group by precedence using sub-rules:

expression
    : primaryExpression
    | expression postfixOperator
    | prefixOperator expression
    | expression multiplicativeOp expression
    | expression additiveOp expression
    | expression shiftOp expression
    | expression relationalOp expression
    | expression equalityOp expression
    | expression bitwiseAndOp expression
    | expression bitwiseXorOp expression
    | expression bitwiseOrOp expression
    | expression rangeOp expression
    | expression 'in' expression
    | expression 'not' 'in' expression
    | expression 'and' expression
    | expression 'or' expression
    | expression 'xor' expression
    | 'if' expression 'then' expression 'else' expression  // if-expression
    ;

// Operator groups
multiplicativeOp: '*' | '/' | '%' ;
additiveOp: '+' | '-' ;
shiftOp: '<<' | '>>' ;
relationalOp: '<' | '>' | '<=' | '>=' ;
equalityOp: '==' | '!=' ;
bitwiseAndOp: '&' ;
bitwiseXorOp: '^' ;
bitwiseOrOp: '|' ;
rangeOp: 'to' | 'downto' ;
postfixOperator: '++' | '--' ;
prefixOperator: '+' | '-' | '~' ;

B. Statement Rule Grouping

Current Problem: 25 alternatives with no grouping, hard to extend (lines 98-124)

Optimization: Group by category:

statement
    : directive
    | declaration
    | controlFlow
    | assignment
    | jumpStatement
    | loopStatement
    | subroutine
    | inlineasm
    | labeldef
    | alias
    ;

declaration
    : variabledeclaration
    | structdeclaration
    | subroutinedeclaration
    ;

controlFlow
    : if_stmt
    | branch_stmt
    | whenstmt
    | ongoto
    ;

jumpStatement
    : unconditionaljump
    | returnstmt
    | breakstmt
    | continuestmt
    ;

loopStatement
    : forloop
    | whileloop
    | untilloop
    | repeatloop
    | unrollloop
    ;

C. Directive Rule Simplification

Current Problem: Complex, ambiguous alternatives (line 156)

Optimization: Simplify structure:

directive
    : '%' name=UNICODEDNAME '!'? directiveArgs
    ;

directiveArgs
    : '(' EOL? scoped_identifier (',' EOL? scoped_identifier)* ','? EOL? ')'  // List
    | directiveArg (',' directiveArg)*  // Regular args
    |  // Empty
    ;

3. Lexer Improvements

A. NOT_IN Token Fix

Current Problem: Fragile lexer rule with whitespace dependence (line 72)

NOT_IN: 'not' [ \t]+ 'in' [ \t] ;

Fix: Move to parser rule:

// Remove NOT_IN from lexer
// In parser:
expression
    : ...
    | left=expression 'not' 'in' right=expression  #NotInExpression
    ;

B. Identifier Rule Improvement

Current Problem: Must manually list keywords that can be identifiers (line 266)

identifier: UNICODEDNAME | UNDERSCORENAME | ON | CALL | INLINE | STEP ;

Fix: Use parser rule approach:

// Instead of tokens, use a parser rule that matches any keyword as identifier
identifier
    : UNICODEDNAME
    | UNDERSCORENAME
    | keywordAsIdentifier
    ;

keywordAsIdentifier
    : 'on' | 'call' | 'inline' | 'step' | 'else' | 'then' | 'goto' | 'void' | 'struct'
    ;

C. Float Number Rules Simplification

Current Problem: Complex, potentially ambiguous rules (lines 51-54)

FLOAT_NUMBER : FNUMBER (('E'|'e') ('+' | '-')? DEC_INTEGER)? ;
FNUMBER : FDOTNUMBER | FNUMDOTNUMBER ;
FDOTNUMBER : '.' (DEC_DIGIT | '_')+ ;
FNUMDOTNUMBER : DEC_DIGIT (DEC_DIGIT | '_')* FDOTNUMBER? ;

Fix: Simplify:

FLOAT_NUMBER
    : DEC_DIGIT (DEC_DIGIT | '_')* ('.' (DEC_DIGIT | '_')*)?
      (('E'|'e') ('+'|'-')? DEC_INTEGER)?
    | '.' (DEC_DIGIT | '_')+ (('E'|'e') ('+'|'-')? DEC_INTEGER)?
    ;

4. Specific Problem Areas

A. Pointer Dereference Cleanup

Current Problem: Described as "cursed mix" in comment, doesn't handle chaining well (lines 343-352)

Current Grammar:

pointerdereference: (prefix=scoped_identifier '.')? derefchain ('.' field=identifier)? ;
derefchain: singlederef ('.' singlederef)* ;
singlederef: identifier arrayindex? POINTER ;

Improved Structure:

pointerdereference
    : pointerBase ('.' pointerElement)*
    ;

pointerBase
    : scoped_identifier '.'?     // Optional base with dot
    |                           // Or start with dereference
    ;

pointerElement
    : identifier arrayindex? POINTER  // array^^ or just pointer^^
    | identifier                      // field access
    ;

B. Assignment Rule Simplification

Current Problem: Multiple ambiguous alternatives (line 182)

assignment
    : (assign_target '=' expression)
    | (assign_target '=' assignment)      // Chained assignment - problematic
    | (multi_assign_target '=' expression)
    ;

Fix: Remove chained assignment ambiguity:

assignment
    : assign_target '=' expression
    | multi_assign_target '=' expression
    // Remove: assign_target '=' assignment  // Too ambiguous
    ;

C. Module Rule EOL Handling

Current Problem: Complex EOL* patterns everywhere (line 79)

module: EOL* (module_element (EOL+ module_element)*)? EOL* EOF;

Note: This is actually necessary due to comment/EOL interleaving (issue #47), but could be cleaner with a channel-based approach.

5. Missing Syntactic Validations

Add parser-level validations to catch errors early:

// Array type validation
vardecl
    : datatype (arrayindex | EMPTYARRAYSIG)? TAG* identifierlist
    | datatype EMPTYARRAYSIG identifierlist
      { if (!$datatype.text.equals("ubyte") && !$datatype.text.equals("byte")) 
          notifyErrorListeners("Empty array syntax [] only valid for byte/ubyte types"); }
    ;

// Function call validation
functioncall_stmt
    : VOID? scoped_identifier '(' EOL? expression_list? EOL? ')'
    | VOID? scoped_identifier '(' EOL? expression_list? EOL? ')' '=' expression  // ERROR: assignment in function call
      { notifyErrorListeners("Cannot assign to function call. Use separate statement."); }
    ;

6. Implementation Priority

Priority	Issue	Impact	Effort	Files to Modify
High	Expression rule refactoring	Maintainability, error quality	Medium	`Prog8ANTLR.g4`
High	Add error recovery alternatives	User experience	Low	`Prog8ANTLR.g4`
Medium	NOT_IN to parser rule	Robustness	Low	`Prog8ANTLR.g4`
Medium	Statement rule grouping	Maintainability	Low	`Prog8ANTLR.g4`
Medium	Identifier keyword handling	Completeness	Low	`Prog8ANTLR.g4`
Low	Pointer dereference cleanup	Technical debt	Medium	`Prog8ANTLR.g4`
Low	Float number simplification	Maintainability	Low	`Prog8ANTLR.g4`

7. Example: Improved Error Messages

Before (Current)

line 5:8 mismatched input '=' expecting {<EOF>, EOL, ';', ...}
line 10:3 mismatched input 'if' expecting {'{', 'then', ...}
line 15:12 mismatched input 'on' expecting {UNICODEDNAME, ...}

After (With Custom Messages)

line 5:8 syntax error: Cannot chain assignments. Use multiple statements instead.
line 10:3 syntax error: 'if' statement missing 'then' or '{' before condition body
line 15:12 syntax error: Variable 'on' is a keyword. Use a different name or escape it.
line 20:4 syntax error: Empty array syntax [] only valid for byte/ubyte types.

8. Testing Strategy

After implementing these changes:

Regression Testing: Ensure all existing test cases still pass
Error Message Testing: Create test cases for each new error message
Edge Case Testing: Test complex expressions, pointer dereferencing, etc.
Performance Testing: Verify grammar changes don't significantly impact parsing speed

9. Migration Plan

Phase 1: Implement high-priority error recovery improvements
Phase 2: Refactor expression and statement rules
Phase 3: Fix lexer issues (NOT_IN, identifier handling)
Phase 4: Clean up specific problem areas (pointers, assignments)
Phase 5: Add comprehensive error validations

Each phase should include:

Grammar changes
Test updates
Documentation updates
Performance verification

/home/irmen/Projects/prog8/parser/src/main/antlr/Prog8ANTLR.g4 - Main grammar file
/home/irmen/Projects/prog8/compilerAst/src/prog8/parser/Prog8Parser.kt - Error handling
Test files in /home/irmen/Projects/prog8/compiler/test/ - Update for new error messages
Documentation - Update language specification if grammar semantics change

Conclusion

These improvements will significantly enhance the Prog8 parser by:

Providing clearer, more helpful error messages
Making the grammar more maintainable and extensible
Fixing known parsing issues and ambiguities
Improving overall user experience for developers using the language

The changes are designed to be backward-compatible where possible, with careful attention to maintaining existing functionality while improving the parser's robustness and usability.

Appendix: BailErrorStrategy Migration Guide

What is BailErrorStrategy?

BailErrorStrategy is an ANTLR4 error handling strategy that immediately stops parsing when it encounters the first syntax error, rather than attempting to recover and continue parsing.

Key Characteristics:

Fail-Fast: Stops on the first error - no recovery attempts
No Error Recovery: Doesn't try to skip tokens or resynchronize
Throws Exceptions: Immediately throws InputMismatchException or other parse errors
Simple but Limited: Easy to implement but poor user experience

Current Implementation in Prog8:

// File: /home/irmen/Projects/prog8/compilerAst/src/prog8/parser/Prog8Parser.kt
private object Prog8ErrorStrategy: BailErrorStrategy() {
    override fun recover(recognizer: Parser?, e: RecognitionException?) {
        fillIn(e, recognizer!!.context)
        reportError(recognizer, e)
    }
    
    override fun recoverInline(recognizer: Parser?): Token {
        val e = InputMismatchException(recognizer)
        fillIn(e, recognizer!!.context)
        reportError(recognizer, e)
        throw e
    }
}

Problems with BailErrorStrategy:

Single Error Only: Users only see the first syntax error, not all issues
Poor IDE Integration: IDEs can't highlight multiple errors simultaneously
Frustrating Workflow: Fix one error, recompile, find next error
Limited Context: No information about what might be expected

Migration Plan: Replace BailErrorStrategy

1. Replace Error Strategy

Current:

parser.errorHandler = Prog8ErrorStrategy     // BailErrorStrategy

New:

parser.errorHandler = DefaultErrorStrategy()  // Built-in recovery strategy

2. Implement Custom Error Listener

Create new error listener:

private class Prog8ErrorListener(val src: SourceCode): BaseErrorListener() {
    private val errors = mutableListOf<ParseError>()
    
    override fun syntaxError(recognizer: Recognizer<*, *>?, 
                            offendingSymbol: Any?, 
                            line: Int, 
                            charPositionInLine: Int, 
                            msg: String, 
                            e: RecognitionException?) {
        // Collect errors instead of throwing immediately
        val error = ParseError(msg, Position(src.origin, line, charPositionInLine+1, charPositionInLine+1), e ?: RuntimeException("parse error"))
        errors.add(error)
    }
    
    fun getErrors(): List<ParseError> = errors.toList()
    fun hasErrors(): Boolean = errors.isNotEmpty()
}

3. Update Parser Setup

Current:

fun parseModule(src: SourceCode): Module {
    val antlrErrorListener = AntlrErrorListener(src)
    val lexer = Prog8ANTLRLexer(CharStreams.fromString(src.text, src.origin))
    lexer.removeErrorListeners()
    lexer.addErrorListener(antlrErrorListener)
    val tokens = CommonTokenStream(lexer)
    val parser = Prog8ANTLRParser(tokens)
    parser.errorHandler = Prog8ErrorStrategy     // BailErrorStrategy
    parser.removeErrorListeners()
    parser.addErrorListener(antlrErrorListener)
    
    val parseTree = parser.module()
    // ... visitor pattern
}

New:

fun parseModule(src: SourceCode): Module {
    val errorListener = Prog8ErrorListener(src)
    val lexer = Prog8ANTLRLexer(CharStreams.fromString(src.text, src.origin))
    lexer.removeErrorListeners()
    lexer.addErrorListener(errorListener)
    val tokens = CommonTokenStream(lexer)
    val parser = Prog8ANTLRParser(tokens)
    parser.errorHandler = DefaultErrorStrategy()  // Recovery strategy
    parser.removeErrorListeners()
    parser.addErrorListener(errorListener)
    
    val parseTree = parser.module()
    
    // Check for errors after parsing
    if (errorListener.hasErrors()) {
        throw MultipleParseErrors(errorListener.getErrors())
    }
    
    // ... visitor pattern
}

4. Create Multiple Errors Exception

class MultipleParseErrors(val errors: List<ParseError>) : Exception() {
    override val message: String
        get() = "Found ${errors.size} parse errors:\n" + 
                errors.joinToString("\n") { "${it.position}: ${it.message}" }
}

5. Update Compiler Error Handling

Current in Compiler.kt:

} catch (px: ParseError) {
    args.errors.printSingleError("${px.position.toClickableStr()} parse error: ${px.message}".trim())
}

New:

} catch (mpe: MultipleParseErrors) {
    // Report all parse errors
    mpe.errors.forEach { error ->
        args.errors.printSingleError("${error.position.toClickableStr()} parse error: ${error.message}".trim())
    }
} catch (px: ParseError) {
    // Fallback for single errors
    args.errors.printSingleError("${px.position.toClickableStr()} parse error: ${px.message}".trim())
}

Grammar Changes for Better Error Recovery

1. Add Error Recovery Alternatives

// Statement with error recovery
statement
    : directive
    | ongoto
    | variabledeclaration
    | structdeclaration
    | assignment
    | augassignment
    | unconditionaljump
    | postincrdecr
    | functioncall_stmt
    | if_stmt
    | branch_stmt
    | subroutinedeclaration
    | inlineasm
    | returnstmt
    | forloop
    | whileloop
    | untilloop
    | repeatloop
    | unrollloop
    | whenstmt
    | breakstmt
    | continuestmt
    | labeldef
    | defer
    | alias
    // Error recovery alternatives
    | 'if' expression error=statement
      { notifyErrorListeners("Expected 'then' or '{' after if condition"); }
    | 'for' identifier 'in' error=statement
      { notifyErrorListeners("Expected expression after 'in'"); }
    | 'while' error=statement
      { notifyErrorListeners("Expected condition after 'while'"); }
    | 'return' error=expression
      { notifyErrorListeners("Invalid return expression"); }
    ;

2. Add Synchronization Points

// Block with synchronization
block: identifier integerliteral? EOL? '{' EOL? (block_statement | EOL)* '}' 
    | identifier integerliteral? EOL? '{' error=EOL? (block_statement | EOL)* '}'
      { notifyErrorListeners("Error in block: " + $error.text); }
    ;

// Module with synchronization
module: EOL* (module_element (EOL+ module_element)*)? EOL* EOF
      | EOL* module_element error=EOL+ module_element* EOL* EOF
      { notifyErrorListeners("Error between module elements"); }
      ;

3. Add Error Tokens

// Add to lexer
ERROR_TOKEN: . -> skip ;  // Skip unknown tokens

Benefits of Migration

1. Multiple Error Reporting

Users see all syntax errors at once
Better IDE integration with multiple error highlights
More efficient development workflow

2. Better Error Context

ANTLR's recovery provides context about expected tokens
Can suggest alternatives based on grammar
More precise error locations

3. Improved User Experience

Less frustrating compilation process
Better error messages with context
Ability to fix multiple issues in one iteration

Implementation Steps

Phase 1: Basic Migration

Replace BailErrorStrategy with DefaultErrorStrategy
Update error listener to collect instead of throw
Create MultipleParseErrors exception
Update compiler error handling

Phase 2: Grammar Improvements

Add error recovery alternatives to key rules
Add synchronization points for better recovery
Improve error messages with context

Phase 3: Advanced Features

Add error suggestion logic
Implement custom recovery strategies for specific patterns
Add error severity levels (warning vs error)

Testing Strategy

1. Create Test Cases with Multiple Errors

test("multiple parse errors") {
    val src = """
        sub main() {
            x = 1 +    // Missing right operand
            if x > 0   // Missing then/block
            y =        // Missing expression
        }
    """
    // Should report all 3 errors, not just the first
}

2. Test Error Recovery

test("error recovery continues parsing") {
    val src = """
        sub bad() { x = 1 + }
        sub good() { return 42 }
    """
    // Should parse both subroutines and report error in first
}

3. Test Synchronization

test("block synchronization") {
    val src = """
        block1 {
            x = 1 +    // Error in block1
        }
        block2 {      // Should still parse block2
            y = 2
        }
    """
    // Should recover and parse block2 correctly
}

Potential Challenges

1. Cascading Errors

One syntax error might cause multiple subsequent errors
Need to filter or prioritize errors intelligently

2. Performance Impact

Error recovery has overhead
Need to benchmark parsing performance

3. False Positives

Recovery might parse invalid constructs
Need to validate AST after parsing

Summary

Migrating from BailErrorStrategy to a recovery-based approach will:

Improve User Experience: Show all errors at once instead of one-by-one
Better IDE Integration: Enable multiple error highlights
Provide Context: Better error messages with expected tokens
Maintain Robustness: Still catch all errors, just report them differently

The migration requires changes to:

Error handling strategy in parser setup
Error listener implementation
Compiler error reporting
Grammar for better recovery
Test cases for multiple errors

This change will significantly improve the development experience for Prog8 programmers while maintaining the compiler's accuracy and robustness.

21 KiB Raw Blame History

ANTLR4 Grammar Improvements for Prog8

Overview

Current Issues Summary

1. Error Recovery & Reporting Improvements

Current Issues

Recommended Improvements

Add Error Alternatives with Custom Messages

Add Error Recovery Rules

Improve Error Messages for Common Mistakes

2. Rule Optimization

A. Expression Rule Refactoring

B. Statement Rule Grouping

C. Directive Rule Simplification

3. Lexer Improvements

A. NOT_IN Token Fix

B. Identifier Rule Improvement

C. Float Number Rules Simplification

4. Specific Problem Areas

A. Pointer Dereference Cleanup

B. Assignment Rule Simplification

C. Module Rule EOL Handling

5. Missing Syntactic Validations

6. Implementation Priority

7. Example: Improved Error Messages

Before (Current)

After (With Custom Messages)

8. Testing Strategy

9. Migration Plan

10. Related Files to Update

Conclusion

Appendix: BailErrorStrategy Migration Guide

What is BailErrorStrategy?

Key Characteristics:

Current Implementation in Prog8:

Problems with BailErrorStrategy:

Migration Plan: Replace BailErrorStrategy

1. Replace Error Strategy

2. Implement Custom Error Listener

3. Update Parser Setup

4. Create Multiple Errors Exception

5. Update Compiler Error Handling

Grammar Changes for Better Error Recovery

1. Add Error Recovery Alternatives

2. Add Synchronization Points

3. Add Error Tokens

Benefits of Migration

1. Multiple Error Reporting

2. Better Error Context

3. Improved User Experience

Implementation Steps

Phase 1: Basic Migration

Phase 2: Grammar Improvements

Phase 3: Advanced Features

Testing Strategy

1. Create Test Cases with Multiple Errors

2. Test Error Recovery

3. Test Synchronization

Potential Challenges

1. Cascading Errors

2. Performance Impact

3. False Positives

Summary

21 KiB

Raw Blame History