Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

encoding/base32: Add RFC-compliant error handling and improve reliability #4641

Merged
merged 19 commits into from
Jan 4, 2025

Conversation

zolk3ri
Copy link
Contributor

@zolk3ri zolk3ri commented Dec 31, 2024

The current implementation of base32.decode() uses assertions for (incomplete) validation, which causes programs to crash on invalid input. This PR replaces assertions with proper error handling according to RFC 4648, allowing users of the package to handle invalid input gracefully.

Motivation: While implementing a pure Odin implementation of TOTP (Time-based One-Time Password), I discovered that base32.decode() would segfault on invalid input due to assertions. As a core library package, it should provide proper error handling to allow users to gracefully handle decode errors, which these changes enable.

Problem Description

  • Programs crash with segfaults on invalid base32 input due to assertions
  • Explicit disabling of bounds checking (#no_bounds_check) prevents proper error handling
  • Buffer allocation issues can cause out-of-bounds access
  • Non-compliance with RFC 4648 padding rules

The original implementation prioritized performance by disabling bounds checking, but this tradeoff is not appropriate for a core library package. The small or potentially negligible performance overhead from bounds checking is worth the improved safety and usability.

Implementation Details

The error enum:

Error :: enum {
  None,
  Invalid_Character, // Input contains characters outside the specified alphabet
  Invalid_Length,    // Input length is not valid for base32 (must be a multiple of 8 with proper padding)
  Malformed_Input,   // Input has improper structure (wrong padding position or incomplete groups)
}

Before (crashes on invalid input):

data := decode(input)
if data == nil { // Only checks for empty input
  return nil
}
defer delete(data)

After (error handling available):

data, err := decode(input)
if err != .None {
  #partial switch err {
    case .Invalid_Character:
    // Handle error where input contains characters not in base32 alphabet
    case .Malformed_Input:
    // Handle error where invalid padding structure or position
    case .Invalid_Length:
    // Handle error where input length is not valid for base32
  }
}
defer delete(data)

Key implementation changes:

  • Function now returns ([]byte, Error) instead of just []byte
  • Removed #no_bounds_check directive for safety
  • Proper buffer size calculation
  • Added bounds validation throughout
  • Proper padding validation
  • RFC 4648 compliant error handling
  • Support for custom alphabets with custom validation

Design Decisions

  1. Bounds Checking:

    • Original code used #no_bounds_check for performance
    • Removed this directive despite small (if any) performance impact
    • Proper error handling requires bounds validation
    • Safety and usability outweigh minor performance cost
    • Follows Odin's philosophy of being explicit and safe by default
  2. Error Handling:

    • Simple enum with distinct error cases
    • Clear separation between character, length, and structure errors
    • RFC 4648 compliant validation
    • Empty input returns (nil, .None) by convention
  3. Testing:

    • RFC 4648 test vectors
    • Tests all error conditions
    • Verifies correct handling of edge cases

Breaking Changes

  1. Function Signature:

    • decode() now returns ([]byte, Error) instead of []byte
    • Error handling available but not required
  2. Performance:

    • Minor overhead from bounds checking
    • No additional allocations for error handling
  3. Error Handling:

    • Programs can handle errors if desired
    • No more unexpected crashes on invalid input

Testing

Test coverage includes:

  • RFC 4648 test vectors and validation rules
  • Character validation (invalid characters and case sensitivity)
  • Padding validation (placement and length requirements)
  • Length validation
  • Roundtrip encoding/decoding
  • Custom alphabet support

All tests pass with 100% coverage of the new functionality and error cases.

Future Work

  • Add similar error handling improvements to encoding/base64 package
  • Consider adding error handling to encode() for input validation

Thank you for considering this contribution.

Happy New Year! :)

Replace assertions with proper error handling in base32.decode() to allow
programs to handle invalid input gracefully rather than crashing.

The function now returns ([]byte, Error) instead of just []byte.
Fix buffer allocation size calculation and add proper bounds checking to
ensure output buffer has sufficient space. This fixes crashes that could
occur with inputs like "AA" and other edge cases where the output buffer
was too small.

Remove #no_bounds_check as proper bounds checking is necessary for safe
error handling. The small performance trade-off is worth the improved
robustness.
Add test suite based on RFC 4648 test vectors and validation rules:
- Add section 10 test vectors for valid encoding/decoding
- Add test cases for invalid character handling (section 3.2)
- Add test cases for padding validation (section 4)
- Add test cases for length requirements (section 6)

The test vectors verify that:
- Empty string encodes/decodes correctly
- Standard cases like "foo" -> "MZXW6===" work
- Invalid characters are rejected
- Missing or malformed padding is detected
- Invalid lengths are caught
Rework base32.decode() to properly handle all cases per RFC 4648:

- Fix error detection order:
  - Check minimum length first (Invalid_Length)
  - Check character validity (Invalid_Character)
  - Check padding and structure (Malformed_Input)

- Fix padding validation:
  - Add required padding length checks (2=6, 4=4, 5=3, 7=1 chars)
  - Ensure padding only appears at end
  - Fix handling of unpadded inputs

- Fix buffer handling:
  - Proper output buffer size calculation
  - Add bounds checking for buffer access
  - Add proper buffer validation

For example:
- "M" correctly returns Invalid_Length (too short)
- "mzxq====" correctly returns Invalid_Character (lowercase)
- "MZXQ=" correctly returns Malformed_Input (wrong padding)
- Unpadded input lengths must be multiples of 8

These changes make the decode function fully compliant with RFC 4648
requirements while providing proper error handling.
Fix memory handling throughout base32 package:
- Make padding map package-level constant (to avoid repeated allocs)
- Use passed allocator in encode's make() call
- Add defer delete for allocated memory in encode
- Add proper cleanup in test cases
- Fix memory cleanup of output buffers

The changes ensure consistent allocator usage and cleanup in both
implementation and tests.
Replace package-level map with a simple switch statement for padding
validation. This eliminates allocations we can't properly free while
maintaining the same validation logic.
Add support for custom alphabet validation through an optional validation
function parameter. The default validation follows RFC 4648 base32
alphabet rules (A-Z, 2-7).

This properly supports the documented ability to use custom alphabets.
Fix encoding to properly use provided encoding table parameter instead of
hardcoded `ENC_TABLE`.

This makes encode properly support custom alphabets as documented.
The decoding table was only 224 bytes which caused type mismatches when
using custom alphabets, so expand with zeroes to cover full byte range
while maintaining the same decoding logic.
Move existing test procedures to a dedicated test file for better
code organization and maintainability.
Add test_base32_roundtrip() to verify the encode->decode roundtrip
preserves data integrity. This test helps ensure our base32 implementation
correctly handles the full encode->decode cycle without data loss or
corruption.
Remove premature deallocation of the output buffer which was causing
use-after-free behavior. The returned string needs to take ownership
of this memory, but the defer delete was freeing it before the
string could be used. This fixes issues with encoding that were
introduced by overly aggressive memory cleanup in 93238db.
Add test case to verify custom alphabet support. The test uses a
decimal-uppercase alphabet (0-9, A-V) to test both encoding and decoding
with custom tables, including validation. This ensures the encode and
decode functions work correctly with custom encoding tables and
validation functions as documented.
Add defer delete for encoded strings across all test procedures to ensure
proper cleanup and prevent memory leaks. This completes the memory
management improvements started in 591dd87.
@zolk3ri zolk3ri force-pushed the fix/base32-error-handling branch from 60f794a to d6f4412 Compare December 31, 2024 18:00
@zolk3ri zolk3ri marked this pull request as draft December 31, 2024 18:49
Fix incorrect RFC 4648 section references:
- Add RFC URL reference at package level
- Update Error enum documentation to reference correct sections:
  - Invalid_Character: Section 3.3 (non-alphabet characters)
  - Invalid_Length: Section 6 (base32 block size requirements)
  - Malformed_Input: Section 3.2 (padding)
- Fix test file section references to match correct sections

This ensures all RFC references are accurate and adds a link to the
source RFC for reference.
@zolk3ri zolk3ri marked this pull request as ready for review December 31, 2024 22:48
@bullyingteen
Copy link

encoding/base64 needs the same treatment I guess?
See base64.decode_into

@zolk3ri
Copy link
Contributor Author

zolk3ri commented Jan 1, 2025

encoding/base64 needs the same treatment I guess? See base64.decode_into

Yup! That is why I mentioned it under "Future Work". If this happens to get merged, I will work on encoding/base64 as well.

@bullyingteen
Copy link

encoding/base64 needs the same treatment I guess? See base64.decode_into

Yup! That is why I mentioned it under "Future Work". If this happens to get merged, I will work on encoding/base64 as well.

Missed it ngl 🙂

Add `@(rodata)` attribute to `ENC_TABLE` and `DEC_TABLE` to mark
them as read-only data. This places these tables in the read-only
section of the executable, protecting them from modification
during program execution.
// Validate padding and length
data_len := len(data)
padding_count := 0
for i := data_len - 1; i >= 0; i -= 1 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#no_bounds_check can be used here

// Check for proper padding and length combinations
if padding_count > 0 {
// Verify no padding in the middle
for i := 0; i < data_len - padding_count; i += 1 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#no_bounds_check can be used here

}

@private
_encode :: proc(out, data: []byte, ENC_TBL := ENC_TABLE, allocator := context.allocator) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can #no_bounds_check can be used for this proc?

Copy link
Member

@Kelimion Kelimion left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two minor nitpicks. Looks good to me otherwise. Also thanks for adding a test suite for the package.

core/encoding/base32/base32.odin Show resolved Hide resolved
content_len := data_len - padding_count
mod8 := content_len % 8
required_padding: int
switch mod8 {
Copy link
Member

@Kelimion Kelimion Jan 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose this works fine, but I might be a bit more explicit about the values here:

// Remainders of 2, 4, 5, and 7 need to be padded to 8 bytes.
switch mod8 {
case 2, 4, 5, 7: required_padding = 8 - mod8
case: required_padding = 0
}

Alternatively, you could use a LUT:

// Remainders of 2, 4, 5, and 7 need to be padded to 8 bytes.
PAD_LENGTH := [8]int{0, 0, 6, 0, 4, 3, 0, 1}
required_padding := PAD_LENGTH[mod8]

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Should we not keep the explicit mapping as it directly documents the relationship between character count and required padding (e.g., "2 chars need 6 padding chars"). Using 8 - mod8 or a LUT would obscure this relationship by hiding the actual padding requirements behind a formula or indirect mapping. The current form makes the RFC requirements more apparent.

Copy link
Member

@Kelimion Kelimion Jan 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't read this particular RFC, so I'll defer to you on that. But to me at least, 6, 4, 3 and 1 felt like magic numbers, even though they were clearly derived from 8 - mod8, so the 8 - mod8 felt more explicit than "these are the magic numbers". If you follow my train of thought.

But that could be added to the comments instead?

// Requirement is for 2, 4, 5 and 7 remaining characters to be padded to 8. So the values here are `8 - mod8`.
switch mod8 {
case 2: required_padding = 6 // 2 chars need 6 padding chars
case 4: required_padding = 4 // 4 chars need 4 padding chars
case 5: required_padding = 3 // 5 chars need 3 padding chars
case 7: required_padding = 1 // 7 chars need 1 padding char
case:   required_padding = 0
}

@Kelimion
Copy link
Member

Kelimion commented Jan 4, 2025

Motivation: While implementing a pure Odin implementation of TOTP (Time-based One-Time Password), I discovered that base32.decode() would segfault on invalid input due to assertions.

I've been considering writing a TOTP implementation for core, as it fits in nicely with core:net and core:crypto, and the work underway by Laytan to support http(s). So if you wanted to contribute TOTP, I'd certainly be willing to give it a fair shake.

@zolk3ri
Copy link
Contributor Author

zolk3ri commented Jan 4, 2025

Motivation: While implementing a pure Odin implementation of TOTP (Time-based One-Time Password), I discovered that base32.decode() would segfault on invalid input due to assertions.

I've been considering writing a TOTP implementation for core, as it fits in nicely with core:net and core:crypto, and the work underway by Laytan to support http(s). So if you wanted to contribute TOTP, I'd certainly be willing to give it a fair shake.

I would be interested in contributing my TOTP implementation to core. My current implementation (which is currently a standalone library) is available at totp.odin. As noted in the README, this was my first Odin code and first experience with the language, so there might be areas that could use improvement.

I haven't yet pushed the changes that utilize the improved error handling for the encoding/base32 package from my PR. Initially, I planned to add base32 validation in my TOTP library, but I thought it would be more beneficial to improve the encoding/base32 package directly instead.

As a side note, I noticed there are many uses of assert() throughout core (some likely valid). If you're open to exploring more robust error handling approaches (allowing users to handle errors gracefully), I would be happy to help improve packages as needed, in collaboration with others.

I've also been working on a minimal libcurl wrapper, available at odin-curl. The goal is to provide libcurl functionality with idiomatic Odin features where appropriate, rather than just a direct FFI binding.

I'm happy to collaborate on Discord (I've joined but haven't been very active), though I should mention that I may be away for a couple of months due to health reasons. If the TOTP work can't wait, feel free to use what you need from my code - it's completely fine. :)

@Kelimion
Copy link
Member

Kelimion commented Jan 4, 2025

I would be interested in contributing my TOTP implementation to core. My current implementation is available at totp.odin. As noted in the README, this was my first Odin code and first experience with the language, so there might be areas that could use improvement.

Great. I'll give it a read soon.

I haven't yet pushed the changes that utilize the improved error handling for the encoding/base32 package from my PR. Initially, I planned to add base32 validation in my TOTP library, but I thought it would be more beneficial to improve the encoding/base32 package directly instead.

Indeed.

As a side note, I noticed there are many uses of assert() throughout core (some likely valid). If you're open to exploring more robust error handling approaches (allowing users to handle errors gracefully), I would be happy to help improve packages as needed, in collaboration with others.

Many of those are intentional, as they deal with programmer/user error. If a certain state is only possible because of misuse of an API, not untrusted input, then an assert is sensible. It's the kind of feedback a programmer can't ignore and will helpfully tell them they've neglected to pass an appropriately sized buffer, for example. So the rule of thumb we've settled on for newer implementations is to have asserts for "contract violations", and errors for well, errors.

I've also been working on a minimal libcurl wrapper, available at odin-curl. The goal is to provide libcurl functionality with idiomatic Odin features where appropriate, rather than just a direct FFI binding.

We'll have to have a think about whether that fits. As a binding it would have to live in vendor, not core, but it may be better situated in your own repository and added to Mjolnir's package index and other places for people to find it.

I'm happy to collaborate on Discord (I've joined but haven't been very active), though I should mention that I may be away for a couple of months due to health reasons. If the TOTP work can't wait, feel free to use what you need from my code - it's completely fine. :)

At your leisure. Health comes first.

@zolk3ri
Copy link
Contributor Author

zolk3ri commented Jan 4, 2025

Many of those are intentional, as they deal with programmer/user error. If a certain state is only possible because of misuse of an API, not untrusted input, then an assert is sensible. It's the kind of feedback a programmer can't ignore and will helpfully tell them they've neglected to pass an appropriately sized buffer, for example. So the rule of thumb we've settled on for newer implementations is to have asserts for "contract violations", and errors for well, errors.

Exactly, that's what I was thinking as well. :)

We'll have to have a think about whether that fits. As a binding it would have to live in vendor, not core, but it may be better situated in your own repository and added to Mjolnir's package index and other places for people to find it.

Yeah, I know, it belongs to vendor, definitely not core. When you have the time, I'd appreciate it if you could let me know if it is suitable for inclusion in vendor.

@Kelimion Kelimion merged commit 397e371 into odin-lang:master Jan 4, 2025
7 checks passed
@zolk3ri zolk3ri deleted the fix/base32-error-handling branch January 4, 2025 19:29
@zolk3ri zolk3ri restored the fix/base32-error-handling branch January 4, 2025 19:29
@zolk3ri
Copy link
Contributor Author

zolk3ri commented Jan 4, 2025

Thank you a lot! :)

@Kelimion
Copy link
Member

Kelimion commented Jan 4, 2025

Thank you a lot! :)

Thank you for the package's cleanup. :-)

Don't mind the CI error. That spurious one occasionally happens on the RISCV target, and I don't really feel like special casing the test to ignore it on that target.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants