Files
fuj-management/docs/plans/2026-05-05-2204-go-rewrite-m2-1-czech-normalize.md
Jan Novak d9a61b338c
All checks were successful
Deploy to K8s / deploy (push) Successful in 8s
feat(go/M2.1): port czech.Normalize — NFKD + Mn strip + lowercase
Adds internal/domain/czech.Normalize, the first pure-domain function in
the Go rewrite (M2 milestone). Matches Python czech_utils.normalize byte-
for-byte: NFKD decompose via golang.org/x/text/unicode/norm, drop Mn-
category combining marks (unicode.Mn, not IsMark, to match Python's
unicodedata.combining() semantics), then strings.ToLower.

Includes 13-case table-driven test; all inputs spot-checked against the
Python implementation before locking. Adds golang.org/x/text v0.36.0 as
first external dependency.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-05 22:23:40 +02:00

5.9 KiB

Plan: Go rewrite — M2.1 domain/czech.Normalize

Context

The Go rewrite finished M1 (skeleton, tooling, hello server) in commit cf0f176 on 2026-05-04. The next milestone, M2 — Pure-domain helpers, is current per progress tracker but has no work landed yet (all 12 sub-tasks unchecked).

This plan covers only the first M2 task: porting Python's normalize from scripts/czech_utils.py to Go as internal/domain/czech.Normalize. It is the lowest-level helper in the domain — parse_month_references, _build_name_variants, match_members, exception keys, and reconcile all transitively depend on it. Getting it byte-equivalent first removes a class of "why does my match not fire" failures from every later M2 task.

Decision (confirmed in plan-mode Q): start with hand-written Go unit tests for fresh Czech edge cases. Defer parity-fixture wiring until M3.1/M3.2 land (separate task); add the parity test for Normalize retroactively at that point.

Scope

  • New package go/internal/domain/czech/ with Normalize and unit tests.
  • Add golang.org/x/text dependency to go/go.mod (currently zero deps).
  • Out of scope: ParseMonthReferences (M2.2), fixture tooling (M3.1/M3.2), CLI subcommand wiring (M2.11/M2.12), parity test runner.

Python contract to match

def normalize(text: str) -> str:
    nfkd = unicodedata.normalize("NFKD", text)
    return "".join(c for c in nfkd if not unicodedata.combining(c)).lower()

Three semantic operations:

  1. NFKD decompose
  2. Drop characters where unicodedata.combining(c) is non-zero
  3. Lowercase

Go implementation

go/internal/domain/czech/normalize.go:

package czech

import (
    "strings"
    "unicode"
    "golang.org/x/text/unicode/norm"
)

func Normalize(s string) string {
    decomposed := norm.NFKD.String(s)
    var b strings.Builder
    b.Grow(len(decomposed))
    for _, r := range decomposed {
        if unicode.In(r, unicode.Mn) {
            continue
        }
        b.WriteRune(r)
    }
    return strings.ToLower(b.String())
}

Two precision points worth flagging:

  1. unicode.Mn not unicode.IsMark. The plan's library-choices table mentions unicode.IsMark, but that covers Mn + Mc + Me. Python unicodedata.combining() returns 0 for Mc/Me (their canonical combining class is 0), so it effectively filters only Mn. Use unicode.In(r, unicode.Mn) for byte-equivalence with Python. Cite this in a one-line code comment; it's the kind of thing a future reader will second-guess.
  2. strings.ToLower vs Go's locale-aware tools. Python's .lower() on already-decomposed Latin is straight ASCII lowercase for Czech. Stdlib strings.ToLower matches; do not pull in golang.org/x/text/cases.

Tests

go/internal/domain/czech/normalize_test.go — table-driven, covers:

  • ASCII passthrough: "Honza" → "honza"
  • Czech lowercase diacritics: "žluťoučký" → "zlutoucky"
  • Mixed case + diacritics: "Příliš" → "prilis"
  • Czech caron + ring: "Dvořák" → "dvorak", "Růžena" → "ruzena"
  • Hard letters: "Čeněk" → "cenek", "Kačer" → "kacer"
  • Empty string: "" → ""
  • Already-normalized: "prilis" → "prilis" (idempotence)
  • Pre-composed vs decomposed input both produce the same output (NFC "é" and "é" both → "e")
  • Whitespace preserved: "Jan Novák" → "jan novak"

Run a one-shot cross-check against the live Python implementation for each test input before locking the table:

PYTHONPATH=scripts:. python -c \
  'from czech_utils import normalize; print(repr(normalize("Dvořák")))'

This is the manual stand-in for the M3 parity fixtures.

Wire-up

  • go get golang.org/x/text@latest (run from go/); go mod tidy.
  • No CLI changes — cmd/fuj already stubs fees/reconcile with exit code 2; no need to touch dispatcher for this task. Normalize is consumed by other domain code, not by users directly.

Critical files

Verification

End-to-end checks before marking M2.1 done:

  1. cd go && go build ./... — clean compile.
  2. cd go && go test ./internal/domain/czech/... — all table cases green.
  3. cd go && go test -race ./... — race-clean.
  4. cd go && golangci-lint run (or make go-lint from repo root) — clean.
  5. Spot parity (manual, will be automated in M3): for each Go test input, run the Python normalize via PYTHONPATH=scripts:. python -c '...' and confirm bytes match. Capture the diff in the commit message if anything surprises.
  6. make go-build && make go-test && make go-lint from repo root — proves the existing M1 gate still passes.

Branching & follow-up

Per CLAUDE.md, this is feature work → branch + Gitea MR:

  • Branch: feat/m2-1-czech-normalize off main.
  • Single commit, Co-Authored-By trailer.
  • Push with -u, print compare URL https://gitea.home.hrajfrisbee.cz/kacerr/fuj-management/compare/main...feat/m2-1-czech-normalize
  • User opens/merges the MR.
  • After merge: tick M2.1 in the progress tracker with the commit SHA; add a one-line CHANGELOG entry; record any porting surprise in the tracker's "Notes & decisions" section (e.g. the Mn-vs-IsMark precision point if it bears noting).

Next task after this lands is M2.2 ParseMonthReferences — the larger, edge-case-heavier sibling. Whether to start it before or after M3.1/M3.2 is a separate decision the user can make then.