diff --git a/docs/plans/2026-05-05-2337-go-rewrite-m2-2-parse-month-references.md b/docs/plans/2026-05-05-2337-go-rewrite-m2-2-parse-month-references.md new file mode 100644 index 0000000..dc6219b --- /dev/null +++ b/docs/plans/2026-05-05-2337-go-rewrite-m2-2-parse-month-references.md @@ -0,0 +1,205 @@ +# Plan: Go rewrite — M2.2 `domain/czech.ParseMonthReferences` + +## Context + +M2.1 (`domain/czech.Normalize`) merged via PR #4 (`d9a61b3`) on +2026-05-05. Per the [progress tracker](2026-05-03-2349-go-backend-rewrite-progress.md), +**M2.2** is next: port `parse_month_references` from +[scripts/czech_utils.py](../../scripts/czech_utils.py) to Go as +`internal/domain/czech.ParseMonthReferences`. + +This function is the second-most-load-bearing pure helper after +`reconcile`: every payment-message → month inference goes through it. +Risk #4 in the [parent plan](2026-05-03-2349-go-backend-rewrite.md) +specifically calls out its semantics — wrap-around year inference and +the `m >= 10 → previous year` standalone heuristic — as easy to mis-port. + +This plan locks the test table against the live Python implementation +*before* coding, so the Go port has a verified parity baseline even +before the M3.1/M3.2 fixture infrastructure exists. + +## Scope + +- New file `go/internal/domain/czech/parse_month_references.go` in the + existing `czech` package (alongside [normalize.go](../../go/internal/domain/czech/normalize.go)). +- New file `go/internal/domain/czech/parse_month_references_test.go` + with the test table below. +- **Out of scope:** parity-fixture wiring (M3.1/M3.2); CLI hook-up + (M2.11/M2.12); any consumer call-sites. +- **No new dependencies** — stdlib `regexp`, `sort`, `strconv`, `strings` + plus the existing `czech.Normalize` cover everything. + +## Recommended approach + +### Python contract to mirror + +Three regex passes, all run on `normalize(text)`: + +1. `([\d+]+)\s*/\s*(\d{2,4})` — captures `"11+12/2025"`, `"01/26"`, `"1/26"`. + Split the months part on `+`, keep digit-only tokens, validate `1..12`. + Year < 100 → year + 2000. +2. `(\d{1,2})\s*\.\s*(\d{4})` — captures `"12.2025"`. **4-digit year only** + (so `"1.26"` does not match). +3. Czech month names. First the **range** sub-pass: + `(name)\s*-\s*(name)` finds pairs; walk start→end with `m % 12 + 1`, + stopping when `m == end_m`. Wrap rule: if `start_m > end_m`, months + `>= start_m` are `defaultYear - 1`, the rest are `defaultYear`. Both + matched names go into a `foundInRanges` set. + Then the **standalone** sub-pass: `\b(name)\b`, skipping any name in + `foundInRanges`. For each remaining match, `m >= 10 → defaultYear - 1`, + else `defaultYear`. + +Output: sorted, deduplicated `[]string` of `"YYYY-MM"`. + +### Go signature + +```go +package czech + +// ParseMonthReferences extracts YYYY-MM month references from Czech +// free text. defaultYear seeds two heuristics: standalone month names +// with m >= 10 are treated as defaultYear-1 (out-of-year backfill), and +// wrap-around ranges (e.g. listopad-leden) place months >= start in +// defaultYear-1. +func ParseMonthReferences(text string, defaultYear int) []string +``` + +Required `defaultYear` (no default value — Go convention). + +### Implementation sketch + +```go +var czechMonths = map[string]int{ + "leden": 1, "ledna": 1, "lednu": 1, + "unor": 2, "unora": 2, "unoru": 2, + "brezen": 3, "brezna": 3, "breznu": 3, + "duben": 4, "dubna": 4, "dubnu": 4, + "kveten": 5, "kvetna": 5, "kvetnu": 5, + "cerven": 6, "cervna": 6, "cervnu": 6, + "cervenec": 7, "cervnce": 7, "cervenci": 7, + "srpen": 8, "srpna": 8, "srpnu": 8, + "zari": 9, + "rijen": 10, "rijna": 10, "rijnu": 10, + "listopad": 11, "listopadu": 11, + "prosinec": 12, "prosince": 12, "prosinci": 12, +} + +// Sorted by descending length at init, so longer alternatives win in +// the regex (e.g. "cervenec" beats "cerven"). Mirrors Python's +// sorted(..., key=len, reverse=True). +var monthNameAlt = buildMonthNameAlt() + +var ( + numericRe = regexp.MustCompile(`([\d+]+)\s*/\s*(\d{2,4})`) + dotRe = regexp.MustCompile(`(\d{1,2})\s*\.\s*(\d{4})`) + rangeRe = regexp.MustCompile(`(` + monthNameAlt + `)\s*-\s*(` + monthNameAlt + `)`) + standRe = regexp.MustCompile(`\b(` + monthNameAlt + `)\b`) +) +``` + +Three Go-specific gotchas worth a code comment: + +1. **RE2 alternation is leftmost-first**, same as Python `re`. Sorting + month names by descending length is therefore necessary (otherwise + `"cervenec"` matches as `"cerven"` + leftover `"ec"`). Mirror the + Python sort exactly. +2. **Map iteration is randomized in Go.** Build the alternation list + from a sorted slice of keys, not by iterating the map. +3. **`\d` and `\b`** in Go RE2 are ASCII-only, which matches the + effective behavior on `Normalize`'d input (NFKD already collapsed + any Unicode digits/letters that would matter; standalone Devanagari + digits in member messages aren't a real-world concern). + +The walk loop uses a bounded counter (max 12 iterations) defensively in +Go; Python's `while True` is fine because every range terminates within +12 hops, but a future reader appreciates the bound. + +### Test table (verified against live Python — `default_year=2026`) + +Locked outputs from `PYTHONPATH=scripts:. python -c 'from czech_utils +import parse_month_references; print(parse_month_references(, 2026))'` +on 2026-05-05. + +| # | Input | Expected | Path exercised | +|---|---|---|---| +| 1 | `""` | `[]` | empty | +| 2 | `"11+12/2025"` | `["2025-11", "2025-12"]` | numeric, plus-split | +| 3 | `"1/2026"` | `["2026-01"]` | numeric, single | +| 4 | `"01/26"` | `["2026-01"]` | 2-digit year normalization | +| 5 | `"11+12/25"` | `["2025-11", "2025-12"]` | plus-split + 2-digit year | +| 6 | `"12+1+2/2026"` | `["2026-01", "2026-02", "2026-12"]` | sorting | +| 7 | `"12.2025"` | `["2025-12"]` | dot pattern | +| 8 | `"1.26"` | `[]` | dot pattern requires 4-digit year | +| 9 | `"leden"` | `["2026-01"]` | standalone, m<10 | +| 10 | `"prosinec"` | `["2025-12"]` | standalone, m≥10 → previous year | +| 11 | `"prosince"` | `["2025-12"]` | declension | +| 12 | `"lednu"` | `["2026-01"]` | declension | +| 13 | `"rijen"` | `["2025-10"]` | m≥10 boundary (10 itself) | +| 14 | `"zari"` | `["2026-09"]` | m<10 just below boundary | +| 15 | `"listopad-leden"` | `["2025-11", "2025-12", "2026-01"]` | wrap range Nov→Jan | +| 16 | `"rijen-leden"` | `["2025-10", "2025-11", "2025-12", "2026-01"]` | wrap from October | +| 17 | `"unor-kveten"` | `["2026-02", "2026-03", "2026-04", "2026-05"]` | non-wrap range | +| 18 | `"leden-leden"` | `["2026-01"]` | degenerate range | +| 19 | `"unor-listopad"` | `["2026-02", ..., "2026-11"]` (10 entries) | range spans m≥10 — heuristic does NOT fire (range exclusion) | +| 20 | `"cervenec-srpen"` | `["2026-07", "2026-08"]` | longest-match alt (`cervenec` not `cerven`+`ec`) | +| 21 | `"listopad-leden, prosinec"` | `["2025-11", "2025-12", "2026-01"]` | range + standalone, dedup | +| 22 | `"prosinec leden"` | `["2025-12", "2026-01"]` | two standalones, no range | +| 23 | `"11+12/2025, leden-brezen"` | `["2025-11", "2025-12", "2026-01", "2026-02", "2026-03"]` | numeric + range mix | +| 24 | `"11+12/25 a listopad"` | `["2025-11", "2025-12"]` | dedup across passes | +| 25 | `"prosince/2025"` | `["2025-12"]` | numeric pattern fails (no digits before `/`); standalone fires | +| 26 | `"listopad-prosinec/2025"` | `["2026-11", "2026-12"]` | range wins; numeric pattern fails | +| 27 | `"01.2026 / 02.2026"` | `["2026-01", "2026-02"]` | dot pattern only; numeric matches `(2026, 02)` but month 2026 is out of range | +| 28 | `"/12/2025"` | `["2025-12"]` | numeric matches at second `/` | +| 29 | `"PROSINEC"` | `["2025-12"]` | normalize lowercases | +| 30 | `"Žluťoučký prosinec"` | `["2025-12"]` | normalize strips diacritics | +| 31 | `"Únor - květen"` | `["2026-02", ..., "2026-05"]` | range tolerates spaces around `-`, diacritics survive normalize | +| 32 | `"platba 11/2025 a leden"` | `["2025-11", "2026-01"]` | mixed natural-language | +| 33 | `"December"` | `[]` | English month names not recognized | +| 34 | `"11+12/2025 11+12/2025"` | `["2025-11", "2025-12"]` | dedup of repeated input | +| 35 | `"leden 2026"` | `["2026-01"]` | trailing year is ignored unless dot/slash separator present | + +35 cases is enough to lock semantics; the M3.x corpus will pile on +real-message fixtures later. + +### Wire-up + +- No `go.mod` changes (stdlib only). +- No CLI changes. +- `Normalize` is in the same package, so call it directly. + +## Critical files + +- New: [go/internal/domain/czech/parse_month_references.go](../../go/internal/domain/czech/parse_month_references.go) +- New: [go/internal/domain/czech/parse_month_references_test.go](../../go/internal/domain/czech/parse_month_references_test.go) +- Reference (read-only): [scripts/czech_utils.py](../../scripts/czech_utils.py) — the porting source +- Reference (read-only): [docs/plans/2026-05-03-2349-go-backend-rewrite.md](2026-05-03-2349-go-backend-rewrite.md) — risk #4 +- Reuses: [go/internal/domain/czech/normalize.go](../../go/internal/domain/czech/normalize.go) — `Normalize` is called once at the top of `ParseMonthReferences` + +## Verification + +End-to-end checks before marking M2.2 done: + +1. `cd go && go build ./...` — clean compile. +2. `cd go && go test ./internal/domain/czech/...` — all 35 table cases green. +3. `cd go && go test -race ./...` — race-clean (regex compiles are global; verify no init races). +4. `cd go && golangci-lint run` (or `make go-lint` from repo root) — clean, gofumpt-formatted. +5. **Spot parity** (manual, will be automated in M3.x): each test input has its expected output captured from the live Python implementation on 2026-05-05; the test table itself is the parity record. If any case diverges during implementation, re-run Python with the exact input to confirm the truth and update either the Go code or the test entry. +6. `make go-build && make go-test && make go-lint` from repo root — proves M1/M2.1 gate still passes. + +## Branching & follow-up + +Per [CLAUDE.md](../../CLAUDE.md), this is feature work → branch + Gitea MR via `tea`: + +- Branch: `feat/m2-2-parse-month-references` off `main`. +- Single focused commit, Co-Authored-By trailer. +- Push with `-u`. +- Open MR with `tea pr create --title "feat(go/M2.2): port czech.ParseMonthReferences" --description ... --base main --head feat/m2-2-parse-month-references`. Print the MR URL for the user. +- User merges/deletes the branch in Gitea — never from the CLI. + +After merge (small doc edits land straight on `main` per CLAUDE.md exception): + +- Tick `M2.2` in the [progress tracker](2026-05-03-2349-go-backend-rewrite-progress.md) with the merge SHA. +- Add a one-line `CHANGELOG.md` entry (timestamp via `date "+%Y-%m-%d %H:%M %Z"`). +- Record any porting surprise (e.g. an unexpected diff between Go RE2 and Python `re`) in the tracker's "Notes & decisions" section. + +Next task is **M2.3 `domain/fees.CalculateFee`** — straightforward constants table; no parser semantics to debate. diff --git a/go/internal/domain/czech/parse_month_references.go b/go/internal/domain/czech/parse_month_references.go new file mode 100644 index 0000000..1aa1309 --- /dev/null +++ b/go/internal/domain/czech/parse_month_references.go @@ -0,0 +1,154 @@ +package czech + +import ( + "fmt" + "regexp" + "sort" + "strconv" + "strings" +) + +var czechMonths = map[string]int{ + "leden": 1, "ledna": 1, "lednu": 1, + "unor": 2, "unora": 2, "unoru": 2, + "brezen": 3, "brezna": 3, "breznu": 3, + "duben": 4, "dubna": 4, "dubnu": 4, + "kveten": 5, "kvetna": 5, "kvetnu": 5, + "cerven": 6, "cervna": 6, "cervnu": 6, + "cervenec": 7, "cervnce": 7, "cervenci": 7, + "srpen": 8, "srpna": 8, "srpnu": 8, + "zari": 9, + "rijen": 10, "rijna": 10, "rijnu": 10, + "listopad": 11, "listopadu": 11, + "prosinec": 12, "prosince": 12, "prosinci": 12, +} + +var ( + numericRe *regexp.Regexp + dotRe *regexp.Regexp + rangeRe *regexp.Regexp + standRe *regexp.Regexp +) + +func init() { + // Sort by descending length so longer alternatives win in RE2 leftmost-first + // matching (e.g. "cervenec" is tried before "cerven"). + names := make([]string, 0, len(czechMonths)) + for name := range czechMonths { + names = append(names, name) + } + sort.Slice(names, func(i, j int) bool { + if len(names[i]) != len(names[j]) { + return len(names[i]) > len(names[j]) + } + return names[i] < names[j] + }) + alt := strings.Join(names, "|") + + numericRe = regexp.MustCompile(`([\d+]+)\s*/\s*(\d{2,4})`) + dotRe = regexp.MustCompile(`(\d{1,2})\s*\.\s*(\d{4})`) + rangeRe = regexp.MustCompile(`(` + alt + `)\s*-\s*(` + alt + `)`) + standRe = regexp.MustCompile(`\b(` + alt + `)\b`) +} + +// ParseMonthReferences extracts YYYY-MM month references from Czech free text. +// +// defaultYear seeds two heuristics: standalone month names with m >= 10 are +// treated as defaultYear-1 (out-of-year backfill), and wrap-around ranges +// (e.g. listopad-leden) place months >= start_m in defaultYear-1. +// +// Returns a sorted, deduplicated slice of "YYYY-MM" strings. +func ParseMonthReferences(text string, defaultYear int) []string { + normalized := Normalize(text) + seen := map[string]struct{}{} + + add := func(year, m int) { + if m >= 1 && m <= 12 { + seen[fmt.Sprintf("%04d-%02d", year, m)] = struct{}{} + } + } + + // Pass 1: numeric months — "11+12/2025", "01/26", "1/2026" + for _, groups := range numericRe.FindAllStringSubmatch(normalized, -1) { + monthsPart, yearStr := groups[1], groups[2] + year, err := strconv.Atoi(yearStr) + if err != nil { + continue + } + if year < 100 { + year += 2000 + } + for mStr := range strings.SplitSeq(monthsPart, "+") { + mStr = strings.TrimSpace(mStr) + if mStr == "" { + continue + } + allDigits := true + for _, c := range mStr { + if c < '0' || c > '9' { + allDigits = false + break + } + } + if !allDigits { + continue + } + m, err := strconv.Atoi(mStr) + if err != nil { + continue + } + add(year, m) + } + } + + // Pass 2: dot-separated month.year — "12.2025" (4-digit year only) + for _, groups := range dotRe.FindAllStringSubmatch(normalized, -1) { + m, _ := strconv.Atoi(groups[1]) + year, _ := strconv.Atoi(groups[2]) + add(year, m) + } + + // Pass 3a: Czech month name ranges — "listopad-leden" + foundInRanges := map[string]struct{}{} + for _, groups := range rangeRe.FindAllStringSubmatch(normalized, -1) { + startName, endName := groups[1], groups[2] + foundInRanges[startName] = struct{}{} + foundInRanges[endName] = struct{}{} + startM := czechMonths[startName] + endM := czechMonths[endName] + wraps := startM > endM + m := startM + for range 12 { + year := defaultYear + if wraps && m >= startM { + year = defaultYear - 1 + } + add(year, m) + if m == endM { + break + } + m = m%12 + 1 + } + } + + // Pass 3b: standalone Czech month names (not part of a range) + for _, groups := range standRe.FindAllStringSubmatch(normalized, -1) { + name := groups[1] + if _, inRange := foundInRanges[name]; inRange { + continue + } + m := czechMonths[name] + year := defaultYear + if m >= 10 { + year = defaultYear - 1 + } + add(year, m) + } + + result := make([]string, 0, len(seen)) + for k := range seen { + result = append(result, k) + } + sort.Strings(result) + return result +} diff --git a/go/internal/domain/czech/parse_month_references_test.go b/go/internal/domain/czech/parse_month_references_test.go new file mode 100644 index 0000000..2f43a47 --- /dev/null +++ b/go/internal/domain/czech/parse_month_references_test.go @@ -0,0 +1,244 @@ +package czech + +import ( + "reflect" + "testing" +) + +func TestParseMonthReferences(t *testing.T) { + t.Parallel() + + // All expected outputs verified against live Python implementation on 2026-05-05: + // PYTHONPATH=scripts:. python -c 'from czech_utils import parse_month_references; print(parse_month_references("", 2026))' + tests := []struct { + name string + input string + defaultYear int + want []string + }{ + { + name: "empty", + input: "", + defaultYear: 2026, + want: []string{}, + }, + { + name: "numeric plus-split two months full year", + input: "11+12/2025", + defaultYear: 2026, + want: []string{"2025-11", "2025-12"}, + }, + { + name: "numeric single month full year", + input: "1/2026", + defaultYear: 2026, + want: []string{"2026-01"}, + }, + { + name: "numeric 2-digit year", + input: "01/26", + defaultYear: 2026, + want: []string{"2026-01"}, + }, + { + name: "numeric plus-split with 2-digit year", + input: "11+12/25", + defaultYear: 2026, + want: []string{"2025-11", "2025-12"}, + }, + { + name: "numeric three months sorted", + input: "12+1+2/2026", + defaultYear: 2026, + want: []string{"2026-01", "2026-02", "2026-12"}, + }, + { + name: "dot pattern", + input: "12.2025", + defaultYear: 2026, + want: []string{"2025-12"}, + }, + { + name: "dot pattern requires 4-digit year", + input: "1.26", + defaultYear: 2026, + want: []string{}, + }, + { + name: "standalone month below m10 threshold", + input: "leden", + defaultYear: 2026, + want: []string{"2026-01"}, + }, + { + name: "standalone month m10 heuristic", + input: "prosinec", + defaultYear: 2026, + want: []string{"2025-12"}, + }, + { + name: "declension prosince", + input: "prosince", + defaultYear: 2026, + want: []string{"2025-12"}, + }, + { + name: "declension lednu", + input: "lednu", + defaultYear: 2026, + want: []string{"2026-01"}, + }, + { + name: "standalone m10 boundary (rijen = October)", + input: "rijen", + defaultYear: 2026, + want: []string{"2025-10"}, + }, + { + name: "standalone m9 just below boundary (zari = September)", + input: "zari", + defaultYear: 2026, + want: []string{"2026-09"}, + }, + { + name: "range wrap Nov-Jan", + input: "listopad-leden", + defaultYear: 2026, + want: []string{"2025-11", "2025-12", "2026-01"}, + }, + { + name: "range wrap starting at October", + input: "rijen-leden", + defaultYear: 2026, + want: []string{"2025-10", "2025-11", "2025-12", "2026-01"}, + }, + { + name: "range no wrap", + input: "unor-kveten", + defaultYear: 2026, + want: []string{"2026-02", "2026-03", "2026-04", "2026-05"}, + }, + { + name: "degenerate range same month", + input: "leden-leden", + defaultYear: 2026, + want: []string{"2026-01"}, + }, + { + name: "range spanning m10 — heuristic does NOT fire for range members", + input: "unor-listopad", + defaultYear: 2026, + want: []string{"2026-02", "2026-03", "2026-04", "2026-05", "2026-06", "2026-07", "2026-08", "2026-09", "2026-10", "2026-11"}, + }, + { + name: "longest-match alternation cervenec beats cerven", + input: "cervenec-srpen", + defaultYear: 2026, + want: []string{"2026-07", "2026-08"}, + }, + { + name: "range plus standalone — range excludes, dedup", + input: "listopad-leden, prosinec", + defaultYear: 2026, + want: []string{"2025-11", "2025-12", "2026-01"}, + }, + { + name: "two standalones no range", + input: "prosinec leden", + defaultYear: 2026, + want: []string{"2025-12", "2026-01"}, + }, + { + name: "numeric plus range mix", + input: "11+12/2025, leden-brezen", + defaultYear: 2026, + want: []string{"2025-11", "2025-12", "2026-01", "2026-02", "2026-03"}, + }, + { + name: "dedup across numeric and standalone passes", + input: "11+12/25 a listopad", + defaultYear: 2026, + want: []string{"2025-11", "2025-12"}, + }, + { + name: "no digits before slash — standalone fires instead", + input: "prosince/2025", + defaultYear: 2026, + want: []string{"2025-12"}, + }, + { + name: "range with trailing slash-year — numeric fails, range wins", + input: "listopad-prosinec/2025", + defaultYear: 2026, + want: []string{"2026-11", "2026-12"}, + }, + { + name: "dot pattern only — numeric matches but month out of 1-12 range", + input: "01.2026 / 02.2026", + defaultYear: 2026, + want: []string{"2026-01", "2026-02"}, + }, + { + name: "leading slash — numeric matches at second slash", + input: "/12/2025", + defaultYear: 2026, + want: []string{"2025-12"}, + }, + { + name: "uppercase input normalized", + input: "PROSINEC", + defaultYear: 2026, + want: []string{"2025-12"}, + }, + { + name: "diacritics stripped by Normalize", + input: "Žluťoučký prosinec", + defaultYear: 2026, + want: []string{"2025-12"}, + }, + { + name: "diacritics in range with spaces around dash", + input: "Únor - květen", + defaultYear: 2026, + want: []string{"2026-02", "2026-03", "2026-04", "2026-05"}, + }, + { + name: "natural language mixed with numeric and standalone", + input: "platba 11/2025 a leden", + defaultYear: 2026, + want: []string{"2025-11", "2026-01"}, + }, + { + name: "English month name not recognized", + input: "December", + defaultYear: 2026, + want: []string{}, + }, + { + name: "duplicate input deduped", + input: "11+12/2025 11+12/2025", + defaultYear: 2026, + want: []string{"2025-11", "2025-12"}, + }, + { + name: "trailing year without separator ignored", + input: "leden 2026", + defaultYear: 2026, + want: []string{"2026-01"}, + }, + } + + for _, tc := range tests { + t.Run(tc.name, func(t *testing.T) { + t.Parallel() + got := ParseMonthReferences(tc.input, tc.defaultYear) + if got == nil { + got = []string{} + } + if !reflect.DeepEqual(got, tc.want) { + t.Errorf("ParseMonthReferences(%q, %d)\n got %v\n want %v", + tc.input, tc.defaultYear, got, tc.want) + } + }) + } +}