# Plan: Go rewrite — M2.2 `domain/czech.ParseMonthReferences` ## Context M2.1 (`domain/czech.Normalize`) merged via PR #4 (`d9a61b3`) on 2026-05-05. Per the [progress tracker](2026-05-03-2349-go-backend-rewrite-progress.md), **M2.2** is next: port `parse_month_references` from [scripts/czech_utils.py](../../scripts/czech_utils.py) to Go as `internal/domain/czech.ParseMonthReferences`. This function is the second-most-load-bearing pure helper after `reconcile`: every payment-message → month inference goes through it. Risk #4 in the [parent plan](2026-05-03-2349-go-backend-rewrite.md) specifically calls out its semantics — wrap-around year inference and the `m >= 10 → previous year` standalone heuristic — as easy to mis-port. This plan locks the test table against the live Python implementation *before* coding, so the Go port has a verified parity baseline even before the M3.1/M3.2 fixture infrastructure exists. ## Scope - New file `go/internal/domain/czech/parse_month_references.go` in the existing `czech` package (alongside [normalize.go](../../go/internal/domain/czech/normalize.go)). - New file `go/internal/domain/czech/parse_month_references_test.go` with the test table below. - **Out of scope:** parity-fixture wiring (M3.1/M3.2); CLI hook-up (M2.11/M2.12); any consumer call-sites. - **No new dependencies** — stdlib `regexp`, `sort`, `strconv`, `strings` plus the existing `czech.Normalize` cover everything. ## Recommended approach ### Python contract to mirror Three regex passes, all run on `normalize(text)`: 1. `([\d+]+)\s*/\s*(\d{2,4})` — captures `"11+12/2025"`, `"01/26"`, `"1/26"`. Split the months part on `+`, keep digit-only tokens, validate `1..12`. Year < 100 → year + 2000. 2. `(\d{1,2})\s*\.\s*(\d{4})` — captures `"12.2025"`. **4-digit year only** (so `"1.26"` does not match). 3. Czech month names. First the **range** sub-pass: `(name)\s*-\s*(name)` finds pairs; walk start→end with `m % 12 + 1`, stopping when `m == end_m`. Wrap rule: if `start_m > end_m`, months `>= start_m` are `defaultYear - 1`, the rest are `defaultYear`. Both matched names go into a `foundInRanges` set. Then the **standalone** sub-pass: `\b(name)\b`, skipping any name in `foundInRanges`. For each remaining match, `m >= 10 → defaultYear - 1`, else `defaultYear`. Output: sorted, deduplicated `[]string` of `"YYYY-MM"`. ### Go signature ```go package czech // ParseMonthReferences extracts YYYY-MM month references from Czech // free text. defaultYear seeds two heuristics: standalone month names // with m >= 10 are treated as defaultYear-1 (out-of-year backfill), and // wrap-around ranges (e.g. listopad-leden) place months >= start in // defaultYear-1. func ParseMonthReferences(text string, defaultYear int) []string ``` Required `defaultYear` (no default value — Go convention). ### Implementation sketch ```go var czechMonths = map[string]int{ "leden": 1, "ledna": 1, "lednu": 1, "unor": 2, "unora": 2, "unoru": 2, "brezen": 3, "brezna": 3, "breznu": 3, "duben": 4, "dubna": 4, "dubnu": 4, "kveten": 5, "kvetna": 5, "kvetnu": 5, "cerven": 6, "cervna": 6, "cervnu": 6, "cervenec": 7, "cervnce": 7, "cervenci": 7, "srpen": 8, "srpna": 8, "srpnu": 8, "zari": 9, "rijen": 10, "rijna": 10, "rijnu": 10, "listopad": 11, "listopadu": 11, "prosinec": 12, "prosince": 12, "prosinci": 12, } // Sorted by descending length at init, so longer alternatives win in // the regex (e.g. "cervenec" beats "cerven"). Mirrors Python's // sorted(..., key=len, reverse=True). var monthNameAlt = buildMonthNameAlt() var ( numericRe = regexp.MustCompile(`([\d+]+)\s*/\s*(\d{2,4})`) dotRe = regexp.MustCompile(`(\d{1,2})\s*\.\s*(\d{4})`) rangeRe = regexp.MustCompile(`(` + monthNameAlt + `)\s*-\s*(` + monthNameAlt + `)`) standRe = regexp.MustCompile(`\b(` + monthNameAlt + `)\b`) ) ``` Three Go-specific gotchas worth a code comment: 1. **RE2 alternation is leftmost-first**, same as Python `re`. Sorting month names by descending length is therefore necessary (otherwise `"cervenec"` matches as `"cerven"` + leftover `"ec"`). Mirror the Python sort exactly. 2. **Map iteration is randomized in Go.** Build the alternation list from a sorted slice of keys, not by iterating the map. 3. **`\d` and `\b`** in Go RE2 are ASCII-only, which matches the effective behavior on `Normalize`'d input (NFKD already collapsed any Unicode digits/letters that would matter; standalone Devanagari digits in member messages aren't a real-world concern). The walk loop uses a bounded counter (max 12 iterations) defensively in Go; Python's `while True` is fine because every range terminates within 12 hops, but a future reader appreciates the bound. ### Test table (verified against live Python — `default_year=2026`) Locked outputs from `PYTHONPATH=scripts:. python -c 'from czech_utils import parse_month_references; print(parse_month_references(, 2026))'` on 2026-05-05. | # | Input | Expected | Path exercised | |---|---|---|---| | 1 | `""` | `[]` | empty | | 2 | `"11+12/2025"` | `["2025-11", "2025-12"]` | numeric, plus-split | | 3 | `"1/2026"` | `["2026-01"]` | numeric, single | | 4 | `"01/26"` | `["2026-01"]` | 2-digit year normalization | | 5 | `"11+12/25"` | `["2025-11", "2025-12"]` | plus-split + 2-digit year | | 6 | `"12+1+2/2026"` | `["2026-01", "2026-02", "2026-12"]` | sorting | | 7 | `"12.2025"` | `["2025-12"]` | dot pattern | | 8 | `"1.26"` | `[]` | dot pattern requires 4-digit year | | 9 | `"leden"` | `["2026-01"]` | standalone, m<10 | | 10 | `"prosinec"` | `["2025-12"]` | standalone, m≥10 → previous year | | 11 | `"prosince"` | `["2025-12"]` | declension | | 12 | `"lednu"` | `["2026-01"]` | declension | | 13 | `"rijen"` | `["2025-10"]` | m≥10 boundary (10 itself) | | 14 | `"zari"` | `["2026-09"]` | m<10 just below boundary | | 15 | `"listopad-leden"` | `["2025-11", "2025-12", "2026-01"]` | wrap range Nov→Jan | | 16 | `"rijen-leden"` | `["2025-10", "2025-11", "2025-12", "2026-01"]` | wrap from October | | 17 | `"unor-kveten"` | `["2026-02", "2026-03", "2026-04", "2026-05"]` | non-wrap range | | 18 | `"leden-leden"` | `["2026-01"]` | degenerate range | | 19 | `"unor-listopad"` | `["2026-02", ..., "2026-11"]` (10 entries) | range spans m≥10 — heuristic does NOT fire (range exclusion) | | 20 | `"cervenec-srpen"` | `["2026-07", "2026-08"]` | longest-match alt (`cervenec` not `cerven`+`ec`) | | 21 | `"listopad-leden, prosinec"` | `["2025-11", "2025-12", "2026-01"]` | range + standalone, dedup | | 22 | `"prosinec leden"` | `["2025-12", "2026-01"]` | two standalones, no range | | 23 | `"11+12/2025, leden-brezen"` | `["2025-11", "2025-12", "2026-01", "2026-02", "2026-03"]` | numeric + range mix | | 24 | `"11+12/25 a listopad"` | `["2025-11", "2025-12"]` | dedup across passes | | 25 | `"prosince/2025"` | `["2025-12"]` | numeric pattern fails (no digits before `/`); standalone fires | | 26 | `"listopad-prosinec/2025"` | `["2026-11", "2026-12"]` | range wins; numeric pattern fails | | 27 | `"01.2026 / 02.2026"` | `["2026-01", "2026-02"]` | dot pattern only; numeric matches `(2026, 02)` but month 2026 is out of range | | 28 | `"/12/2025"` | `["2025-12"]` | numeric matches at second `/` | | 29 | `"PROSINEC"` | `["2025-12"]` | normalize lowercases | | 30 | `"Žluťoučký prosinec"` | `["2025-12"]` | normalize strips diacritics | | 31 | `"Únor - květen"` | `["2026-02", ..., "2026-05"]` | range tolerates spaces around `-`, diacritics survive normalize | | 32 | `"platba 11/2025 a leden"` | `["2025-11", "2026-01"]` | mixed natural-language | | 33 | `"December"` | `[]` | English month names not recognized | | 34 | `"11+12/2025 11+12/2025"` | `["2025-11", "2025-12"]` | dedup of repeated input | | 35 | `"leden 2026"` | `["2026-01"]` | trailing year is ignored unless dot/slash separator present | 35 cases is enough to lock semantics; the M3.x corpus will pile on real-message fixtures later. ### Wire-up - No `go.mod` changes (stdlib only). - No CLI changes. - `Normalize` is in the same package, so call it directly. ## Critical files - New: [go/internal/domain/czech/parse_month_references.go](../../go/internal/domain/czech/parse_month_references.go) - New: [go/internal/domain/czech/parse_month_references_test.go](../../go/internal/domain/czech/parse_month_references_test.go) - Reference (read-only): [scripts/czech_utils.py](../../scripts/czech_utils.py) — the porting source - Reference (read-only): [docs/plans/2026-05-03-2349-go-backend-rewrite.md](2026-05-03-2349-go-backend-rewrite.md) — risk #4 - Reuses: [go/internal/domain/czech/normalize.go](../../go/internal/domain/czech/normalize.go) — `Normalize` is called once at the top of `ParseMonthReferences` ## Verification End-to-end checks before marking M2.2 done: 1. `cd go && go build ./...` — clean compile. 2. `cd go && go test ./internal/domain/czech/...` — all 35 table cases green. 3. `cd go && go test -race ./...` — race-clean (regex compiles are global; verify no init races). 4. `cd go && golangci-lint run` (or `make go-lint` from repo root) — clean, gofumpt-formatted. 5. **Spot parity** (manual, will be automated in M3.x): each test input has its expected output captured from the live Python implementation on 2026-05-05; the test table itself is the parity record. If any case diverges during implementation, re-run Python with the exact input to confirm the truth and update either the Go code or the test entry. 6. `make go-build && make go-test && make go-lint` from repo root — proves M1/M2.1 gate still passes. ## Branching & follow-up Per [CLAUDE.md](../../CLAUDE.md), this is feature work → branch + Gitea MR via `tea`: - Branch: `feat/m2-2-parse-month-references` off `main`. - Single focused commit, Co-Authored-By trailer. - Push with `-u`. - Open MR with `tea pr create --title "feat(go/M2.2): port czech.ParseMonthReferences" --description ... --base main --head feat/m2-2-parse-month-references`. Print the MR URL for the user. - User merges/deletes the branch in Gitea — never from the CLI. After merge (small doc edits land straight on `main` per CLAUDE.md exception): - Tick `M2.2` in the [progress tracker](2026-05-03-2349-go-backend-rewrite-progress.md) with the merge SHA. - Add a one-line `CHANGELOG.md` entry (timestamp via `date "+%Y-%m-%d %H:%M %Z"`). - Record any porting surprise (e.g. an unexpected diff between Go RE2 and Python `re`) in the tracker's "Notes & decisions" section. Next task is **M2.3 `domain/fees.CalculateFee`** — straightforward constants table; no parser semantics to debate.