Files
fuj-management/docs/plans/2026-05-05-2337-go-rewrite-m2-2-parse-month-references.md
Jan Novak 6d971b61d4
All checks were successful
Deploy to K8s / deploy (push) Successful in 8s
feat(go/M2.2): port czech.ParseMonthReferences
Three-pass regex parser matching python/czech_utils.py parse_month_references:
1. Numeric slash notation — "11+12/2025", "01/26"; 2-digit year → +2000
2. Dot notation — "12.2025" (4-digit year only)
3. Czech month names — range walk (listopad-leden wrap logic) then
   standalone with m≥10 → defaultYear-1 heuristic; longest-match
   alternation (sorted desc by name length) handles cervenec vs cerven

35 table-driven tests, all expected outputs verified against live Python
on 2026-05-05 before locking. Plan at
docs/plans/2026-05-05-2337-go-rewrite-m2-2-parse-month-references.md.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-06 00:05:40 +02:00

10 KiB

Plan: Go rewrite — M2.2 domain/czech.ParseMonthReferences

Context

M2.1 (domain/czech.Normalize) merged via PR #4 (d9a61b3) on 2026-05-05. Per the progress tracker, M2.2 is next: port parse_month_references from scripts/czech_utils.py to Go as internal/domain/czech.ParseMonthReferences.

This function is the second-most-load-bearing pure helper after reconcile: every payment-message → month inference goes through it. Risk #4 in the parent plan specifically calls out its semantics — wrap-around year inference and the m >= 10 → previous year standalone heuristic — as easy to mis-port.

This plan locks the test table against the live Python implementation before coding, so the Go port has a verified parity baseline even before the M3.1/M3.2 fixture infrastructure exists.

Scope

  • New file go/internal/domain/czech/parse_month_references.go in the existing czech package (alongside normalize.go).
  • New file go/internal/domain/czech/parse_month_references_test.go with the test table below.
  • Out of scope: parity-fixture wiring (M3.1/M3.2); CLI hook-up (M2.11/M2.12); any consumer call-sites.
  • No new dependencies — stdlib regexp, sort, strconv, strings plus the existing czech.Normalize cover everything.

Python contract to mirror

Three regex passes, all run on normalize(text):

  1. ([\d+]+)\s*/\s*(\d{2,4}) — captures "11+12/2025", "01/26", "1/26". Split the months part on +, keep digit-only tokens, validate 1..12. Year < 100 → year + 2000.
  2. (\d{1,2})\s*\.\s*(\d{4}) — captures "12.2025". 4-digit year only (so "1.26" does not match).
  3. Czech month names. First the range sub-pass: (name)\s*-\s*(name) finds pairs; walk start→end with m % 12 + 1, stopping when m == end_m. Wrap rule: if start_m > end_m, months >= start_m are defaultYear - 1, the rest are defaultYear. Both matched names go into a foundInRanges set. Then the standalone sub-pass: \b(name)\b, skipping any name in foundInRanges. For each remaining match, m >= 10 → defaultYear - 1, else defaultYear.

Output: sorted, deduplicated []string of "YYYY-MM".

Go signature

package czech

// ParseMonthReferences extracts YYYY-MM month references from Czech
// free text. defaultYear seeds two heuristics: standalone month names
// with m >= 10 are treated as defaultYear-1 (out-of-year backfill), and
// wrap-around ranges (e.g. listopad-leden) place months >= start in
// defaultYear-1.
func ParseMonthReferences(text string, defaultYear int) []string

Required defaultYear (no default value — Go convention).

Implementation sketch

var czechMonths = map[string]int{
    "leden": 1, "ledna": 1, "lednu": 1,
    "unor": 2, "unora": 2, "unoru": 2,
    "brezen": 3, "brezna": 3, "breznu": 3,
    "duben": 4, "dubna": 4, "dubnu": 4,
    "kveten": 5, "kvetna": 5, "kvetnu": 5,
    "cerven": 6, "cervna": 6, "cervnu": 6,
    "cervenec": 7, "cervnce": 7, "cervenci": 7,
    "srpen": 8, "srpna": 8, "srpnu": 8,
    "zari": 9,
    "rijen": 10, "rijna": 10, "rijnu": 10,
    "listopad": 11, "listopadu": 11,
    "prosinec": 12, "prosince": 12, "prosinci": 12,
}

// Sorted by descending length at init, so longer alternatives win in
// the regex (e.g. "cervenec" beats "cerven"). Mirrors Python's
// sorted(..., key=len, reverse=True).
var monthNameAlt = buildMonthNameAlt()

var (
    numericRe = regexp.MustCompile(`([\d+]+)\s*/\s*(\d{2,4})`)
    dotRe     = regexp.MustCompile(`(\d{1,2})\s*\.\s*(\d{4})`)
    rangeRe   = regexp.MustCompile(`(` + monthNameAlt + `)\s*-\s*(` + monthNameAlt + `)`)
    standRe   = regexp.MustCompile(`\b(` + monthNameAlt + `)\b`)
)

Three Go-specific gotchas worth a code comment:

  1. RE2 alternation is leftmost-first, same as Python re. Sorting month names by descending length is therefore necessary (otherwise "cervenec" matches as "cerven" + leftover "ec"). Mirror the Python sort exactly.
  2. Map iteration is randomized in Go. Build the alternation list from a sorted slice of keys, not by iterating the map.
  3. \d and \b in Go RE2 are ASCII-only, which matches the effective behavior on Normalize'd input (NFKD already collapsed any Unicode digits/letters that would matter; standalone Devanagari digits in member messages aren't a real-world concern).

The walk loop uses a bounded counter (max 12 iterations) defensively in Go; Python's while True is fine because every range terminates within 12 hops, but a future reader appreciates the bound.

Test table (verified against live Python — default_year=2026)

Locked outputs from PYTHONPATH=scripts:. python -c 'from czech_utils import parse_month_references; print(parse_month_references(<input>, 2026))' on 2026-05-05.

# Input Expected Path exercised
1 "" [] empty
2 "11+12/2025" ["2025-11", "2025-12"] numeric, plus-split
3 "1/2026" ["2026-01"] numeric, single
4 "01/26" ["2026-01"] 2-digit year normalization
5 "11+12/25" ["2025-11", "2025-12"] plus-split + 2-digit year
6 "12+1+2/2026" ["2026-01", "2026-02", "2026-12"] sorting
7 "12.2025" ["2025-12"] dot pattern
8 "1.26" [] dot pattern requires 4-digit year
9 "leden" ["2026-01"] standalone, m<10
10 "prosinec" ["2025-12"] standalone, m≥10 → previous year
11 "prosince" ["2025-12"] declension
12 "lednu" ["2026-01"] declension
13 "rijen" ["2025-10"] m≥10 boundary (10 itself)
14 "zari" ["2026-09"] m<10 just below boundary
15 "listopad-leden" ["2025-11", "2025-12", "2026-01"] wrap range Nov→Jan
16 "rijen-leden" ["2025-10", "2025-11", "2025-12", "2026-01"] wrap from October
17 "unor-kveten" ["2026-02", "2026-03", "2026-04", "2026-05"] non-wrap range
18 "leden-leden" ["2026-01"] degenerate range
19 "unor-listopad" ["2026-02", ..., "2026-11"] (10 entries) range spans m≥10 — heuristic does NOT fire (range exclusion)
20 "cervenec-srpen" ["2026-07", "2026-08"] longest-match alt (cervenec not cerven+ec)
21 "listopad-leden, prosinec" ["2025-11", "2025-12", "2026-01"] range + standalone, dedup
22 "prosinec leden" ["2025-12", "2026-01"] two standalones, no range
23 "11+12/2025, leden-brezen" ["2025-11", "2025-12", "2026-01", "2026-02", "2026-03"] numeric + range mix
24 "11+12/25 a listopad" ["2025-11", "2025-12"] dedup across passes
25 "prosince/2025" ["2025-12"] numeric pattern fails (no digits before /); standalone fires
26 "listopad-prosinec/2025" ["2026-11", "2026-12"] range wins; numeric pattern fails
27 "01.2026 / 02.2026" ["2026-01", "2026-02"] dot pattern only; numeric matches (2026, 02) but month 2026 is out of range
28 "/12/2025" ["2025-12"] numeric matches at second /
29 "PROSINEC" ["2025-12"] normalize lowercases
30 "Žluťoučký prosinec" ["2025-12"] normalize strips diacritics
31 "Únor - květen" ["2026-02", ..., "2026-05"] range tolerates spaces around -, diacritics survive normalize
32 "platba 11/2025 a leden" ["2025-11", "2026-01"] mixed natural-language
33 "December" [] English month names not recognized
34 "11+12/2025 11+12/2025" ["2025-11", "2025-12"] dedup of repeated input
35 "leden 2026" ["2026-01"] trailing year is ignored unless dot/slash separator present

35 cases is enough to lock semantics; the M3.x corpus will pile on real-message fixtures later.

Wire-up

  • No go.mod changes (stdlib only).
  • No CLI changes.
  • Normalize is in the same package, so call it directly.

Critical files

Verification

End-to-end checks before marking M2.2 done:

  1. cd go && go build ./... — clean compile.
  2. cd go && go test ./internal/domain/czech/... — all 35 table cases green.
  3. cd go && go test -race ./... — race-clean (regex compiles are global; verify no init races).
  4. cd go && golangci-lint run (or make go-lint from repo root) — clean, gofumpt-formatted.
  5. Spot parity (manual, will be automated in M3.x): each test input has its expected output captured from the live Python implementation on 2026-05-05; the test table itself is the parity record. If any case diverges during implementation, re-run Python with the exact input to confirm the truth and update either the Go code or the test entry.
  6. make go-build && make go-test && make go-lint from repo root — proves M1/M2.1 gate still passes.

Branching & follow-up

Per CLAUDE.md, this is feature work → branch + Gitea MR via tea:

  • Branch: feat/m2-2-parse-month-references off main.
  • Single focused commit, Co-Authored-By trailer.
  • Push with -u.
  • Open MR with tea pr create --title "feat(go/M2.2): port czech.ParseMonthReferences" --description ... --base main --head feat/m2-2-parse-month-references. Print the MR URL for the user.
  • User merges/deletes the branch in Gitea — never from the CLI.

After merge (small doc edits land straight on main per CLAUDE.md exception):

  • Tick M2.2 in the progress tracker with the merge SHA.
  • Add a one-line CHANGELOG.md entry (timestamp via date "+%Y-%m-%d %H:%M %Z").
  • Record any porting surprise (e.g. an unexpected diff between Go RE2 and Python re) in the tracker's "Notes & decisions" section.

Next task is M2.3 domain/fees.CalculateFee — straightforward constants table; no parser semantics to debate.