Three-pass regex parser matching python/czech_utils.py parse_month_references: 1. Numeric slash notation — "11+12/2025", "01/26"; 2-digit year → +2000 2. Dot notation — "12.2025" (4-digit year only) 3. Czech month names — range walk (listopad-leden wrap logic) then standalone with m≥10 → defaultYear-1 heuristic; longest-match alternation (sorted desc by name length) handles cervenec vs cerven 35 table-driven tests, all expected outputs verified against live Python on 2026-05-05 before locking. Plan at docs/plans/2026-05-05-2337-go-rewrite-m2-2-parse-month-references.md. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
10 KiB
Plan: Go rewrite — M2.2 domain/czech.ParseMonthReferences
Context
M2.1 (domain/czech.Normalize) merged via PR #4 (d9a61b3) on
2026-05-05. Per the progress tracker,
M2.2 is next: port parse_month_references from
scripts/czech_utils.py to Go as
internal/domain/czech.ParseMonthReferences.
This function is the second-most-load-bearing pure helper after
reconcile: every payment-message → month inference goes through it.
Risk #4 in the parent plan
specifically calls out its semantics — wrap-around year inference and
the m >= 10 → previous year standalone heuristic — as easy to mis-port.
This plan locks the test table against the live Python implementation before coding, so the Go port has a verified parity baseline even before the M3.1/M3.2 fixture infrastructure exists.
Scope
- New file
go/internal/domain/czech/parse_month_references.goin the existingczechpackage (alongside normalize.go). - New file
go/internal/domain/czech/parse_month_references_test.gowith the test table below. - Out of scope: parity-fixture wiring (M3.1/M3.2); CLI hook-up (M2.11/M2.12); any consumer call-sites.
- No new dependencies — stdlib
regexp,sort,strconv,stringsplus the existingczech.Normalizecover everything.
Recommended approach
Python contract to mirror
Three regex passes, all run on normalize(text):
([\d+]+)\s*/\s*(\d{2,4})— captures"11+12/2025","01/26","1/26". Split the months part on+, keep digit-only tokens, validate1..12. Year < 100 → year + 2000.(\d{1,2})\s*\.\s*(\d{4})— captures"12.2025". 4-digit year only (so"1.26"does not match).- Czech month names. First the range sub-pass:
(name)\s*-\s*(name)finds pairs; walk start→end withm % 12 + 1, stopping whenm == end_m. Wrap rule: ifstart_m > end_m, months>= start_maredefaultYear - 1, the rest aredefaultYear. Both matched names go into afoundInRangesset. Then the standalone sub-pass:\b(name)\b, skipping any name infoundInRanges. For each remaining match,m >= 10 → defaultYear - 1, elsedefaultYear.
Output: sorted, deduplicated []string of "YYYY-MM".
Go signature
package czech
// ParseMonthReferences extracts YYYY-MM month references from Czech
// free text. defaultYear seeds two heuristics: standalone month names
// with m >= 10 are treated as defaultYear-1 (out-of-year backfill), and
// wrap-around ranges (e.g. listopad-leden) place months >= start in
// defaultYear-1.
func ParseMonthReferences(text string, defaultYear int) []string
Required defaultYear (no default value — Go convention).
Implementation sketch
var czechMonths = map[string]int{
"leden": 1, "ledna": 1, "lednu": 1,
"unor": 2, "unora": 2, "unoru": 2,
"brezen": 3, "brezna": 3, "breznu": 3,
"duben": 4, "dubna": 4, "dubnu": 4,
"kveten": 5, "kvetna": 5, "kvetnu": 5,
"cerven": 6, "cervna": 6, "cervnu": 6,
"cervenec": 7, "cervnce": 7, "cervenci": 7,
"srpen": 8, "srpna": 8, "srpnu": 8,
"zari": 9,
"rijen": 10, "rijna": 10, "rijnu": 10,
"listopad": 11, "listopadu": 11,
"prosinec": 12, "prosince": 12, "prosinci": 12,
}
// Sorted by descending length at init, so longer alternatives win in
// the regex (e.g. "cervenec" beats "cerven"). Mirrors Python's
// sorted(..., key=len, reverse=True).
var monthNameAlt = buildMonthNameAlt()
var (
numericRe = regexp.MustCompile(`([\d+]+)\s*/\s*(\d{2,4})`)
dotRe = regexp.MustCompile(`(\d{1,2})\s*\.\s*(\d{4})`)
rangeRe = regexp.MustCompile(`(` + monthNameAlt + `)\s*-\s*(` + monthNameAlt + `)`)
standRe = regexp.MustCompile(`\b(` + monthNameAlt + `)\b`)
)
Three Go-specific gotchas worth a code comment:
- RE2 alternation is leftmost-first, same as Python
re. Sorting month names by descending length is therefore necessary (otherwise"cervenec"matches as"cerven"+ leftover"ec"). Mirror the Python sort exactly. - Map iteration is randomized in Go. Build the alternation list from a sorted slice of keys, not by iterating the map.
\dand\bin Go RE2 are ASCII-only, which matches the effective behavior onNormalize'd input (NFKD already collapsed any Unicode digits/letters that would matter; standalone Devanagari digits in member messages aren't a real-world concern).
The walk loop uses a bounded counter (max 12 iterations) defensively in
Go; Python's while True is fine because every range terminates within
12 hops, but a future reader appreciates the bound.
Test table (verified against live Python — default_year=2026)
Locked outputs from PYTHONPATH=scripts:. python -c 'from czech_utils import parse_month_references; print(parse_month_references(<input>, 2026))'
on 2026-05-05.
| # | Input | Expected | Path exercised |
|---|---|---|---|
| 1 | "" |
[] |
empty |
| 2 | "11+12/2025" |
["2025-11", "2025-12"] |
numeric, plus-split |
| 3 | "1/2026" |
["2026-01"] |
numeric, single |
| 4 | "01/26" |
["2026-01"] |
2-digit year normalization |
| 5 | "11+12/25" |
["2025-11", "2025-12"] |
plus-split + 2-digit year |
| 6 | "12+1+2/2026" |
["2026-01", "2026-02", "2026-12"] |
sorting |
| 7 | "12.2025" |
["2025-12"] |
dot pattern |
| 8 | "1.26" |
[] |
dot pattern requires 4-digit year |
| 9 | "leden" |
["2026-01"] |
standalone, m<10 |
| 10 | "prosinec" |
["2025-12"] |
standalone, m≥10 → previous year |
| 11 | "prosince" |
["2025-12"] |
declension |
| 12 | "lednu" |
["2026-01"] |
declension |
| 13 | "rijen" |
["2025-10"] |
m≥10 boundary (10 itself) |
| 14 | "zari" |
["2026-09"] |
m<10 just below boundary |
| 15 | "listopad-leden" |
["2025-11", "2025-12", "2026-01"] |
wrap range Nov→Jan |
| 16 | "rijen-leden" |
["2025-10", "2025-11", "2025-12", "2026-01"] |
wrap from October |
| 17 | "unor-kveten" |
["2026-02", "2026-03", "2026-04", "2026-05"] |
non-wrap range |
| 18 | "leden-leden" |
["2026-01"] |
degenerate range |
| 19 | "unor-listopad" |
["2026-02", ..., "2026-11"] (10 entries) |
range spans m≥10 — heuristic does NOT fire (range exclusion) |
| 20 | "cervenec-srpen" |
["2026-07", "2026-08"] |
longest-match alt (cervenec not cerven+ec) |
| 21 | "listopad-leden, prosinec" |
["2025-11", "2025-12", "2026-01"] |
range + standalone, dedup |
| 22 | "prosinec leden" |
["2025-12", "2026-01"] |
two standalones, no range |
| 23 | "11+12/2025, leden-brezen" |
["2025-11", "2025-12", "2026-01", "2026-02", "2026-03"] |
numeric + range mix |
| 24 | "11+12/25 a listopad" |
["2025-11", "2025-12"] |
dedup across passes |
| 25 | "prosince/2025" |
["2025-12"] |
numeric pattern fails (no digits before /); standalone fires |
| 26 | "listopad-prosinec/2025" |
["2026-11", "2026-12"] |
range wins; numeric pattern fails |
| 27 | "01.2026 / 02.2026" |
["2026-01", "2026-02"] |
dot pattern only; numeric matches (2026, 02) but month 2026 is out of range |
| 28 | "/12/2025" |
["2025-12"] |
numeric matches at second / |
| 29 | "PROSINEC" |
["2025-12"] |
normalize lowercases |
| 30 | "Žluťoučký prosinec" |
["2025-12"] |
normalize strips diacritics |
| 31 | "Únor - květen" |
["2026-02", ..., "2026-05"] |
range tolerates spaces around -, diacritics survive normalize |
| 32 | "platba 11/2025 a leden" |
["2025-11", "2026-01"] |
mixed natural-language |
| 33 | "December" |
[] |
English month names not recognized |
| 34 | "11+12/2025 11+12/2025" |
["2025-11", "2025-12"] |
dedup of repeated input |
| 35 | "leden 2026" |
["2026-01"] |
trailing year is ignored unless dot/slash separator present |
35 cases is enough to lock semantics; the M3.x corpus will pile on real-message fixtures later.
Wire-up
- No
go.modchanges (stdlib only). - No CLI changes.
Normalizeis in the same package, so call it directly.
Critical files
- New: go/internal/domain/czech/parse_month_references.go
- New: go/internal/domain/czech/parse_month_references_test.go
- Reference (read-only): scripts/czech_utils.py — the porting source
- Reference (read-only): docs/plans/2026-05-03-2349-go-backend-rewrite.md — risk #4
- Reuses: go/internal/domain/czech/normalize.go —
Normalizeis called once at the top ofParseMonthReferences
Verification
End-to-end checks before marking M2.2 done:
cd go && go build ./...— clean compile.cd go && go test ./internal/domain/czech/...— all 35 table cases green.cd go && go test -race ./...— race-clean (regex compiles are global; verify no init races).cd go && golangci-lint run(ormake go-lintfrom repo root) — clean, gofumpt-formatted.- Spot parity (manual, will be automated in M3.x): each test input has its expected output captured from the live Python implementation on 2026-05-05; the test table itself is the parity record. If any case diverges during implementation, re-run Python with the exact input to confirm the truth and update either the Go code or the test entry.
make go-build && make go-test && make go-lintfrom repo root — proves M1/M2.1 gate still passes.
Branching & follow-up
Per CLAUDE.md, this is feature work → branch + Gitea MR via tea:
- Branch:
feat/m2-2-parse-month-referencesoffmain. - Single focused commit, Co-Authored-By trailer.
- Push with
-u. - Open MR with
tea pr create --title "feat(go/M2.2): port czech.ParseMonthReferences" --description ... --base main --head feat/m2-2-parse-month-references. Print the MR URL for the user. - User merges/deletes the branch in Gitea — never from the CLI.
After merge (small doc edits land straight on main per CLAUDE.md exception):
- Tick
M2.2in the progress tracker with the merge SHA. - Add a one-line
CHANGELOG.mdentry (timestamp viadate "+%Y-%m-%d %H:%M %Z"). - Record any porting surprise (e.g. an unexpected diff between Go RE2 and Python
re) in the tracker's "Notes & decisions" section.
Next task is M2.3 domain/fees.CalculateFee — straightforward constants table; no parser semantics to debate.