Files

Deploy to K8s / deploy (push) Successful in 8s

Details

feat(go/M2.2): port czech.ParseMonthReferences

Three-pass regex parser matching python/czech_utils.py parse_month_references:
1. Numeric slash notation — "11+12/2025", "01/26"; 2-digit year → +2000
2. Dot notation — "12.2025" (4-digit year only)
3. Czech month names — range walk (listopad-leden wrap logic) then
   standalone with m≥10 → defaultYear-1 heuristic; longest-match
   alternation (sorted desc by name length) handles cervenec vs cerven

35 table-driven tests, all expected outputs verified against live Python
on 2026-05-05 before locking. Plan at
docs/plans/2026-05-05-2337-go-rewrite-m2-2-parse-month-references.md.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

2026-05-06 00:05:40 +02:00

10 KiB

Raw Blame History

Plan: Go rewrite — M2.2 `domain/czech.ParseMonthReferences`

Context

M2.1 (domain/czech.Normalize) merged via PR #4 (d9a61b3) on 2026-05-05. Per the progress tracker, M2.2 is next: port parse_month_references from scripts/czech_utils.py to Go as internal/domain/czech.ParseMonthReferences.

This function is the second-most-load-bearing pure helper after reconcile: every payment-message → month inference goes through it. Risk #4 in the parent plan specifically calls out its semantics — wrap-around year inference and the m >= 10 → previous year standalone heuristic — as easy to mis-port.

This plan locks the test table against the live Python implementation before coding, so the Go port has a verified parity baseline even before the M3.1/M3.2 fixture infrastructure exists.

Scope

New file go/internal/domain/czech/parse_month_references.go in the existing czech package (alongside normalize.go).
New file go/internal/domain/czech/parse_month_references_test.go with the test table below.
Out of scope: parity-fixture wiring (M3.1/M3.2); CLI hook-up (M2.11/M2.12); any consumer call-sites.
No new dependencies — stdlib regexp, sort, strconv, strings plus the existing czech.Normalize cover everything.

Recommended approach

Python contract to mirror

Three regex passes, all run on normalize(text):

([\d+]+)\s*/\s*(\d{2,4}) — captures "11+12/2025", "01/26", "1/26". Split the months part on +, keep digit-only tokens, validate 1..12. Year < 100 → year + 2000.
(\d{1,2})\s*\.\s*(\d{4}) — captures "12.2025". 4-digit year only (so "1.26" does not match).
Czech month names. First the range sub-pass: (name)\s*-\s*(name) finds pairs; walk start→end with m % 12 + 1, stopping when m == end_m. Wrap rule: if start_m > end_m, months >= start_m are defaultYear - 1, the rest are defaultYear. Both matched names go into a foundInRanges set. Then the standalone sub-pass: \b(name)\b, skipping any name in foundInRanges. For each remaining match, m >= 10 → defaultYear - 1, else defaultYear.

Output: sorted, deduplicated []string of "YYYY-MM".

Go signature

package czech

// ParseMonthReferences extracts YYYY-MM month references from Czech
// free text. defaultYear seeds two heuristics: standalone month names
// with m >= 10 are treated as defaultYear-1 (out-of-year backfill), and
// wrap-around ranges (e.g. listopad-leden) place months >= start in
// defaultYear-1.
func ParseMonthReferences(text string, defaultYear int) []string

Required defaultYear (no default value — Go convention).

Implementation sketch

var czechMonths = map[string]int{
    "leden": 1, "ledna": 1, "lednu": 1,
    "unor": 2, "unora": 2, "unoru": 2,
    "brezen": 3, "brezna": 3, "breznu": 3,
    "duben": 4, "dubna": 4, "dubnu": 4,
    "kveten": 5, "kvetna": 5, "kvetnu": 5,
    "cerven": 6, "cervna": 6, "cervnu": 6,
    "cervenec": 7, "cervnce": 7, "cervenci": 7,
    "srpen": 8, "srpna": 8, "srpnu": 8,
    "zari": 9,
    "rijen": 10, "rijna": 10, "rijnu": 10,
    "listopad": 11, "listopadu": 11,
    "prosinec": 12, "prosince": 12, "prosinci": 12,
}

// Sorted by descending length at init, so longer alternatives win in
// the regex (e.g. "cervenec" beats "cerven"). Mirrors Python's
// sorted(..., key=len, reverse=True).
var monthNameAlt = buildMonthNameAlt()

var (
    numericRe = regexp.MustCompile(`([\d+]+)\s*/\s*(\d{2,4})`)
    dotRe     = regexp.MustCompile(`(\d{1,2})\s*\.\s*(\d{4})`)
    rangeRe   = regexp.MustCompile(`(` + monthNameAlt + `)\s*-\s*(` + monthNameAlt + `)`)
    standRe   = regexp.MustCompile(`\b(` + monthNameAlt + `)\b`)
)

Three Go-specific gotchas worth a code comment:

RE2 alternation is leftmost-first, same as Python re. Sorting month names by descending length is therefore necessary (otherwise "cervenec" matches as "cerven" + leftover "ec"). Mirror the Python sort exactly.
Map iteration is randomized in Go. Build the alternation list from a sorted slice of keys, not by iterating the map.
\d and \b in Go RE2 are ASCII-only, which matches the effective behavior on Normalize'd input (NFKD already collapsed any Unicode digits/letters that would matter; standalone Devanagari digits in member messages aren't a real-world concern).

The walk loop uses a bounded counter (max 12 iterations) defensively in Go; Python's while True is fine because every range terminates within 12 hops, but a future reader appreciates the bound.

Test table (verified against live Python — `default_year=2026`)

Locked outputs from PYTHONPATH=scripts:. python -c 'from czech_utils import parse_month_references; print(parse_month_references(<input>, 2026))' on 2026-05-05.

#	Input	Expected	Path exercised
1	`""`	`[]`	empty
2	`"11+12/2025"`	`["2025-11", "2025-12"]`	numeric, plus-split
3	`"1/2026"`	`["2026-01"]`	numeric, single
4	`"01/26"`	`["2026-01"]`	2-digit year normalization
5	`"11+12/25"`	`["2025-11", "2025-12"]`	plus-split + 2-digit year
6	`"12+1+2/2026"`	`["2026-01", "2026-02", "2026-12"]`	sorting
7	`"12.2025"`	`["2025-12"]`	dot pattern
8	`"1.26"`	`[]`	dot pattern requires 4-digit year
9	`"leden"`	`["2026-01"]`	standalone, m<10
10	`"prosinec"`	`["2025-12"]`	standalone, m≥10 → previous year
11	`"prosince"`	`["2025-12"]`	declension
12	`"lednu"`	`["2026-01"]`	declension
13	`"rijen"`	`["2025-10"]`	m≥10 boundary (10 itself)
14	`"zari"`	`["2026-09"]`	m<10 just below boundary
15	`"listopad-leden"`	`["2025-11", "2025-12", "2026-01"]`	wrap range Nov→Jan
16	`"rijen-leden"`	`["2025-10", "2025-11", "2025-12", "2026-01"]`	wrap from October
17	`"unor-kveten"`	`["2026-02", "2026-03", "2026-04", "2026-05"]`	non-wrap range
18	`"leden-leden"`	`["2026-01"]`	degenerate range
19	`"unor-listopad"`	`["2026-02", ..., "2026-11"]` (10 entries)	range spans m≥10 — heuristic does NOT fire (range exclusion)
20	`"cervenec-srpen"`	`["2026-07", "2026-08"]`	longest-match alt (`cervenec` not `cerven`+`ec`)
21	`"listopad-leden, prosinec"`	`["2025-11", "2025-12", "2026-01"]`	range + standalone, dedup
22	`"prosinec leden"`	`["2025-12", "2026-01"]`	two standalones, no range
23	`"11+12/2025, leden-brezen"`	`["2025-11", "2025-12", "2026-01", "2026-02", "2026-03"]`	numeric + range mix
24	`"11+12/25 a listopad"`	`["2025-11", "2025-12"]`	dedup across passes
25	`"prosince/2025"`	`["2025-12"]`	numeric pattern fails (no digits before `/`); standalone fires
26	`"listopad-prosinec/2025"`	`["2026-11", "2026-12"]`	range wins; numeric pattern fails
27	`"01.2026 / 02.2026"`	`["2026-01", "2026-02"]`	dot pattern only; numeric matches `(2026, 02)` but month 2026 is out of range
28	`"/12/2025"`	`["2025-12"]`	numeric matches at second `/`
29	`"PROSINEC"`	`["2025-12"]`	normalize lowercases
30	`"Žluťoučký prosinec"`	`["2025-12"]`	normalize strips diacritics
31	`"Únor - květen"`	`["2026-02", ..., "2026-05"]`	range tolerates spaces around `-`, diacritics survive normalize
32	`"platba 11/2025 a leden"`	`["2025-11", "2026-01"]`	mixed natural-language
33	`"December"`	`[]`	English month names not recognized
34	`"11+12/2025 11+12/2025"`	`["2025-11", "2025-12"]`	dedup of repeated input
35	`"leden 2026"`	`["2026-01"]`	trailing year is ignored unless dot/slash separator present

35 cases is enough to lock semantics; the M3.x corpus will pile on real-message fixtures later.

Wire-up

No go.mod changes (stdlib only).
No CLI changes.
Normalize is in the same package, so call it directly.

Critical files

New: go/internal/domain/czech/parse_month_references.go
New: go/internal/domain/czech/parse_month_references_test.go
Reference (read-only): scripts/czech_utils.py — the porting source
Reference (read-only): docs/plans/2026-05-03-2349-go-backend-rewrite.md — risk #4
Reuses: go/internal/domain/czech/normalize.go — Normalize is called once at the top of ParseMonthReferences

Verification

End-to-end checks before marking M2.2 done:

cd go && go build ./... — clean compile.
cd go && go test ./internal/domain/czech/... — all 35 table cases green.
cd go && go test -race ./... — race-clean (regex compiles are global; verify no init races).
cd go && golangci-lint run (or make go-lint from repo root) — clean, gofumpt-formatted.
Spot parity (manual, will be automated in M3.x): each test input has its expected output captured from the live Python implementation on 2026-05-05; the test table itself is the parity record. If any case diverges during implementation, re-run Python with the exact input to confirm the truth and update either the Go code or the test entry.
make go-build && make go-test && make go-lint from repo root — proves M1/M2.1 gate still passes.

Branching & follow-up

Per CLAUDE.md, this is feature work → branch + Gitea MR via tea:

Branch: feat/m2-2-parse-month-references off main.
Single focused commit, Co-Authored-By trailer.
Push with -u.
Open MR with tea pr create --title "feat(go/M2.2): port czech.ParseMonthReferences" --description ... --base main --head feat/m2-2-parse-month-references. Print the MR URL for the user.
User merges/deletes the branch in Gitea — never from the CLI.

After merge (small doc edits land straight on main per CLAUDE.md exception):

Tick M2.2 in the progress tracker with the merge SHA.
Add a one-line CHANGELOG.md entry (timestamp via date "+%Y-%m-%d %H:%M %Z").
Record any porting surprise (e.g. an unexpected diff between Go RE2 and Python re) in the tracker's "Notes & decisions" section.

Next task is M2.3 domain/fees.CalculateFee — straightforward constants table; no parser semantics to debate.

10 KiB Raw Blame History

Plan: Go rewrite — M2.2 domain/czech.ParseMonthReferences