Files
fuj-management/docs/plans/2026-05-04-2249-payment-name-match-exact.md
Jan Novak 81b36878b3
Some checks failed
Deploy to K8s / deploy (push) Successful in 11s
Build and Push / build (push) Successful in 7s
Build and Push / build-go (push) Failing after 5s
fix: Payment inference returns only exact-name matches when present
match_members() now short-circuits on whole-word full-name hits and
uses word-boundary regex everywhere else, so a nickname that is a
substring of another member's surname (e.g. "tov" inside "ottova")
no longer produces false positives. Adds tests/test_match_members.py.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-04 23:08:59 +02:00

6.2 KiB

Exact full-name match for payment inference

Context

A bank payment with the message Henrietta Ottová (Heny): 04/2026 is being inferred to two members: the correct Henrietta Ottová and the unrelated Tomáš Němeček (Tov). As a result, reconcile() splits the amount 50/50 between them, producing wrong balances.

Root cause (scripts/match_payments.py:51-115): match_members runs four substring checks via raw Python in, with no word boundaries. Tomáš's nickname Tov normalizes to tov, which is literally a substring of ottova. Check #3 (match_payments.py:79-85) treats bare nickname presence as an auto-confidence match, so Tomáš is appended even though no part of his name is actually in the message. There is also no short-circuit when a member's full canonical name appears verbatim — every other member is still scored against the same haystack.

Goal: when a member's full canonical name (diacritics-insensitive) appears in the message as whole words, return only the full-name hit(s) and skip nickname/partial scoring entirely. Additionally, harden the remaining checks with word boundaries so future substring collisions (any nickname or short name part that happens to live inside another member's surname) can't reproduce this class of bug.

Approach

Single-file change in scripts/match_payments.py. Two coordinated edits to match_members (match_payments.py:51-115):

1. Add an exact-canonical-name short-circuit (new, before the existing loop)

After computing normalized_text, do a first pass that collects every member whose normalized_base (the full name minus the parenthesized nickname, normalized) appears in the haystack as whole words. If at least one is found, return only those as auto matches and skip the rest of the function.

Implementation sketch (inserted between match_payments.py:58 and match_payments.py:61):

exact_matches = []
for name in member_names:
    variants = _build_name_variants(name)
    full_name = variants[0] if variants else ""
    if full_name and re.search(rf"\b{re.escape(full_name)}\b", normalized_text):
        exact_matches.append((name, "auto"))
if exact_matches:
    return exact_matches

This satisfies the user's primary ask: when the message literally contains the canonical name, that wins outright. Multi-member messages still work — every full-name occurrence is collected.

2. Replace remaining in normalized_text checks with \b…\b regex

For the three checks that survive the short-circuit (and the review-tier partials), swap raw in for whole-word regex so tov cannot match inside ottova, dan cannot match inside bohdan, etc. Affected lines:

Helper to keep the call sites tidy:

def _word_in(needle: str, haystack: str) -> bool:
    return bool(re.search(rf"\b{re.escape(needle)}\b", haystack))

Check #1 (line 67) becomes redundant once the short-circuit is in place, but leave it untouched as a defensive fallback in case _build_name_variants ever returns a full_name shorter than the 3-char filter would allow. (No code change there.)

3. Why this is sufficient

  • The reported message Henrietta Ottová (Heny): 04/2026 hits the new short-circuit on henrietta ottova, returns [("Henrietta Ottová", "auto")], and never even evaluates Tomáš.
  • Bare-nickname messages (e.g. Heny 04/2026) skip the short-circuit (no full name present) and fall into the existing nickname check — now word-bounded, so tov no longer collides with ottova even there.
  • Combined-payment messages listing two full names continue to work: both are collected by the short-circuit.

Files to modify

Files to read for confidence (no edits)

  • scripts/czech_utils.py — confirm normalize() semantics (NFKD strip + lowercase). Already understood; relevant because re.escape on already-normalized lowercase ASCII is safe.
  • scripts/infer_payments.py — confirm it just consumes the match_members output verbatim and writes comma-joined names. No change needed; the upstream fix propagates.
  • scripts/match_payments.py:336-362reconcile() only re-runs inference when Person is empty, so existing wrong rows in the sheet must be cleared by hand or via the manual fix/blank-cell workflow before re-running make infer.

Verification

  1. Unit test — add tests/test_match_members.py (new file, mirroring tests/test_reconcile_exceptions.py style). Cases:

    • match_members("Henrietta Ottová (Heny): 04/2026", ["Henrietta Ottová", "Tomáš Němeček (Tov)"])[("Henrietta Ottová", "auto")] only.
    • match_members("Heny 04/2026", ["Tomáš Němeček (Tov)", "Henrietta Ottová"]) → no match for Tomáš (the substring trap is closed); whatever the legitimate behavior for "Heny" is, document it.
    • Combined payment: match_members("Henrietta Ottová a Tomáš Němeček 04/2026", ["Henrietta Ottová", "Tomáš Němeček (Tov)"]) → both as auto.
    • Sanity: match_members("VS 1234 Tomáš Němeček", [...]) still returns Tomáš.
  2. Run the suite: make test.

  3. End-to-end: clear the buggy row's Person/Purpose cells in the payments sheet, then make infer, then make reconcile. Confirm the payment now allocates fully to Henrietta and balance reflects it.

  4. Changelog: per CLAUDE.md, append an entry to CHANGELOG.md once the user confirms the fix works in production. Format: ## 2026-05-04 HH:MM TZ — fix: payment inference exact-match short-circuit.