fix: Payment inference returns only exact-name matches when present

match_members() now short-circuits on whole-word full-name hits and uses word-boundary regex everywhere else, so a nickname that is a substring of another member's surname (e.g. "tov" inside "ottova") no longer produces false positives. Adds tests/test_match_members.py. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-04 23:08:59 +02:00
parent 97f568f49f
commit 81b36878b3
3 changed files with 161 additions and 14 deletions
--- a/docs/plans/2026-05-04-2249-payment-name-match-exact.md
+++ b/docs/plans/2026-05-04-2249-payment-name-match-exact.md
@@ -0,0 +1,81 @@
+# Exact full-name match for payment inference
+
+## Context
+
+A bank payment with the message `Henrietta Ottová (Heny): 04/2026` is being inferred to **two** members: the correct `Henrietta Ottová` *and* the unrelated `Tomáš Němeček (Tov)`. As a result, `reconcile()` splits the amount 50/50 between them, producing wrong balances.
+
+**Root cause** (`scripts/match_payments.py:51-115`): `match_members` runs four substring checks via raw Python `in`, with no word boundaries. Tomáš's nickname `Tov` normalizes to `tov`, which is literally a substring of `ottova`. Check #3 (`match_payments.py:79-85`) treats bare nickname presence as an `auto`-confidence match, so Tomáš is appended even though no part of his name is actually in the message. There is also no short-circuit when a member's full canonical name appears verbatim — every other member is still scored against the same haystack.
+
+**Goal:** when a member's full canonical name (diacritics-insensitive) appears in the message as whole words, return only the full-name hit(s) and skip nickname/partial scoring entirely. Additionally, harden the remaining checks with word boundaries so future substring collisions (any nickname or short name part that happens to live inside another member's surname) can't reproduce this class of bug.
+
+## Approach
+
+Single-file change in [scripts/match_payments.py](scripts/match_payments.py). Two coordinated edits to `match_members` (`match_payments.py:51-115`):
+
+### 1. Add an exact-canonical-name short-circuit (new, before the existing loop)
+
+After computing `normalized_text`, do a first pass that collects every member whose `normalized_base` (the full name minus the parenthesized nickname, normalized) appears in the haystack as **whole words**. If at least one is found, return *only* those as `auto` matches and skip the rest of the function.
+
+Implementation sketch (inserted between [match_payments.py:58](scripts/match_payments.py#L58) and [match_payments.py:61](scripts/match_payments.py#L61)):
+
+```python
+exact_matches = []
+for name in member_names:
+    variants = _build_name_variants(name)
+    full_name = variants[0] if variants else ""
+    if full_name and re.search(rf"\b{re.escape(full_name)}\b", normalized_text):
+        exact_matches.append((name, "auto"))
+if exact_matches:
+    return exact_matches
+```
+
+This satisfies the user's primary ask: when the message literally contains the canonical name, that wins outright. Multi-member messages still work — every full-name occurrence is collected.
+
+### 2. Replace remaining `in normalized_text` checks with `\b…\b` regex
+
+For the three checks that survive the short-circuit (and the `review`-tier partials), swap raw `in` for whole-word regex so `tov` cannot match inside `ottova`, `dan` cannot match inside `bohdan`, etc. Affected lines:
+
+- [match_payments.py:73](scripts/match_payments.py#L73) — first+last name both present
+- [match_payments.py:82](scripts/match_payments.py#L82) — nickname presence
+- [match_payments.py:94](scripts/match_payments.py#L94) — last-name partial (`review`)
+- [match_payments.py:99](scripts/match_payments.py#L99) — first-name partial (`review`)
+- [match_payments.py:104](scripts/match_payments.py#L104) — single-name member partial
+
+Helper to keep the call sites tidy:
+
+```python
+def _word_in(needle: str, haystack: str) -> bool:
+    return bool(re.search(rf"\b{re.escape(needle)}\b", haystack))
+```
+
+Check #1 (line 67) becomes redundant once the short-circuit is in place, but leave it untouched as a defensive fallback in case `_build_name_variants` ever returns a `full_name` shorter than the 3-char filter would allow. (No code change there.)
+
+### 3. Why this is sufficient
+
+- The reported message `Henrietta Ottová (Heny): 04/2026` hits the new short-circuit on `henrietta ottova`, returns `[("Henrietta Ottová", "auto")]`, and never even evaluates Tomáš.
+- Bare-nickname messages (e.g. `Heny 04/2026`) skip the short-circuit (no full name present) and fall into the existing nickname check — now word-bounded, so `tov` no longer collides with `ottova` even there.
+- Combined-payment messages listing two full names continue to work: both are collected by the short-circuit.
+
+### Files to modify
+
+- [scripts/match_payments.py](scripts/match_payments.py) — only `match_members` (lines 51-115). Add `_word_in` helper just above it.
+
+### Files to read for confidence (no edits)
+
+- [scripts/czech_utils.py](scripts/czech_utils.py) — confirm `normalize()` semantics (NFKD strip + lowercase). Already understood; relevant because `re.escape` on already-normalized lowercase ASCII is safe.
+- [scripts/infer_payments.py](scripts/infer_payments.py) — confirm it just consumes the `match_members` output verbatim and writes comma-joined names. No change needed; the upstream fix propagates.
+- [scripts/match_payments.py:336-362](scripts/match_payments.py#L336-L362) — `reconcile()` only re-runs inference when `Person` is empty, so existing wrong rows in the sheet must be cleared by hand or via the `manual fix`/blank-cell workflow before re-running `make infer`.
+
+## Verification
+
+1. **Unit test** — add `tests/test_match_members.py` (new file, mirroring `tests/test_reconcile_exceptions.py` style). Cases:
+   - `match_members("Henrietta Ottová (Heny): 04/2026", ["Henrietta Ottová", "Tomáš Němeček (Tov)"])` → `[("Henrietta Ottová", "auto")]` only.
+   - `match_members("Heny 04/2026", ["Tomáš Němeček (Tov)", "Henrietta Ottová"])` → no match for Tomáš (the substring trap is closed); whatever the legitimate behavior for "Heny" is, document it.
+   - Combined payment: `match_members("Henrietta Ottová a Tomáš Němeček 04/2026", ["Henrietta Ottová", "Tomáš Němeček (Tov)"])` → both as `auto`.
+   - Sanity: `match_members("VS 1234 Tomáš Němeček", [...])` still returns Tomáš.
+
+2. **Run the suite**: `make test`.
+
+3. **End-to-end**: clear the buggy row's `Person`/`Purpose` cells in the payments sheet, then `make infer`, then `make reconcile`. Confirm the payment now allocates fully to Henrietta and balance reflects it.
+
+4. **Changelog**: per [CLAUDE.md](CLAUDE.md), append an entry to [CHANGELOG.md](CHANGELOG.md) once the user confirms the fix works in production. Format: `## 2026-05-04 HH:MM TZ — fix: payment inference exact-match short-circuit`.
--- a/scripts/match_payments.py
+++ b/scripts/match_payments.py
@@ -48,6 +48,11 @@ def _build_name_variants(name: str) -> list[str]:
    return [v for v in variants if len(v) >= 3]


+def _word_in(needle: str, haystack: str) -> bool:
+    """Return True if needle appears as a whole word in haystack."""
+    return bool(re.search(rf"\b{re.escape(needle)}\b", haystack))
+
+
 def match_members(
    text: str, member_names: list[str]
 ) -> list[tuple[str, str]]:
@@ -56,6 +61,19 @@ def match_members(
    Returns list of (member_name, confidence) where confidence is 'auto' or 'review'.
    """
    normalized_text = normalize(text)
+
+    # Short-circuit: if any member's full canonical name appears verbatim (whole words),
+    # return only those matches and skip all fuzzy/nickname checks. This prevents a
+    # nickname that is a substring of another member's surname from producing false hits.
+    exact_matches = []
+    for name in member_names:
+        variants = _build_name_variants(name)
+        full_name = variants[0] if variants else ""
+        if full_name and _word_in(full_name, normalized_text):
+            exact_matches.append((name, "auto"))
+    if exact_matches:
+        return exact_matches
+
    matches = []

    for name in member_names:
@@ -70,17 +88,16 @@ def match_members(

        # 2. Both first and last name present (any order) = high confidence
        if len(parts) >= 2:
-            if parts[0] in normalized_text and parts[-1] in normalized_text:
+            if _word_in(parts[0], normalized_text) and _word_in(parts[-1], normalized_text):
                matches.append((name, "auto"))
                continue

-        # 3. Nickname + one part of the name = high confidence
+        # 3. Nickname present = high confidence
        nickname = ""
        nickname_match = re.search(r"\(([^)]+)\)", name)
        if nickname_match:
            nickname = normalize(nickname_match.group(1))
-            if nickname and nickname in normalized_text:
-                # Nickname alone is often enough, but let's check if it's combined with a name part
+            if nickname and _word_in(nickname, normalized_text):
                matches.append((name, "auto"))
                continue

@@ -90,18 +107,15 @@ def match_members(
            last_name = parts[-1]
            _COMMON_SURNAMES = {"novak", "novakova", "prach"}

-            # Match last name
-            if len(last_name) >= 4 and last_name not in _COMMON_SURNAMES and last_name in normalized_text:
+            if len(last_name) >= 4 and last_name not in _COMMON_SURNAMES and _word_in(last_name, normalized_text):
                matches.append((name, "review"))
                continue

-            # Match first name (if not too short)
-            if len(first_name) >= 3 and first_name in normalized_text:
+            if len(first_name) >= 3 and _word_in(first_name, normalized_text):
                matches.append((name, "review"))
                continue
        elif len(parts) == 1:
-            # Single name member
-            if len(parts[0]) >= 4 and parts[0] in normalized_text:
+            if len(parts[0]) >= 4 and _word_in(parts[0], normalized_text):
                matches.append((name, "review"))
                continue

@@ -109,7 +123,6 @@ def match_members(
    # If we have any "auto" matches, discard all "review" matches
    auto_matches = [m for m in matches if m[1] == "auto"]
    if auto_matches:
-        # If multiple auto matches, keep them (ambiguous but high priority)
        return auto_matches

    return matches
--- a/tests/test_match_members.py
+++ b/tests/test_match_members.py
@@ -0,0 +1,53 @@
+import unittest
+from scripts.match_payments import match_members
+
+
+MEMBERS = [
+    "Henrietta Ottová",
+    "Tomáš Němeček (Tov)",
+    "František Vrbík (Štrúdl)",
+    "Jana Nováková",
+]
+
+
+class TestMatchMembersExact(unittest.TestCase):
+    def test_full_name_in_message_returns_only_that_member(self):
+        # "tov" is a substring of "ottova" — the old code returned both members
+        result = match_members("Henrietta Ottová (Heny): 04/2026", MEMBERS)
+        names = [r[0] for r in result]
+        self.assertEqual(names, ["Henrietta Ottová"])
+        self.assertTrue(all(conf == "auto" for _, conf in result))
+
+    def test_nickname_tov_not_matched_inside_ottova(self):
+        # Bare nickname message should NOT match Tomáš via "tov" inside "ottova"
+        result = match_members("platba ottova 04/2026", MEMBERS)
+        names = [r[0] for r in result]
+        self.assertNotIn("Tomáš Němeček (Tov)", names)
+
+    def test_combined_payment_two_full_names(self):
+        result = match_members("Henrietta Ottová a Tomáš Němeček 04/2026", MEMBERS)
+        names = [r[0] for r in result]
+        self.assertIn("Henrietta Ottová", names)
+        self.assertIn("Tomáš Němeček (Tov)", names)
+        self.assertTrue(all(conf == "auto" for _, conf in result))
+
+    def test_nickname_alone_still_matches_correctly(self):
+        # "Tov" alone should still match Tomáš (as long as "ottova" is not in the text)
+        result = match_members("Tov platba 04/2026", MEMBERS)
+        names = [r[0] for r in result]
+        self.assertIn("Tomáš Němeček (Tov)", names)
+
+    def test_full_name_no_diacritics_still_matches(self):
+        result = match_members("Henrietta Ottova 04/2026", MEMBERS)
+        names = [r[0] for r in result]
+        self.assertIn("Henrietta Ottová", names)
+        self.assertNotIn("Tomáš Němeček (Tov)", names)
+
+    def test_first_last_name_present_any_order(self):
+        result = match_members("Platba od Nemeček Tomas 04/2026", MEMBERS)
+        names = [r[0] for r in result]
+        self.assertIn("Tomáš Němeček (Tov)", names)
+
+
+if __name__ == "__main__":
+    unittest.main()