match_members() now short-circuits on whole-word full-name hits and uses word-boundary regex everywhere else, so a nickname that is a substring of another member's surname (e.g. "tov" inside "ottova") no longer produces false positives. Adds tests/test_match_members.py. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
82 lines
6.2 KiB
Markdown
82 lines
6.2 KiB
Markdown
# Exact full-name match for payment inference
|
|
|
|
## Context
|
|
|
|
A bank payment with the message `Henrietta Ottová (Heny): 04/2026` is being inferred to **two** members: the correct `Henrietta Ottová` *and* the unrelated `Tomáš Němeček (Tov)`. As a result, `reconcile()` splits the amount 50/50 between them, producing wrong balances.
|
|
|
|
**Root cause** (`scripts/match_payments.py:51-115`): `match_members` runs four substring checks via raw Python `in`, with no word boundaries. Tomáš's nickname `Tov` normalizes to `tov`, which is literally a substring of `ottova`. Check #3 (`match_payments.py:79-85`) treats bare nickname presence as an `auto`-confidence match, so Tomáš is appended even though no part of his name is actually in the message. There is also no short-circuit when a member's full canonical name appears verbatim — every other member is still scored against the same haystack.
|
|
|
|
**Goal:** when a member's full canonical name (diacritics-insensitive) appears in the message as whole words, return only the full-name hit(s) and skip nickname/partial scoring entirely. Additionally, harden the remaining checks with word boundaries so future substring collisions (any nickname or short name part that happens to live inside another member's surname) can't reproduce this class of bug.
|
|
|
|
## Approach
|
|
|
|
Single-file change in [scripts/match_payments.py](scripts/match_payments.py). Two coordinated edits to `match_members` (`match_payments.py:51-115`):
|
|
|
|
### 1. Add an exact-canonical-name short-circuit (new, before the existing loop)
|
|
|
|
After computing `normalized_text`, do a first pass that collects every member whose `normalized_base` (the full name minus the parenthesized nickname, normalized) appears in the haystack as **whole words**. If at least one is found, return *only* those as `auto` matches and skip the rest of the function.
|
|
|
|
Implementation sketch (inserted between [match_payments.py:58](scripts/match_payments.py#L58) and [match_payments.py:61](scripts/match_payments.py#L61)):
|
|
|
|
```python
|
|
exact_matches = []
|
|
for name in member_names:
|
|
variants = _build_name_variants(name)
|
|
full_name = variants[0] if variants else ""
|
|
if full_name and re.search(rf"\b{re.escape(full_name)}\b", normalized_text):
|
|
exact_matches.append((name, "auto"))
|
|
if exact_matches:
|
|
return exact_matches
|
|
```
|
|
|
|
This satisfies the user's primary ask: when the message literally contains the canonical name, that wins outright. Multi-member messages still work — every full-name occurrence is collected.
|
|
|
|
### 2. Replace remaining `in normalized_text` checks with `\b…\b` regex
|
|
|
|
For the three checks that survive the short-circuit (and the `review`-tier partials), swap raw `in` for whole-word regex so `tov` cannot match inside `ottova`, `dan` cannot match inside `bohdan`, etc. Affected lines:
|
|
|
|
- [match_payments.py:73](scripts/match_payments.py#L73) — first+last name both present
|
|
- [match_payments.py:82](scripts/match_payments.py#L82) — nickname presence
|
|
- [match_payments.py:94](scripts/match_payments.py#L94) — last-name partial (`review`)
|
|
- [match_payments.py:99](scripts/match_payments.py#L99) — first-name partial (`review`)
|
|
- [match_payments.py:104](scripts/match_payments.py#L104) — single-name member partial
|
|
|
|
Helper to keep the call sites tidy:
|
|
|
|
```python
|
|
def _word_in(needle: str, haystack: str) -> bool:
|
|
return bool(re.search(rf"\b{re.escape(needle)}\b", haystack))
|
|
```
|
|
|
|
Check #1 (line 67) becomes redundant once the short-circuit is in place, but leave it untouched as a defensive fallback in case `_build_name_variants` ever returns a `full_name` shorter than the 3-char filter would allow. (No code change there.)
|
|
|
|
### 3. Why this is sufficient
|
|
|
|
- The reported message `Henrietta Ottová (Heny): 04/2026` hits the new short-circuit on `henrietta ottova`, returns `[("Henrietta Ottová", "auto")]`, and never even evaluates Tomáš.
|
|
- Bare-nickname messages (e.g. `Heny 04/2026`) skip the short-circuit (no full name present) and fall into the existing nickname check — now word-bounded, so `tov` no longer collides with `ottova` even there.
|
|
- Combined-payment messages listing two full names continue to work: both are collected by the short-circuit.
|
|
|
|
### Files to modify
|
|
|
|
- [scripts/match_payments.py](scripts/match_payments.py) — only `match_members` (lines 51-115). Add `_word_in` helper just above it.
|
|
|
|
### Files to read for confidence (no edits)
|
|
|
|
- [scripts/czech_utils.py](scripts/czech_utils.py) — confirm `normalize()` semantics (NFKD strip + lowercase). Already understood; relevant because `re.escape` on already-normalized lowercase ASCII is safe.
|
|
- [scripts/infer_payments.py](scripts/infer_payments.py) — confirm it just consumes the `match_members` output verbatim and writes comma-joined names. No change needed; the upstream fix propagates.
|
|
- [scripts/match_payments.py:336-362](scripts/match_payments.py#L336-L362) — `reconcile()` only re-runs inference when `Person` is empty, so existing wrong rows in the sheet must be cleared by hand or via the `manual fix`/blank-cell workflow before re-running `make infer`.
|
|
|
|
## Verification
|
|
|
|
1. **Unit test** — add `tests/test_match_members.py` (new file, mirroring `tests/test_reconcile_exceptions.py` style). Cases:
|
|
- `match_members("Henrietta Ottová (Heny): 04/2026", ["Henrietta Ottová", "Tomáš Němeček (Tov)"])` → `[("Henrietta Ottová", "auto")]` only.
|
|
- `match_members("Heny 04/2026", ["Tomáš Němeček (Tov)", "Henrietta Ottová"])` → no match for Tomáš (the substring trap is closed); whatever the legitimate behavior for "Heny" is, document it.
|
|
- Combined payment: `match_members("Henrietta Ottová a Tomáš Němeček 04/2026", ["Henrietta Ottová", "Tomáš Němeček (Tov)"])` → both as `auto`.
|
|
- Sanity: `match_members("VS 1234 Tomáš Němeček", [...])` still returns Tomáš.
|
|
|
|
2. **Run the suite**: `make test`.
|
|
|
|
3. **End-to-end**: clear the buggy row's `Person`/`Purpose` cells in the payments sheet, then `make infer`, then `make reconcile`. Confirm the payment now allocates fully to Henrietta and balance reflects it.
|
|
|
|
4. **Changelog**: per [CLAUDE.md](CLAUDE.md), append an entry to [CHANGELOG.md](CHANGELOG.md) once the user confirms the fix works in production. Format: `## 2026-05-04 HH:MM TZ — fix: payment inference exact-match short-circuit`.
|