Reliability improvements: retry logic, validation, ratings sync
Some checks failed
Build and Push / build (push) Failing after 4s

- Add 3-attempt retry with exponential backoff to Sreality, Realingo,
  Bezrealitky, and PSN scrapers (CityHome and iDNES already had it)
- Add shared validate_listing() in scraper_stats.py; all 6 scrapers now
  validate GPS bounds, price, area, and required fields before output
- Wire ratings to server /api/ratings on page load (merge with
  localStorage) and save (async POST); ratings now persist across
  browsers and devices
- Namespace JS hash IDs as {source}_{id} to prevent rating collisions
  between listings from different portals with the same numeric ID
- Replace manual Czech diacritic table with unicodedata.normalize()
  in merge_and_map.py for correct deduplication of all edge cases
- Correct README schedule docs: every 4 hours, not twice daily

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
Jan Novak
2026-02-27 10:36:37 +01:00
parent 57a9f6f21a
commit 27a7834eb6
9 changed files with 212 additions and 114 deletions

View File

@@ -151,7 +151,7 @@ The project includes a Docker setup for unattended operation with a cron-based s
│ PID 1: python3 -m http.server :8080 │
│ serves /app/data/ │
│ │
│ crond: runs run_all.sh at 06:00/18:00
│ crond: runs run_all.sh every 4 hours
│ Europe/Prague timezone │
│ │
│ /app/ -- scripts (.py, .sh) │
@@ -160,7 +160,7 @@ The project includes a Docker setup for unattended operation with a cron-based s
└─────────────────────────────────────────┘
```
On startup, the HTTP server starts immediately. The initial scrape runs in the background. Subsequent cron runs update data in-place twice daily at 06:00 and 18:00 CET/CEST.
On startup, the HTTP server starts immediately. The initial scrape runs in the background. Subsequent cron runs update data in-place every 4 hours.
### Quick start
@@ -208,7 +208,7 @@ Validation targets run scrapers with `--max-pages 1 --max-properties 10` for a f
├── build/
│ ├── Dockerfile # Container image definition (python:3.13-alpine)
│ ├── entrypoint.sh # Container entrypoint (HTTP server + cron + initial scrape)
│ ├── crontab # Cron schedule (06:00 and 18:00 CET)
│ ├── crontab # Cron schedule (every 4 hours)
│ └── CONTAINER.md # Container-specific documentation
└── .gitignore # Ignores byty_*.json, __pycache__, .vscode
```