# Validation Recipe End-to-end check that scraping, data persistence, history, and the status page all work correctly in Docker. ## What it verifies - All scrapers run and write output to `DATA_DIR` (`/app/data`) - `stats_*.json` land in `/app/data/` (not in `/app/`) - `status.json` and `scraper_history.json` land in `/app/data/` - `/api/status`, `/api/status/history`, and `/scrapers-status` serve correct data - History accumulates across runs ## Steps ### 1. Build the image ```bash make build ``` ### 2. Start a clean validation container ```bash # Stop/remove any leftover container and volume from a previous run docker stop maru-hleda-byt-validation 2>/dev/null; docker rm maru-hleda-byt-validation 2>/dev/null docker volume rm maru-hleda-byt-validation-data 2>/dev/null docker run -d --name maru-hleda-byt-validation \ -p 8081:8080 \ -v maru-hleda-byt-validation-data:/app/data \ maru-hleda-byt ``` Give the container ~3 seconds to start. The entrypoint launches a background full scrape automatically — suppress it so only controlled runs execute: ```bash sleep 3 docker exec maru-hleda-byt-validation pkill -f run_all.sh 2>/dev/null || true docker exec maru-hleda-byt-validation rm -f /app/data/scraper_running.json 2>/dev/null || true ``` ### 3. Run a limited scrape (run 1) ```bash docker exec maru-hleda-byt-validation bash /app/run_all.sh --max-pages 1 --max-properties 10 ``` Expected output (last few lines): ``` Status uložen: /app/data/status.json Historie uložena: /app/data/scraper_history.json (1 záznamů) ``` ### 4. Verify data files are in `/app/data/` ```bash docker exec maru-hleda-byt-validation ls /app/data/ ``` Expected files: ``` byty_cityhome.json byty_idnes.json byty_merged.json byty_realingo.json byty_sreality.json mapa_bytu.html scraper_history.json stats_bezrealitky.json stats_cityhome.json stats_idnes.json stats_realingo.json stats_sreality.json status.json ``` ### 5. Run a second limited scrape (run 2) ```bash docker exec maru-hleda-byt-validation bash /app/run_all.sh --max-pages 1 --max-properties 10 ``` Expected last line: `Historie uložena: /app/data/scraper_history.json (2 záznamů)` ### 6. Verify history via API ```bash curl -s http://localhost:8081/api/status/history | python3 -c " import json, sys h = json.load(sys.stdin) print(f'{len(h)} entries:') for i, e in enumerate(h): print(f' [{i}] {e[\"timestamp\"]} total={e[\"total_accepted\"]}') " ``` Expected: 2 entries with different timestamps. ```bash curl -s http://localhost:8081/api/status | python3 -c " import json, sys; s=json.load(sys.stdin) print(f'status={s[\"status\"]} total={s[\"total_accepted\"]} ts={s[\"timestamp\"]}') " ``` Expected: `status=done total= ts=` ### 7. Check the status page Open http://localhost:8081/scrapers-status in a browser (or `curl -s http://localhost:8081/scrapers-status | grep -c "clickable-row"` — should print `2`). ### 8. Clean up ```bash docker stop maru-hleda-byt-validation && docker rm maru-hleda-byt-validation docker volume rm maru-hleda-byt-validation-data ``` Or use the Makefile shortcut: ```bash make validation-stop ``` ## Notes - PSN scraper does not support `--max-pages` and will always fail with this command; `success=False` in history is expected during validation. - Bezrealitky may return 0 results with a 1-page limit; `byty_bezrealitky.json` will be absent from `/app/data/` in that case — this is normal. - `make validation` (the Makefile target) runs the same limited scrape but does not suppress the background startup scrape, so two concurrent runs may occur. Use the manual steps above for a clean controlled test.