Files
maru-hleda-byt/docs/validation.md
Jan Novak 00c9144010
All checks were successful
Build and Push / build (push) Successful in 5s
Fix DATA_DIR usage in stats/history paths, set env in Dockerfile, add validation docs
- scraper_stats.py: respect DATA_DIR env var when writing stats_*.json files
- generate_status.py: read stats files and write history from DATA_DIR instead of HERE
- build/Dockerfile: set DATA_DIR=/app/data as default env var
- docs/validation.md: end-to-end Docker validation recipe

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-02-26 09:46:16 +01:00

3.6 KiB

Validation Recipe

End-to-end check that scraping, data persistence, history, and the status page all work correctly in Docker.

What it verifies

  • All scrapers run and write output to DATA_DIR (/app/data)
  • stats_*.json land in /app/data/ (not in /app/)
  • status.json and scraper_history.json land in /app/data/
  • /api/status, /api/status/history, and /scrapers-status serve correct data
  • History accumulates across runs

Steps

1. Build the image

make build

2. Start a clean validation container

# Stop/remove any leftover container and volume from a previous run
docker stop maru-hleda-byt-validation 2>/dev/null; docker rm maru-hleda-byt-validation 2>/dev/null
docker volume rm maru-hleda-byt-validation-data 2>/dev/null

docker run -d --name maru-hleda-byt-validation \
  -p 8081:8080 \
  -v maru-hleda-byt-validation-data:/app/data \
  maru-hleda-byt

Give the container ~3 seconds to start. The entrypoint launches a background full scrape automatically — suppress it so only controlled runs execute:

sleep 3
docker exec maru-hleda-byt-validation pkill -f run_all.sh 2>/dev/null || true
docker exec maru-hleda-byt-validation rm -f /app/data/scraper_running.json 2>/dev/null || true

3. Run a limited scrape (run 1)

docker exec maru-hleda-byt-validation bash /app/run_all.sh --max-pages 1 --max-properties 10

Expected output (last few lines):

Status uložen: /app/data/status.json
Historie uložena: /app/data/scraper_history.json (1 záznamů)

4. Verify data files are in /app/data/

docker exec maru-hleda-byt-validation ls /app/data/

Expected files:

byty_cityhome.json   byty_idnes.json   byty_merged.json
byty_realingo.json   byty_sreality.json
mapa_bytu.html
scraper_history.json
stats_bezrealitky.json  stats_cityhome.json  stats_idnes.json
stats_realingo.json     stats_sreality.json
status.json

5. Run a second limited scrape (run 2)

docker exec maru-hleda-byt-validation bash /app/run_all.sh --max-pages 1 --max-properties 10

Expected last line: Historie uložena: /app/data/scraper_history.json (2 záznamů)

6. Verify history via API

curl -s http://localhost:8081/api/status/history | python3 -c "
import json, sys
h = json.load(sys.stdin)
print(f'{len(h)} entries:')
for i, e in enumerate(h):
    print(f'  [{i}] {e[\"timestamp\"]} total={e[\"total_accepted\"]}')
"

Expected: 2 entries with different timestamps.

curl -s http://localhost:8081/api/status | python3 -c "
import json, sys; s=json.load(sys.stdin)
print(f'status={s[\"status\"]} total={s[\"total_accepted\"]} ts={s[\"timestamp\"]}')
"

Expected: status=done total=<N> ts=<latest timestamp>

7. Check the status page

Open http://localhost:8081/scrapers-status in a browser (or curl -s http://localhost:8081/scrapers-status | grep -c "clickable-row" — should print 2).

8. Clean up

docker stop maru-hleda-byt-validation && docker rm maru-hleda-byt-validation
docker volume rm maru-hleda-byt-validation-data

Or use the Makefile shortcut:

make validation-stop

Notes

  • PSN scraper does not support --max-pages and will always fail with this command; success=False in history is expected during validation.
  • Bezrealitky may return 0 results with a 1-page limit; byty_bezrealitky.json will be absent from /app/data/ in that case — this is normal.
  • make validation (the Makefile target) runs the same limited scrape but does not suppress the background startup scrape, so two concurrent runs may occur. Use the manual steps above for a clean controlled test.