All checks were successful
Build and Push / build (push) Successful in 5s
- scraper_stats.py: respect DATA_DIR env var when writing stats_*.json files - generate_status.py: read stats files and write history from DATA_DIR instead of HERE - build/Dockerfile: set DATA_DIR=/app/data as default env var - docs/validation.md: end-to-end Docker validation recipe Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
3.6 KiB
3.6 KiB
Validation Recipe
End-to-end check that scraping, data persistence, history, and the status page all work correctly in Docker.
What it verifies
- All scrapers run and write output to
DATA_DIR(/app/data) stats_*.jsonland in/app/data/(not in/app/)status.jsonandscraper_history.jsonland in/app/data//api/status,/api/status/history, and/scrapers-statusserve correct data- History accumulates across runs
Steps
1. Build the image
make build
2. Start a clean validation container
# Stop/remove any leftover container and volume from a previous run
docker stop maru-hleda-byt-validation 2>/dev/null; docker rm maru-hleda-byt-validation 2>/dev/null
docker volume rm maru-hleda-byt-validation-data 2>/dev/null
docker run -d --name maru-hleda-byt-validation \
-p 8081:8080 \
-v maru-hleda-byt-validation-data:/app/data \
maru-hleda-byt
Give the container ~3 seconds to start. The entrypoint launches a background full scrape automatically — suppress it so only controlled runs execute:
sleep 3
docker exec maru-hleda-byt-validation pkill -f run_all.sh 2>/dev/null || true
docker exec maru-hleda-byt-validation rm -f /app/data/scraper_running.json 2>/dev/null || true
3. Run a limited scrape (run 1)
docker exec maru-hleda-byt-validation bash /app/run_all.sh --max-pages 1 --max-properties 10
Expected output (last few lines):
Status uložen: /app/data/status.json
Historie uložena: /app/data/scraper_history.json (1 záznamů)
4. Verify data files are in /app/data/
docker exec maru-hleda-byt-validation ls /app/data/
Expected files:
byty_cityhome.json byty_idnes.json byty_merged.json
byty_realingo.json byty_sreality.json
mapa_bytu.html
scraper_history.json
stats_bezrealitky.json stats_cityhome.json stats_idnes.json
stats_realingo.json stats_sreality.json
status.json
5. Run a second limited scrape (run 2)
docker exec maru-hleda-byt-validation bash /app/run_all.sh --max-pages 1 --max-properties 10
Expected last line: Historie uložena: /app/data/scraper_history.json (2 záznamů)
6. Verify history via API
curl -s http://localhost:8081/api/status/history | python3 -c "
import json, sys
h = json.load(sys.stdin)
print(f'{len(h)} entries:')
for i, e in enumerate(h):
print(f' [{i}] {e[\"timestamp\"]} total={e[\"total_accepted\"]}')
"
Expected: 2 entries with different timestamps.
curl -s http://localhost:8081/api/status | python3 -c "
import json, sys; s=json.load(sys.stdin)
print(f'status={s[\"status\"]} total={s[\"total_accepted\"]} ts={s[\"timestamp\"]}')
"
Expected: status=done total=<N> ts=<latest timestamp>
7. Check the status page
Open http://localhost:8081/scrapers-status in a browser (or curl -s http://localhost:8081/scrapers-status | grep -c "clickable-row" — should print 2).
8. Clean up
docker stop maru-hleda-byt-validation && docker rm maru-hleda-byt-validation
docker volume rm maru-hleda-byt-validation-data
Or use the Makefile shortcut:
make validation-stop
Notes
- PSN scraper does not support
--max-pagesand will always fail with this command;success=Falsein history is expected during validation. - Bezrealitky may return 0 results with a 1-page limit;
byty_bezrealitky.jsonwill be absent from/app/data/in that case — this is normal. make validation(the Makefile target) runs the same limited scrape but does not suppress the background startup scrape, so two concurrent runs may occur. Use the manual steps above for a clean controlled test.