Compare commits
2 Commits
0.06
...
23d208a5b7
| Author | SHA1 | Date | |
|---|---|---|---|
| 23d208a5b7 | |||
|
|
00c9144010 |
@@ -5,6 +5,7 @@ RUN apk add --no-cache curl bash tzdata \
|
|||||||
&& echo "Europe/Prague" > /etc/timezone
|
&& echo "Europe/Prague" > /etc/timezone
|
||||||
|
|
||||||
ENV PYTHONUNBUFFERED=1
|
ENV PYTHONUNBUFFERED=1
|
||||||
|
ENV DATA_DIR=/app/data
|
||||||
|
|
||||||
WORKDIR /app
|
WORKDIR /app
|
||||||
|
|
||||||
|
|||||||
123
docs/validation.md
Normal file
123
docs/validation.md
Normal file
@@ -0,0 +1,123 @@
|
|||||||
|
# Validation Recipe
|
||||||
|
|
||||||
|
End-to-end check that scraping, data persistence, history, and the status page all work correctly in Docker.
|
||||||
|
|
||||||
|
## What it verifies
|
||||||
|
|
||||||
|
- All scrapers run and write output to `DATA_DIR` (`/app/data`)
|
||||||
|
- `stats_*.json` land in `/app/data/` (not in `/app/`)
|
||||||
|
- `status.json` and `scraper_history.json` land in `/app/data/`
|
||||||
|
- `/api/status`, `/api/status/history`, and `/scrapers-status` serve correct data
|
||||||
|
- History accumulates across runs
|
||||||
|
|
||||||
|
## Steps
|
||||||
|
|
||||||
|
### 1. Build the image
|
||||||
|
|
||||||
|
```bash
|
||||||
|
make build
|
||||||
|
```
|
||||||
|
|
||||||
|
### 2. Start a clean validation container
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Stop/remove any leftover container and volume from a previous run
|
||||||
|
docker stop maru-hleda-byt-validation 2>/dev/null; docker rm maru-hleda-byt-validation 2>/dev/null
|
||||||
|
docker volume rm maru-hleda-byt-validation-data 2>/dev/null
|
||||||
|
|
||||||
|
docker run -d --name maru-hleda-byt-validation \
|
||||||
|
-p 8081:8080 \
|
||||||
|
-v maru-hleda-byt-validation-data:/app/data \
|
||||||
|
maru-hleda-byt
|
||||||
|
```
|
||||||
|
|
||||||
|
Give the container ~3 seconds to start. The entrypoint launches a background full scrape automatically — suppress it so only controlled runs execute:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
sleep 3
|
||||||
|
docker exec maru-hleda-byt-validation pkill -f run_all.sh 2>/dev/null || true
|
||||||
|
docker exec maru-hleda-byt-validation rm -f /app/data/scraper_running.json 2>/dev/null || true
|
||||||
|
```
|
||||||
|
|
||||||
|
### 3. Run a limited scrape (run 1)
|
||||||
|
|
||||||
|
```bash
|
||||||
|
docker exec maru-hleda-byt-validation bash /app/run_all.sh --max-pages 1 --max-properties 10
|
||||||
|
```
|
||||||
|
|
||||||
|
Expected output (last few lines):
|
||||||
|
```
|
||||||
|
Status uložen: /app/data/status.json
|
||||||
|
Historie uložena: /app/data/scraper_history.json (1 záznamů)
|
||||||
|
```
|
||||||
|
|
||||||
|
### 4. Verify data files are in `/app/data/`
|
||||||
|
|
||||||
|
```bash
|
||||||
|
docker exec maru-hleda-byt-validation ls /app/data/
|
||||||
|
```
|
||||||
|
|
||||||
|
Expected files:
|
||||||
|
```
|
||||||
|
byty_cityhome.json byty_idnes.json byty_merged.json
|
||||||
|
byty_realingo.json byty_sreality.json
|
||||||
|
mapa_bytu.html
|
||||||
|
scraper_history.json
|
||||||
|
stats_bezrealitky.json stats_cityhome.json stats_idnes.json
|
||||||
|
stats_realingo.json stats_sreality.json
|
||||||
|
status.json
|
||||||
|
```
|
||||||
|
|
||||||
|
### 5. Run a second limited scrape (run 2)
|
||||||
|
|
||||||
|
```bash
|
||||||
|
docker exec maru-hleda-byt-validation bash /app/run_all.sh --max-pages 1 --max-properties 10
|
||||||
|
```
|
||||||
|
|
||||||
|
Expected last line: `Historie uložena: /app/data/scraper_history.json (2 záznamů)`
|
||||||
|
|
||||||
|
### 6. Verify history via API
|
||||||
|
|
||||||
|
```bash
|
||||||
|
curl -s http://localhost:8081/api/status/history | python3 -c "
|
||||||
|
import json, sys
|
||||||
|
h = json.load(sys.stdin)
|
||||||
|
print(f'{len(h)} entries:')
|
||||||
|
for i, e in enumerate(h):
|
||||||
|
print(f' [{i}] {e[\"timestamp\"]} total={e[\"total_accepted\"]}')
|
||||||
|
"
|
||||||
|
```
|
||||||
|
|
||||||
|
Expected: 2 entries with different timestamps.
|
||||||
|
|
||||||
|
```bash
|
||||||
|
curl -s http://localhost:8081/api/status | python3 -c "
|
||||||
|
import json, sys; s=json.load(sys.stdin)
|
||||||
|
print(f'status={s[\"status\"]} total={s[\"total_accepted\"]} ts={s[\"timestamp\"]}')
|
||||||
|
"
|
||||||
|
```
|
||||||
|
|
||||||
|
Expected: `status=done total=<N> ts=<latest timestamp>`
|
||||||
|
|
||||||
|
### 7. Check the status page
|
||||||
|
|
||||||
|
Open http://localhost:8081/scrapers-status in a browser (or `curl -s http://localhost:8081/scrapers-status | grep -c "clickable-row"` — should print `2`).
|
||||||
|
|
||||||
|
### 8. Clean up
|
||||||
|
|
||||||
|
```bash
|
||||||
|
docker stop maru-hleda-byt-validation && docker rm maru-hleda-byt-validation
|
||||||
|
docker volume rm maru-hleda-byt-validation-data
|
||||||
|
```
|
||||||
|
|
||||||
|
Or use the Makefile shortcut:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
make validation-stop
|
||||||
|
```
|
||||||
|
|
||||||
|
## Notes
|
||||||
|
|
||||||
|
- PSN scraper does not support `--max-pages` and will always fail with this command; `success=False` in history is expected during validation.
|
||||||
|
- Bezrealitky may return 0 results with a 1-page limit; `byty_bezrealitky.json` will be absent from `/app/data/` in that case — this is normal.
|
||||||
|
- `make validation` (the Makefile target) runs the same limited scrape but does not suppress the background startup scrape, so two concurrent runs may occur. Use the manual steps above for a clean controlled test.
|
||||||
@@ -58,7 +58,7 @@ def read_scraper_stats(path: Path) -> dict:
|
|||||||
|
|
||||||
def append_to_history(status: dict, keep: int) -> None:
|
def append_to_history(status: dict, keep: int) -> None:
|
||||||
"""Append the current status entry to scraper_history.json, keeping only `keep` latest."""
|
"""Append the current status entry to scraper_history.json, keeping only `keep` latest."""
|
||||||
history_path = HERE / HISTORY_FILE
|
history_path = DATA_DIR / HISTORY_FILE
|
||||||
history: list = []
|
history: list = []
|
||||||
if history_path.exists():
|
if history_path.exists():
|
||||||
try:
|
try:
|
||||||
@@ -98,7 +98,7 @@ def main():
|
|||||||
info["name"] = name
|
info["name"] = name
|
||||||
|
|
||||||
# Merge in stats from the per-scraper stats file (authoritative for run data)
|
# Merge in stats from the per-scraper stats file (authoritative for run data)
|
||||||
stats = read_scraper_stats(HERE / STATS_FILES[name])
|
stats = read_scraper_stats(DATA_DIR / STATS_FILES[name])
|
||||||
for key in ("accepted", "fetched", "pages", "cache_hits", "excluded", "excluded_total",
|
for key in ("accepted", "fetched", "pages", "cache_hits", "excluded", "excluded_total",
|
||||||
"success", "duration_sec", "error"):
|
"success", "duration_sec", "error"):
|
||||||
if key in stats:
|
if key in stats:
|
||||||
|
|||||||
@@ -2,12 +2,14 @@
|
|||||||
from __future__ import annotations
|
from __future__ import annotations
|
||||||
|
|
||||||
import json
|
import json
|
||||||
|
import os
|
||||||
from pathlib import Path
|
from pathlib import Path
|
||||||
|
|
||||||
HERE = Path(__file__).parent
|
HERE = Path(__file__).parent
|
||||||
|
DATA_DIR = Path(os.environ.get("DATA_DIR", HERE))
|
||||||
|
|
||||||
|
|
||||||
def write_stats(filename: str, stats: dict) -> None:
|
def write_stats(filename: str, stats: dict) -> None:
|
||||||
"""Write scraper run stats dict to a JSON file next to this module."""
|
"""Write scraper run stats dict to the data directory."""
|
||||||
path = HERE / filename
|
path = DATA_DIR / filename
|
||||||
path.write_text(json.dumps(stats, ensure_ascii=False, indent=2), encoding="utf-8")
|
path.write_text(json.dumps(stats, ensure_ascii=False, indent=2), encoding="utf-8")
|
||||||
|
|||||||
Reference in New Issue
Block a user