add docker build, makefile, and some more shit before we move forward

2026-02-14 22:18:02 +01:00
parent 215b51aadb
commit 5207c48890
11 changed files with 271 additions and 26 deletions
--- a/build/.dockerignore
+++ b/build/.dockerignore
@@ -0,0 +1,5 @@
+.git
+mapa_bytu.html
+byty_*.json
+*.pyc
+__pycache__
--- a/build/CONTAINER.md
+++ b/build/CONTAINER.md
@@ -0,0 +1,100 @@
+# Container Setup
+
+OCI container image for the apartment finder. Runs two processes:
+
+1. **Web server** (`python3 -m http.server`) serving `mapa_bytu.html` on port 8080
+2. **Cron job** running `run_all.sh` (all 6 scrapers + merge) every 12 hours
+
+## Architecture
+
+```
+┌─────────────────────────────────────────┐
+│  Container (python:3.13-alpine)         │
+│                                         │
+│  PID 1: python3 -m http.server :8080    │
+│         serves /app/data/               │
+│                                         │
+│  crond:  runs run_all.sh at 06:00/18:00 │
+│          Europe/Prague timezone          │
+│                                         │
+│  /app/        ← scripts (.py, .sh)      │
+│  /app/data/   ← volume (JSON + HTML)    │
+│         ↑ symlinked from /app/byty_*    │
+└─────────────────────────────────────────┘
+```
+
+On startup, the web server starts immediately. The initial scrape runs in the background and populates data as it completes. Subsequent cron runs update the data in-place.
+
+## Build and Run
+
+```bash
+# Build the image
+docker build -t maru-hleda-byt .
+
+# Run with persistent data volume
+docker run -d --name maru-hleda-byt \
+  -p 8080:8080 \
+  -v maru-hleda-byt-data:/app/data \
+  --restart unless-stopped \
+  maru-hleda-byt
+```
+
+Access the map at **http://localhost:8080/mapa_bytu.html**
+
+## Volume Persistence
+
+A named volume `maru-hleda-byt-data` stores:
+
+- `byty_*.json` — cached scraper data (6 source files + 1 merged)
+- `mapa_bytu.html` — the generated interactive map
+
+The JSON cache is important: each scraper skips re-fetching properties that haven't changed. Without the volume, every container restart triggers a full re-scrape of all 6 portals (several minutes with rate limiting).
+
+## Cron Schedule
+
+Scrapers run at **06:00** and **18:00 Europe/Prague time** (CET/CEST).
+
+Cron output is forwarded to the container's stdout/stderr, visible via `docker logs`.
+
+## Operations
+
+```bash
+# View logs (including cron and scraper output)
+docker logs -f maru-hleda-byt
+
+# Check cron schedule
+docker exec maru-hleda-byt crontab -l
+
+# Trigger a manual scrape
+docker exec maru-hleda-byt bash /app/run_all.sh
+
+# Stop / start (data persists in volume)
+docker stop maru-hleda-byt
+docker start maru-hleda-byt
+
+# Rebuild after code changes
+docker stop maru-hleda-byt && docker rm maru-hleda-byt
+docker build -t maru-hleda-byt .
+docker run -d --name maru-hleda-byt \
+  -p 8080:8080 \
+  -v maru-hleda-byt-data:/app/data \
+  --restart unless-stopped \
+  maru-hleda-byt
+```
+
+## Troubleshooting
+
+**Map shows 404**: The initial background scrape hasn't finished yet. Check `docker logs` for progress. First run takes a few minutes due to rate-limited API calls.
+
+**SSL errors from PSN scraper**: The `scrape_psn.py` uses `curl` (not Python urllib) specifically for Cloudflare SSL compatibility. Alpine's curl includes modern TLS via OpenSSL, so this should work. If not, check that `ca-certificates` is installed (`apk add ca-certificates`).
+
+**Health check failing**: The health check has a 5-minute start period to allow the initial scrape to complete. If it still fails, verify the HTTP server is running: `docker exec maru-hleda-byt wget -q -O /dev/null http://localhost:8080/`.
+
+**Timezone verification**: `docker exec maru-hleda-byt date` should show Czech time.
+
+## Image Details
+
+- **Base**: `python:3.13-alpine` (~55 MB)
+- **Added packages**: `curl`, `bash`, `tzdata` (~10 MB)
+- **No pip packages** — all scrapers use Python standard library only
+- **Approximate image size**: ~70 MB
--- a/build/Dockerfile
+++ b/build/Dockerfile
@@ -0,0 +1,26 @@
+FROM python:3.13-alpine
+
+RUN apk add --no-cache curl bash tzdata \
+    && cp /usr/share/zoneinfo/Europe/Prague /etc/localtime \
+    && echo "Europe/Prague" > /etc/timezone
+
+ENV PYTHONUNBUFFERED=1
+
+WORKDIR /app
+
+COPY scrape_and_map.py scrape_realingo.py scrape_bezrealitky.py \
+     scrape_idnes.py scrape_psn.py scrape_cityhome.py \
+     merge_and_map.py regen_map.py run_all.sh ./
+
+COPY build/crontab /etc/crontabs/root
+COPY build/entrypoint.sh /entrypoint.sh
+RUN chmod +x /entrypoint.sh run_all.sh
+
+RUN mkdir -p /app/data
+
+EXPOSE 8080
+
+HEALTHCHECK --interval=60s --timeout=5s --start-period=300s \
+    CMD wget -q -O /dev/null http://localhost:8080/ || exit 1
+
+ENTRYPOINT ["/entrypoint.sh"]
--- a/build/Makefile
+++ b/build/Makefile
@@ -0,0 +1,31 @@
+IMAGE_NAME     := maru-hleda-byt
+CONTAINER_NAME := maru-hleda-byt
+VOLUME_NAME    := maru-hleda-byt-data
+PORT           := 8080
+
+.PHONY: build run stop logs scrape restart clean
+
+build:
+	docker build -f build/Dockerfile -t $(IMAGE_NAME) .
+
+run: build
+	docker run -d --name $(CONTAINER_NAME) \
+		-p $(PORT):8080 \
+		-v $(VOLUME_NAME):/app/data \
+		--restart unless-stopped \
+		$(IMAGE_NAME)
+	@echo "Map will be at http://localhost:$(PORT)/mapa_bytu.html"
+
+stop:
+	docker stop $(CONTAINER_NAME) && docker rm $(CONTAINER_NAME)
+
+logs:
+	docker logs -f $(CONTAINER_NAME)
+
+scrape:
+	docker exec $(CONTAINER_NAME) bash /app/run_all.sh
+
+restart: stop run
+
+clean: stop
+	docker rmi $(IMAGE_NAME)
--- a/build/crontab
+++ b/build/crontab
@@ -0,0 +1 @@
+0 6,18 * * * cd /app && bash /app/run_all.sh >> /proc/1/fd/1 2>> /proc/1/fd/2
--- a/build/entrypoint.sh
+++ b/build/entrypoint.sh
@@ -0,0 +1,22 @@
+#!/bin/bash
+set -euo pipefail
+
+DATA_DIR="/app/data"
+
+# Create symlinks so scripts (which write to /app/) persist data to the volume
+for f in byty_sreality.json byty_realingo.json byty_bezrealitky.json \
+         byty_idnes.json byty_psn.json byty_cityhome.json byty_merged.json \
+         mapa_bytu.html; do
+    # Remove real file if it exists (e.g. baked into image)
+    [ -f "/app/$f" ] && [ ! -L "/app/$f" ] && rm -f "/app/$f"
+    ln -sf "$DATA_DIR/$f" "/app/$f"
+done
+
+echo "[entrypoint] Starting crond..."
+crond -b -l 2
+
+echo "[entrypoint] Starting initial scrape in background..."
+bash /app/run_all.sh &
+
+echo "[entrypoint] Starting HTTP server on port 8080..."
+exec python3 -m http.server 8080 --directory "$DATA_DIR"
				`@@ -0,0 +1 @@`
				`0 6,18 * * * cd /app && bash /app/run_all.sh >> /proc/1/fd/1 2>> /proc/1/fd/2`