add docker build, makefile, and some more shit before we move forward
This commit is contained in:
5
build/.dockerignore
Normal file
5
build/.dockerignore
Normal file
@@ -0,0 +1,5 @@
|
||||
.git
|
||||
mapa_bytu.html
|
||||
byty_*.json
|
||||
*.pyc
|
||||
__pycache__
|
||||
100
build/CONTAINER.md
Normal file
100
build/CONTAINER.md
Normal file
@@ -0,0 +1,100 @@
|
||||
# Container Setup
|
||||
|
||||
OCI container image for the apartment finder. Runs two processes:
|
||||
|
||||
1. **Web server** (`python3 -m http.server`) serving `mapa_bytu.html` on port 8080
|
||||
2. **Cron job** running `run_all.sh` (all 6 scrapers + merge) every 12 hours
|
||||
|
||||
## Architecture
|
||||
|
||||
```
|
||||
┌─────────────────────────────────────────┐
|
||||
│ Container (python:3.13-alpine) │
|
||||
│ │
|
||||
│ PID 1: python3 -m http.server :8080 │
|
||||
│ serves /app/data/ │
|
||||
│ │
|
||||
│ crond: runs run_all.sh at 06:00/18:00 │
|
||||
│ Europe/Prague timezone │
|
||||
│ │
|
||||
│ /app/ ← scripts (.py, .sh) │
|
||||
│ /app/data/ ← volume (JSON + HTML) │
|
||||
│ ↑ symlinked from /app/byty_* │
|
||||
└─────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
On startup, the web server starts immediately. The initial scrape runs in the background and populates data as it completes. Subsequent cron runs update the data in-place.
|
||||
|
||||
## Build and Run
|
||||
|
||||
```bash
|
||||
# Build the image
|
||||
docker build -t maru-hleda-byt .
|
||||
|
||||
# Run with persistent data volume
|
||||
docker run -d --name maru-hleda-byt \
|
||||
-p 8080:8080 \
|
||||
-v maru-hleda-byt-data:/app/data \
|
||||
--restart unless-stopped \
|
||||
maru-hleda-byt
|
||||
```
|
||||
|
||||
Access the map at **http://localhost:8080/mapa_bytu.html**
|
||||
|
||||
## Volume Persistence
|
||||
|
||||
A named volume `maru-hleda-byt-data` stores:
|
||||
|
||||
- `byty_*.json` — cached scraper data (6 source files + 1 merged)
|
||||
- `mapa_bytu.html` — the generated interactive map
|
||||
|
||||
The JSON cache is important: each scraper skips re-fetching properties that haven't changed. Without the volume, every container restart triggers a full re-scrape of all 6 portals (several minutes with rate limiting).
|
||||
|
||||
## Cron Schedule
|
||||
|
||||
Scrapers run at **06:00** and **18:00 Europe/Prague time** (CET/CEST).
|
||||
|
||||
Cron output is forwarded to the container's stdout/stderr, visible via `docker logs`.
|
||||
|
||||
## Operations
|
||||
|
||||
```bash
|
||||
# View logs (including cron and scraper output)
|
||||
docker logs -f maru-hleda-byt
|
||||
|
||||
# Check cron schedule
|
||||
docker exec maru-hleda-byt crontab -l
|
||||
|
||||
# Trigger a manual scrape
|
||||
docker exec maru-hleda-byt bash /app/run_all.sh
|
||||
|
||||
# Stop / start (data persists in volume)
|
||||
docker stop maru-hleda-byt
|
||||
docker start maru-hleda-byt
|
||||
|
||||
# Rebuild after code changes
|
||||
docker stop maru-hleda-byt && docker rm maru-hleda-byt
|
||||
docker build -t maru-hleda-byt .
|
||||
docker run -d --name maru-hleda-byt \
|
||||
-p 8080:8080 \
|
||||
-v maru-hleda-byt-data:/app/data \
|
||||
--restart unless-stopped \
|
||||
maru-hleda-byt
|
||||
```
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
**Map shows 404**: The initial background scrape hasn't finished yet. Check `docker logs` for progress. First run takes a few minutes due to rate-limited API calls.
|
||||
|
||||
**SSL errors from PSN scraper**: The `scrape_psn.py` uses `curl` (not Python urllib) specifically for Cloudflare SSL compatibility. Alpine's curl includes modern TLS via OpenSSL, so this should work. If not, check that `ca-certificates` is installed (`apk add ca-certificates`).
|
||||
|
||||
**Health check failing**: The health check has a 5-minute start period to allow the initial scrape to complete. If it still fails, verify the HTTP server is running: `docker exec maru-hleda-byt wget -q -O /dev/null http://localhost:8080/`.
|
||||
|
||||
**Timezone verification**: `docker exec maru-hleda-byt date` should show Czech time.
|
||||
|
||||
## Image Details
|
||||
|
||||
- **Base**: `python:3.13-alpine` (~55 MB)
|
||||
- **Added packages**: `curl`, `bash`, `tzdata` (~10 MB)
|
||||
- **No pip packages** — all scrapers use Python standard library only
|
||||
- **Approximate image size**: ~70 MB
|
||||
26
build/Dockerfile
Normal file
26
build/Dockerfile
Normal file
@@ -0,0 +1,26 @@
|
||||
FROM python:3.13-alpine
|
||||
|
||||
RUN apk add --no-cache curl bash tzdata \
|
||||
&& cp /usr/share/zoneinfo/Europe/Prague /etc/localtime \
|
||||
&& echo "Europe/Prague" > /etc/timezone
|
||||
|
||||
ENV PYTHONUNBUFFERED=1
|
||||
|
||||
WORKDIR /app
|
||||
|
||||
COPY scrape_and_map.py scrape_realingo.py scrape_bezrealitky.py \
|
||||
scrape_idnes.py scrape_psn.py scrape_cityhome.py \
|
||||
merge_and_map.py regen_map.py run_all.sh ./
|
||||
|
||||
COPY build/crontab /etc/crontabs/root
|
||||
COPY build/entrypoint.sh /entrypoint.sh
|
||||
RUN chmod +x /entrypoint.sh run_all.sh
|
||||
|
||||
RUN mkdir -p /app/data
|
||||
|
||||
EXPOSE 8080
|
||||
|
||||
HEALTHCHECK --interval=60s --timeout=5s --start-period=300s \
|
||||
CMD wget -q -O /dev/null http://localhost:8080/ || exit 1
|
||||
|
||||
ENTRYPOINT ["/entrypoint.sh"]
|
||||
31
build/Makefile
Normal file
31
build/Makefile
Normal file
@@ -0,0 +1,31 @@
|
||||
IMAGE_NAME := maru-hleda-byt
|
||||
CONTAINER_NAME := maru-hleda-byt
|
||||
VOLUME_NAME := maru-hleda-byt-data
|
||||
PORT := 8080
|
||||
|
||||
.PHONY: build run stop logs scrape restart clean
|
||||
|
||||
build:
|
||||
docker build -f build/Dockerfile -t $(IMAGE_NAME) .
|
||||
|
||||
run: build
|
||||
docker run -d --name $(CONTAINER_NAME) \
|
||||
-p $(PORT):8080 \
|
||||
-v $(VOLUME_NAME):/app/data \
|
||||
--restart unless-stopped \
|
||||
$(IMAGE_NAME)
|
||||
@echo "Map will be at http://localhost:$(PORT)/mapa_bytu.html"
|
||||
|
||||
stop:
|
||||
docker stop $(CONTAINER_NAME) && docker rm $(CONTAINER_NAME)
|
||||
|
||||
logs:
|
||||
docker logs -f $(CONTAINER_NAME)
|
||||
|
||||
scrape:
|
||||
docker exec $(CONTAINER_NAME) bash /app/run_all.sh
|
||||
|
||||
restart: stop run
|
||||
|
||||
clean: stop
|
||||
docker rmi $(IMAGE_NAME)
|
||||
1
build/crontab
Normal file
1
build/crontab
Normal file
@@ -0,0 +1 @@
|
||||
0 6,18 * * * cd /app && bash /app/run_all.sh >> /proc/1/fd/1 2>> /proc/1/fd/2
|
||||
22
build/entrypoint.sh
Normal file
22
build/entrypoint.sh
Normal file
@@ -0,0 +1,22 @@
|
||||
#!/bin/bash
|
||||
set -euo pipefail
|
||||
|
||||
DATA_DIR="/app/data"
|
||||
|
||||
# Create symlinks so scripts (which write to /app/) persist data to the volume
|
||||
for f in byty_sreality.json byty_realingo.json byty_bezrealitky.json \
|
||||
byty_idnes.json byty_psn.json byty_cityhome.json byty_merged.json \
|
||||
mapa_bytu.html; do
|
||||
# Remove real file if it exists (e.g. baked into image)
|
||||
[ -f "/app/$f" ] && [ ! -L "/app/$f" ] && rm -f "/app/$f"
|
||||
ln -sf "$DATA_DIR/$f" "/app/$f"
|
||||
done
|
||||
|
||||
echo "[entrypoint] Starting crond..."
|
||||
crond -b -l 2
|
||||
|
||||
echo "[entrypoint] Starting initial scrape in background..."
|
||||
bash /app/run_all.sh &
|
||||
|
||||
echo "[entrypoint] Starting HTTP server on port 8080..."
|
||||
exec python3 -m http.server 8080 --directory "$DATA_DIR"
|
||||
Reference in New Issue
Block a user