15 Commits

Author SHA1 Message Date
59ef3274b6 Add CLAUDE.md project documentation for session context
Provides automatic context loading for new Claude Code sessions,
documenting architecture, filters, sources, and conventions.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-06 09:58:01 +01:00
27e5b05f88 Add Bazoš.cz as new apartment scraper source
New scraper for reality.bazos.cz with full HTML parsing (no API),
GPS extraction from Google Maps links, panel/sídliště filtering,
floor/area parsing from free text, and pagination fix for Bazoš's
numeric locality codes. Integrated into merge pipeline and map
with purple (#7B1FA2) markers.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-06 09:47:37 +01:00
63663e4b6b Merge pull request 'Move Realingo scraper to run last' (#6) from fix/scraper-order into main
All checks were successful
Build and Push / build (push) Successful in 6s
Reviewed-on: #6
2026-02-27 21:19:29 +00:00
8c052840cd Move Realingo scraper to run last in pipeline
Reorder scrapers: Sreality → Bezrealitky → iDNES → PSN+CityHome → Realingo → Merge

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-27 21:35:54 +01:00
39e4b9ce2a Merge pull request 'Reliability improvements and cleanup' (#5) from improve/reliability-and-fixes into main
Reviewed-on: #5
2026-02-27 10:26:04 +00:00
Jan Novak
fd3991f8d6 Remove regen_map.py references from Dockerfile and README
All checks were successful
Build and Push / build (push) Successful in 6s
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-02-27 10:44:08 +01:00
Jan Novak
27a7834eb6 Reliability improvements: retry logic, validation, ratings sync
Some checks failed
Build and Push / build (push) Failing after 4s
- Add 3-attempt retry with exponential backoff to Sreality, Realingo,
  Bezrealitky, and PSN scrapers (CityHome and iDNES already had it)
- Add shared validate_listing() in scraper_stats.py; all 6 scrapers now
  validate GPS bounds, price, area, and required fields before output
- Wire ratings to server /api/ratings on page load (merge with
  localStorage) and save (async POST); ratings now persist across
  browsers and devices
- Namespace JS hash IDs as {source}_{id} to prevent rating collisions
  between listings from different portals with the same numeric ID
- Replace manual Czech diacritic table with unicodedata.normalize()
  in merge_and_map.py for correct deduplication of all edge cases
- Correct README schedule docs: every 4 hours, not twice daily

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-02-27 10:36:37 +01:00
57a9f6f21a Add NEW badge for recent listings, text input for price filter, cleanup
- New listings (≤1 day) show yellow NEW badge instead of oversized marker
- Price filter changed from dropdown to text input (max 14M)
- Cap price filter at 14M in JS
- Remove unused regen_map.py
- Remove unused HTMLParser import in scrape_idnes.py

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-26 21:14:48 +01:00
0ea31d3013 Remove tracked generated/data files and fix map link on status page
- Remove byty_*.json, mapa_bytu.html, .DS_Store and settings.local.json from git tracking
  (already in .gitignore, files kept locally)
- Fix "Otevřít mapu" link on scraper status page: / → /mapa_bytu.html

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-26 20:42:35 +01:00
Jan Novak
4304a42776 Track first_seen/last_changed per property, add map filters and clickable legend
All checks were successful
Build and Push / build (push) Successful in 6s
Scraper changes (all 6 sources):
- Add first_seen: date the hash_id was first scraped, never overwritten
- Add last_changed: date the price last changed (= first_seen when new)
- PSN and CityHome load previous output as a lightweight cache to compute these fields
- merge_and_map.py preserves earliest first_seen when deduplicating cross-source duplicates

Map popup:
- Show "Přidáno: YYYY-MM-DD" and "Změněno: YYYY-MM-DD" in each property popup
- NOVÉ badge and pulsing marker now driven by first_seen == today (more accurate than scraped_at)

Map filters (sidebar):
- New "Přidáno / změněno" dropdown: 1, 2, 3, 4, 5, 7, 14, 30 days or all
- Clickable price/m² legend bands: click to filter to that band, multi-select supported
- "✕ Zobrazit všechny ceny" reset link appears when any band is active

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-02-26 16:58:46 +01:00
23d208a5b7 Merge pull request 'Add scraper status collection and presentation' (#3) from add-scraper-statuses into main
Reviewed-on: #3
2026-02-26 09:04:23 +00:00
Jan Novak
00c9144010 Fix DATA_DIR usage in stats/history paths, set env in Dockerfile, add validation docs
All checks were successful
Build and Push / build (push) Successful in 5s
- scraper_stats.py: respect DATA_DIR env var when writing stats_*.json files
- generate_status.py: read stats files and write history from DATA_DIR instead of HERE
- build/Dockerfile: set DATA_DIR=/app/data as default env var
- docs/validation.md: end-to-end Docker validation recipe

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-02-26 09:46:16 +01:00
Jan Novak
44c02b45b4 Increase history retention to 20, run scrapers every 4 hours
All checks were successful
Build and Push / build (push) Successful in 7s
- generate_status.py: raise --keep default from 5 to 20 entries
- build/crontab: change schedule from 06:00/18:00 to every 4 hours (*/4)
  covers 6 runs/day ≈ 3.3 days of history at default retention

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-02-26 08:53:27 +01:00
Jan Novak
5fb3b984b6 Add status dashboard, server, scraper stats, and DATA_DIR support
All checks were successful
Build and Push / build (push) Successful in 7s
Key changes:
- Replace ratings_server.py + status.html with a unified server.py that
  serves the map, scraper status dashboard, and ratings API in one process
- Add scraper_stats.py utility: each scraper writes per-run stats (fetched,
  accepted, excluded, duration) to stats_<source>.json for the status page
- generate_status.py: respect DATA_DIR env var so status.json lands in the
  configured data directory instead of always the project root
- run_all.sh: replace the {"status":"running"} overwrite of status.json with
  a dedicated scraper_running.json lock file; trap on EXIT ensures cleanup
  even on kill/error, preventing the previous run's results from being wiped
- server.py: detect running state via scraper_running.json existence instead
  of status["status"] field, eliminating the dual-use race condition
- Makefile: add serve (local dev), debug (Docker debug container) targets;
  add SERVER_PORT variable
- build/Dockerfile + entrypoint.sh: switch to server.py, set DATA_DIR,
  adjust volume mounts
- .gitignore: add *.json and *.log to keep runtime data files out of VCS
- mapa_bytu.html: price-per-m² colouring, status link, UX tweaks

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-02-26 00:30:25 +01:00
6f49533c94 Merge pull request 'Rewrite PSN + CityHome scrapers, add price/m² map coloring, ratings system, and status dashboard' (#2) from ui-tweaks/2026-02-17 into main
Reviewed-on: #2
2026-02-25 21:26:51 +00:00
33 changed files with 2035 additions and 33016 deletions

BIN
.DS_Store vendored

Binary file not shown.

View File

@@ -1,31 +0,0 @@
{
"permissions": {
"allow": [
"WebFetch(domain:github.com)",
"WebFetch(domain:www.sreality.cz)",
"WebFetch(domain:webscraping.pro)",
"WebFetch(domain:raw.githubusercontent.com)",
"Bash(python3:*)",
"Bash(open:*)",
"WebFetch(domain:www.realingo.cz)",
"WebFetch(domain:api.realingo.cz)",
"Bash(curl:*)",
"Bash(grep:*)",
"WebFetch(domain:www.realitni-pes.cz)",
"WebFetch(domain:www.bezrealitky.cz)",
"WebFetch(domain:apify.com)",
"WebFetch(domain:www.bezrealitky.com)",
"WebFetch(domain:reality.idnes.cz)",
"Bash(# Final checks: robots.txt and response time for rate limiting clues curl -s -L -H \"\"User-Agent: Mozilla/5.0 \\(Windows NT 10.0; Win64; x64\\) AppleWebKit/537.36 \\(KHTML, like Gecko\\) Chrome/120.0.0.0 Safari/537.36\"\" \"\"https://reality.idnes.cz/robots.txt\"\")",
"WebFetch(domain:www.cityhome.cz)",
"WebFetch(domain:www.psn.cz)",
"WebFetch(domain:www.city-home.cz)",
"WebFetch(domain:psn.cz)",
"WebFetch(domain:api.psn.cz)",
"Bash(done)",
"Bash(# Final summary: count total units across all projects\n# Get the total count from the unitsCountData we already extracted\necho \"\"From unitsCountData on /prodej page:\"\"\necho \"\" type_id 0 \\(Prodej bytů a ateliérů\\): 146\"\"\necho \"\" type_id 1 \\(Prodej komerčních nemovitostí\\): 14\"\"\necho \"\" type_id 2 \\(Pronájem bytů\\): 3\"\"\necho \"\" type_id 3 \\(Pronájem komerčních nemovitostí\\): 48\"\"\necho \"\"\"\"\necho \"\"Total for-sale projects: 19\"\"\necho \"\"\"\"\necho \"\"Disposition counts from the data:\"\"\npython3 << 'PYEOF'\n# Extract disposition counts from prodej page\nimport re\n\nwith open\\('/tmp/psn_prodej_p1.html', 'r', encoding='utf-8'\\) as f:\n html = f.read\\(\\)\n\n# Find disposition data\nidx = html.find\\('\\\\\\\\\"disposition\\\\\\\\\":['\\)\nif idx >= 0:\n chunk = html[idx:idx+2000].replace\\('\\\\\\\\\"', '\"'\\)\n # Extract name and count pairs\n import re\n pairs = re.findall\\(r'\"name\":\"\\([^\"]+\\)\",\"count\":\\(\\\\d+\\)', chunk\\)\n for name, count in pairs:\n print\\(f\" {name}: {count}\"\\)\nPYEOF)",
"Bash(ls:*)",
"Bash(chmod:*)"
]
}
}

5
.gitignore vendored
View File

@@ -1,3 +1,8 @@
.vscode/ .vscode/
__pycache__/ __pycache__/
.DS_Store
byty_*.json byty_*.json
*.json
*.log
mapa_bytu.html

124
CLAUDE.md Normal file
View File

@@ -0,0 +1,124 @@
# Maru hledá byt
Projekt pro hledání bytů v Praze. Scrapuje inzeráty ze 7 realitních portálů, filtruje, deduplikuje a generuje interaktivní mapu.
**Jazyk komunikace:** Čeština (uživatelka Marie). Kód a komentáře v kódu jsou mix CZ/EN.
## Architektura
```
run_all.sh (orchestrátor)
├─ scrape_and_map.py → byty_sreality.json (Sreality API)
├─ scrape_bezrealitky.py → byty_bezrealitky.json (HTML Apollo cache)
├─ scrape_idnes.py → byty_idnes.json (HTML regex)
├─ scrape_psn.py } → byty_psn.json (React API + curl)
├─ scrape_cityhome.py } → byty_cityhome.json (HTML tabulky)
├─ scrape_bazos.py → byty_bazos.json (HTML regex)
└─ scrape_realingo.py → byty_realingo.json (Next.js __NEXT_DATA__)
merge_and_map.py
├─ byty_merged.json (deduplikovaná data)
└─ mapa_bytu.html (Leaflet.js mapa)
generate_status.py → status.json + scraper_history.json
server.py (port 8080) → servíruje mapu + status page + ratings API
```
## Filtry (společné všem scraperům)
| Parametr | Hodnota | Poznámka |
|----------|---------|----------|
| Max cena | 13.5M Kč (Sreality/Realingo/Bezrealitky/iDNES), 14M Kč (PSN/CityHome/Bazoš) | Rozdíl je záměrný |
| Min plocha | 69 m² | |
| Min patro | 2. NP | 2. NP se na mapě označí varováním |
| Dispozice | 3+kk, 3+1, 4+kk, 4+1, 5+kk, 5+1, 6+ | |
| Region | Praha | |
| Vyloučit | panelové domy, sídliště | regex v popisu/polích |
## Klíčové soubory
- **scrape_and_map.py** — Sreality scraper + `generate_map()` funkce (sdílená, generuje HTML mapu)
- **merge_and_map.py** — sloučí 7 JSON zdrojů, deduplikuje (klíč: ulice + cena + plocha), volá `generate_map()`
- **scraper_stats.py** — utility: `validate_listing()` (validace povinných polí + GPS bounds) a `write_stats()`
- **generate_status.py** — generuje status.json a scraper_history.json z výstupů scraperů
- **server.py** — HTTP server (port 8080), endpointy: `/mapa_bytu.html`, `/scrapers-status`, `/api/ratings`, `/api/status`
- **run_all.sh** — orchestrátor, spouští scrapery postupně (PSN+CityHome paralelně), pak merge + status
## Mapa (mapa_bytu.html)
- Leaflet.js + CARTO tiles
- Barvy markerů podle ceny/m² (modrá < 110k → červená > 165k, šedá = neuvedeno)
- PSN/CityHome = srdíčkové markery (❤️)
- Nové inzeráty (≤ 1 den) = žlutý badge "NEW"
- Zamítnuté = zprůhledněné + 🚫 SVG overlay
- Oblíbené = hvězdička (⭐)
- Filtry: patro, max cena (input, default 13.5M, max 14M), datum přidání, skrýt zamítnuté, klik na cenový pás
- Ratings uložené v localStorage + sync na server `/api/ratings`
## Barvy zdrojů na mapě
```python
source_colors = {
"sreality": "#1976D2", # modrá
"realingo": "#00897B", # teal
"bezrealitky": "#E91E63", # růžová
"idnes": "#FF6F00", # oranžová
"psn": "#D32F2F", # červená
"cityhome": "#D32F2F", # červená
"bazos": "#7B1FA2", # fialová
}
```
## Deduplikace (merge_and_map.py)
- Klíč: `normalize_street(locality) + price + area`
- Normalizace ulice: první část před čárkou, lowercase, odstranění diakritiky, jen alfanumerické znaky
- PSN a CityHome mají prioritu (načtou se první)
## Vývoj
- **Git remote:** `https://gitea.home.hrajfrisbee.cz/littlemeat/maru-hleda-byt.git`
- **Gitea API token:** uložen v `.claude/settings.local.json`
- **Python 3.9+** kompatibilita (`from __future__ import annotations`)
- **Žádné pip závislosti** — jen stdlib (urllib, json, re, logging, pathlib, subprocess)
- **Docker:** `build/Dockerfile` (python:3.13-alpine), cron každé 4 hodiny
- Generované soubory (`byty_*.json`, `mapa_bytu.html`, `*.log`) jsou v `.gitignore`
## Typické úlohy
```bash
# Rychlý test scraperu
python3 scrape_bazos.py --max-pages 1 --max-properties 5 --log-level DEBUG
# Lokální validace (všechny scrapery s limity)
make validation-local
# Vygenerovat mapu z existujících dat
python3 merge_and_map.py
# Spustit server
python3 server.py # nebo: make serve
# Plný scrape
./run_all.sh
```
## Pořadí scraperů v run_all.sh
1. Sreality
2. Bezrealitky
3. iDNES
4. PSN + CityHome (paralelně)
5. Bazoš
6. Realingo (poslední — uživatelka ho nemá ráda)
7. Merge + mapa
8. Status generování
## Konvence
- Commit messages v angličtině, PR popis v angličtině
- Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- PRy přes Gitea API (viz create_pr.sh pattern v historii)
- Nové scrapery kopírují vzor z `scrape_bezrealitky.py`
- Každý scraper má argparse s `--max-pages`, `--max-properties`, `--log-level`

View File

@@ -3,9 +3,13 @@ CONTAINER_NAME := maru-hleda-byt
VOLUME_NAME := maru-hleda-byt-data VOLUME_NAME := maru-hleda-byt-data
VALIDATION_CONTAINER := maru-hleda-byt-validation VALIDATION_CONTAINER := maru-hleda-byt-validation
VALIDATION_VOLUME := maru-hleda-byt-validation-data VALIDATION_VOLUME := maru-hleda-byt-validation-data
DEBUG_CONTAINER := maru-hleda-byt-debug
DEBUG_VOLUME := maru-hleda-byt-debug-data
DEBUG_PORT ?= 8082
PORT := 8080 PORT := 8080
SERVER_PORT ?= 8080
.PHONY: build run stop logs scrape restart clean help validation validation-local validation-stop validation-local-debug .PHONY: build run stop logs scrape restart clean help serve validation validation-local validation-stop validation-local-debug debug debug-stop
help: help:
@echo "Available targets:" @echo "Available targets:"
@@ -20,6 +24,9 @@ help:
@echo " validation-local-debug - Run validation locally with DEBUG logging" @echo " validation-local-debug - Run validation locally with DEBUG logging"
@echo " restart - Restart the container (stop and run again)" @echo " restart - Restart the container (stop and run again)"
@echo " clean - Stop container and remove the Docker image" @echo " clean - Stop container and remove the Docker image"
@echo " serve - Start server.py locally on port 8080"
@echo " debug - Build and run debug Docker container with limited scrape (port $(DEBUG_PORT))"
@echo " debug-stop - Stop and remove the debug Docker container"
@echo " help - Show this help message" @echo " help - Show this help message"
build: build:
@@ -59,6 +66,27 @@ validation-stop:
@docker rm $(VALIDATION_CONTAINER) 2>/dev/null || true @docker rm $(VALIDATION_CONTAINER) 2>/dev/null || true
@echo "Validation container stopped and removed" @echo "Validation container stopped and removed"
debug: build
@docker stop $(DEBUG_CONTAINER) 2>/dev/null || true
@docker rm $(DEBUG_CONTAINER) 2>/dev/null || true
docker run -d --name $(DEBUG_CONTAINER) \
-p $(DEBUG_PORT):8080 \
-v $(DEBUG_VOLUME):/app/data \
-e LOG_LEVEL=DEBUG \
$(IMAGE_NAME)
@sleep 2
docker exec $(DEBUG_CONTAINER) bash /app/run_all.sh --max-pages 1 --max-properties 10
@echo "Debug app at http://localhost:$(DEBUG_PORT)/mapa_bytu.html"
@echo "Debug status at http://localhost:$(DEBUG_PORT)/scrapers-status"
debug-stop:
@docker stop $(DEBUG_CONTAINER) 2>/dev/null || true
@docker rm $(DEBUG_CONTAINER) 2>/dev/null || true
@echo "Debug container stopped and removed"
serve:
DATA_DIR=. SERVER_PORT=$(SERVER_PORT) python3 server.py
validation-local: validation-local:
./run_all.sh --max-pages 1 --max-properties 10 ./run_all.sh --max-pages 1 --max-properties 10

View File

@@ -83,10 +83,6 @@ Merges all `byty_*.json` files into `byty_merged.json` and generates `mapa_bytu.
**Deduplication logic:** Two listings are considered duplicates if they share the same normalized street name + price + area. PSN and CityHome have priority during dedup (loaded first), so their listings are kept over duplicates from other portals. **Deduplication logic:** Two listings are considered duplicates if they share the same normalized street name + price + area. PSN and CityHome have priority during dedup (loaded first), so their listings are kept over duplicates from other portals.
### `regen_map.py`
Regenerates the map from existing `byty_sreality.json` data without re-scraping. Fetches missing area values from the Sreality API, fixes URLs, and re-applies the area filter. Useful for tweaking map output after data has already been collected.
## Interactive map (`mapa_bytu.html`) ## Interactive map (`mapa_bytu.html`)
The generated map is a standalone HTML file using Leaflet.js with CARTO basemap tiles. Features: The generated map is a standalone HTML file using Leaflet.js with CARTO basemap tiles. Features:
@@ -151,7 +147,7 @@ The project includes a Docker setup for unattended operation with a cron-based s
│ PID 1: python3 -m http.server :8080 │ │ PID 1: python3 -m http.server :8080 │
│ serves /app/data/ │ │ serves /app/data/ │
│ │ │ │
│ crond: runs run_all.sh at 06:00/18:00 │ crond: runs run_all.sh every 4 hours
│ Europe/Prague timezone │ │ Europe/Prague timezone │
│ │ │ │
│ /app/ -- scripts (.py, .sh) │ │ /app/ -- scripts (.py, .sh) │
@@ -160,7 +156,7 @@ The project includes a Docker setup for unattended operation with a cron-based s
└─────────────────────────────────────────┘ └─────────────────────────────────────────┘
``` ```
On startup, the HTTP server starts immediately. The initial scrape runs in the background. Subsequent cron runs update data in-place twice daily at 06:00 and 18:00 CET/CEST. On startup, the HTTP server starts immediately. The initial scrape runs in the background. Subsequent cron runs update data in-place every 4 hours.
### Quick start ### Quick start
@@ -201,14 +197,13 @@ Validation targets run scrapers with `--max-pages 1 --max-properties 10` for a f
├── scrape_psn.py # PSN scraper ├── scrape_psn.py # PSN scraper
├── scrape_cityhome.py # CityHome scraper ├── scrape_cityhome.py # CityHome scraper
├── merge_and_map.py # Merge all sources + generate final map ├── merge_and_map.py # Merge all sources + generate final map
├── regen_map.py # Regenerate map from cached Sreality data
├── run_all.sh # Orchestrator script (runs all scrapers + merge) ├── run_all.sh # Orchestrator script (runs all scrapers + merge)
├── mapa_bytu.html # Generated interactive map (output) ├── mapa_bytu.html # Generated interactive map (output)
├── Makefile # Docker management + validation shortcuts ├── Makefile # Docker management + validation shortcuts
├── build/ ├── build/
│ ├── Dockerfile # Container image definition (python:3.13-alpine) │ ├── Dockerfile # Container image definition (python:3.13-alpine)
│ ├── entrypoint.sh # Container entrypoint (HTTP server + cron + initial scrape) │ ├── entrypoint.sh # Container entrypoint (HTTP server + cron + initial scrape)
│ ├── crontab # Cron schedule (06:00 and 18:00 CET) │ ├── crontab # Cron schedule (every 4 hours)
│ └── CONTAINER.md # Container-specific documentation │ └── CONTAINER.md # Container-specific documentation
└── .gitignore # Ignores byty_*.json, __pycache__, .vscode └── .gitignore # Ignores byty_*.json, __pycache__, .vscode
``` ```

View File

@@ -5,12 +5,14 @@ RUN apk add --no-cache curl bash tzdata \
&& echo "Europe/Prague" > /etc/timezone && echo "Europe/Prague" > /etc/timezone
ENV PYTHONUNBUFFERED=1 ENV PYTHONUNBUFFERED=1
ENV DATA_DIR=/app/data
WORKDIR /app WORKDIR /app
COPY scrape_and_map.py scrape_realingo.py scrape_bezrealitky.py \ COPY scrape_and_map.py scrape_realingo.py scrape_bezrealitky.py \
scrape_idnes.py scrape_psn.py scrape_cityhome.py \ scrape_idnes.py scrape_psn.py scrape_cityhome.py \
merge_and_map.py regen_map.py run_all.sh ratings_server.py ./ merge_and_map.py generate_status.py scraper_stats.py \
run_all.sh server.py ./
COPY build/crontab /etc/crontabs/root COPY build/crontab /etc/crontabs/root
COPY build/entrypoint.sh /entrypoint.sh COPY build/entrypoint.sh /entrypoint.sh
@@ -18,7 +20,7 @@ RUN chmod +x /entrypoint.sh run_all.sh
RUN mkdir -p /app/data RUN mkdir -p /app/data
EXPOSE 8080 8081 EXPOSE 8080
HEALTHCHECK --interval=60s --timeout=5s --start-period=300s \ HEALTHCHECK --interval=60s --timeout=5s --start-period=300s \
CMD wget -q -O /dev/null http://localhost:8080/ || exit 1 CMD wget -q -O /dev/null http://localhost:8080/ || exit 1

View File

@@ -1 +1 @@
0 6,18 * * * cd /app && bash /app/run_all.sh >> /proc/1/fd/1 2>> /proc/1/fd/2 0 */4 * * * cd /app && bash /app/run_all.sh >> /proc/1/fd/1 2>> /proc/1/fd/2

View File

@@ -1,7 +1,7 @@
#!/bin/bash #!/bin/bash
set -euo pipefail set -euo pipefail
DATA_DIR="/app/data" export DATA_DIR="/app/data"
# Create symlinks so scripts (which write to /app/) persist data to the volume # Create symlinks so scripts (which write to /app/) persist data to the volume
for f in byty_sreality.json byty_realingo.json byty_bezrealitky.json \ for f in byty_sreality.json byty_realingo.json byty_bezrealitky.json \
@@ -18,8 +18,5 @@ crond -b -l 2
echo "[entrypoint] Starting initial scrape in background..." echo "[entrypoint] Starting initial scrape in background..."
bash /app/run_all.sh & bash /app/run_all.sh &
echo "[entrypoint] Starting ratings API server on port 8081..." echo "[entrypoint] Starting server on port 8080..."
DATA_DIR="$DATA_DIR" python3 /app/ratings_server.py & exec python3 /app/server.py
echo "[entrypoint] Starting HTTP server on port 8080..."
exec python3 -m http.server 8080 --directory "$DATA_DIR"

View File

@@ -1,427 +0,0 @@
[
{
"hash_id": 990183,
"name": "Prodej bytu 3+kk 86 m²",
"price": 10385000,
"price_formatted": "10 385 000 Kč",
"locality": "Ke Tvrzi, Praha - Královice",
"lat": 50.0390519,
"lon": 14.63862,
"disposition": "3+kk",
"floor": 2,
"area": 86,
"building_type": "Cihlová",
"ownership": "Osobní",
"url": "https://www.bezrealitky.cz/nemovitosti-byty-domy/990183-nabidka-prodej-bytu-ke-tvrzi-praha",
"source": "bezrealitky",
"image": ""
},
{
"hash_id": 989862,
"name": "Prodej bytu 3+kk 73 m²",
"price": 12790000,
"price_formatted": "12 790 000 Kč",
"locality": "Vrázova, Praha - Smíchov",
"lat": 50.0711312,
"lon": 14.4076652,
"disposition": "3+kk",
"floor": 3,
"area": 73,
"building_type": "Cihlová",
"ownership": "Osobní",
"url": "https://www.bezrealitky.cz/nemovitosti-byty-domy/989862-nabidka-prodej-bytu-vrazova-praha",
"source": "bezrealitky",
"image": ""
},
{
"hash_id": 981278,
"name": "Prodej bytu 3+kk 70 m²",
"price": 11890000,
"price_formatted": "11 890 000 Kč",
"locality": "Argentinská, Praha - Holešovice",
"lat": 50.1026043,
"lon": 14.4435365,
"disposition": "3+kk",
"floor": 3,
"area": 70,
"building_type": "Cihlová",
"ownership": "Osobní",
"url": "https://www.bezrealitky.cz/nemovitosti-byty-domy/981278-nabidka-prodej-bytu-argentinska-praha",
"source": "bezrealitky",
"image": ""
},
{
"hash_id": 989817,
"name": "Prodej bytu 3+kk 88 m²",
"price": 13490000,
"price_formatted": "13 490 000 Kč",
"locality": "Miroslava Hajna, Praha - Letňany",
"lat": 50.1406487,
"lon": 14.5207541,
"disposition": "3+kk",
"floor": 2,
"area": 88,
"building_type": "Cihlová",
"ownership": "Osobní",
"url": "https://www.bezrealitky.cz/nemovitosti-byty-domy/989817-nabidka-prodej-bytu-miroslava-hajna-praha",
"source": "bezrealitky",
"image": ""
},
{
"hash_id": 970257,
"name": "Prodej bytu 3+1 106 m²",
"price": 12950000,
"price_formatted": "12 950 000 Kč",
"locality": "Novákových, Praha - Libeň",
"lat": 50.1034771,
"lon": 14.4758735,
"disposition": "3+1",
"floor": 5,
"area": 106,
"building_type": "Cihlová",
"ownership": "Osobní",
"url": "https://www.bezrealitky.cz/nemovitosti-byty-domy/970257-nabidka-prodej-bytu-novakovych-praha",
"source": "bezrealitky",
"image": ""
},
{
"hash_id": 972406,
"name": "Prodej bytu 3+kk 83 m²",
"price": 10490000,
"price_formatted": "10 490 000 Kč",
"locality": "Na Výrovně, Praha - Stodůlky",
"lat": 50.0396067,
"lon": 14.3167022,
"disposition": "3+kk",
"floor": 2,
"area": 83,
"building_type": "Cihlová",
"ownership": "Osobní",
"url": "https://www.bezrealitky.cz/nemovitosti-byty-domy/972406-nabidka-prodej-bytu-na-vyrovne",
"source": "bezrealitky",
"image": ""
},
{
"hash_id": 967142,
"name": "Prodej bytu 3+kk 78 m²",
"price": 11648000,
"price_formatted": "11 648 000 Kč",
"locality": "Na Míčánkách, Praha - Vršovice",
"lat": 50.0713284,
"lon": 14.4638722,
"disposition": "3+kk",
"floor": 6,
"area": 78,
"building_type": "Cihlová",
"ownership": "Osobní",
"url": "https://www.bezrealitky.cz/nemovitosti-byty-domy/967142-nabidka-prodej-bytu-na-micankach",
"source": "bezrealitky",
"image": ""
},
{
"hash_id": 955977,
"name": "Prodej bytu 4+kk 75 m²",
"price": 10363000,
"price_formatted": "10 363 000 Kč",
"locality": "Karla Guta, Praha - Uhříněves",
"lat": 50.03017,
"lon": 14.5940072,
"disposition": "4+kk",
"floor": 4,
"area": 75,
"building_type": "Cihlová",
"ownership": "Osobní",
"url": "https://www.bezrealitky.cz/nemovitosti-byty-domy/955977-nabidka-prodej-bytu-karla-guta",
"source": "bezrealitky",
"image": ""
},
{
"hash_id": 974557,
"name": "Prodej bytu 4+kk 94 m²",
"price": 13499900,
"price_formatted": "13 499 900 Kč",
"locality": "V Dolině, Praha - Michle",
"lat": 50.0579963,
"lon": 14.4682887,
"disposition": "4+kk",
"floor": 8,
"area": 94,
"building_type": "Cihlová",
"ownership": "Osobní",
"url": "https://www.bezrealitky.cz/nemovitosti-byty-domy/974557-nabidka-prodej-bytu-v-doline-praha",
"source": "bezrealitky",
"image": ""
},
{
"hash_id": 988498,
"name": "Prodej bytu 3+1 75 m²",
"price": 11400000,
"price_formatted": "11 400 000 Kč",
"locality": "5. května, Praha - Nusle",
"lat": 50.0604096,
"lon": 14.4326302,
"disposition": "3+1",
"floor": 4,
"area": 75,
"building_type": "Cihlová",
"ownership": "Osobní",
"url": "https://www.bezrealitky.cz/nemovitosti-byty-domy/988498-nabidka-prodej-bytu-5-kvetna-praha",
"source": "bezrealitky",
"image": ""
},
{
"hash_id": 985285,
"name": "Prodej bytu 3+kk 70 m²",
"price": 12200000,
"price_formatted": "12 200 000 Kč",
"locality": "Klausova, Praha - Stodůlky",
"lat": 50.0370204,
"lon": 14.3432643,
"disposition": "3+kk",
"floor": 5,
"area": 70,
"building_type": "Cihlová",
"ownership": "Osobní",
"url": "https://www.bezrealitky.cz/nemovitosti-byty-domy/985285-nabidka-prodej-bytu-klausova-praha",
"source": "bezrealitky",
"image": ""
},
{
"hash_id": 965526,
"name": "Prodej bytu 3+kk 77 m²",
"price": 11890000,
"price_formatted": "11 890 000 Kč",
"locality": "Vinohradská, Praha - Strašnice",
"lat": 50.0776726,
"lon": 14.4870072,
"disposition": "3+kk",
"floor": 16,
"area": 77,
"building_type": "Smíšená",
"ownership": "Osobní",
"url": "https://www.bezrealitky.cz/nemovitosti-byty-domy/965526-nabidka-prodej-bytu-vinohradska-praha",
"source": "bezrealitky",
"image": ""
},
{
"hash_id": 924811,
"name": "Prodej bytu 3+kk 75 m²",
"price": 13390000,
"price_formatted": "13 390 000 Kč",
"locality": "Waltariho, Praha - Hloubětín",
"lat": 50.1076717,
"lon": 14.5248559,
"disposition": "3+kk",
"floor": 4,
"area": 75,
"building_type": "Smíšená",
"ownership": "Osobní",
"url": "https://www.bezrealitky.cz/nemovitosti-byty-domy/924811-nabidka-prodej-bytu-waltariho-praha",
"source": "bezrealitky",
"image": ""
},
{
"hash_id": 985859,
"name": "Prodej bytu 3+1 80 m²",
"price": 9000000,
"price_formatted": "9 000 000 Kč",
"locality": "Staňkova, Praha - Háje",
"lat": 50.0377128,
"lon": 14.5311557,
"disposition": "3+1",
"floor": 2,
"area": 80,
"building_type": "Cihlová",
"ownership": "Osobní",
"url": "https://www.bezrealitky.cz/nemovitosti-byty-domy/985859-nabidka-prodej-bytu-stankova-praha",
"source": "bezrealitky",
"image": ""
},
{
"hash_id": 985583,
"name": "Prodej bytu 3+kk 76 m²",
"price": 10850000,
"price_formatted": "10 850 000 Kč",
"locality": "Boloňská, Praha - Horní Měcholupy",
"lat": 50.047328,
"lon": 14.5565277,
"disposition": "3+kk",
"floor": 4,
"area": 76,
"building_type": "Cihlová",
"ownership": "Osobní",
"url": "https://www.bezrealitky.cz/nemovitosti-byty-domy/985583-nabidka-prodej-bytu-bolonska-praha",
"source": "bezrealitky",
"image": ""
},
{
"hash_id": 981178,
"name": "Prodej bytu 4+kk 86 m²",
"price": 11990000,
"price_formatted": "11 990 000 Kč",
"locality": "Sušilova, Praha - Uhříněves",
"lat": 50.032081,
"lon": 14.5885148,
"disposition": "4+kk",
"floor": 2,
"area": 86,
"building_type": "SKELET",
"ownership": "Osobní",
"url": "https://www.bezrealitky.cz/nemovitosti-byty-domy/981178-nabidka-prodej-bytu-susilova-praha",
"source": "bezrealitky",
"image": ""
},
{
"hash_id": 973216,
"name": "Prodej bytu 4+1 82 m²",
"price": 11357000,
"price_formatted": "11 357 000 Kč",
"locality": "Nad Kapličkou, Praha - Strašnice",
"lat": 50.0839509,
"lon": 14.4904493,
"disposition": "4+1",
"floor": 2,
"area": 82,
"building_type": "Cihlová",
"ownership": "Osobní",
"url": "https://www.bezrealitky.cz/nemovitosti-byty-domy/973216-nabidka-prodej-bytu-nad-kaplickou-praha",
"source": "bezrealitky",
"image": ""
},
{
"hash_id": 868801,
"name": "Prodej bytu 3+kk 109 m²",
"price": 7299000,
"price_formatted": "7 299 000 Kč",
"locality": "Pod Karlovem, Praha - Vinohrady",
"lat": 50.0676313,
"lon": 14.432498,
"disposition": "3+kk",
"floor": 5,
"area": 109,
"building_type": "Cihlová",
"ownership": "Družstevní",
"url": "https://www.bezrealitky.cz/nemovitosti-byty-domy/868801-nabidka-prodej-bytu-pod-karlovem-praha",
"source": "bezrealitky",
"image": ""
},
{
"hash_id": 868795,
"name": "Prodej bytu 3+kk 106 m²",
"price": 6299000,
"price_formatted": "6 299 000 Kč",
"locality": "Pod Karlovem, Praha - Vinohrady",
"lat": 50.0676313,
"lon": 14.432498,
"disposition": "3+kk",
"floor": 2,
"area": 106,
"building_type": "Cihlová",
"ownership": "Družstevní",
"url": "https://www.bezrealitky.cz/nemovitosti-byty-domy/868795-nabidka-prodej-bytu-pod-karlovem-praha",
"source": "bezrealitky",
"image": ""
},
{
"hash_id": 981890,
"name": "Prodej bytu 3+1 84 m²",
"price": 12980000,
"price_formatted": "12 980 000 Kč",
"locality": "Novákových, Praha - Libeň",
"lat": 50.103273,
"lon": 14.4746894,
"disposition": "3+1",
"floor": 2,
"area": 84,
"building_type": "Cihlová",
"ownership": "Osobní",
"url": "https://www.bezrealitky.cz/nemovitosti-byty-domy/981890-nabidka-prodej-bytu-novakovych-praha",
"source": "bezrealitky",
"image": ""
},
{
"hash_id": 976276,
"name": "Prodej bytu 3+kk 75 m²",
"price": 13490000,
"price_formatted": "13 490 000 Kč",
"locality": "Svornosti, Praha - Smíchov",
"lat": 50.0673284,
"lon": 14.4095087,
"disposition": "3+kk",
"floor": 2,
"area": 75,
"building_type": "Cihlová",
"ownership": "Osobní",
"url": "https://www.bezrealitky.cz/nemovitosti-byty-domy/976276-nabidka-prodej-bytu-svornosti-praha",
"source": "bezrealitky",
"image": ""
},
{
"hash_id": 950787,
"name": "Prodej bytu 3+kk 70 m²",
"price": 9999000,
"price_formatted": "9 999 000 Kč",
"locality": "Sečská, Praha - Strašnice",
"lat": 50.071191,
"lon": 14.5035501,
"disposition": "3+kk",
"floor": 3,
"area": 70,
"building_type": "Smíšená",
"ownership": "Osobní",
"url": "https://www.bezrealitky.cz/nemovitosti-byty-domy/950787-nabidka-prodej-bytu-secska-praha",
"source": "bezrealitky",
"image": ""
},
{
"hash_id": 978045,
"name": "Prodej bytu 3+kk 76 m²",
"price": 11133000,
"price_formatted": "11 133 000 Kč",
"locality": "K Vinoři, Praha - Kbely",
"lat": 50.1329656,
"lon": 14.5618499,
"disposition": "3+kk",
"floor": 2,
"area": 76,
"building_type": "Smíšená",
"ownership": "Osobní",
"url": "https://www.bezrealitky.cz/nemovitosti-byty-domy/978045-nabidka-prodej-bytu-k-vinori",
"source": "bezrealitky",
"image": ""
},
{
"hash_id": 974552,
"name": "Prodej bytu 3+1 75 m²",
"price": 11000000,
"price_formatted": "11 000 000 Kč",
"locality": "Vejražkova, Praha - Košíře",
"lat": 50.0637808,
"lon": 14.3612275,
"disposition": "3+1",
"floor": 2,
"area": 75,
"building_type": "Cihlová",
"ownership": "Osobní",
"url": "https://www.bezrealitky.cz/nemovitosti-byty-domy/974552-nabidka-prodej-bytu-vejrazkova-praha",
"source": "bezrealitky",
"image": ""
},
{
"hash_id": 955010,
"name": "Prodej bytu 3+kk 70 m²",
"price": 12290000,
"price_formatted": "12 290 000 Kč",
"locality": "Břeclavská, Praha - Kyje",
"lat": 50.0951045,
"lon": 14.5454237,
"disposition": "3+kk",
"floor": 2,
"area": 70,
"building_type": "Cihlová",
"ownership": "Osobní",
"url": "https://www.bezrealitky.cz/nemovitosti-byty-domy/955010-nabidka-prodej-bytu-breclavska-hlavni-mesto-praha",
"source": "bezrealitky",
"image": ""
}
]

View File

@@ -1 +0,0 @@
[]

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff

View File

@@ -1 +0,0 @@
[]

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff

123
docs/validation.md Normal file
View File

@@ -0,0 +1,123 @@
# Validation Recipe
End-to-end check that scraping, data persistence, history, and the status page all work correctly in Docker.
## What it verifies
- All scrapers run and write output to `DATA_DIR` (`/app/data`)
- `stats_*.json` land in `/app/data/` (not in `/app/`)
- `status.json` and `scraper_history.json` land in `/app/data/`
- `/api/status`, `/api/status/history`, and `/scrapers-status` serve correct data
- History accumulates across runs
## Steps
### 1. Build the image
```bash
make build
```
### 2. Start a clean validation container
```bash
# Stop/remove any leftover container and volume from a previous run
docker stop maru-hleda-byt-validation 2>/dev/null; docker rm maru-hleda-byt-validation 2>/dev/null
docker volume rm maru-hleda-byt-validation-data 2>/dev/null
docker run -d --name maru-hleda-byt-validation \
-p 8081:8080 \
-v maru-hleda-byt-validation-data:/app/data \
maru-hleda-byt
```
Give the container ~3 seconds to start. The entrypoint launches a background full scrape automatically — suppress it so only controlled runs execute:
```bash
sleep 3
docker exec maru-hleda-byt-validation pkill -f run_all.sh 2>/dev/null || true
docker exec maru-hleda-byt-validation rm -f /app/data/scraper_running.json 2>/dev/null || true
```
### 3. Run a limited scrape (run 1)
```bash
docker exec maru-hleda-byt-validation bash /app/run_all.sh --max-pages 1 --max-properties 10
```
Expected output (last few lines):
```
Status uložen: /app/data/status.json
Historie uložena: /app/data/scraper_history.json (1 záznamů)
```
### 4. Verify data files are in `/app/data/`
```bash
docker exec maru-hleda-byt-validation ls /app/data/
```
Expected files:
```
byty_cityhome.json byty_idnes.json byty_merged.json
byty_realingo.json byty_sreality.json
mapa_bytu.html
scraper_history.json
stats_bezrealitky.json stats_cityhome.json stats_idnes.json
stats_realingo.json stats_sreality.json
status.json
```
### 5. Run a second limited scrape (run 2)
```bash
docker exec maru-hleda-byt-validation bash /app/run_all.sh --max-pages 1 --max-properties 10
```
Expected last line: `Historie uložena: /app/data/scraper_history.json (2 záznamů)`
### 6. Verify history via API
```bash
curl -s http://localhost:8081/api/status/history | python3 -c "
import json, sys
h = json.load(sys.stdin)
print(f'{len(h)} entries:')
for i, e in enumerate(h):
print(f' [{i}] {e[\"timestamp\"]} total={e[\"total_accepted\"]}')
"
```
Expected: 2 entries with different timestamps.
```bash
curl -s http://localhost:8081/api/status | python3 -c "
import json, sys; s=json.load(sys.stdin)
print(f'status={s[\"status\"]} total={s[\"total_accepted\"]} ts={s[\"timestamp\"]}')
"
```
Expected: `status=done total=<N> ts=<latest timestamp>`
### 7. Check the status page
Open http://localhost:8081/scrapers-status in a browser (or `curl -s http://localhost:8081/scrapers-status | grep -c "clickable-row"` — should print `2`).
### 8. Clean up
```bash
docker stop maru-hleda-byt-validation && docker rm maru-hleda-byt-validation
docker volume rm maru-hleda-byt-validation-data
```
Or use the Makefile shortcut:
```bash
make validation-stop
```
## Notes
- PSN scraper does not support `--max-pages` and will always fail with this command; `success=False` in history is expected during validation.
- Bezrealitky may return 0 results with a 1-page limit; `byty_bezrealitky.json` will be absent from `/app/data/` in that case — this is normal.
- `make validation` (the Makefile target) runs the same limited scrape but does not suppress the background startup scrape, so two concurrent runs may occur. Use the manual steps above for a clean controlled test.

View File

@@ -1,16 +1,15 @@
#!/usr/bin/env python3 #!/usr/bin/env python3
"""Generate status.json from scraper JSON outputs and run log.""" """Generate status.json from scraper JSON outputs and per-scraper stats files."""
from __future__ import annotations from __future__ import annotations
import argparse
import json import json
import os import os
import re
import sys
from datetime import datetime from datetime import datetime
from pathlib import Path from pathlib import Path
from typing import Optional
HERE = Path(__file__).parent HERE = Path(__file__).parent
DATA_DIR = Path(os.environ.get("DATA_DIR", HERE))
SOURCE_FILES = { SOURCE_FILES = {
"Sreality": "byty_sreality.json", "Sreality": "byty_sreality.json",
@@ -21,7 +20,17 @@ SOURCE_FILES = {
"CityHome": "byty_cityhome.json", "CityHome": "byty_cityhome.json",
} }
STATS_FILES = {
"Sreality": "stats_sreality.json",
"Realingo": "stats_realingo.json",
"Bezrealitky": "stats_bezrealitky.json",
"iDNES": "stats_idnes.json",
"PSN": "stats_psn.json",
"CityHome": "stats_cityhome.json",
}
MERGED_FILE = "byty_merged.json" MERGED_FILE = "byty_merged.json"
HISTORY_FILE = "scraper_history.json"
def count_source(path: Path) -> dict: def count_source(path: Path) -> dict:
@@ -36,105 +45,51 @@ def count_source(path: Path) -> dict:
return {"accepted": 0, "error": str(e)} return {"accepted": 0, "error": str(e)}
def parse_log(log_path: str) -> dict[str, dict]: def read_scraper_stats(path: Path) -> dict:
"""Parse scraper run log and extract per-source statistics. """Load a per-scraper stats JSON. Returns {} on missing or corrupt file."""
if not path.exists():
Scrapers log summary lines like: return {}
✓ Vyhovující byty: 12 try:
Vyloučeno (prodáno): 5 data = json.loads(path.read_text(encoding="utf-8"))
Staženo stránek: 3 return data if isinstance(data, dict) else {}
Staženo inzerátů: 48 except Exception:
Celkem bytů v cache: 120
and section headers like:
[2/6] Realingo
"""
if not log_path or not os.path.exists(log_path):
return {} return {}
with open(log_path, encoding="utf-8") as f:
content = f.read()
# Split into per-source sections by the [N/6] Step header def append_to_history(status: dict, keep: int) -> None:
# Each section header looks like "[2/6] Realingo\n----..." """Append the current status entry to scraper_history.json, keeping only `keep` latest."""
section_pattern = re.compile(r'\[(\d+)/\d+\]\s+(.+)\n-+', re.MULTILINE) history_path = DATA_DIR / HISTORY_FILE
sections_found = list(section_pattern.finditer(content)) history: list = []
if history_path.exists():
try:
history = json.loads(history_path.read_text(encoding="utf-8"))
if not isinstance(history, list):
history = []
except Exception:
history = []
if not sections_found: history.append(status)
return {}
stats = {} # Keep only the N most recent entries
for i, match in enumerate(sections_found): if keep > 0 and len(history) > keep:
step_name = match.group(2).strip() history = history[-keep:]
start = match.end()
end = sections_found[i + 1].start() if i + 1 < len(sections_found) else len(content)
section_text = content[start:end]
# Identify which sources this section covers history_path.write_text(json.dumps(history, ensure_ascii=False, indent=2), encoding="utf-8")
# "PSN + CityHome" covers both print(f"Historie uložena: {history_path} ({len(history)} záznamů)")
source_names = []
for name in SOURCE_FILES:
if name.lower() in step_name.lower():
source_names.append(name)
if not source_names:
continue
# Parse numeric summary lines
def extract(pattern: str) -> Optional[int]:
m = re.search(pattern, section_text)
return int(m.group(1)) if m else None
# Lines present in all/most scrapers
accepted = extract(r'Vyhovující byty[:\s]+(\d+)')
fetched = extract(r'Staženo inzerátů[:\s]+(\d+)')
pages = extract(r'Staženo stránek[:\s]+(\d+)')
cached = extract(r'Celkem bytů v cache[:\s]+(\d+)')
cache_hits = extract(r'Cache hit[:\s]+(\d+)')
# Rejection reasons — collect all into a dict
excluded = {}
for m in re.finditer(r'Vyloučeno\s+\(([^)]+)\)[:\s]+(\d+)', section_text):
excluded[m.group(1)] = int(m.group(2))
# Also PSN-style "Vyloučeno (prodáno): N"
total_excluded = sum(excluded.values()) if excluded else extract(r'Vyloučen\w*[:\s]+(\d+)')
entry = {}
if accepted is not None:
entry["accepted"] = accepted
if fetched is not None:
entry["fetched"] = fetched
if pages is not None:
entry["pages"] = pages
if cached is not None:
entry["cached"] = cached
if cache_hits is not None:
entry["cache_hits"] = cache_hits
if excluded:
entry["excluded"] = excluded
elif total_excluded is not None:
entry["excluded_total"] = total_excluded
for name in source_names:
stats[name] = entry
return stats
def main(): def main():
start_time = None parser = argparse.ArgumentParser(description="Generate status.json from scraper outputs.")
duration_sec = None parser.add_argument("--start-time", dest="start_time", default=None,
help="ISO timestamp of scrape start (default: now)")
parser.add_argument("--duration", dest="duration", type=int, default=None,
help="Run duration in seconds")
parser.add_argument("--keep", dest="keep", type=int, default=20,
help="Number of history entries to keep (default: 20, 0=unlimited)")
args = parser.parse_args()
if len(sys.argv) >= 3: start_time = args.start_time or datetime.now().isoformat(timespec="seconds")
start_time = sys.argv[1] duration_sec = args.duration
try:
duration_sec = int(sys.argv[2])
except ValueError:
pass
if not start_time:
start_time = datetime.now().isoformat(timespec="seconds")
log_path = sys.argv[3] if len(sys.argv) >= 4 else None
log_stats = parse_log(log_path)
sources = [] sources = []
for name, filename in SOURCE_FILES.items(): for name, filename in SOURCE_FILES.items():
@@ -142,14 +97,12 @@ def main():
info = count_source(path) info = count_source(path)
info["name"] = name info["name"] = name
# Merge log stats # Merge in stats from the per-scraper stats file (authoritative for run data)
ls = log_stats.get(name, {}) stats = read_scraper_stats(DATA_DIR / STATS_FILES[name])
for k in ("fetched", "pages", "cached", "cache_hits", "excluded", "excluded_total"): for key in ("accepted", "fetched", "pages", "cache_hits", "excluded", "excluded_total",
if k in ls: "success", "duration_sec", "error"):
info[k] = ls[k] if key in stats:
# Override accepted from log if available (log is authoritative for latest run) info[key] = stats[key]
if "accepted" in ls:
info["accepted"] = ls["accepted"]
sources.append(info) sources.append(info)
@@ -168,17 +121,21 @@ def main():
duplicates_removed = total_accepted - deduplicated if deduplicated else 0 duplicates_removed = total_accepted - deduplicated if deduplicated else 0
# Top-level success: True if no source has an error
success = not any("error" in s for s in sources)
status = { status = {
"status": "done", "status": "done",
"timestamp": start_time, "timestamp": start_time,
"duration_sec": duration_sec, "duration_sec": duration_sec,
"success": success,
"total_accepted": total_accepted, "total_accepted": total_accepted,
"deduplicated": deduplicated, "deduplicated": deduplicated,
"duplicates_removed": duplicates_removed, "duplicates_removed": duplicates_removed,
"sources": sources, "sources": sources,
} }
out = HERE / "status.json" out = DATA_DIR / "status.json"
out.write_text(json.dumps(status, ensure_ascii=False, indent=2), encoding="utf-8") out.write_text(json.dumps(status, ensure_ascii=False, indent=2), encoding="utf-8")
print(f"Status uložen: {out}") print(f"Status uložen: {out}")
print(f" Celkem bytů (před dedup): {total_accepted}") print(f" Celkem bytů (před dedup): {total_accepted}")
@@ -197,6 +154,8 @@ def main():
parts.append(f"[CHYBA: {err}]") parts.append(f"[CHYBA: {err}]")
print(" " + " ".join(parts)) print(" " + " ".join(parts))
append_to_history(status, args.keep)
if __name__ == "__main__": if __name__ == "__main__":
main() main()

File diff suppressed because it is too large Load Diff

View File

@@ -1,6 +1,6 @@
#!/usr/bin/env python3 #!/usr/bin/env python3
""" """
Sloučí data ze Sreality, Realinga, Bezrealitek, iDNES, PSN a CityHome, Sloučí data ze Sreality, Realinga, Bezrealitek, iDNES, PSN, CityHome a Bazoše,
deduplikuje a vygeneruje mapu. deduplikuje a vygeneruje mapu.
Deduplikace: stejná ulice (z locality) + stejná cena + stejná plocha = duplikát. Deduplikace: stejná ulice (z locality) + stejná cena + stejná plocha = duplikát.
PSN a CityHome mají při deduplikaci prioritu (načtou se první). PSN a CityHome mají při deduplikaci prioritu (načtou se první).
@@ -9,6 +9,7 @@ from __future__ import annotations
import json import json
import re import re
import unicodedata
from pathlib import Path from pathlib import Path
from scrape_and_map import generate_map, format_price from scrape_and_map import generate_map, format_price
@@ -19,14 +20,8 @@ def normalize_street(locality: str) -> str:
# "Studentská, Praha 6 - Dejvice" → "studentska" # "Studentská, Praha 6 - Dejvice" → "studentska"
# "Rýnská, Praha" → "rynska" # "Rýnská, Praha" → "rynska"
street = locality.split(",")[0].strip().lower() street = locality.split(",")[0].strip().lower()
# Remove diacritics (simple Czech) # Remove diacritics using Unicode decomposition (handles all Czech characters)
replacements = { street = unicodedata.normalize("NFKD", street).encode("ascii", "ignore").decode("ascii")
"á": "a", "č": "c", "ď": "d", "é": "e", "ě": "e",
"í": "i", "ň": "n", "ó": "o", "ř": "r", "š": "s",
"ť": "t", "ú": "u", "ů": "u", "ý": "y", "ž": "z",
}
for src, dst in replacements.items():
street = street.replace(src, dst)
# Remove non-alphanumeric # Remove non-alphanumeric
street = re.sub(r"[^a-z0-9]", "", street) street = re.sub(r"[^a-z0-9]", "", street)
return street return street
@@ -49,6 +44,7 @@ def main():
("Realingo", "byty_realingo.json"), ("Realingo", "byty_realingo.json"),
("Bezrealitky", "byty_bezrealitky.json"), ("Bezrealitky", "byty_bezrealitky.json"),
("iDNES", "byty_idnes.json"), ("iDNES", "byty_idnes.json"),
("Bazoš", "byty_bazos.json"),
] ]
all_estates = [] all_estates = []
@@ -79,6 +75,10 @@ def main():
if key in seen_keys: if key in seen_keys:
dupes += 1 dupes += 1
existing = seen_keys[key] existing = seen_keys[key]
# Preserve earliest first_seen across sources
dup_fs = e.get("first_seen", "")
if dup_fs and (not existing.get("first_seen") or dup_fs < existing["first_seen"]):
existing["first_seen"] = dup_fs
# Log it # Log it
print(f" Duplikát: {e['locality']} | {format_price(e['price'])} | {e.get('area', '?')}" print(f" Duplikát: {e['locality']} | {format_price(e['price'])} | {e.get('area', '?')}"
f"({e.get('source', '?')} vs {existing.get('source', '?')})") f"({e.get('source', '?')} vs {existing.get('source', '?')})")

View File

@@ -1,116 +0,0 @@
#!/usr/bin/env python3
"""
Minimal HTTP API server for persisting apartment ratings.
GET /api/ratings → returns ratings.json contents
POST /api/ratings → saves entire ratings object
GET /api/ratings/export → same as GET, but with download header
Ratings file: /app/data/ratings.json (or ./ratings.json locally)
"""
import json
import logging
import os
import sys
from http.server import BaseHTTPRequestHandler, HTTPServer
from pathlib import Path
PORT = int(os.environ.get("RATINGS_PORT", 8081))
DATA_DIR = Path(os.environ.get("DATA_DIR", "."))
RATINGS_FILE = DATA_DIR / "ratings.json"
logging.basicConfig(
level=logging.INFO,
format="%(asctime)s [ratings] %(levelname)s %(message)s",
datefmt="%Y-%m-%dT%H:%M:%S",
)
log = logging.getLogger(__name__)
def load_ratings() -> dict:
try:
if RATINGS_FILE.exists():
return json.loads(RATINGS_FILE.read_text(encoding="utf-8"))
except Exception as e:
log.error("Failed to load ratings: %s", e)
return {}
def save_ratings(data: dict) -> None:
RATINGS_FILE.write_text(
json.dumps(data, ensure_ascii=False, indent=2),
encoding="utf-8",
)
class RatingsHandler(BaseHTTPRequestHandler):
def log_message(self, format, *args):
# Suppress default HTTP access log (we use our own)
pass
def _send_json(self, status: int, body: dict, extra_headers=None):
payload = json.dumps(body, ensure_ascii=False).encode("utf-8")
self.send_response(status)
self.send_header("Content-Type", "application/json; charset=utf-8")
self.send_header("Content-Length", str(len(payload)))
self.send_header("Access-Control-Allow-Origin", "*")
self.send_header("Access-Control-Allow-Methods", "GET, POST, OPTIONS")
self.send_header("Access-Control-Allow-Headers", "Content-Type")
if extra_headers:
for k, v in extra_headers.items():
self.send_header(k, v)
self.end_headers()
self.wfile.write(payload)
def do_OPTIONS(self):
# CORS preflight
self.send_response(204)
self.send_header("Access-Control-Allow-Origin", "*")
self.send_header("Access-Control-Allow-Methods", "GET, POST, OPTIONS")
self.send_header("Access-Control-Allow-Headers", "Content-Type")
self.end_headers()
def do_GET(self):
if self.path in ("/api/ratings", "/api/ratings/export"):
ratings = load_ratings()
extra = None
if self.path == "/api/ratings/export":
extra = {"Content-Disposition": 'attachment; filename="ratings.json"'}
log.info("GET %s%d ratings", self.path, len(ratings))
self._send_json(200, ratings, extra)
else:
self._send_json(404, {"error": "not found"})
def do_POST(self):
if self.path == "/api/ratings":
length = int(self.headers.get("Content-Length", 0))
if length == 0:
self._send_json(400, {"error": "empty body"})
return
try:
raw = self.rfile.read(length)
data = json.loads(raw.decode("utf-8"))
except Exception as e:
log.warning("Bad request body: %s", e)
self._send_json(400, {"error": "invalid JSON"})
return
if not isinstance(data, dict):
self._send_json(400, {"error": "expected JSON object"})
return
save_ratings(data)
log.info("POST /api/ratings → saved %d ratings", len(data))
self._send_json(200, {"ok": True, "count": len(data)})
else:
self._send_json(404, {"error": "not found"})
if __name__ == "__main__":
log.info("Ratings server starting on port %d, data dir: %s", PORT, DATA_DIR)
log.info("Ratings file: %s", RATINGS_FILE)
server = HTTPServer(("0.0.0.0", PORT), RatingsHandler)
try:
server.serve_forever()
except KeyboardInterrupt:
log.info("Stopped.")
sys.exit(0)

View File

@@ -1,114 +0,0 @@
#!/usr/bin/env python3
"""
Přegeneruje mapu z již stažených dat (byty_sreality.json).
Doplní chybějící plochy ze Sreality API, opraví URL, aplikuje filtry.
"""
from __future__ import annotations
import json
import time
import urllib.request
from pathlib import Path
from scrape_and_map import (
generate_map, format_price, MIN_AREA, HEADERS, DETAIL_API
)
def api_get(url: str) -> dict:
req = urllib.request.Request(url, headers=HEADERS)
with urllib.request.urlopen(req, timeout=30) as resp:
return json.loads(resp.read().decode("utf-8"))
def fix_sreality_url(estate: dict) -> str:
"""Fix the Sreality URL to include disposition segment (only if missing)."""
disp = estate.get("disposition", "")
slug_map = {
"1+kk": "1+kk", "1+1": "1+1", "2+kk": "2+kk", "2+1": "2+1",
"3+kk": "3+kk", "3+1": "3+1", "4+kk": "4+kk", "4+1": "4+1",
"5+kk": "5+kk", "5+1": "5+1", "6+": "6-a-vice", "Atypický": "atypicky",
}
slug = slug_map.get(disp, "byt")
old_url = estate.get("url", "")
parts = old_url.split("/")
try:
byt_idx = parts.index("byt")
# Only insert if disposition slug is not already there
if byt_idx + 1 < len(parts) and parts[byt_idx + 1] == slug:
return old_url # already correct
parts.insert(byt_idx + 1, slug)
return "/".join(parts)
except ValueError:
return old_url
def fetch_area(hash_id: int) -> int | None:
"""Fetch area from detail API."""
try:
url = DETAIL_API.format(hash_id)
detail = api_get(url)
for item in detail.get("items", []):
name = item.get("name", "")
if "žitná ploch" in name or "zitna ploch" in name.lower():
return int(item["value"])
except Exception:
pass
return None
def main():
json_path = Path("byty_sreality.json")
if not json_path.exists():
print("Soubor byty_sreality.json nenalezen. Nejprve spusť scrape_and_map.py")
return
estates = json.loads(json_path.read_text(encoding="utf-8"))
print(f"Načteno {len(estates)} bytů z byty_sreality.json")
# Step 1: Fetch missing areas
missing_area = [e for e in estates if e.get("area") is None]
print(f"Doplňuji plochu u {len(missing_area)} bytů...")
for i, e in enumerate(missing_area):
time.sleep(0.3)
area = fetch_area(e["hash_id"])
if area is not None:
e["area"] = area
if (i + 1) % 50 == 0:
print(f" {i + 1}/{len(missing_area)} ...")
# Count results
with_area = sum(1 for e in estates if e.get("area") is not None)
print(f"Plocha doplněna: {with_area}/{len(estates)}")
# Step 2: Fix URLs
for e in estates:
e["url"] = fix_sreality_url(e)
# Step 3: Filter by min area
filtered = []
excluded = 0
for e in estates:
area = e.get("area")
if area is not None and area < MIN_AREA:
excluded += 1
continue
filtered.append(e)
print(f"Vyloučeno (< {MIN_AREA} m²): {excluded}")
print(f"Zbývá: {len(filtered)} bytů")
# Save updated data
filtered_path = Path("byty_sreality.json")
filtered_path.write_text(
json.dumps(filtered, ensure_ascii=False, indent=2),
encoding="utf-8",
)
# Generate map
generate_map(filtered)
if __name__ == "__main__":
main()

View File

@@ -13,15 +13,17 @@ RED='\033[0;31m'
BOLD='\033[1m' BOLD='\033[1m'
NC='\033[0m' NC='\033[0m'
TOTAL=6 TOTAL=7
CURRENT=0 CURRENT=0
FAILED=0 FAILED=0
START_TIME=$(date -u +"%Y-%m-%dT%H:%M:%S") START_TIME=$(date -u +"%Y-%m-%dT%H:%M:%S")
START_EPOCH=$(date +%s) START_EPOCH=$(date +%s)
LOG_FILE="$(pwd)/scrape_run.log" LOG_FILE="$(pwd)/scrape_run.log"
# Mark status as running # Mark scraper as running; cleaned up on exit (even on error/kill)
echo '{"status":"running"}' > status.json LOCK_FILE="${DATA_DIR:-.}/scraper_running.json"
echo '{"running":true,"started_at":"'"$START_TIME"'"}' > "$LOCK_FILE"
trap 'rm -f "$LOCK_FILE"' EXIT
show_help() { show_help() {
echo "Usage: ./run_all.sh [OPTIONS]" echo "Usage: ./run_all.sh [OPTIONS]"
@@ -32,16 +34,19 @@ show_help() {
echo " --max-pages N Maximální počet stránek ke stažení z každého zdroje" echo " --max-pages N Maximální počet stránek ke stažení z každého zdroje"
echo " --max-properties N Maximální počet nemovitostí ke stažení z každého zdroje" echo " --max-properties N Maximální počet nemovitostí ke stažení z každého zdroje"
echo " --log-level LEVEL Úroveň logování (DEBUG, INFO, WARNING, ERROR)" echo " --log-level LEVEL Úroveň logování (DEBUG, INFO, WARNING, ERROR)"
echo " --keep N Počet běhů v historii (výchozí: 5, 0=neomezeno)"
echo " -h, --help Zobrazí tuto nápovědu" echo " -h, --help Zobrazí tuto nápovědu"
echo "" echo ""
echo "Examples:" echo "Examples:"
echo " ./run_all.sh # plný běh" echo " ./run_all.sh # plný běh"
echo " ./run_all.sh --max-pages 1 --max-properties 10 # rychlý test" echo " ./run_all.sh --max-pages 1 --max-properties 10 # rychlý test"
echo " ./run_all.sh --log-level DEBUG # s debug logováním" echo " ./run_all.sh --log-level DEBUG # s debug logováním"
echo " ./run_all.sh --keep 10 # uchovej 10 běhů v historii"
} }
# Parse arguments # Parse arguments
SCRAPER_ARGS="" SCRAPER_ARGS=""
KEEP_ARG=""
while [[ $# -gt 0 ]]; do while [[ $# -gt 0 ]]; do
case $1 in case $1 in
-h|--help) -h|--help)
@@ -52,6 +57,10 @@ while [[ $# -gt 0 ]]; do
SCRAPER_ARGS="$SCRAPER_ARGS $1 $2" SCRAPER_ARGS="$SCRAPER_ARGS $1 $2"
shift 2 shift 2
;; ;;
--keep)
KEEP_ARG="--keep $2"
shift 2
;;
*) *)
echo "Unknown argument: $1" echo "Unknown argument: $1"
echo "" echo ""
@@ -75,9 +84,6 @@ exec > >(tee -a "$LOG_FILE") 2>&1
step "Sreality" step "Sreality"
python3 scrape_and_map.py $SCRAPER_ARGS || { echo -e "${RED}✗ Sreality selhalo${NC}"; FAILED=$((FAILED + 1)); } python3 scrape_and_map.py $SCRAPER_ARGS || { echo -e "${RED}✗ Sreality selhalo${NC}"; FAILED=$((FAILED + 1)); }
step "Realingo"
python3 scrape_realingo.py $SCRAPER_ARGS || { echo -e "${RED}✗ Realingo selhalo${NC}"; FAILED=$((FAILED + 1)); }
step "Bezrealitky" step "Bezrealitky"
python3 scrape_bezrealitky.py $SCRAPER_ARGS || { echo -e "${RED}✗ Bezrealitky selhalo${NC}"; FAILED=$((FAILED + 1)); } python3 scrape_bezrealitky.py $SCRAPER_ARGS || { echo -e "${RED}✗ Bezrealitky selhalo${NC}"; FAILED=$((FAILED + 1)); }
@@ -92,6 +98,12 @@ PID_CH=$!
wait $PID_PSN || { echo -e "${RED}✗ PSN selhalo${NC}"; FAILED=$((FAILED + 1)); } wait $PID_PSN || { echo -e "${RED}✗ PSN selhalo${NC}"; FAILED=$((FAILED + 1)); }
wait $PID_CH || { echo -e "${RED}✗ CityHome selhalo${NC}"; FAILED=$((FAILED + 1)); } wait $PID_CH || { echo -e "${RED}✗ CityHome selhalo${NC}"; FAILED=$((FAILED + 1)); }
step "Bazoš"
python3 scrape_bazos.py $SCRAPER_ARGS || { echo -e "${RED}✗ Bazoš selhalo${NC}"; FAILED=$((FAILED + 1)); }
step "Realingo"
python3 scrape_realingo.py $SCRAPER_ARGS || { echo -e "${RED}✗ Realingo selhalo${NC}"; FAILED=$((FAILED + 1)); }
# ── Sloučení + mapa ────────────────────────────────────────── # ── Sloučení + mapa ──────────────────────────────────────────
step "Sloučení dat a generování mapy" step "Sloučení dat a generování mapy"
@@ -103,12 +115,12 @@ python3 merge_and_map.py || { echo -e "${RED}✗ Merge selhal${NC}"; FAILED=$((F
END_EPOCH=$(date +%s) END_EPOCH=$(date +%s)
DURATION=$((END_EPOCH - START_EPOCH)) DURATION=$((END_EPOCH - START_EPOCH))
python3 generate_status.py "$START_TIME" "$DURATION" "$LOG_FILE" python3 generate_status.py --start-time "$START_TIME" --duration "$DURATION" $KEEP_ARG
echo "" echo ""
echo "============================================================" echo "============================================================"
if [ $FAILED -eq 0 ]; then if [ $FAILED -eq 0 ]; then
echo -e "${GREEN}${BOLD}Hotovo! Všech 6 zdrojů úspěšně staženo.${NC}" echo -e "${GREEN}${BOLD}Hotovo! Všech 7 zdrojů úspěšně staženo.${NC}"
else else
echo -e "${RED}${BOLD}Hotovo s $FAILED chybami.${NC}" echo -e "${RED}${BOLD}Hotovo s $FAILED chybami.${NC}"
fi fi

View File

@@ -13,8 +13,11 @@ import math
import time import time
import urllib.request import urllib.request
import urllib.parse import urllib.parse
from datetime import datetime from datetime import datetime, timedelta
from pathlib import Path from pathlib import Path
from scraper_stats import write_stats, validate_listing
STATS_FILE = "stats_sreality.json"
logger = logging.getLogger(__name__) logger = logging.getLogger(__name__)
@@ -42,19 +45,26 @@ HEADERS = {
def api_get(url: str) -> dict: def api_get(url: str) -> dict:
"""Fetch JSON from Sreality API.""" """Fetch JSON from Sreality API with retry."""
logger.debug(f"HTTP GET request: {url}") for attempt in range(3):
logger.debug(f"Headers: {HEADERS}") logger.debug(f"HTTP GET request (attempt {attempt + 1}/3): {url}")
req = urllib.request.Request(url, headers=HEADERS) req = urllib.request.Request(url, headers=HEADERS)
try: try:
with urllib.request.urlopen(req, timeout=30) as resp: with urllib.request.urlopen(req, timeout=30) as resp:
response_data = resp.read().decode("utf-8") response_data = resp.read().decode("utf-8")
logger.debug(f"HTTP response: status={resp.status}, size={len(response_data)} bytes") logger.debug(f"HTTP response: status={resp.status}, size={len(response_data)} bytes")
logger.debug(f"Response preview: {response_data[:200]}") logger.debug(f"Response preview: {response_data[:200]}")
return json.loads(response_data) return json.loads(response_data)
except (urllib.error.URLError, ConnectionError, OSError) as e: except urllib.error.HTTPError:
logger.error(f"HTTP request failed for {url}: {e}", exc_info=True) raise
raise except (urllib.error.URLError, ConnectionError, OSError) as e:
if attempt < 2:
wait = (attempt + 1) * 2
logger.warning(f"Connection error (retry {attempt + 1}/3 after {wait}s): {e}")
time.sleep(wait)
else:
logger.error(f"HTTP request failed after 3 attempts: {e}", exc_info=True)
raise
def build_list_url(disposition: int, page: int = 1) -> str: def build_list_url(disposition: int, page: int = 1) -> str:
@@ -209,6 +219,8 @@ def load_cache(json_path: str = "byty_sreality.json") -> dict[int, dict]:
def scrape(max_pages: int | None = None, max_properties: int | None = None): def scrape(max_pages: int | None = None, max_properties: int | None = None):
"""Main scraping function. Returns list of filtered estates.""" """Main scraping function. Returns list of filtered estates."""
_run_start = time.time()
_run_ts = datetime.now().isoformat(timespec="seconds")
all_estates_raw = [] all_estates_raw = []
cache = load_cache() cache = load_cache()
@@ -348,7 +360,11 @@ def scrape(max_pages: int | None = None, max_properties: int | None = None):
"url": sreality_url(hash_id, seo), "url": sreality_url(hash_id, seo),
"image": (estate.get("_links", {}).get("images", [{}])[0].get("href", "") if estate.get("_links", {}).get("images") else ""), "image": (estate.get("_links", {}).get("images", [{}])[0].get("href", "") if estate.get("_links", {}).get("images") else ""),
"scraped_at": datetime.now().strftime("%Y-%m-%d"), "scraped_at": datetime.now().strftime("%Y-%m-%d"),
"first_seen": cached.get("first_seen", datetime.now().strftime("%Y-%m-%d")) if cached else datetime.now().strftime("%Y-%m-%d"),
"last_changed": datetime.now().strftime("%Y-%m-%d"),
} }
if not validate_listing(result, "sreality"):
continue
results.append(result) results.append(result)
details_fetched += 1 details_fetched += 1
@@ -366,6 +382,21 @@ def scrape(max_pages: int | None = None, max_properties: int | None = None):
logger.info(f" ✓ Vyhovující byty: {len(results)}") logger.info(f" ✓ Vyhovující byty: {len(results)}")
logger.info(f"{'=' * 60}") logger.info(f"{'=' * 60}")
write_stats(STATS_FILE, {
"source": "Sreality",
"timestamp": _run_ts,
"duration_sec": round(time.time() - _run_start, 1),
"success": True,
"accepted": len(results),
"fetched": len(unique_estates),
"cache_hits": cache_hits,
"excluded": {
"panel/síd": excluded_panel,
"<69 m²": excluded_small,
"bez GPS": excluded_no_gps,
"bez detailu": excluded_no_detail,
},
})
return results return results
@@ -409,18 +440,30 @@ def generate_map(estates: list[dict], output_path: str = "mapa_bytu.html"):
] ]
for bcolor, blabel in bands: for bcolor, blabel in bands:
price_legend_items += ( price_legend_items += (
f'<div style="display:flex;align-items:center;gap:6px;margin:2px 0;">' f'<div class="price-band" data-color="{bcolor}" onclick="toggleColorFilter(\'{bcolor}\')" '
f'style="display:flex;align-items:center;gap:6px;margin:2px 0;padding:2px 4px;'
f'border-radius:4px;border:2px solid transparent;">'
f'<span style="width:14px;height:14px;border-radius:50%;background:{bcolor};' f'<span style="width:14px;height:14px;border-radius:50%;background:{bcolor};'
f'display:inline-block;border:2px solid white;box-shadow:0 1px 3px rgba(0,0,0,0.3);flex-shrink:0;"></span>' f'display:inline-block;border:2px solid white;box-shadow:0 1px 3px rgba(0,0,0,0.3);flex-shrink:0;"></span>'
f'<span>{blabel}</span></div>' f'<span>{blabel}</span></div>'
) )
price_legend_items += (
'<div id="price-filter-reset" style="display:none;margin:3px 0 0 4px;">'
'<a href="#" onclick="resetColorFilter();return false;" '
'style="font-size:11px;color:#1976D2;text-decoration:none;">✕ Zobrazit všechny ceny</a>'
'</div>'
)
# New marker indicator — bigger dot, no extra border # New marker indicator — bigger dot, no extra border
price_legend_items += ( price_legend_items += (
'<div style="display:flex;align-items:center;gap:6px;margin:6px 0 0 0;' '<div style="display:flex;align-items:center;gap:6px;margin:6px 0 0 0;'
'padding-top:6px;border-top:1px solid #eee;">' 'padding-top:6px;border-top:1px solid #eee;">'
'<span style="width:18px;height:18px;border-radius:50%;background:#66BB6A;' '<span style="display:inline-flex;align-items:center;gap:3px;flex-shrink:0;">'
'display:inline-block;box-shadow:0 1px 4px rgba(0,0,0,0.35);flex-shrink:0;"></span>' '<span style="width:14px;height:14px;border-radius:50%;background:#66BB6A;'
'<span>Nové (z dnešního scrapu) — větší</span></div>' 'display:inline-block;box-shadow:0 1px 3px rgba(0,0,0,0.3);"></span>'
'<span style="font-size:8px;font-weight:700;background:#FFD600;color:#333;'
'padding:1px 3px;border-radius:2px;">NEW</span>'
'</span>'
'<span>Nové (≤ 1 den)</span></div>'
) )
markers_js = "" markers_js = ""
@@ -437,23 +480,37 @@ def generate_map(estates: list[dict], output_path: str = "mapa_bytu.html"):
floor_note = '<br><span style="color:#FF9800;font-weight:bold;">⚠ 2. NP — zvážit klidnost lokality</span>' floor_note = '<br><span style="color:#FF9800;font-weight:bold;">⚠ 2. NP — zvážit klidnost lokality</span>'
source = e.get("source", "sreality") source = e.get("source", "sreality")
source_labels = {"sreality": "Sreality", "realingo": "Realingo", "bezrealitky": "Bezrealitky", "idnes": "iDNES", "psn": "PSN", "cityhome": "CityHome"} source_labels = {"sreality": "Sreality", "realingo": "Realingo", "bezrealitky": "Bezrealitky", "idnes": "iDNES", "psn": "PSN", "cityhome": "CityHome", "bazos": "Bazoš"}
source_colors = {"sreality": "#1976D2", "realingo": "#00897B", "bezrealitky": "#E91E63", "idnes": "#FF6F00", "psn": "#D32F2F", "cityhome": "#D32F2F"} source_colors = {"sreality": "#1976D2", "realingo": "#00897B", "bezrealitky": "#E91E63", "idnes": "#FF6F00", "psn": "#D32F2F", "cityhome": "#D32F2F", "bazos": "#7B1FA2"}
source_label = source_labels.get(source, source) source_label = source_labels.get(source, source)
source_color = source_colors.get(source, "#999") source_color = source_colors.get(source, "#999")
hash_id = e.get("hash_id", "") hash_id = f"{source}_{e.get('hash_id', '')}"
scraped_at = e.get("scraped_at", "") first_seen = e.get("first_seen", "")
is_new = scraped_at == datetime.now().strftime("%Y-%m-%d") last_changed = e.get("last_changed", "")
today = datetime.now().strftime("%Y-%m-%d")
yesterday = (datetime.now() - timedelta(days=1)).strftime("%Y-%m-%d")
is_new = first_seen in (today, yesterday)
new_badge = ( new_badge = (
'<span style="margin-left:6px;font-size:11px;background:#FFD600;color:#333;' '<span style="margin-left:6px;font-size:11px;background:#FFD600;color:#333;'
'padding:1px 6px;border-radius:3px;font-weight:bold;">NOVÉ</span>' 'padding:1px 6px;border-radius:3px;font-weight:bold;">NOVÉ</span>'
if is_new else "" if is_new else ""
) )
date_parts = []
if first_seen:
date_parts.append(f'Přidáno: {first_seen}')
if last_changed and last_changed != first_seen:
date_parts.append(f'Změněno: {last_changed}')
date_row = (
f'<span style="font-size:11px;color:#888;">{"&nbsp;·&nbsp;".join(date_parts)}</span><br>'
if date_parts else ""
)
popup = ( popup = (
f'<div style="min-width:280px;font-family:system-ui,sans-serif;" data-hashid="{hash_id}">' f'<div style="min-width:280px;font-family:system-ui,sans-serif;" data-hashid="{hash_id}" data-first-seen="{first_seen}" data-last-changed="{last_changed}">'
f'<b style="font-size:14px;">{format_price(e["price"])}</b>' f'<b style="font-size:14px;">{format_price(e["price"])}</b>'
f'<span style="margin-left:8px;font-size:11px;background:{source_color};color:white;' f'<span style="margin-left:8px;font-size:11px;background:{source_color};color:white;'
f'padding:1px 6px;border-radius:3px;">{source_label}</span>{new_badge}<br>' f'padding:1px 6px;border-radius:3px;">{source_label}</span>{new_badge}<br>'
@@ -461,7 +518,9 @@ def generate_map(estates: list[dict], output_path: str = "mapa_bytu.html"):
f'{floor_note}<br><br>' f'{floor_note}<br><br>'
f'<b>{e["locality"]}</b><br>' f'<b>{e["locality"]}</b><br>'
f'Stavba: {building_text}<br>' f'Stavba: {building_text}<br>'
f'Vlastnictví: {ownership_text}<br><br>' f'Vlastnictví: {ownership_text}<br>'
f'{date_row}'
f'<br>'
f'<a href="{e["url"]}" target="_blank" ' f'<a href="{e["url"]}" target="_blank" '
f'style="color:{source_color};text-decoration:none;font-weight:bold;">' f'style="color:{source_color};text-decoration:none;font-weight:bold;">'
f'→ Otevřít na {source_label}</a>' f'→ Otevřít na {source_label}</a>'
@@ -493,7 +552,7 @@ def generate_map(estates: list[dict], output_path: str = "mapa_bytu.html"):
else: else:
marker_fn = "addMarker" marker_fn = "addMarker"
markers_js += ( markers_js += (
f" {marker_fn}({e['lat']}, {e['lon']}, '{color}', '{popup}', '{hash_id}');\n" f" {marker_fn}({e['lat']}, {e['lon']}, '{color}', '{popup}', '{hash_id}', '{first_seen}', '{last_changed}');\n"
) )
# Build legend — price per m² bands + disposition counts # Build legend — price per m² bands + disposition counts
@@ -559,12 +618,12 @@ def generate_map(estates: list[dict], output_path: str = "mapa_bytu.html"):
.heart-icon-fav svg path {{ stroke: gold !important; stroke-width: 2.5 !important; filter: drop-shadow(0 0 4px rgba(255,193,7,0.7)); }} .heart-icon-fav svg path {{ stroke: gold !important; stroke-width: 2.5 !important; filter: drop-shadow(0 0 4px rgba(255,193,7,0.7)); }}
.heart-icon-rej {{ opacity: 0.4 !important; filter: grayscale(1); }} .heart-icon-rej {{ opacity: 0.4 !important; filter: grayscale(1); }}
.reject-overlay {{ background: none !important; border: none !important; pointer-events: none !important; }} .reject-overlay {{ background: none !important; border: none !important; pointer-events: none !important; }}
@keyframes pulse-new {{ .new-badge-icon {{ background: none !important; border: none !important; pointer-events: none !important; }}
0% {{ stroke-opacity: 1; stroke-width: 3px; r: 11; }} .new-badge {{
50% {{ stroke-opacity: 0.4; stroke-width: 6px; r: 12; }} font-size: 9px; font-weight: 700; color: #333; background: #FFD600;
100% {{ stroke-opacity: 1; stroke-width: 3px; r: 11; }} padding: 1px 4px; border-radius: 3px; white-space: nowrap;
box-shadow: 0 1px 3px rgba(0,0,0,0.3); letter-spacing: 0.5px;
}} }}
.marker-new {{ animation: pulse-new 2s ease-in-out infinite; }}
.info-panel {{ .info-panel {{
position: absolute; top: 10px; right: 10px; z-index: 1000; position: absolute; top: 10px; right: 10px; z-index: 1000;
background: white; padding: 16px; border-radius: 10px; background: white; padding: 16px; border-radius: 10px;
@@ -597,6 +656,10 @@ def generate_map(estates: list[dict], output_path: str = "mapa_bytu.html"):
.info-panel .stats {{ color: #666; margin-bottom: 10px; padding-bottom: 10px; border-bottom: 1px solid #eee; }} .info-panel .stats {{ color: #666; margin-bottom: 10px; padding-bottom: 10px; border-bottom: 1px solid #eee; }}
.filter-section {{ margin-top: 10px; padding-top: 10px; border-top: 1px solid #eee; }} .filter-section {{ margin-top: 10px; padding-top: 10px; border-top: 1px solid #eee; }}
.filter-section label {{ display: flex; align-items: center; gap: 6px; margin: 3px 0; cursor: pointer; }} .filter-section label {{ display: flex; align-items: center; gap: 6px; margin: 3px 0; cursor: pointer; }}
.price-band {{ cursor: pointer; transition: background 0.12s; }}
.price-band:hover {{ background: #f0f0f0; }}
.price-band.active {{ border-color: #333 !important; background: #e8f0fe; }}
.price-band.dimmed {{ opacity: 0.35; }}
.filter-section input[type="checkbox"] {{ accent-color: #1976D2; }} .filter-section input[type="checkbox"] {{ accent-color: #1976D2; }}
#floor-filter {{ margin-top: 8px; }} #floor-filter {{ margin-top: 8px; }}
#floor-filter select {{ width: 100%; padding: 4px; border-radius: 4px; border: 1px solid #ccc; }} #floor-filter select {{ width: 100%; padding: 4px; border-radius: 4px; border: 1px solid #ccc; }}
@@ -635,11 +698,23 @@ def generate_map(estates: list[dict], output_path: str = "mapa_bytu.html"):
</div> </div>
<div style="margin-top:6px;"> <div style="margin-top:6px;">
<label>Max cena: <label>Max cena:
<select id="max-price" onchange="applyFilters()"> <input type="number" id="max-price" value="13500000" max="14000000" step="500000"
<option value="13500000">13 500 000 Kč</option> style="width:130px;padding:2px 4px;border:1px solid #ccc;border-radius:3px;"
<option value="12000000">12 000 000 Kč</option> onchange="applyFilters()" onkeyup="applyFilters()"> Kč
<option value="10000000">10 000 000 Kč</option> </label>
<option value="8000000">8 000 000 Kč</option> </div>
<div style="margin-top:6px;">
<label>Přidáno / změněno:
<select id="days-filter" onchange="applyFilters()" style="width:100%;padding:4px;border-radius:4px;border:1px solid #ccc;">
<option value="0">Vše</option>
<option value="1">za 1 den</option>
<option value="2">za 2 dny</option>
<option value="3">za 3 dny</option>
<option value="4">za 4 dny</option>
<option value="5">za 5 dní</option>
<option value="7">za 7 dní</option>
<option value="14">za 14 dní</option>
<option value="30">za 30 dní</option>
</select> </select>
</label> </label>
</div> </div>
@@ -653,7 +728,7 @@ def generate_map(estates: list[dict], output_path: str = "mapa_bytu.html"):
Skrýt zamítnuté Skrýt zamítnuté
</label> </label>
</div> </div>
<div class="status-link"><a href="status.html">Scraper status</a></div> <div class="status-link"><a href="/scrapers-status">Scraper status</a></div>
</div> </div>
<script> <script>
@@ -673,9 +748,39 @@ L.tileLayer('https://{{s}}.basemaps.cartocdn.com/light_only_labels/{{z}}/{{x}}/{
pane: 'shadowPane', pane: 'shadowPane',
}}).addTo(map); }}).addTo(map);
var selectedColors = [];
function toggleColorFilter(color) {{
var idx = selectedColors.indexOf(color);
if (idx >= 0) selectedColors.splice(idx, 1);
else selectedColors.push(color);
document.querySelectorAll('.price-band').forEach(function(el) {{
var c = el.getAttribute('data-color');
if (selectedColors.length === 0) {{
el.classList.remove('active', 'dimmed');
}} else if (selectedColors.indexOf(c) >= 0) {{
el.classList.add('active'); el.classList.remove('dimmed');
}} else {{
el.classList.add('dimmed'); el.classList.remove('active');
}}
}});
document.getElementById('price-filter-reset').style.display =
selectedColors.length > 0 ? 'block' : 'none';
applyFilters();
}}
function resetColorFilter() {{
selectedColors = [];
document.querySelectorAll('.price-band').forEach(function(el) {{
el.classList.remove('active', 'dimmed');
}});
document.getElementById('price-filter-reset').style.display = 'none';
applyFilters();
}}
var allMarkers = []; var allMarkers = [];
function addMarker(lat, lon, color, popup, hashId) {{ function addMarker(lat, lon, color, popup, hashId, firstSeen, lastChanged) {{
var marker = L.circleMarker([lat, lon], {{ var marker = L.circleMarker([lat, lon], {{
radius: 8, radius: 8,
fillColor: color, fillColor: color,
@@ -684,26 +789,35 @@ function addMarker(lat, lon, color, popup, hashId) {{
opacity: 1, opacity: 1,
fillOpacity: 0.85, fillOpacity: 0.85,
}}).bindPopup(popup); }}).bindPopup(popup);
marker._data = {{ lat: lat, lon: lon, color: color, hashId: hashId }}; marker._data = {{ lat: lat, lon: lon, color: color, hashId: hashId, firstSeen: firstSeen || '', lastChanged: lastChanged || '' }};
allMarkers.push(marker); allMarkers.push(marker);
marker.addTo(map); marker.addTo(map);
}} }}
function addNewMarker(lat, lon, color, popup, hashId) {{ function addNewMarker(lat, lon, color, popup, hashId, firstSeen, lastChanged) {{
var marker = L.circleMarker([lat, lon], {{ var marker = L.circleMarker([lat, lon], {{
radius: 12, radius: 8,
fillColor: color, fillColor: color,
color: color, color: '#fff',
weight: 4, weight: 2,
opacity: 0.35, opacity: 1,
fillOpacity: 0.95, fillOpacity: 0.85,
}}).bindPopup(popup); }}).bindPopup(popup);
marker._data = {{ lat: lat, lon: lon, color: color, hashId: hashId, isNew: true }}; marker._data = {{ lat: lat, lon: lon, color: color, hashId: hashId, isNew: true, firstSeen: firstSeen || '', lastChanged: lastChanged || '' }};
allMarkers.push(marker); allMarkers.push(marker);
marker.addTo(map); marker.addTo(map);
marker.on('add', function() {{ var badge = L.marker([lat, lon], {{
if (marker._path) marker._path.classList.add('marker-new'); icon: L.divIcon({{
className: 'new-badge-icon',
html: '<span class="new-badge">NEW</span>',
iconSize: [32, 14],
iconAnchor: [-6, 7],
}}),
interactive: false,
pane: 'markerPane',
}}); }});
badge.addTo(map);
marker._newBadge = badge;
}} }}
function heartIcon(color) {{ function heartIcon(color) {{
@@ -736,11 +850,11 @@ function starIcon() {{
}}); }});
}} }}
function addHeartMarker(lat, lon, color, popup, hashId) {{ function addHeartMarker(lat, lon, color, popup, hashId, firstSeen, lastChanged) {{
var marker = L.marker([lat, lon], {{ var marker = L.marker([lat, lon], {{
icon: heartIcon(color), icon: heartIcon(color),
}}).bindPopup(popup); }}).bindPopup(popup);
marker._data = {{ lat: lat, lon: lon, color: color, hashId: hashId, isHeart: true }}; marker._data = {{ lat: lat, lon: lon, color: color, hashId: hashId, isHeart: true, firstSeen: firstSeen || '', lastChanged: lastChanged || '' }};
allMarkers.push(marker); allMarkers.push(marker);
marker.addTo(map); marker.addTo(map);
}} }}
@@ -759,6 +873,11 @@ function loadRatings() {{
function saveRatings(ratings) {{ function saveRatings(ratings) {{
localStorage.setItem(RATINGS_KEY, JSON.stringify(ratings)); localStorage.setItem(RATINGS_KEY, JSON.stringify(ratings));
fetch('/api/ratings', {{
method: 'POST',
headers: {{'Content-Type': 'application/json'}},
body: JSON.stringify(ratings)
}}).catch(function() {{}});
}} }}
function addRejectStrike(marker) {{ function addRejectStrike(marker) {{
@@ -806,6 +925,7 @@ function applyMarkerStyle(marker, status) {{
}} else {{ }} else {{
if (status === 'fav') {{ if (status === 'fav') {{
removeRejectStrike(marker); removeRejectStrike(marker);
if (marker._newBadge && map.hasLayer(marker._newBadge)) map.removeLayer(marker._newBadge);
if (!marker._data._origCircle) marker._data._origCircle = true; if (!marker._data._origCircle) marker._data._origCircle = true;
var popup = marker.getPopup(); var popup = marker.getPopup();
var popupContent = popup ? popup.getContent() : ''; var popupContent = popup ? popup.getContent() : '';
@@ -829,6 +949,7 @@ function applyMarkerStyle(marker, status) {{
}} }}
// Add strikethrough line over the marker // Add strikethrough line over the marker
addRejectStrike(marker); addRejectStrike(marker);
if (marker._newBadge && map.hasLayer(marker._newBadge)) map.removeLayer(marker._newBadge);
}} else {{ }} else {{
if (marker._data._origCircle && !(marker instanceof L.CircleMarker)) {{ if (marker._data._origCircle && !(marker instanceof L.CircleMarker)) {{
revertToCircle(marker, {{ radius: 8, fillColor: marker._data.color, color: '#fff', weight: 2, fillOpacity: 0.85 }}); revertToCircle(marker, {{ radius: 8, fillColor: marker._data.color, color: '#fff', weight: 2, fillOpacity: 0.85 }});
@@ -841,6 +962,7 @@ function applyMarkerStyle(marker, status) {{
}} }}
if (marker._path) marker._path.classList.remove('marker-rejected'); if (marker._path) marker._path.classList.remove('marker-rejected');
removeRejectStrike(marker); removeRejectStrike(marker);
if (marker._newBadge && !map.hasLayer(marker._newBadge)) marker._newBadge.addTo(map);
}} }}
}} }}
}} }}
@@ -996,11 +1118,21 @@ map.on('popupopen', function(e) {{
// ── Filters ──────────────────────────────────────────────────── // ── Filters ────────────────────────────────────────────────────
function applyFilters() {{ function applyFilters() {{
var minFloor = parseInt(document.getElementById('min-floor').value); var minFloor = parseInt(document.getElementById('min-floor').value);
var maxPrice = parseInt(document.getElementById('max-price').value); var maxPriceEl = document.getElementById('max-price');
var maxPrice = parseInt(maxPriceEl.value) || 14000000;
if (maxPrice > 14000000) {{ maxPrice = 14000000; maxPriceEl.value = 14000000; }}
var hideRejected = document.getElementById('hide-rejected').checked; var hideRejected = document.getElementById('hide-rejected').checked;
var daysFilter = parseInt(document.getElementById('days-filter').value) || 0;
var ratings = loadRatings(); var ratings = loadRatings();
var visible = 0; var visible = 0;
var cutoff = null;
if (daysFilter > 0) {{
cutoff = new Date();
cutoff.setDate(cutoff.getDate() - daysFilter);
cutoff.setHours(0, 0, 0, 0);
}}
allMarkers.forEach(function(m) {{ allMarkers.forEach(function(m) {{
var popup = m.getPopup().getContent(); var popup = m.getPopup().getContent();
var floorMatch = popup.match(/(\\d+)\\. NP/); var floorMatch = popup.match(/(\\d+)\\. NP/);
@@ -1013,6 +1145,14 @@ function applyFilters() {{
if (floor !== null && floor < minFloor) show = false; if (floor !== null && floor < minFloor) show = false;
if (price > maxPrice) show = false; if (price > maxPrice) show = false;
if (cutoff) {{
var fs = m._data.firstSeen ? new Date(m._data.firstSeen) : null;
var lc = m._data.lastChanged ? new Date(m._data.lastChanged) : null;
if (!((fs && fs >= cutoff) || (lc && lc >= cutoff))) show = false;
}}
if (selectedColors.length > 0 && selectedColors.indexOf(m._data.color) < 0) show = false;
var r = ratings[m._data.hashId]; var r = ratings[m._data.hashId];
if (hideRejected && r && r.status === 'reject') show = false; if (hideRejected && r && r.status === 'reject') show = false;
@@ -1021,10 +1161,12 @@ function applyFilters() {{
visible++; visible++;
// Show strike line if rejected and visible // Show strike line if rejected and visible
if (m._rejectStrike && !map.hasLayer(m._rejectStrike)) m._rejectStrike.addTo(map); if (m._rejectStrike && !map.hasLayer(m._rejectStrike)) m._rejectStrike.addTo(map);
if (m._newBadge && !map.hasLayer(m._newBadge)) m._newBadge.addTo(map);
}} else {{ }} else {{
if (map.hasLayer(m)) map.removeLayer(m); if (map.hasLayer(m)) map.removeLayer(m);
// Hide strike line when marker hidden // Hide strike line when marker hidden
if (m._rejectStrike && map.hasLayer(m._rejectStrike)) map.removeLayer(m._rejectStrike); if (m._rejectStrike && map.hasLayer(m._rejectStrike)) map.removeLayer(m._rejectStrike);
if (m._newBadge && map.hasLayer(m._newBadge)) map.removeLayer(m._newBadge);
}} }}
}}); }});
@@ -1039,8 +1181,25 @@ function applyFilters() {{
document.getElementById('visible-count').textContent = visible; document.getElementById('visible-count').textContent = visible;
}} }}
// Initialize ratings on load // Initialize ratings: load from server, merge with localStorage, then restore
restoreRatings(); function initRatings() {{
var local = loadRatings();
fetch('/api/ratings')
.then(function(r) {{ return r.ok ? r.json() : null; }})
.then(function(server) {{
if (server && typeof server === 'object') {{
var merged = Object.assign({{}}, local, server);
localStorage.setItem(RATINGS_KEY, JSON.stringify(merged));
}}
restoreRatings();
updateRatingCounts();
}})
.catch(function() {{
restoreRatings();
updateRatingCounts();
}});
}}
initRatings();
// ── Panel toggle ────────────────────────────────────────────── // ── Panel toggle ──────────────────────────────────────────────
function togglePanel() {{ function togglePanel() {{
@@ -1089,8 +1248,22 @@ if __name__ == "__main__":
handlers=[logging.StreamHandler()] handlers=[logging.StreamHandler()]
) )
_run_ts = datetime.now().isoformat(timespec="seconds")
start = time.time() start = time.time()
estates = scrape(max_pages=args.max_pages, max_properties=args.max_properties) try:
estates = scrape(max_pages=args.max_pages, max_properties=args.max_properties)
except Exception as e:
logger.error(f"Scraper failed: {e}", exc_info=True)
write_stats(STATS_FILE, {
"source": "Sreality",
"timestamp": _run_ts,
"duration_sec": round(time.time() - start, 1),
"success": False,
"accepted": 0,
"fetched": 0,
"error": str(e),
})
raise
if estates: if estates:
# Save raw data as JSON backup # Save raw data as JSON backup

560
scrape_bazos.py Normal file
View File

@@ -0,0 +1,560 @@
#!/usr/bin/env python3
"""
Bazoš.cz scraper.
Stáhne byty na prodej v Praze a vyfiltruje podle kritérií.
Výstup: byty_bazos.json
"""
from __future__ import annotations
import argparse
from datetime import datetime
import json
import logging
import math
import re
import time
import urllib.request
import urllib.parse
from pathlib import Path
from scraper_stats import write_stats, validate_listing
STATS_FILE = "stats_bazos.json"
logger = logging.getLogger(__name__)
# ── Konfigurace ─────────────────────────────────────────────────────────────
MAX_PRICE = 14_000_000
MIN_AREA = 69
MIN_FLOOR = 2
PER_PAGE = 20 # Bazoš vrací 20 na stránku
WANTED_DISPOSITIONS = {"3+kk", "3+1", "4+kk", "4+1", "5+kk", "5+1", "6+kk", "6+1"}
# Regex patterns pro parsování dispozice, plochy a patra z textu
DISP_RE = re.compile(r'(\d)\s*\+\s*(kk|1)', re.IGNORECASE)
AREA_RE = re.compile(r'(\d+(?:[.,]\d+)?)\s*m[²2\s,.]', re.IGNORECASE)
FLOOR_RE = re.compile(r'(\d+)\s*[./]\s*(\d+)\s*(?:NP|patr|podlaž|floor)', re.IGNORECASE)
FLOOR_RE2 = re.compile(r'(\d+)\.\s*(?:NP|patr[eouě]|podlaž[ií])', re.IGNORECASE)
FLOOR_RE3 = re.compile(r'(?:patr[eouě]|podlaž[ií]|NP)\s*[:\s]*(\d+)', re.IGNORECASE)
PANEL_RE = re.compile(r'panel(?:ov|ák|\.)', re.IGNORECASE)
SIDLISTE_RE = re.compile(r'sídliště|sidliste|panelák', re.IGNORECASE)
HEADERS = {
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36",
"Accept": "text/html,application/xhtml+xml",
"Accept-Language": "cs,en;q=0.9",
}
BASE_URL = "https://reality.bazos.cz"
SEARCH_PARAMS = "hledat=&rubriky=reality&hlokalita=Praha&humkreis=25&cenado={max_price}&kitx=ano"
def fetch_url(url: str, retries: int = 3) -> str:
"""Fetch URL and return HTML string with retry on transient errors."""
for attempt in range(retries):
try:
logger.debug(f"HTTP GET request (attempt {attempt + 1}/{retries}): {url}")
req = urllib.request.Request(url, headers=HEADERS)
resp = urllib.request.urlopen(req, timeout=30)
html = resp.read().decode("utf-8", errors="replace")
logger.debug(f"HTTP response: status={resp.status}, size={len(html)} bytes")
return html
except urllib.error.HTTPError:
raise
except (ConnectionResetError, ConnectionError, urllib.error.URLError, OSError) as e:
if attempt < retries - 1:
wait = (attempt + 1) * 3
logger.warning(f"Connection error (retry {attempt + 1}/{retries} after {wait}s): {e}")
time.sleep(wait)
else:
logger.error(f"HTTP request failed after {retries} attempts: {e}", exc_info=True)
raise
def format_price(price: int) -> str:
s = str(price)
parts = []
while s:
parts.append(s[-3:])
s = s[:-3]
return " ".join(reversed(parts)) + ""
def parse_price(text: str) -> int:
"""Parse price from text like '5 250 000 Kč' → 5250000."""
cleaned = re.sub(r'[^\d]', '', text)
return int(cleaned) if cleaned else 0
def parse_disposition(text: str) -> str | None:
"""Parse disposition from title/description like '3+kk', '4+1'."""
m = DISP_RE.search(text)
if m:
rooms = m.group(1)
suffix = m.group(2).lower()
return f"{rooms}+{suffix}"
return None
def parse_area(text: str) -> float | None:
"""Parse area from text like '82 m²' → 82.0."""
m = AREA_RE.search(text)
if m:
return float(m.group(1).replace(',', '.'))
return None
def parse_floor(text: str) -> int | None:
"""Parse floor number from description."""
for pattern in [FLOOR_RE, FLOOR_RE2, FLOOR_RE3]:
m = pattern.search(text)
if m:
return int(m.group(1))
return None
def is_panel(text: str) -> bool:
"""Check if description mentions panel construction."""
return bool(PANEL_RE.search(text))
def is_sidliste(text: str) -> bool:
"""Check if description mentions housing estate."""
return bool(SIDLISTE_RE.search(text))
def fetch_listing_page(offset: int = 0, pagination_params: str | None = None) -> tuple[list[dict], int, str | None]:
"""
Fetch a page of listings from Bazoš.
Returns (list of basic listing dicts, total count, pagination_params for next pages).
"""
if pagination_params and offset > 0:
# Use resolved numeric params from first page's pagination links
url = f"{BASE_URL}/prodam/byt/{offset}/?{pagination_params}"
else:
params = SEARCH_PARAMS.format(max_price=MAX_PRICE)
if offset > 0:
url = f"{BASE_URL}/prodam/byt/{offset}/?{params}"
else:
url = f"{BASE_URL}/prodam/byt/?{params}"
html = fetch_url(url)
# Parse total count: "Zobrazeno 1-20 z 727"
total = 0
total_match = re.search(r'z\s+([\d\s]+)\s', html)
if total_match:
total = int(total_match.group(1).replace(' ', ''))
# Extract resolved pagination params from first page (Bazoš converts
# hlokalita=Praha → hlokalita=11000, and pagination only works with numeric form)
resolved_params = None
pag_link = re.search(r'href="/prodam/byt/\d+/\?([^"]+)"', html)
if pag_link:
resolved_params = pag_link.group(1)
# Parse listings — split by listing blocks (class="inzeraty inzeratyflex")
listings = []
all_blocks = re.split(r'<div class="inzeraty\s+inzeratyflex">', html)[1:] # skip before first
for block in all_blocks:
# Extract URL and ID from first link (/inzerat/XXXXXX/slug.php)
url_match = re.search(r'href="(/inzerat/(\d+)/[^"]*)"', block)
if not url_match:
continue
detail_path = url_match.group(1)
listing_id = int(url_match.group(2))
# Title — class=nadpis (without quotes) or class="nadpis"
title_match = re.search(r'class=.?nadpis.?[^>]*>\s*<a[^>]*>([^<]+)</a>', block)
title = title_match.group(1).strip() if title_match else ""
# Price — inside <span translate="no"> within inzeratycena
price_match = re.search(r'class="inzeratycena"[^>]*>.*?<span[^>]*>([^<]+)</span>', block, re.DOTALL)
if not price_match:
# Fallback: direct text in inzeratycena
price_match = re.search(r'class="inzeratycena"[^>]*>\s*(?:<b>)?([^<]+)', block)
price_text = price_match.group(1).strip() if price_match else ""
price = parse_price(price_text)
# Location
loc_match = re.search(r'class="inzeratylok"[^>]*>(.*?)</div>', block, re.DOTALL)
location = ""
if loc_match:
location = re.sub(r'<[^>]+>', ' ', loc_match.group(1)).strip()
location = re.sub(r'\s+', ' ', location)
# Date — [5.3. 2026]
date_match = re.search(r'\[(\d+\.\d+\.\s*\d{4})\]', block)
date_str = date_match.group(1).strip() if date_match else ""
# Description preview — class=popis (without quotes) or class="popis"
desc_match = re.search(r'class=.?popis.?[^>]*>(.*?)</div>', block, re.DOTALL)
description = ""
if desc_match:
description = re.sub(r'<[^>]+>', ' ', desc_match.group(1)).strip()
description = re.sub(r'\s+', ' ', description)
# Image — <img ... class="obrazek" ... src="...">
img_match = re.search(r'<img[^>]*src="([^"]+)"[^>]*class="obrazek"', block)
if not img_match:
img_match = re.search(r'class="obrazek"[^>]*src="([^"]+)"', block)
image = img_match.group(1) if img_match else ""
if "empty.gif" in image:
image = ""
listings.append({
"id": listing_id,
"title": title,
"price": price,
"location": location,
"date": date_str,
"description": description,
"detail_path": detail_path,
"image": image,
})
logger.debug(f"Offset {offset}: found {len(listings)} listings, total={total}")
return listings, total, resolved_params
def fetch_detail(path: str) -> dict | None:
"""Fetch listing detail page and extract GPS, full description."""
try:
url = f"{BASE_URL}{path}"
html = fetch_url(url)
result = {}
# GPS from Google Maps link
gps_match = re.search(r'google\.com/maps[^"]*place/([\d.]+),([\d.]+)', html)
if gps_match:
result["lat"] = float(gps_match.group(1))
result["lon"] = float(gps_match.group(2))
# Full description — Bazoš uses unquoted class=popisdetail
desc_match = re.search(r'class=.?popisdetail.?[^>]*>(.*?)</div>', html, re.DOTALL)
if desc_match:
desc = re.sub(r'<[^>]+>', ' ', desc_match.group(1)).strip()
desc = re.sub(r'\s+', ' ', desc)
result["description"] = desc
# Location from detail
loc_match = re.search(r'Lokalita:</td>\s*<td[^>]*>(.*?)</td>', html, re.DOTALL)
if loc_match:
loc = re.sub(r'<[^>]+>', ' ', loc_match.group(1)).strip()
loc = re.sub(r'\s+', ' ', loc)
result["detail_location"] = loc
return result
except Exception as e:
logger.warning(f"Detail fetch failed for {path}: {e}")
return None
def load_cache(json_path: str = "byty_bazos.json") -> dict[int, dict]:
"""Load previously scraped data as cache keyed by hash_id."""
path = Path(json_path)
if not path.exists():
return {}
try:
data = json.loads(path.read_text(encoding="utf-8"))
return {e["hash_id"]: e for e in data if "hash_id" in e}
except (json.JSONDecodeError, KeyError):
return {}
def scrape(max_pages: int | None = None, max_properties: int | None = None):
_run_start = time.time()
_run_ts = datetime.now().isoformat(timespec="seconds")
cache = load_cache()
today = datetime.now().strftime("%Y-%m-%d")
logger.info("=" * 60)
logger.info("Stahuji inzeráty z Bazoš.cz")
logger.info(f"Cena: do {format_price(MAX_PRICE)}")
logger.info(f"Min. plocha: {MIN_AREA}")
logger.info(f"Patro: od {MIN_FLOOR}. NP")
logger.info(f"Region: Praha")
if cache:
logger.info(f"Cache: {len(cache)} bytů z minulého běhu")
if max_pages:
logger.info(f"Max. stran: {max_pages}")
if max_properties:
logger.info(f"Max. bytů: {max_properties}")
logger.info("=" * 60)
# Step 1: Fetch listing pages
logger.info("\nFáze 1: Stahování seznamu inzerátů...")
all_listings = {} # id -> listing dict (dedup)
page = 1
offset = 0
total = None
pagination_params = None # resolved numeric params from first page
while True:
if max_pages and page > max_pages:
logger.debug(f"Max pages limit reached: {max_pages}")
break
logger.info(f"Strana {page} (offset {offset}) ...")
listings, total_count, resolved = fetch_listing_page(offset, pagination_params)
if resolved and not pagination_params:
pagination_params = resolved
logger.debug(f"Resolved pagination params: {pagination_params}")
if total is None and total_count > 0:
total = total_count
total_pages = math.ceil(total / PER_PAGE)
logger.info(f"→ Celkem {total} inzerátů, ~{total_pages} stran")
if not listings:
logger.debug(f"No listings found on page {page}, stopping")
break
for lst in listings:
lid = lst["id"]
if lid not in all_listings:
all_listings[lid] = lst
page += 1
offset += PER_PAGE
if total and offset >= total:
break
time.sleep(0.5)
logger.info(f"\nStaženo: {len(all_listings)} unikátních inzerátů")
# Step 2: Pre-filter by disposition, price, area from listing data
pre_filtered = []
excluded_disp = 0
excluded_price = 0
excluded_area = 0
excluded_no_disp = 0
for lst in all_listings.values():
title_and_desc = f"{lst['title']} {lst['description']}"
# Parse disposition
disp = parse_disposition(title_and_desc)
if not disp:
excluded_no_disp += 1
logger.debug(f"Filter: id={lst['id']} - excluded (no disposition found in '{lst['title']}')")
continue
if disp not in WANTED_DISPOSITIONS:
excluded_disp += 1
logger.debug(f"Filter: id={lst['id']} - excluded (disposition {disp})")
continue
# Price
price = lst["price"]
if price <= 0 or price > MAX_PRICE:
excluded_price += 1
logger.debug(f"Filter: id={lst['id']} - excluded (price {price})")
continue
# Area (if parseable from listing)
area = parse_area(title_and_desc)
if area is not None and area < MIN_AREA:
excluded_area += 1
logger.debug(f"Filter: id={lst['id']} - excluded (area {area} m²)")
continue
lst["_disposition"] = disp
lst["_area"] = area
pre_filtered.append(lst)
logger.info(f"\nPo předfiltraci:")
logger.info(f" Vyloučeno (bez dispozice): {excluded_no_disp}")
logger.info(f" Vyloučeno (dispozice): {excluded_disp}")
logger.info(f" Vyloučeno (cena): {excluded_price}")
logger.info(f" Vyloučeno (plocha): {excluded_area}")
logger.info(f" Zbývá: {len(pre_filtered)}")
# Step 3: Fetch details (for GPS + full description)
logger.info(f"\nFáze 2: Stahování detailů ({len(pre_filtered)} bytů)...")
results = []
excluded_panel = 0
excluded_floor = 0
excluded_no_gps = 0
excluded_detail = 0
excluded_area_detail = 0
cache_hits = 0
properties_fetched = 0
for i, lst in enumerate(pre_filtered):
if max_properties and properties_fetched >= max_properties:
logger.debug(f"Max properties limit reached: {max_properties}")
break
listing_id = lst["id"]
price = lst["price"]
# Check cache
cached = cache.get(listing_id)
if cached and cached.get("price") == price:
cache_hits += 1
logger.debug(f"Cache hit for id={listing_id}")
results.append(cached)
continue
time.sleep(0.4)
detail = fetch_detail(lst["detail_path"])
if not detail:
excluded_detail += 1
logger.debug(f"Filter: id={listing_id} - excluded (detail fetch failed)")
continue
# GPS required
lat = detail.get("lat")
lon = detail.get("lon")
if not lat or not lon:
excluded_no_gps += 1
logger.debug(f"Filter: id={listing_id} - excluded (no GPS)")
continue
# Full text for filtering
full_desc = detail.get("description", "")
full_text = f"{lst['title']} {lst['description']} {full_desc}"
# Panel check
if is_panel(full_text):
excluded_panel += 1
logger.info(f"✗ Vyloučen #{listing_id}: panelová stavba")
continue
# Sídliště check
if is_sidliste(full_text):
excluded_panel += 1
logger.info(f"✗ Vyloučen #{listing_id}: sídliště")
continue
# Floor
floor = parse_floor(full_text)
if floor is not None and floor < MIN_FLOOR:
excluded_floor += 1
logger.debug(f"Filter: id={listing_id} - excluded (floor {floor})")
continue
# Area — re-check from detail if not found before
area = lst.get("_area") or parse_area(full_desc)
if area is not None and area < MIN_AREA:
excluded_area_detail += 1
logger.debug(f"Filter: id={listing_id} - excluded (area {area} m² from detail)")
continue
disp = lst["_disposition"]
locality = detail.get("detail_location") or lst["location"]
result = {
"hash_id": listing_id,
"name": f"Prodej bytu {disp} {int(area) if area else '?'}",
"price": price,
"price_formatted": format_price(price),
"locality": locality,
"lat": lat,
"lon": lon,
"disposition": disp,
"floor": floor,
"area": area,
"building_type": "neuvedeno",
"ownership": "neuvedeno",
"url": f"{BASE_URL}{lst['detail_path']}",
"source": "bazos",
"image": lst.get("image", ""),
"scraped_at": today,
"first_seen": cached.get("first_seen", today) if cached else today,
"last_changed": today if not cached or cached.get("price") != price else cached.get("last_changed", today),
}
if not validate_listing(result, "bazos"):
continue
results.append(result)
properties_fetched += 1
if (i + 1) % 20 == 0:
logger.info(f"Zpracováno {i + 1}/{len(pre_filtered)} ...")
logger.info(f"\n{'=' * 60}")
logger.info(f"Výsledky Bazoš:")
logger.info(f" Předfiltrováno: {len(pre_filtered)}")
logger.info(f" Z cache (přeskočeno): {cache_hits}")
logger.info(f" Vyloučeno (panel/síd): {excluded_panel}")
logger.info(f" Vyloučeno (patro): {excluded_floor}")
logger.info(f" Vyloučeno (bez GPS): {excluded_no_gps}")
logger.info(f" Vyloučeno (bez detailu): {excluded_detail}")
logger.info(f" Vyloučeno (plocha det): {excluded_area_detail}")
logger.info(f" ✓ Vyhovující byty: {len(results)}")
logger.info(f"{'=' * 60}")
write_stats(STATS_FILE, {
"source": "Bazoš",
"timestamp": _run_ts,
"duration_sec": round(time.time() - _run_start, 1),
"success": True,
"accepted": len(results),
"fetched": len(all_listings),
"pages": page - 1,
"cache_hits": cache_hits,
"excluded": {
"bez dispozice": excluded_no_disp,
"dispozice": excluded_disp,
"cena": excluded_price,
"plocha": excluded_area + excluded_area_detail,
"bez GPS": excluded_no_gps,
"panel/síd": excluded_panel,
"patro": excluded_floor,
"bez detailu": excluded_detail,
},
})
return results
if __name__ == "__main__":
parser = argparse.ArgumentParser(description="Scrape apartments from Bazoš.cz")
parser.add_argument("--max-pages", type=int, default=None,
help="Maximum number of listing pages to scrape")
parser.add_argument("--max-properties", type=int, default=None,
help="Maximum number of properties to fetch details for")
parser.add_argument("--log-level", type=str, default="INFO", choices=["DEBUG", "INFO", "WARNING", "ERROR"],
help="Logging level (default: INFO)")
args = parser.parse_args()
logging.basicConfig(
level=getattr(logging, args.log_level),
format="[%(levelname)s] %(asctime)s - %(name)s - %(message)s",
handlers=[logging.StreamHandler()]
)
_run_ts = datetime.now().isoformat(timespec="seconds")
start = time.time()
try:
estates = scrape(max_pages=args.max_pages, max_properties=args.max_properties)
except Exception as e:
logger.error(f"Scraper failed: {e}", exc_info=True)
write_stats(STATS_FILE, {
"source": "Bazoš",
"timestamp": _run_ts,
"duration_sec": round(time.time() - start, 1),
"success": False,
"accepted": 0,
"fetched": 0,
"error": str(e),
})
raise
if estates:
json_path = Path("byty_bazos.json")
json_path.write_text(
json.dumps(estates, ensure_ascii=False, indent=2),
encoding="utf-8",
)
elapsed = time.time() - start
logger.info(f"\n✓ Data uložena: {json_path.resolve()}")
logger.info(f"⏱ Celkový čas: {elapsed:.0f} s")
else:
logger.info("\nŽádné byty z Bazoše neodpovídají kritériím :(")

View File

@@ -15,6 +15,9 @@ import re
import time import time
import urllib.request import urllib.request
from pathlib import Path from pathlib import Path
from scraper_stats import write_stats, validate_listing
STATS_FILE = "stats_bezrealitky.json"
logger = logging.getLogger(__name__) logger = logging.getLogger(__name__)
@@ -68,62 +71,71 @@ HEADERS = {
BASE_URL = "https://www.bezrealitky.cz" BASE_URL = "https://www.bezrealitky.cz"
def fetch_url(url: str, retries: int = 3) -> str:
"""Fetch URL and return HTML string with retry on transient errors."""
for attempt in range(retries):
try:
logger.debug(f"HTTP GET request (attempt {attempt + 1}/{retries}): {url}")
req = urllib.request.Request(url, headers=HEADERS)
resp = urllib.request.urlopen(req, timeout=30)
html = resp.read().decode("utf-8")
logger.debug(f"HTTP response: status={resp.status}, size={len(html)} bytes")
return html
except urllib.error.HTTPError:
raise
except (ConnectionResetError, ConnectionError, urllib.error.URLError, OSError) as e:
if attempt < retries - 1:
wait = (attempt + 1) * 2
logger.warning(f"Connection error (retry {attempt + 1}/{retries} after {wait}s): {e}")
time.sleep(wait)
else:
logger.error(f"HTTP request failed after {retries} attempts: {e}", exc_info=True)
raise
def fetch_page(page: int) -> tuple[list[dict], int]: def fetch_page(page: int) -> tuple[list[dict], int]:
""" """
Fetch a listing page from Bezrealitky. Fetch a listing page from Bezrealitky.
Returns (list of advert dicts from Apollo cache, total count). Returns (list of advert dicts from Apollo cache, total count).
""" """
url = f"{BASE_URL}/vypis/nabidka-prodej/byt/praha?page={page}" url = f"{BASE_URL}/vypis/nabidka-prodej/byt/praha?page={page}"
logger.debug(f"HTTP GET request: {url}") html = fetch_url(url)
logger.debug(f"Headers: {HEADERS}")
req = urllib.request.Request(url, headers=HEADERS)
try:
resp = urllib.request.urlopen(req, timeout=30)
html = resp.read().decode("utf-8")
logger.debug(f"HTTP response: status={resp.status}, size={len(html)} bytes")
match = re.search( match = re.search(
r'<script id="__NEXT_DATA__" type="application/json">(.*?)</script>', r'<script id="__NEXT_DATA__" type="application/json">(.*?)</script>',
html, re.DOTALL html, re.DOTALL
) )
if not match: if not match:
logger.debug("No __NEXT_DATA__ script found in HTML") logger.debug("No __NEXT_DATA__ script found in HTML")
return [], 0 return [], 0
data = json.loads(match.group(1)) data = json.loads(match.group(1))
cache = data["props"]["pageProps"]["apolloCache"] cache = data["props"]["pageProps"]["apolloCache"]
# Extract adverts from cache # Extract adverts from cache
adverts = [] adverts = []
for key, val in cache.items(): for key, val in cache.items():
if key.startswith("Advert:") and isinstance(val, dict) and val.get("__typename") == "Advert": if key.startswith("Advert:") and isinstance(val, dict) and val.get("__typename") == "Advert":
adverts.append(val) adverts.append(val)
# Get total count from ROOT_QUERY # Get total count from ROOT_QUERY
total = 0 total = 0
root = cache.get("ROOT_QUERY", {}) root = cache.get("ROOT_QUERY", {})
for key, val in root.items(): for key, val in root.items():
if "listAdverts" in key and isinstance(val, dict): if "listAdverts" in key and isinstance(val, dict):
tc = val.get("totalCount") tc = val.get("totalCount")
if tc and tc > total: if tc and tc > total:
total = tc total = tc
logger.debug(f"Page {page}: found {len(adverts)} adverts, total={total}") logger.debug(f"Page {page}: found {len(adverts)} adverts, total={total}")
return adverts, total return adverts, total
except (urllib.error.URLError, ConnectionError, OSError) as e:
logger.error(f"HTTP request failed for {url}: {e}", exc_info=True)
raise
def fetch_detail(uri: str) -> dict | None: def fetch_detail(uri: str) -> dict | None:
"""Fetch detail page for a listing.""" """Fetch detail page for a listing."""
try: try:
url = f"{BASE_URL}/nemovitosti-byty-domy/{uri}" url = f"{BASE_URL}/nemovitosti-byty-domy/{uri}"
logger.debug(f"HTTP GET request: {url}") html = fetch_url(url)
req = urllib.request.Request(url, headers=HEADERS)
resp = urllib.request.urlopen(req, timeout=30)
html = resp.read().decode("utf-8")
logger.debug(f"HTTP response: status={resp.status}, size={len(html)} bytes")
match = re.search( match = re.search(
r'<script id="__NEXT_DATA__" type="application/json">(.*?)</script>', r'<script id="__NEXT_DATA__" type="application/json">(.*?)</script>',
@@ -171,6 +183,8 @@ def load_cache(json_path: str = "byty_bezrealitky.json") -> dict[int, dict]:
def scrape(max_pages: int | None = None, max_properties: int | None = None): def scrape(max_pages: int | None = None, max_properties: int | None = None):
_run_start = time.time()
_run_ts = datetime.now().isoformat(timespec="seconds")
cache = load_cache() cache = load_cache()
logger.info("=" * 60) logger.info("=" * 60)
@@ -357,7 +371,11 @@ def scrape(max_pages: int | None = None, max_properties: int | None = None):
"source": "bezrealitky", "source": "bezrealitky",
"image": "", "image": "",
"scraped_at": datetime.now().strftime("%Y-%m-%d"), "scraped_at": datetime.now().strftime("%Y-%m-%d"),
"first_seen": cached.get("first_seen", datetime.now().strftime("%Y-%m-%d")) if cached else datetime.now().strftime("%Y-%m-%d"),
"last_changed": datetime.now().strftime("%Y-%m-%d"),
} }
if not validate_listing(result, "bezrealitky"):
continue
results.append(result) results.append(result)
properties_fetched += 1 properties_fetched += 1
@@ -374,6 +392,25 @@ def scrape(max_pages: int | None = None, max_properties: int | None = None):
logger.info(f" ✓ Vyhovující byty: {len(results)}") logger.info(f" ✓ Vyhovující byty: {len(results)}")
logger.info(f"{'=' * 60}") logger.info(f"{'=' * 60}")
write_stats(STATS_FILE, {
"source": "Bezrealitky",
"timestamp": _run_ts,
"duration_sec": round(time.time() - _run_start, 1),
"success": True,
"accepted": len(results),
"fetched": len(all_adverts),
"pages": page - 1,
"cache_hits": cache_hits,
"excluded": {
"dispozice": excluded_disp,
"cena": excluded_price,
"plocha": excluded_area,
"bez GPS": excluded_no_gps,
"panel/síd": excluded_panel,
"patro": excluded_floor,
"bez detailu": excluded_detail,
},
})
return results return results
@@ -394,8 +431,22 @@ if __name__ == "__main__":
handlers=[logging.StreamHandler()] handlers=[logging.StreamHandler()]
) )
_run_ts = datetime.now().isoformat(timespec="seconds")
start = time.time() start = time.time()
estates = scrape(max_pages=args.max_pages, max_properties=args.max_properties) try:
estates = scrape(max_pages=args.max_pages, max_properties=args.max_properties)
except Exception as e:
logger.error(f"Scraper failed: {e}", exc_info=True)
write_stats(STATS_FILE, {
"source": "Bezrealitky",
"timestamp": _run_ts,
"duration_sec": round(time.time() - start, 1),
"success": False,
"accepted": 0,
"fetched": 0,
"error": str(e),
})
raise
if estates: if estates:
json_path = Path("byty_bezrealitky.json") json_path = Path("byty_bezrealitky.json")

View File

@@ -14,6 +14,9 @@ import time
import urllib.request import urllib.request
from datetime import datetime from datetime import datetime
from pathlib import Path from pathlib import Path
from scraper_stats import write_stats, validate_listing
STATS_FILE = "stats_cityhome.json"
logger = logging.getLogger(__name__) logger = logging.getLogger(__name__)
@@ -203,6 +206,8 @@ def extract_project_gps(html: str) -> tuple[float, float] | None:
def scrape(max_pages: int | None = None, max_properties: int | None = None): def scrape(max_pages: int | None = None, max_properties: int | None = None):
_run_start = time.time()
_run_ts = datetime.now().isoformat(timespec="seconds")
logger.info("=" * 60) logger.info("=" * 60)
logger.info("Stahuji inzeráty z CityHome (city-home.cz)") logger.info("Stahuji inzeráty z CityHome (city-home.cz)")
logger.info(f"Cena: do {format_price(MAX_PRICE)}") logger.info(f"Cena: do {format_price(MAX_PRICE)}")
@@ -250,6 +255,16 @@ def scrape(max_pages: int | None = None, max_properties: int | None = None):
else: else:
logger.info(f"{slug}: GPS nenalezeno") logger.info(f"{slug}: GPS nenalezeno")
# Load previous output for first_seen/last_changed tracking
_prev_cache: dict[str, dict] = {}
_prev_path = Path("byty_cityhome.json")
if _prev_path.exists():
try:
for _item in json.loads(_prev_path.read_text(encoding="utf-8")):
_prev_cache[str(_item["hash_id"])] = _item
except Exception:
pass
# Step 3: Filter listings # Step 3: Filter listings
logger.info(f"\nFáze 3: Filtrování...") logger.info(f"\nFáze 3: Filtrování...")
results = [] results = []
@@ -357,7 +372,11 @@ def scrape(max_pages: int | None = None, max_properties: int | None = None):
"source": "cityhome", "source": "cityhome",
"image": "", "image": "",
"scraped_at": datetime.now().strftime("%Y-%m-%d"), "scraped_at": datetime.now().strftime("%Y-%m-%d"),
"first_seen": _prev_cache.get(f"cityhome_{slug}_{listing['unit_name']}", {}).get("first_seen", datetime.now().strftime("%Y-%m-%d")),
"last_changed": datetime.now().strftime("%Y-%m-%d") if _prev_cache.get(f"cityhome_{slug}_{listing['unit_name']}", {}).get("price") != price else _prev_cache[f"cityhome_{slug}_{listing['unit_name']}"].get("last_changed", datetime.now().strftime("%Y-%m-%d")),
} }
if not validate_listing(result, "cityhome"):
continue
results.append(result) results.append(result)
properties_fetched += 1 properties_fetched += 1
@@ -374,6 +393,23 @@ def scrape(max_pages: int | None = None, max_properties: int | None = None):
logger.info(f" ✓ Vyhovující byty: {len(results)}") logger.info(f" ✓ Vyhovující byty: {len(results)}")
logger.info(f"{'=' * 60}") logger.info(f"{'=' * 60}")
write_stats(STATS_FILE, {
"source": "CityHome",
"timestamp": _run_ts,
"duration_sec": round(time.time() - _run_start, 1),
"success": True,
"accepted": len(results),
"fetched": len(all_listings),
"excluded": {
"prodáno": excluded_sold,
"typ": excluded_type,
"dispozice": excluded_disp,
"cena": excluded_price,
"plocha": excluded_area,
"patro": excluded_floor,
"bez GPS": excluded_no_gps,
},
})
return results return results
@@ -394,8 +430,22 @@ if __name__ == "__main__":
handlers=[logging.StreamHandler()] handlers=[logging.StreamHandler()]
) )
_run_ts = datetime.now().isoformat(timespec="seconds")
start = time.time() start = time.time()
estates = scrape(max_pages=args.max_pages, max_properties=args.max_properties) try:
estates = scrape(max_pages=args.max_pages, max_properties=args.max_properties)
except Exception as e:
logger.error(f"Scraper failed: {e}", exc_info=True)
write_stats(STATS_FILE, {
"source": "CityHome",
"timestamp": _run_ts,
"duration_sec": round(time.time() - start, 1),
"success": False,
"accepted": 0,
"fetched": 0,
"error": str(e),
})
raise
if estates: if estates:
json_path = Path("byty_cityhome.json") json_path = Path("byty_cityhome.json")

View File

@@ -15,8 +15,10 @@ import re
import time import time
import urllib.request import urllib.request
import urllib.parse import urllib.parse
from html.parser import HTMLParser
from pathlib import Path from pathlib import Path
from scraper_stats import write_stats, validate_listing
STATS_FILE = "stats_idnes.json"
logger = logging.getLogger(__name__) logger = logging.getLogger(__name__)
@@ -279,6 +281,8 @@ def load_cache(json_path: str = "byty_idnes.json") -> dict[str, dict]:
def scrape(max_pages: int | None = None, max_properties: int | None = None): def scrape(max_pages: int | None = None, max_properties: int | None = None):
_run_start = time.time()
_run_ts = datetime.now().isoformat(timespec="seconds")
cache = load_cache() cache = load_cache()
logger.info("=" * 60) logger.info("=" * 60)
@@ -460,7 +464,11 @@ def scrape(max_pages: int | None = None, max_properties: int | None = None):
"source": "idnes", "source": "idnes",
"image": "", "image": "",
"scraped_at": datetime.now().strftime("%Y-%m-%d"), "scraped_at": datetime.now().strftime("%Y-%m-%d"),
"first_seen": cached.get("first_seen", datetime.now().strftime("%Y-%m-%d")) if cached else datetime.now().strftime("%Y-%m-%d"),
"last_changed": datetime.now().strftime("%Y-%m-%d"),
} }
if not validate_listing(result, "idnes"):
continue
results.append(result) results.append(result)
properties_fetched += 1 properties_fetched += 1
@@ -478,6 +486,25 @@ def scrape(max_pages: int | None = None, max_properties: int | None = None):
logger.info(f" ✓ Vyhovující byty: {len(results)}") logger.info(f" ✓ Vyhovující byty: {len(results)}")
logger.info(f"{'=' * 60}") logger.info(f"{'=' * 60}")
write_stats(STATS_FILE, {
"source": "iDNES",
"timestamp": _run_ts,
"duration_sec": round(time.time() - _run_start, 1),
"success": True,
"accepted": len(results),
"fetched": len(all_listings),
"pages": page,
"cache_hits": cache_hits,
"excluded": {
"cena": excluded_price,
"plocha": excluded_area,
"dispozice": excluded_disp,
"panel/síd": excluded_panel,
"patro": excluded_floor,
"bez GPS": excluded_no_gps,
"bez detailu": excluded_detail,
},
})
return results return results
@@ -498,8 +525,22 @@ if __name__ == "__main__":
handlers=[logging.StreamHandler()] handlers=[logging.StreamHandler()]
) )
_run_ts = datetime.now().isoformat(timespec="seconds")
start = time.time() start = time.time()
estates = scrape(max_pages=args.max_pages, max_properties=args.max_properties) try:
estates = scrape(max_pages=args.max_pages, max_properties=args.max_properties)
except Exception as e:
logger.error(f"Scraper failed: {e}", exc_info=True)
write_stats(STATS_FILE, {
"source": "iDNES",
"timestamp": _run_ts,
"duration_sec": round(time.time() - start, 1),
"success": False,
"accepted": 0,
"fetched": 0,
"error": str(e),
})
raise
if estates: if estates:
json_path = Path("byty_idnes.json") json_path = Path("byty_idnes.json")

View File

@@ -15,6 +15,9 @@ import time
from datetime import datetime from datetime import datetime
from pathlib import Path from pathlib import Path
from urllib.parse import urlencode from urllib.parse import urlencode
from scraper_stats import write_stats, validate_listing
STATS_FILE = "stats_psn.json"
logger = logging.getLogger(__name__) logger = logging.getLogger(__name__)
@@ -35,19 +38,25 @@ BASE_URL = "https://psn.cz"
UNITS_API = f"{BASE_URL}/api/units-list" UNITS_API = f"{BASE_URL}/api/units-list"
def fetch_json(url: str) -> dict: def fetch_json(url: str, retries: int = 3) -> dict:
"""Fetch JSON via curl (urllib SSL may fail on Cloudflare).""" """Fetch JSON via curl (urllib SSL may fail on Cloudflare) with retry."""
logger.debug(f"HTTP GET: {url}") for attempt in range(retries):
result = subprocess.run( logger.debug(f"HTTP GET (attempt {attempt + 1}/{retries}): {url}")
["curl", "-s", "-L", "--max-time", "30", result = subprocess.run(
"-H", f"User-Agent: {UA}", ["curl", "-s", "-L", "--max-time", "30",
"-H", "Accept: application/json", "-H", f"User-Agent: {UA}",
url], "-H", "Accept: application/json",
capture_output=True, text=True, timeout=60 url],
) capture_output=True, text=True, timeout=60
if result.returncode != 0: )
raise RuntimeError(f"curl failed ({result.returncode}): {result.stderr[:200]}") if result.returncode == 0:
return json.loads(result.stdout) return json.loads(result.stdout)
if attempt < retries - 1:
wait = (attempt + 1) * 2
logger.warning(f"curl failed (retry {attempt + 1}/{retries} after {wait}s): {result.stderr[:200]}")
time.sleep(wait)
else:
raise RuntimeError(f"curl failed after {retries} attempts ({result.returncode}): {result.stderr[:200]}")
def fix_gps(lat, lng): def fix_gps(lat, lng):
@@ -67,6 +76,8 @@ def format_price(price: int) -> str:
def scrape(max_properties: int | None = None): def scrape(max_properties: int | None = None):
_run_start = time.time()
_run_ts = datetime.now().isoformat(timespec="seconds")
logger.info("=" * 60) logger.info("=" * 60)
logger.info("Stahuji inzeráty z PSN.cz") logger.info("Stahuji inzeráty z PSN.cz")
logger.info(f"Cena: do {format_price(MAX_PRICE)}") logger.info(f"Cena: do {format_price(MAX_PRICE)}")
@@ -93,11 +104,30 @@ def scrape(max_properties: int | None = None):
data = fetch_json(url) data = fetch_json(url)
except Exception as e: except Exception as e:
logger.error(f"Chyba při stahování: {e}", exc_info=True) logger.error(f"Chyba při stahování: {e}", exc_info=True)
write_stats(STATS_FILE, {
"source": "PSN",
"timestamp": _run_ts,
"duration_sec": round(time.time() - _run_start, 1),
"success": False,
"accepted": 0,
"fetched": 0,
"error": str(e),
})
return [] return []
all_units = data.get("units", {}).get("data", []) all_units = data.get("units", {}).get("data", [])
logger.info(f"Staženo jednotek celkem: {len(all_units)}") logger.info(f"Staženo jednotek celkem: {len(all_units)}")
# Load previous output for first_seen/last_changed tracking
_prev_cache: dict[str, dict] = {}
_prev_path = Path("byty_psn.json")
if _prev_path.exists():
try:
for _item in json.loads(_prev_path.read_text(encoding="utf-8")):
_prev_cache[str(_item["hash_id"])] = _item
except Exception:
pass
# Filtrování # Filtrování
results = [] results = []
excluded = { excluded = {
@@ -228,7 +258,11 @@ def scrape(max_properties: int | None = None):
"source": "psn", "source": "psn",
"image": "", "image": "",
"scraped_at": datetime.now().strftime("%Y-%m-%d"), "scraped_at": datetime.now().strftime("%Y-%m-%d"),
"first_seen": _prev_cache.get(str(unit_id), {}).get("first_seen", datetime.now().strftime("%Y-%m-%d")),
"last_changed": datetime.now().strftime("%Y-%m-%d") if _prev_cache.get(str(unit_id), {}).get("price") != int(price) else _prev_cache[str(unit_id)].get("last_changed", datetime.now().strftime("%Y-%m-%d")),
} }
if not validate_listing(result, "psn"):
continue
results.append(result) results.append(result)
properties_fetched += 1 properties_fetched += 1
@@ -241,6 +275,15 @@ def scrape(max_properties: int | None = None):
logger.info(f" ✓ Vyhovující byty: {len(results)}") logger.info(f" ✓ Vyhovující byty: {len(results)}")
logger.info(f"{'=' * 60}") logger.info(f"{'=' * 60}")
write_stats(STATS_FILE, {
"source": "PSN",
"timestamp": _run_ts,
"duration_sec": round(time.time() - _run_start, 1),
"success": True,
"accepted": len(results),
"fetched": len(all_units),
"excluded": excluded,
})
return results return results
@@ -259,8 +302,22 @@ if __name__ == "__main__":
handlers=[logging.StreamHandler()] handlers=[logging.StreamHandler()]
) )
_run_ts = datetime.now().isoformat(timespec="seconds")
start = time.time() start = time.time()
estates = scrape(max_properties=args.max_properties) try:
estates = scrape(max_properties=args.max_properties)
except Exception as e:
logger.error(f"Scraper failed: {e}", exc_info=True)
write_stats(STATS_FILE, {
"source": "PSN",
"timestamp": _run_ts,
"duration_sec": round(time.time() - start, 1),
"success": False,
"accepted": 0,
"fetched": 0,
"error": str(e),
})
raise
if estates: if estates:
json_path = Path("byty_psn.json") json_path = Path("byty_psn.json")

View File

@@ -15,6 +15,9 @@ import re
import time import time
import urllib.request import urllib.request
from pathlib import Path from pathlib import Path
from scraper_stats import write_stats, validate_listing
STATS_FILE = "stats_realingo.json"
logger = logging.getLogger(__name__) logger = logging.getLogger(__name__)
@@ -53,6 +56,28 @@ HEADERS = {
BASE_URL = "https://www.realingo.cz" BASE_URL = "https://www.realingo.cz"
def fetch_url(url: str, retries: int = 3) -> str:
"""Fetch URL and return HTML string with retry on transient errors."""
for attempt in range(retries):
try:
logger.debug(f"HTTP GET request (attempt {attempt + 1}/{retries}): {url}")
req = urllib.request.Request(url, headers=HEADERS)
resp = urllib.request.urlopen(req, timeout=30)
html = resp.read().decode("utf-8")
logger.debug(f"HTTP response: status={resp.status}, size={len(html)} bytes")
return html
except urllib.error.HTTPError:
raise
except (ConnectionResetError, ConnectionError, urllib.error.URLError, OSError) as e:
if attempt < retries - 1:
wait = (attempt + 1) * 2
logger.warning(f"Connection error (retry {attempt + 1}/{retries} after {wait}s): {e}")
time.sleep(wait)
else:
logger.error(f"HTTP request failed after {retries} attempts: {e}", exc_info=True)
raise
def fetch_listing_page(page: int = 1) -> tuple[list[dict], int]: def fetch_listing_page(page: int = 1) -> tuple[list[dict], int]:
"""Fetch a page of Prague listings. Returns (items, total_count).""" """Fetch a page of Prague listings. Returns (items, total_count)."""
if page == 1: if page == 1:
@@ -60,41 +85,26 @@ def fetch_listing_page(page: int = 1) -> tuple[list[dict], int]:
else: else:
url = f"{BASE_URL}/prodej_byty/praha/{page}_strana/" url = f"{BASE_URL}/prodej_byty/praha/{page}_strana/"
logger.debug(f"HTTP GET request: {url}") html = fetch_url(url)
logger.debug(f"Headers: {HEADERS}") match = re.search(
req = urllib.request.Request(url, headers=HEADERS) r'<script id="__NEXT_DATA__" type="application/json">(.*?)</script>',
try: html, re.DOTALL
resp = urllib.request.urlopen(req, timeout=30) )
html = resp.read().decode("utf-8") if not match:
logger.debug(f"HTTP response: status={resp.status}, size={len(html)} bytes") logger.debug("No __NEXT_DATA__ script found in HTML")
return [], 0
match = re.search( data = json.loads(match.group(1))
r'<script id="__NEXT_DATA__" type="application/json">(.*?)</script>', offer_list = data["props"]["pageProps"]["store"]["offer"]["list"]
html, re.DOTALL logger.debug(f"Page {page}: found {len(offer_list['data'])} items, total={offer_list['total']}")
) return offer_list["data"], offer_list["total"]
if not match:
logger.debug("No __NEXT_DATA__ script found in HTML")
return [], 0
data = json.loads(match.group(1))
offer_list = data["props"]["pageProps"]["store"]["offer"]["list"]
logger.debug(f"Page {page}: found {len(offer_list['data'])} items, total={offer_list['total']}")
return offer_list["data"], offer_list["total"]
except (urllib.error.URLError, ConnectionError, OSError) as e:
logger.error(f"HTTP request failed for {url}: {e}", exc_info=True)
raise
def fetch_detail(listing_url: str) -> dict | None: def fetch_detail(listing_url: str) -> dict | None:
"""Fetch detail page for a listing to get floor, building type, etc.""" """Fetch detail page for a listing to get floor, building type, etc."""
try: try:
url = f"{BASE_URL}{listing_url}" url = f"{BASE_URL}{listing_url}"
logger.debug(f"HTTP GET request: {url}") html = fetch_url(url)
req = urllib.request.Request(url, headers=HEADERS)
resp = urllib.request.urlopen(req, timeout=30)
html = resp.read().decode("utf-8")
logger.debug(f"HTTP response: status={resp.status}, size={len(html)} bytes")
match = re.search( match = re.search(
r'<script id="__NEXT_DATA__" type="application/json">(.*?)</script>', r'<script id="__NEXT_DATA__" type="application/json">(.*?)</script>',
html, re.DOTALL html, re.DOTALL
@@ -136,6 +146,8 @@ def load_cache(json_path: str = "byty_realingo.json") -> dict[int, dict]:
def scrape(max_pages: int | None = None, max_properties: int | None = None): def scrape(max_pages: int | None = None, max_properties: int | None = None):
_run_start = time.time()
_run_ts = datetime.now().isoformat(timespec="seconds")
cache = load_cache() cache = load_cache()
logger.info("=" * 60) logger.info("=" * 60)
@@ -316,7 +328,11 @@ def scrape(max_pages: int | None = None, max_properties: int | None = None):
"source": "realingo", "source": "realingo",
"image": "", "image": "",
"scraped_at": datetime.now().strftime("%Y-%m-%d"), "scraped_at": datetime.now().strftime("%Y-%m-%d"),
"first_seen": cached.get("first_seen", datetime.now().strftime("%Y-%m-%d")) if cached else datetime.now().strftime("%Y-%m-%d"),
"last_changed": datetime.now().strftime("%Y-%m-%d"),
} }
if not validate_listing(result, "realingo"):
continue
results.append(result) results.append(result)
properties_fetched += 1 properties_fetched += 1
@@ -333,6 +349,25 @@ def scrape(max_pages: int | None = None, max_properties: int | None = None):
logger.info(f" ✓ Vyhovující byty: {len(results)}") logger.info(f" ✓ Vyhovující byty: {len(results)}")
logger.info(f"{'=' * 60}") logger.info(f"{'=' * 60}")
write_stats(STATS_FILE, {
"source": "Realingo",
"timestamp": _run_ts,
"duration_sec": round(time.time() - _run_start, 1),
"success": True,
"accepted": len(results),
"fetched": len(all_listings),
"pages": page - 1,
"cache_hits": cache_hits,
"excluded": {
"dispozice": excluded_category,
"cena": excluded_price,
"plocha": excluded_area,
"bez GPS": excluded_no_gps,
"panel/síd": excluded_panel,
"patro": excluded_floor,
"bez detailu": excluded_detail,
},
})
return results return results
@@ -353,8 +388,22 @@ if __name__ == "__main__":
handlers=[logging.StreamHandler()] handlers=[logging.StreamHandler()]
) )
_run_ts = datetime.now().isoformat(timespec="seconds")
start = time.time() start = time.time()
estates = scrape(max_pages=args.max_pages, max_properties=args.max_properties) try:
estates = scrape(max_pages=args.max_pages, max_properties=args.max_properties)
except Exception as e:
logger.error(f"Scraper failed: {e}", exc_info=True)
write_stats(STATS_FILE, {
"source": "Realingo",
"timestamp": _run_ts,
"duration_sec": round(time.time() - start, 1),
"success": False,
"accepted": 0,
"fetched": 0,
"error": str(e),
})
raise
if estates: if estates:
json_path = Path("byty_realingo.json") json_path = Path("byty_realingo.json")

55
scraper_stats.py Normal file
View File

@@ -0,0 +1,55 @@
"""Shared utilities for scraper run statistics and listing validation."""
from __future__ import annotations
import json
import logging
import os
from pathlib import Path
HERE = Path(__file__).parent
DATA_DIR = Path(os.environ.get("DATA_DIR", HERE))
_val_log = logging.getLogger(__name__)
_REQUIRED_FIELDS = ("hash_id", "price", "locality", "lat", "lon", "url", "source")
def validate_listing(listing: dict, context: str = "") -> bool:
"""
Validate a listing dict before it is written to the output JSON.
Returns True if valid, False if the listing should be skipped.
Logs a warning for each invalid listing.
"""
prefix = f"[{context}] " if context else ""
for field in _REQUIRED_FIELDS:
val = listing.get(field)
if val is None or val == "":
_val_log.warning(f"{prefix}Skipping listing — missing field '{field}': {listing.get('hash_id', '?')}")
return False
price = listing.get("price")
if not isinstance(price, (int, float)) or price <= 0:
_val_log.warning(f"{prefix}Skipping listing — invalid price={price!r}: {listing.get('hash_id', '?')}")
return False
lat, lon = listing.get("lat"), listing.get("lon")
if not isinstance(lat, (int, float)) or not isinstance(lon, (int, float)):
_val_log.warning(f"{prefix}Skipping listing — non-numeric GPS lat={lat!r} lon={lon!r}: {listing.get('hash_id', '?')}")
return False
if not (47.0 <= lat <= 52.0) or not (12.0 <= lon <= 19.0):
_val_log.warning(f"{prefix}Skipping listing — GPS outside Czech Republic lat={lat} lon={lon}: {listing.get('hash_id', '?')}")
return False
area = listing.get("area")
if area is not None and (not isinstance(area, (int, float)) or area <= 0):
_val_log.warning(f"{prefix}Skipping listing — invalid area={area!r}: {listing.get('hash_id', '?')}")
return False
return True
def write_stats(filename: str, stats: dict) -> None:
"""Write scraper run stats dict to the data directory."""
path = DATA_DIR / filename
path.write_text(json.dumps(stats, ensure_ascii=False, indent=2), encoding="utf-8")

477
server.py Normal file
View File

@@ -0,0 +1,477 @@
#!/usr/bin/env python3
"""
General-purpose HTTP server for maru-hleda-byt.
Serves static files from DATA_DIR and additionally handles:
GET /scrapers-status → SSR scraper status page
GET /api/ratings → ratings.json contents
POST /api/ratings → save entire ratings object
GET /api/ratings/export → same as GET, with download header
GET /api/status → status.json contents (JSON)
GET /api/status/history → scraper_history.json contents (JSON)
"""
from __future__ import annotations
import functools
import json
import logging
import os
import sys
from datetime import datetime
from http.server import HTTPServer, SimpleHTTPRequestHandler
from pathlib import Path
PORT = int(os.environ.get("SERVER_PORT", 8080))
DATA_DIR = Path(os.environ.get("DATA_DIR", "."))
RATINGS_FILE = DATA_DIR / "ratings.json"
_LOG_LEVEL = getattr(logging, os.environ.get("LOG_LEVEL", "INFO").upper(), logging.INFO)
logging.basicConfig(
level=_LOG_LEVEL,
format="%(asctime)s [server] %(levelname)s %(message)s",
datefmt="%Y-%m-%dT%H:%M:%S",
)
log = logging.getLogger(__name__)
# ── Helpers ──────────────────────────────────────────────────────────────────
COLORS = {
"sreality": "#1976D2",
"realingo": "#7B1FA2",
"bezrealitky": "#E65100",
"idnes": "#C62828",
"psn": "#2E7D32",
"cityhome": "#00838F",
}
MONTHS_CZ = [
"ledna", "února", "března", "dubna", "května", "června",
"července", "srpna", "září", "října", "listopadu", "prosince",
]
def _load_json(path: Path, default=None):
"""Read and parse JSON file; return default on missing or parse error."""
log.debug("_load_json: %s", path.resolve())
try:
if path.exists():
return json.loads(path.read_text(encoding="utf-8"))
except Exception as e:
log.warning("Failed to load %s: %s", path, e)
return default
def _fmt_date(iso_str: str) -> str:
"""Format ISO timestamp as Czech date string."""
try:
d = datetime.fromisoformat(iso_str)
return f"{d.day}. {MONTHS_CZ[d.month - 1]} {d.year}, {d.hour:02d}:{d.minute:02d}"
except Exception:
return iso_str
def load_ratings() -> dict:
return _load_json(RATINGS_FILE, default={})
def save_ratings(data: dict) -> None:
RATINGS_FILE.write_text(
json.dumps(data, ensure_ascii=False, indent=2),
encoding="utf-8",
)
# ── SSR status page ──────────────────────────────────────────────────────────
_CSS = """\
* { margin: 0; padding: 0; box-sizing: border-box; }
body {
font-family: system-ui, -apple-system, sans-serif;
background: #f5f5f5; color: #333;
padding: 24px; max-width: 640px; margin: 0 auto;
}
h1 { font-size: 22px; margin-bottom: 4px; }
.subtitle { color: #888; font-size: 13px; margin-bottom: 24px; }
.card {
background: white; border-radius: 12px; padding: 20px;
box-shadow: 0 1px 4px rgba(0,0,0,0.08); margin-bottom: 16px;
}
.card h2 { font-size: 15px; margin-bottom: 12px; color: #555; }
.timestamp { font-size: 28px; font-weight: 700; color: #1976D2; }
.timestamp-sub { font-size: 13px; color: #999; margin-top: 2px; }
.summary-row {
display: flex; justify-content: space-between; align-items: center;
padding: 10px 0; border-bottom: 1px solid #f0f0f0;
}
.summary-row:last-child { border-bottom: none; }
.summary-label { font-size: 13px; color: #666; }
.summary-value { font-size: 18px; font-weight: 700; }
.badge {
display: inline-block; padding: 2px 8px; border-radius: 4px;
font-size: 11px; font-weight: 600; color: white;
}
.badge-ok { background: #4CAF50; }
.badge-err { background: #F44336; }
.badge-skip { background: #FF9800; }
.bar-row { display: flex; align-items: center; gap: 8px; margin: 4px 0; }
.bar-track { flex: 1; height: 20px; background: #f0f0f0; border-radius: 4px; overflow: hidden; }
.bar-fill { height: 100%; border-radius: 4px; }
.bar-count { font-size: 12px; width: 36px; font-variant-numeric: tabular-nums; }
.loader-wrap {
display: flex; flex-direction: column; align-items: center;
justify-content: center; padding: 60px 0;
}
.spinner {
width: 40px; height: 40px; border: 4px solid #e0e0e0;
border-top-color: #1976D2; border-radius: 50%;
animation: spin 0.8s linear infinite;
}
@keyframes spin { to { transform: rotate(360deg); } }
.loader-text { margin-top: 16px; color: #999; font-size: 14px; }
.link-row { text-align: center; margin-top: 8px; }
.link-row a { color: #1976D2; text-decoration: none; font-size: 14px; }
.history-table { width: 100%; border-collapse: collapse; font-size: 12px; }
.history-table th {
text-align: left; font-weight: 600; color: #999; font-size: 11px;
padding: 4px 6px 8px 6px; border-bottom: 2px solid #f0f0f0;
}
.history-table td { padding: 7px 6px; border-bottom: 1px solid #f5f5f5; vertical-align: middle; }
.history-table tr:last-child td { border-bottom: none; }
.history-table tr.latest td { background: #f8fbff; font-weight: 600; }
.src-nums { display: flex; gap: 4px; flex-wrap: wrap; }
.src-chip {
display: inline-block; padding: 1px 5px; border-radius: 3px;
font-size: 10px; color: white; font-variant-numeric: tabular-nums;
}
.clickable-row { cursor: pointer; }
.clickable-row:hover td { background: #f0f7ff !important; }
/* Modal */
#md-overlay {
position: fixed; inset: 0; background: rgba(0,0,0,0.45);
display: flex; align-items: flex-start; justify-content: center;
z-index: 1000; padding: 40px 16px; overflow-y: auto;
}
#md-box {
background: white; border-radius: 12px; padding: 24px;
width: 100%; max-width: 620px; position: relative;
box-shadow: 0 8px 32px rgba(0,0,0,0.24); margin: auto;
}
#md-close {
position: absolute; top: 10px; right: 14px;
background: none; border: none; font-size: 26px; cursor: pointer;
color: #aaa; line-height: 1;
}
#md-close:hover { color: #333; }
#md-box h3 { font-size: 15px; margin-bottom: 14px; padding-right: 24px; }
.md-summary { display: flex; gap: 20px; flex-wrap: wrap; font-size: 13px; margin-bottom: 16px; color: #555; }
.md-summary b { color: #333; }
.detail-table { width: 100%; border-collapse: collapse; font-size: 12px; }
.detail-table th {
text-align: left; color: #999; font-size: 11px; font-weight: 600;
padding: 4px 8px 6px 0; border-bottom: 2px solid #f0f0f0; white-space: nowrap;
}
.detail-table td { padding: 6px 8px 6px 0; border-bottom: 1px solid #f5f5f5; vertical-align: top; }
.detail-table tr:last-child td { border-bottom: none; }
"""
_SOURCE_ORDER = ["Sreality", "Realingo", "Bezrealitky", "iDNES", "PSN", "CityHome"]
_SOURCE_ABBR = ["Sre", "Rea", "Bez", "iDN", "PSN", "CH"]
def _sources_html(sources: list) -> str:
if not sources:
return ""
max_count = max((s.get("accepted", 0) for s in sources), default=1) or 1
parts = ['<div class="card"><h2>Zdroje</h2>']
for s in sources:
name = s.get("name", "?")
accepted = s.get("accepted", 0)
error = s.get("error")
exc = s.get("excluded", {})
excluded_total = sum(exc.values()) if isinstance(exc, dict) else s.get("excluded_total", 0)
color = COLORS.get(name.lower(), "#999")
pct = round(accepted / max_count * 100) if max_count else 0
if error:
badge = '<span class="badge badge-err">chyba</span>'
elif accepted == 0:
badge = '<span class="badge badge-skip">0</span>'
else:
badge = '<span class="badge badge-ok">OK</span>'
parts.append(
f'<div style="margin-bottom:12px;">'
f'<div style="display:flex;justify-content:space-between;align-items:center;margin-bottom:4px;">'
f'<span style="font-weight:600;font-size:14px;">{name} {badge}</span>'
f'<span style="font-size:12px;color:#999;">{excluded_total} vyloučených</span>'
f'</div>'
f'<div class="bar-row">'
f'<div class="bar-track"><div class="bar-fill" style="width:{pct}%;background:{color};"></div></div>'
f'<span class="bar-count">{accepted}</span>'
f'</div></div>'
)
parts.append("</div>")
return "".join(parts)
def _history_html(history: list) -> str:
if not history:
return ""
rows = list(reversed(history))
parts = [
'<div class="card">'
'<h2>Historie běhů <span style="font-size:11px;font-weight:400;color:#bbb;"> klikni pro detaily</span></h2>',
'<table class="history-table"><thead><tr>',
'<th>Datum</th><th>Trvání</th><th>Přijato&nbsp;/&nbsp;Dedup</th><th>Zdroje</th><th>OK</th>',
'</tr></thead><tbody>',
]
for i, entry in enumerate(rows):
row_class = ' class="latest clickable-row"' if i == 0 else ' class="clickable-row"'
src_map = {s["name"]: s for s in entry.get("sources", []) if "name" in s}
chips = "".join(
f'<span class="src-chip" style="background:{"#F44336" if (src_map.get(name) or {}).get("error") else COLORS.get(name.lower(), "#999")}" title="{name}">'
f'{abbr}&nbsp;{src_map[name].get("accepted", 0) if name in src_map else "-"}</span>'
for name, abbr in zip(_SOURCE_ORDER, _SOURCE_ABBR)
)
ok_badge = (
'<span class="badge badge-err">chyba</span>'
if entry.get("success") is False
else '<span class="badge badge-ok">OK</span>'
)
dur = f'{entry["duration_sec"]}s' if entry.get("duration_sec") is not None else "-"
parts.append(
f'<tr{row_class} data-idx="{i}">'
f'<td>{_fmt_date(entry.get("timestamp", ""))}</td>'
f'<td>{dur}</td>'
f'<td>{entry.get("total_accepted", "-")}&nbsp;/&nbsp;{entry.get("deduplicated", "-")}</td>'
f'<td><div class="src-nums">{chips}</div></td>'
f'<td>{ok_badge}</td>'
f'</tr>'
)
parts.append("</tbody></table></div>")
return "".join(parts)
def _modal_script(rows_json: str) -> str:
"""Return the modal overlay HTML + JS for the history detail popup."""
return (
'<div id="md-overlay" style="display:none">'
'<div id="md-box"><button id="md-close">&times;</button>'
'<div id="md-body"></div></div></div>\n'
'<script>\n(function(){\n'
f'var H={rows_json};\n'
'var C={"sreality":"#1976D2","realingo":"#7B1FA2","bezrealitky":"#E65100","idnes":"#C62828","psn":"#2E7D32","cityhome":"#00838F"};\n'
'var MN=["ledna","února","března","dubna","května","června","července","srpna","září","října","listopadu","prosince"];\n'
'function fd(s){var d=new Date(s);return d.getDate()+". "+MN[d.getMonth()]+" "+d.getFullYear()+", "+String(d.getHours()).padStart(2,"0")+":"+String(d.getMinutes()).padStart(2,"0");}\n'
'function openModal(idx){\n'
' var e=H[idx],src=e.sources||[];\n'
' var h="<h3>Detaily b\u011bhu \u2013 "+fd(e.timestamp)+"</h3>";\n'
' h+="<div class=\\"md-summary\\">";\n'
' if(e.duration_sec!=null) h+="<span><b>Trvání:</b> "+e.duration_sec+"s</span>";\n'
' if(e.total_accepted!=null) h+="<span><b>Přijato:</b> "+e.total_accepted+"</span>";\n'
' if(e.deduplicated!=null) h+="<span><b>Po dedup:</b> "+e.deduplicated+"</span>";\n'
' h+="</div>";\n'
' h+="<table class=\\"detail-table\\"><thead><tr>";\n'
' h+="<th>Zdroj</th><th>Přijato</th><th>Staženo</th><th>Stránky</th><th>Cache</th><th>Vyloučeno</th><th>Čas</th><th>OK</th>";\n'
' h+="</tr></thead><tbody>";\n'
' src.forEach(function(s){\n'
' var nm=s.name||"?",col=C[nm.toLowerCase()]||"#999";\n'
' var exc=s.excluded||{};\n'
' var excStr=Object.entries(exc).filter(function(kv){return kv[1]>0;}).map(function(kv){return kv[0]+":&nbsp;"+kv[1];}).join(", ")||"\u2013";\n'
' var ok=s.error?"<span class=\\"badge badge-err\\" title=\\""+s.error+"\\">chyba</span>":"<span class=\\"badge badge-ok\\">OK</span>";\n'
' var dot="<span style=\\"display:inline-block;width:8px;height:8px;border-radius:50%;background:"+col+";margin-right:5px;\\"></span>";\n'
' h+="<tr>";\n'
' h+="<td>"+dot+nm+"</td>";\n'
' h+="<td>"+(s.accepted!=null?s.accepted:"\u2013")+"</td>";\n'
' h+="<td>"+(s.fetched!=null?s.fetched:"\u2013")+"</td>";\n'
' h+="<td>"+(s.pages!=null?s.pages:"\u2013")+"</td>";\n'
' h+="<td>"+(s.cache_hits!=null?s.cache_hits:"\u2013")+"</td>";\n'
' h+="<td style=\\"font-size:11px;color:#666;\\">"+excStr+"</td>";\n'
' h+="<td>"+(s.duration_sec!=null?s.duration_sec+"s":"\u2013")+"</td>";\n'
' h+="<td>"+ok+"</td></tr>";\n'
' });\n'
' h+="</tbody></table>";\n'
' document.getElementById("md-body").innerHTML=h;\n'
' document.getElementById("md-overlay").style.display="flex";\n'
'}\n'
'function closeModal(){document.getElementById("md-overlay").style.display="none";}\n'
'var tb=document.querySelector(".history-table tbody");\n'
'if(tb)tb.addEventListener("click",function(e){var tr=e.target.closest("tr[data-idx]");if(tr)openModal(parseInt(tr.dataset.idx,10));});\n'
'document.getElementById("md-close").addEventListener("click",closeModal);\n'
'document.getElementById("md-overlay").addEventListener("click",function(e){if(e.target===this)closeModal();});\n'
'document.addEventListener("keydown",function(e){if(e.key==="Escape")closeModal();});\n'
'})();\n</script>'
)
def _render_status_html(status: dict | None, history: list, is_running: bool = False) -> str:
"""Generate the complete HTML page for /scrapers-status."""
head_open = (
'<!DOCTYPE html>\n<html lang="cs">\n<head>\n'
'<meta charset="UTF-8">\n'
'<meta name="viewport" content="width=device-width, initial-scale=1.0">\n'
f'<title>Scraper status</title>\n<style>{_CSS}</style>\n'
)
page_header = '<h1>Scraper status</h1>\n<div class="subtitle">maru-hleda-byt</div>\n'
footer = '<div class="link-row"><a href="/mapa_bytu.html">Otevřít mapu</a></div>'
if status is None:
return (
head_open + '</head>\n<body>\n' + page_header
+ '<div class="card"><p style="color:#F44336">Status není k dispozici.</p></div>\n'
+ footer + '\n</body>\n</html>'
)
if is_running:
return (
head_open
+ '<meta http-equiv="refresh" content="30">\n'
+ '</head>\n<body>\n' + page_header
+ '<div class="loader-wrap"><div class="spinner"></div>'
+ '<div class="loader-text">Scraper právě běží…</div></div>\n'
+ footer + '\n</body>\n</html>'
)
# ── Done state ────────────────────────────────────────────────────────────
ts = status.get("timestamp", "")
duration = status.get("duration_sec")
total_accepted = status.get("total_accepted", 0)
deduplicated = status.get("deduplicated")
ts_card = (
'<div class="card"><h2>Poslední scrape</h2>'
f'<div class="timestamp">{_fmt_date(ts)}</div>'
+ (f'<div class="timestamp-sub">Trvání: {round(duration)}s</div>' if duration is not None else "")
+ '</div>'
)
sum_card = (
'<div class="card"><h2>Souhrn</h2>'
f'<div class="summary-row"><span class="summary-label">Vyhovujících bytů</span>'
f'<span class="summary-value" style="color:#4CAF50">{total_accepted}</span></div>'
+ (
f'<div class="summary-row"><span class="summary-label">Po deduplikaci (v mapě)</span>'
f'<span class="summary-value" style="color:#1976D2">{deduplicated}</span></div>'
if deduplicated is not None else ""
)
+ '</div>'
)
rows_for_js = list(reversed(history))
body = (
page_header
+ ts_card + "\n"
+ sum_card + "\n"
+ _sources_html(status.get("sources", [])) + "\n"
+ _history_html(history) + "\n"
+ footer
)
modal = _modal_script(json.dumps(rows_for_js, ensure_ascii=False))
return head_open + '</head>\n<body>\n' + body + '\n' + modal + '\n</body>\n</html>'
# ── HTTP handler ──────────────────────────────────────────────────────────────
class Handler(SimpleHTTPRequestHandler):
def log_message(self, format, *args):
pass # suppress default access log; use our own where needed
def _send_json(self, status: int, body, extra_headers=None):
payload = json.dumps(body, ensure_ascii=False).encode("utf-8")
self.send_response(status)
self.send_header("Content-Type", "application/json; charset=utf-8")
self.send_header("Content-Length", str(len(payload)))
self.send_header("Access-Control-Allow-Origin", "*")
self.send_header("Access-Control-Allow-Methods", "GET, POST, OPTIONS")
self.send_header("Access-Control-Allow-Headers", "Content-Type")
if extra_headers:
for k, v in extra_headers.items():
self.send_header(k, v)
self.end_headers()
self.wfile.write(payload)
def do_OPTIONS(self):
self.send_response(204)
self.send_header("Access-Control-Allow-Origin", "*")
self.send_header("Access-Control-Allow-Methods", "GET, POST, OPTIONS")
self.send_header("Access-Control-Allow-Headers", "Content-Type")
self.end_headers()
def do_GET(self):
if self.path.startswith("/api/"):
self._handle_api_get()
elif self.path.rstrip("/") == "/scrapers-status":
self._serve_status_page()
else:
log.debug("GET %s → static file: %s", self.path, self.translate_path(self.path))
super().do_GET()
def _handle_api_get(self):
if self.path in ("/api/ratings", "/api/ratings/export"):
ratings = load_ratings()
extra = None
if self.path == "/api/ratings/export":
extra = {"Content-Disposition": 'attachment; filename="ratings.json"'}
log.info("GET %s%d ratings", self.path, len(ratings))
self._send_json(200, ratings, extra)
elif self.path == "/api/status":
data = _load_json(DATA_DIR / "status.json")
if data is None:
self._send_json(404, {"error": "status not available"})
return
log.info("GET /api/status → ok")
self._send_json(200, data)
elif self.path == "/api/status/history":
data = _load_json(DATA_DIR / "scraper_history.json", default=[])
if not isinstance(data, list):
data = []
log.info("GET /api/status/history → %d entries", len(data))
self._send_json(200, data)
else:
self._send_json(404, {"error": "not found"})
def _serve_status_page(self):
status = _load_json(DATA_DIR / "status.json")
history = _load_json(DATA_DIR / "scraper_history.json", default=[])
if not isinstance(history, list):
history = []
is_running = (DATA_DIR / "scraper_running.json").exists()
html = _render_status_html(status, history, is_running)
payload = html.encode("utf-8")
self.send_response(200)
self.send_header("Content-Type", "text/html; charset=utf-8")
self.send_header("Content-Length", str(len(payload)))
self.end_headers()
self.wfile.write(payload)
def do_POST(self):
if self.path == "/api/ratings":
length = int(self.headers.get("Content-Length", 0))
if length == 0:
self._send_json(400, {"error": "empty body"})
return
try:
raw = self.rfile.read(length)
data = json.loads(raw.decode("utf-8"))
except Exception as e:
log.warning("Bad request body: %s", e)
self._send_json(400, {"error": "invalid JSON"})
return
if not isinstance(data, dict):
self._send_json(400, {"error": "expected JSON object"})
return
save_ratings(data)
log.info("POST /api/ratings → saved %d ratings", len(data))
self._send_json(200, {"ok": True, "count": len(data)})
else:
self._send_json(404, {"error": "not found"})
if __name__ == "__main__":
log.info("Server starting on port %d, data dir: %s", PORT, DATA_DIR)
handler = functools.partial(Handler, directory=str(DATA_DIR))
server = HTTPServer(("0.0.0.0", PORT), handler)
try:
server.serve_forever()
except KeyboardInterrupt:
log.info("Stopped.")
sys.exit(0)

View File

@@ -1,204 +0,0 @@
<!DOCTYPE html>
<html lang="cs">
<head>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>Scraper status</title>
<style>
* { margin: 0; padding: 0; box-sizing: border-box; }
body {
font-family: system-ui, -apple-system, sans-serif;
background: #f5f5f5; color: #333;
padding: 24px; max-width: 640px; margin: 0 auto;
}
h1 { font-size: 22px; margin-bottom: 4px; }
.subtitle { color: #888; font-size: 13px; margin-bottom: 24px; }
.card {
background: white; border-radius: 12px; padding: 20px;
box-shadow: 0 1px 4px rgba(0,0,0,0.08); margin-bottom: 16px;
}
.card h2 { font-size: 15px; margin-bottom: 12px; color: #555; }
.timestamp {
font-size: 28px; font-weight: 700; color: #1976D2;
}
.timestamp-ago { font-size: 13px; color: #999; margin-top: 2px; }
/* Source table */
.source-table { width: 100%; border-collapse: collapse; }
.source-table td { padding: 8px 0; border-bottom: 1px solid #f0f0f0; font-size: 14px; }
.source-table tr:last-child td { border-bottom: none; }
.source-table .name { font-weight: 600; }
.source-table .count { text-align: right; font-variant-numeric: tabular-nums; }
.source-table .rejected { text-align: right; color: #999; font-size: 12px; }
.badge {
display: inline-block; padding: 2px 8px; border-radius: 4px;
font-size: 11px; font-weight: 600; color: white;
}
.badge-ok { background: #4CAF50; }
.badge-err { background: #F44336; }
.badge-skip { background: #FF9800; }
/* Summary bar */
.summary-row {
display: flex; justify-content: space-between; align-items: center;
padding: 10px 0; border-bottom: 1px solid #f0f0f0;
}
.summary-row:last-child { border-bottom: none; }
.summary-label { font-size: 13px; color: #666; }
.summary-value { font-size: 18px; font-weight: 700; }
/* Source bar chart */
.bar-row { display: flex; align-items: center; gap: 8px; margin: 4px 0; }
.bar-label { width: 90px; font-size: 12px; text-align: right; color: #666; }
.bar-track { flex: 1; height: 20px; background: #f0f0f0; border-radius: 4px; overflow: hidden; position: relative; }
.bar-fill { height: 100%; border-radius: 4px; transition: width 0.5s ease; }
.bar-count { font-size: 12px; width: 36px; font-variant-numeric: tabular-nums; }
/* Loader */
.loader-wrap {
display: flex; flex-direction: column; align-items: center;
justify-content: center; padding: 60px 0;
}
.spinner {
width: 40px; height: 40px; border: 4px solid #e0e0e0;
border-top-color: #1976D2; border-radius: 50%;
animation: spin 0.8s linear infinite;
}
@keyframes spin { to { transform: rotate(360deg); } }
.loader-text { margin-top: 16px; color: #999; font-size: 14px; }
.error-msg { color: #F44336; padding: 40px 0; text-align: center; }
.link-row { text-align: center; margin-top: 8px; }
.link-row a { color: #1976D2; text-decoration: none; font-size: 14px; }
</style>
</head>
<body>
<h1>Scraper status</h1>
<div class="subtitle">maru-hleda-byt</div>
<div id="content">
<div class="loader-wrap">
<div class="spinner"></div>
<div class="loader-text">Nacitam status...</div>
</div>
</div>
<div class="link-row"><a href="mapa_bytu.html">Otevrit mapu</a></div>
<script>
var COLORS = {
sreality: '#1976D2',
realingo: '#7B1FA2',
bezrealitky: '#E65100',
idnes: '#C62828',
psn: '#2E7D32',
cityhome: '#00838F',
};
function timeAgo(dateStr) {
var d = new Date(dateStr);
var now = new Date();
var diff = Math.floor((now - d) / 1000);
if (diff < 60) return 'prave ted';
if (diff < 3600) return Math.floor(diff / 60) + ' min zpet';
if (diff < 86400) return Math.floor(diff / 3600) + ' hod zpet';
return Math.floor(diff / 86400) + ' dni zpet';
}
function formatDate(dateStr) {
var d = new Date(dateStr);
var day = d.getDate();
var months = ['ledna','unora','brezna','dubna','kvetna','cervna',
'cervence','srpna','zari','rijna','listopadu','prosince'];
var hh = String(d.getHours()).padStart(2, '0');
var mm = String(d.getMinutes()).padStart(2, '0');
return day + '. ' + months[d.getMonth()] + ' ' + d.getFullYear() + ', ' + hh + ':' + mm;
}
function render(data) {
// Check if scrape is currently running
if (data.status === 'running') {
document.getElementById('content').innerHTML =
'<div class="loader-wrap">' +
'<div class="spinner"></div>' +
'<div class="loader-text">Scraper prave bezi...</div>' +
'</div>';
setTimeout(loadStatus, 30000);
return;
}
var sources = data.sources || [];
var totalOk = 0, totalRej = 0;
var maxCount = 0;
sources.forEach(function(s) {
totalOk += s.accepted || 0;
totalRej += s.rejected || 0;
if (s.accepted > maxCount) maxCount = s.accepted;
});
var html = '';
// Timestamp card
html += '<div class="card">';
html += '<h2>Posledni scrape</h2>';
html += '<div class="timestamp">' + formatDate(data.timestamp) + '</div>';
html += '<div class="timestamp-ago">' + timeAgo(data.timestamp) + '</div>';
if (data.duration_sec) {
html += '<div class="timestamp-ago">Trvani: ' + Math.round(data.duration_sec) + 's</div>';
}
html += '</div>';
// Summary card
html += '<div class="card">';
html += '<h2>Souhrn</h2>';
html += '<div class="summary-row"><span class="summary-label">Vyhovujicich bytu</span><span class="summary-value" style="color:#4CAF50">' + totalOk + '</span></div>';
html += '<div class="summary-row"><span class="summary-label">Vyloucenych</span><span class="summary-value" style="color:#999">' + totalRej + '</span></div>';
if (data.deduplicated !== undefined) {
html += '<div class="summary-row"><span class="summary-label">Po deduplikaci (v mape)</span><span class="summary-value" style="color:#1976D2">' + data.deduplicated + '</span></div>';
}
html += '</div>';
// Sources card
html += '<div class="card">';
html += '<h2>Zdroje</h2>';
sources.forEach(function(s) {
var color = COLORS[s.name.toLowerCase()] || '#999';
var pct = maxCount > 0 ? Math.round((s.accepted / maxCount) * 100) : 0;
var badge = s.error
? '<span class="badge badge-err">chyba</span>'
: (s.accepted === 0 ? '<span class="badge badge-skip">0</span>' : '<span class="badge badge-ok">OK</span>');
html += '<div style="margin-bottom:12px;">';
html += '<div style="display:flex;justify-content:space-between;align-items:center;margin-bottom:4px;">';
html += '<span style="font-weight:600;font-size:14px;">' + s.name + ' ' + badge + '</span>';
html += '<span style="font-size:12px;color:#999;">' + (s.rejected || 0) + ' vyloucenych</span>';
html += '</div>';
html += '<div class="bar-row">';
html += '<div class="bar-track"><div class="bar-fill" style="width:' + pct + '%;background:' + color + ';"></div></div>';
html += '<span class="bar-count">' + (s.accepted || 0) + '</span>';
html += '</div>';
html += '</div>';
});
html += '</div>';
document.getElementById('content').innerHTML = html;
}
function loadStatus() {
fetch('status.json?t=' + Date.now())
.then(function(r) {
if (!r.ok) throw new Error(r.status);
return r.json();
})
.then(render)
.catch(function(err) {
document.getElementById('content').innerHTML =
'<div class="error-msg">Status zatim neni k dispozici.<br><small>(' + err.message + ')</small></div>';
});
}
loadStatus();
</script>
</body>
</html>