Triggers on tag push or manual dispatch. Builds the image using build/Dockerfile and pushes to the Gitea container registry. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Maru hleda byt
Apartment search aggregator for Prague. Scrapes listings from 6 Czech real estate portals, filters them by configurable criteria, deduplicates across sources, and generates a single interactive map with all matching apartments.
Built for a specific use case: finding a 3+kk or larger apartment in Prague, excluding panel construction ("panelak") and housing estates ("sidliste"), with personal rating support.
How it works
┌─────────────────────────────────────────────────────────────┐
│ run_all.sh │
│ Orchestrates all scrapers, then merges results into map │
├─────────┬──────────┬──────────┬────────┬────────┬───────────┤
│Sreality │Realingo │Bezreal. │iDNES │PSN │CityHome │
│ (API) │ (HTML) │ (HTML) │ (HTML) │ (HTML) │ (HTML) │
├─────────┴──────────┴──────────┴────────┴────────┴───────────┤
│ merge_and_map.py │
│ Loads all byty_*.json, deduplicates, generates HTML map │
├─────────────────────────────────────────────────────────────┤
│ mapa_bytu.html │
│ Interactive Leaflet.js map with filters & ratings │
└─────────────────────────────────────────────────────────────┘
Pipeline
- Scraping -- Each scraper independently fetches listings from its portal, applies filters, and saves results to a JSON file (
byty_<source>.json). - Merging --
merge_and_map.pyloads all 6 JSON files, deduplicates listings (by street name + price + area), and generates the finalmapa_bytu.html. - Serving -- The HTML map can be opened locally as a file, or served via Docker with a built-in HTTP server.
Execution order in run_all.sh
Scrapers run sequentially (to avoid overwhelming any single portal), except PSN and CityHome which run in parallel (different sites). If a scraper fails, it is logged but does not abort the pipeline -- remaining scrapers continue.
1. scrape_and_map.py (Sreality)
2. scrape_realingo.py (Realingo)
3. scrape_bezrealitky.py (Bezrealitky)
4. scrape_idnes.py (iDNES Reality)
5. scrape_psn.py + scrape_cityhome.py (parallel)
6. merge_and_map.py (merge + map generation)
Scrapers
All scrapers share the same CLI interface and a consistent two-phase approach:
- Phase 1 -- Fetch listing pages (paginated) to get a list of all available apartments.
- Phase 2 -- Fetch detail pages for each listing to get floor, construction type, and other data needed for filtering.
Each scraper uses a JSON file cache: if a listing's hash_id and price haven't changed since the last run, the cached data is reused and the detail page is not re-fetched. This significantly reduces runtime on subsequent runs.
Source details
| Scraper | Portal | Data source | Output file | Notes |
|---|---|---|---|---|
scrape_and_map.py |
Sreality.cz | REST API (JSON) | byty_sreality.json |
Main scraper. Also contains generate_map() used by all other scripts. |
scrape_realingo.py |
Realingo.cz | __NEXT_DATA__ JSON in HTML |
byty_realingo.json |
Next.js app, data extracted from server-side props. |
scrape_bezrealitky.py |
Bezrealitky.cz | __NEXT_DATA__ Apollo cache in HTML |
byty_bezrealitky.json |
Next.js app with Apollo GraphQL cache in page source. |
scrape_idnes.py |
Reality iDNES | HTML parsing (regex) | byty_idnes.json |
Traditional HTML site. GPS extracted from dataLayer.push(). Retry logic with 5 attempts and exponential backoff. |
scrape_psn.py |
PSN.cz | RSC (React Server Components) escaped JSON in HTML | byty_psn.json |
Uses curl instead of urllib due to Cloudflare SSL issues. Hardcoded list of Prague projects with GPS coordinates. |
scrape_cityhome.py |
city-home.cz | HTML table parsing (data attributes on <tr>) |
byty_cityhome.json |
CityHome/SATPO developer projects. GPS fetched from project locality pages. |
Scraper filter criteria
All scrapers apply the same core filters (with minor per-source variations):
| Filter | Value | Notes |
|---|---|---|
| Max price | 13 500 000 CZK | PSN and CityHome use 14 000 000 CZK |
| Min area | 69 m^2 | |
| Min floor | 2. NP (2nd floor) | 2nd floor apartments are included but flagged on the map |
| Dispositions | 3+kk, 3+1, 4+kk, 4+1, 5+kk, 5+1, 6+ | |
| Region | Praha | |
| Construction | Excludes panel ("panelak") | |
| Location | Excludes housing estates ("sidliste") |
Utility scripts
merge_and_map.py
Merges all byty_*.json files into byty_merged.json and generates mapa_bytu.html.
Deduplication logic: Two listings are considered duplicates if they share the same normalized street name + price + area. PSN and CityHome have priority during dedup (loaded first), so their listings are kept over duplicates from other portals.
regen_map.py
Regenerates the map from existing byty_sreality.json data without re-scraping. Fetches missing area values from the Sreality API, fixes URLs, and re-applies the area filter. Useful for tweaking map output after data has already been collected.
Interactive map (mapa_bytu.html)
The generated map is a standalone HTML file using Leaflet.js with CARTO basemap tiles. Features:
- Color-coded markers by disposition (3+kk = blue, 3+1 = green, 4+kk = orange, etc.)
- Heart-shaped markers for PSN and CityHome listings (developer favorites)
- Source badge in each popup (Sreality, Realingo, Bezrealitky, iDNES, PSN, CityHome)
- Client-side filters: minimum floor, maximum price, hide rejected
- Rating system (persisted in
localStorage):- Star -- mark as favorite (enlarged marker with pulsing glow)
- Reject -- dim the marker, optionally hide it
- Notes -- free-text notes per listing
- 2nd floor warning -- listings on 2. NP show an orange warning in the popup
- Statistics panel -- total count, price range, average price, disposition breakdown
CLI arguments
All scrapers accept the same arguments. When run via run_all.sh, these arguments are forwarded to every scraper.
--max-pages N Maximum number of listing pages to scrape per source.
Limits the breadth of the initial listing fetch.
(For PSN: max pages per project)
--max-properties N Maximum number of properties to fetch details for per source.
Limits the depth of the detail-fetching phase.
--log-level LEVEL Logging verbosity. One of: DEBUG, INFO, WARNING, ERROR.
Default: INFO.
DEBUG shows HTTP request/response details, filter decisions
for every single listing, and cache hit/miss info.
-h, --help Show help message (run_all.sh only).
Examples
# Full scrape (all pages, all properties)
./run_all.sh
# Quick validation run (1 page per source, max 10 properties each)
./run_all.sh --max-pages 1 --max-properties 10
# Full scrape with debug logging
./run_all.sh --log-level DEBUG
# Run a single scraper
python3 scrape_bezrealitky.py --max-pages 2 --max-properties 5 --log-level DEBUG
Running with Docker
The project includes a Docker setup for unattended operation with a cron-based schedule.
Container architecture
┌─────────────────────────────────────────┐
│ Container (python:3.13-alpine) │
│ │
│ PID 1: python3 -m http.server :8080 │
│ serves /app/data/ │
│ │
│ crond: runs run_all.sh at 06:00/18:00 │
│ Europe/Prague timezone │
│ │
│ /app/ -- scripts (.py, .sh) │
│ /app/data/ -- volume (JSON + HTML) │
│ ^ symlinked from /app/byty_* │
└─────────────────────────────────────────┘
On startup, the HTTP server starts immediately. The initial scrape runs in the background. Subsequent cron runs update data in-place twice daily at 06:00 and 18:00 CET/CEST.
Quick start
make run # Build image + start container on port 8080
# Map available at http://localhost:8080/mapa_bytu.html
Makefile targets
| Target | Description |
|---|---|
make help |
Show all available targets |
make build |
Build the Docker image |
make run |
Build and run the container (port 8080) |
make stop |
Stop and remove the container |
make logs |
Tail container logs |
make scrape |
Trigger a manual scrape inside the running container |
make restart |
Stop and re-run the container |
make clean |
Stop container and remove the Docker image |
make validation |
Run a limited scrape in a separate Docker container (port 8081) |
make validation-stop |
Stop the validation container |
make validation-local |
Run a limited scrape locally (1 page, 10 properties) |
make validation-local-debug |
Same as above with --log-level DEBUG |
Validation mode
Validation targets run scrapers with --max-pages 1 --max-properties 10 for a fast smoke test (~30 seconds instead of several minutes). The Docker validation target runs on port 8081 in a separate container so it doesn't interfere with production data.
Project structure
.
├── scrape_and_map.py # Sreality scraper + map generator (generate_map())
├── scrape_realingo.py # Realingo scraper
├── scrape_bezrealitky.py # Bezrealitky scraper
├── scrape_idnes.py # iDNES Reality scraper
├── scrape_psn.py # PSN scraper
├── scrape_cityhome.py # CityHome scraper
├── merge_and_map.py # Merge all sources + generate final map
├── regen_map.py # Regenerate map from cached Sreality data
├── run_all.sh # Orchestrator script (runs all scrapers + merge)
├── mapa_bytu.html # Generated interactive map (output)
├── Makefile # Docker management + validation shortcuts
├── build/
│ ├── Dockerfile # Container image definition (python:3.13-alpine)
│ ├── entrypoint.sh # Container entrypoint (HTTP server + cron + initial scrape)
│ ├── crontab # Cron schedule (06:00 and 18:00 CET)
│ └── CONTAINER.md # Container-specific documentation
└── .gitignore # Ignores byty_*.json, __pycache__, .vscode
Dependencies
None. All scrapers use only the Python standard library (urllib, json, re, argparse, logging, html.parser). The only external tool required is curl (used by scrape_psn.py for Cloudflare TLS compatibility).
The Docker image is based on python:3.13-alpine (~70 MB) with curl, bash, and tzdata added.
Caching behavior
Each scraper maintains a JSON file cache (byty_<source>.json). On each run:
- The previous JSON file is loaded and indexed by
hash_id. - For each listing found in the current run, if the
hash_idexists in cache and the price is unchanged, the cached record is reused without fetching the detail page. - New or changed listings trigger a detail page fetch.
- The JSON file is overwritten with the fresh results at the end.
This means the first run is slow (fetches every detail page with rate-limiting delays), but subsequent runs are much faster as they only fetch details for new or changed listings.
Rate limiting
Each scraper includes polite delays between requests:
| Scraper | Delay between requests |
|---|---|
| Sreality | 0.3s (details), 0.5s (pages) |
| Realingo | 0.3s (details), 0.5s (pages) |
| Bezrealitky | 0.4s (details), 0.5s (pages) |
| iDNES | 0.4s (details), 1.0s (pages) + retry backoff (3/6/9/12s) |
| PSN | 0.5s (per project page) |
| CityHome | 0.5s (per project GPS fetch) |