Files
maru-hleda-byt/README.md
Jan Novak 327688d9d2 Add comprehensive project documentation
Cover the full pipeline (scrapers, merge, map generation), all 6 data
sources with their parsing methods, filter criteria, CLI arguments,
Docker setup, caching, rate limiting, and project structure.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-14 22:40:23 +00:00

13 KiB

Maru hleda byt

Apartment search aggregator for Prague. Scrapes listings from 6 Czech real estate portals, filters them by configurable criteria, deduplicates across sources, and generates a single interactive map with all matching apartments.

Built for a specific use case: finding a 3+kk or larger apartment in Prague, excluding panel construction ("panelak") and housing estates ("sidliste"), with personal rating support.

How it works

┌─────────────────────────────────────────────────────────────┐
│                        run_all.sh                           │
│  Orchestrates all scrapers, then merges results into map    │
├─────────┬──────────┬──────────┬────────┬────────┬───────────┤
│Sreality │Realingo  │Bezreal.  │iDNES   │PSN     │CityHome   │
│ (API)   │ (HTML)   │ (HTML)   │ (HTML) │ (HTML) │ (HTML)    │
├─────────┴──────────┴──────────┴────────┴────────┴───────────┤
│              merge_and_map.py                               │
│  Loads all byty_*.json, deduplicates, generates HTML map    │
├─────────────────────────────────────────────────────────────┤
│              mapa_bytu.html                                 │
│  Interactive Leaflet.js map with filters & ratings          │
└─────────────────────────────────────────────────────────────┘

Pipeline

  1. Scraping -- Each scraper independently fetches listings from its portal, applies filters, and saves results to a JSON file (byty_<source>.json).
  2. Merging -- merge_and_map.py loads all 6 JSON files, deduplicates listings (by street name + price + area), and generates the final mapa_bytu.html.
  3. Serving -- The HTML map can be opened locally as a file, or served via Docker with a built-in HTTP server.

Execution order in run_all.sh

Scrapers run sequentially (to avoid overwhelming any single portal), except PSN and CityHome which run in parallel (different sites). If a scraper fails, it is logged but does not abort the pipeline -- remaining scrapers continue.

1. scrape_and_map.py   (Sreality)
2. scrape_realingo.py  (Realingo)
3. scrape_bezrealitky.py (Bezrealitky)
4. scrape_idnes.py     (iDNES Reality)
5. scrape_psn.py + scrape_cityhome.py  (parallel)
6. merge_and_map.py    (merge + map generation)

Scrapers

All scrapers share the same CLI interface and a consistent two-phase approach:

  1. Phase 1 -- Fetch listing pages (paginated) to get a list of all available apartments.
  2. Phase 2 -- Fetch detail pages for each listing to get floor, construction type, and other data needed for filtering.

Each scraper uses a JSON file cache: if a listing's hash_id and price haven't changed since the last run, the cached data is reused and the detail page is not re-fetched. This significantly reduces runtime on subsequent runs.

Source details

Scraper Portal Data source Output file Notes
scrape_and_map.py Sreality.cz REST API (JSON) byty_sreality.json Main scraper. Also contains generate_map() used by all other scripts.
scrape_realingo.py Realingo.cz __NEXT_DATA__ JSON in HTML byty_realingo.json Next.js app, data extracted from server-side props.
scrape_bezrealitky.py Bezrealitky.cz __NEXT_DATA__ Apollo cache in HTML byty_bezrealitky.json Next.js app with Apollo GraphQL cache in page source.
scrape_idnes.py Reality iDNES HTML parsing (regex) byty_idnes.json Traditional HTML site. GPS extracted from dataLayer.push(). Retry logic with 5 attempts and exponential backoff.
scrape_psn.py PSN.cz RSC (React Server Components) escaped JSON in HTML byty_psn.json Uses curl instead of urllib due to Cloudflare SSL issues. Hardcoded list of Prague projects with GPS coordinates.
scrape_cityhome.py city-home.cz HTML table parsing (data attributes on <tr>) byty_cityhome.json CityHome/SATPO developer projects. GPS fetched from project locality pages.

Scraper filter criteria

All scrapers apply the same core filters (with minor per-source variations):

Filter Value Notes
Max price 13 500 000 CZK PSN and CityHome use 14 000 000 CZK
Min area 69 m^2
Min floor 2. NP (2nd floor) 2nd floor apartments are included but flagged on the map
Dispositions 3+kk, 3+1, 4+kk, 4+1, 5+kk, 5+1, 6+
Region Praha
Construction Excludes panel ("panelak")
Location Excludes housing estates ("sidliste")

Utility scripts

merge_and_map.py

Merges all byty_*.json files into byty_merged.json and generates mapa_bytu.html.

Deduplication logic: Two listings are considered duplicates if they share the same normalized street name + price + area. PSN and CityHome have priority during dedup (loaded first), so their listings are kept over duplicates from other portals.

regen_map.py

Regenerates the map from existing byty_sreality.json data without re-scraping. Fetches missing area values from the Sreality API, fixes URLs, and re-applies the area filter. Useful for tweaking map output after data has already been collected.

Interactive map (mapa_bytu.html)

The generated map is a standalone HTML file using Leaflet.js with CARTO basemap tiles. Features:

  • Color-coded markers by disposition (3+kk = blue, 3+1 = green, 4+kk = orange, etc.)
  • Heart-shaped markers for PSN and CityHome listings (developer favorites)
  • Source badge in each popup (Sreality, Realingo, Bezrealitky, iDNES, PSN, CityHome)
  • Client-side filters: minimum floor, maximum price, hide rejected
  • Rating system (persisted in localStorage):
    • Star -- mark as favorite (enlarged marker with pulsing glow)
    • Reject -- dim the marker, optionally hide it
    • Notes -- free-text notes per listing
  • 2nd floor warning -- listings on 2. NP show an orange warning in the popup
  • Statistics panel -- total count, price range, average price, disposition breakdown

CLI arguments

All scrapers accept the same arguments. When run via run_all.sh, these arguments are forwarded to every scraper.

--max-pages N         Maximum number of listing pages to scrape per source.
                      Limits the breadth of the initial listing fetch.
                      (For PSN: max pages per project)

--max-properties N    Maximum number of properties to fetch details for per source.
                      Limits the depth of the detail-fetching phase.

--log-level LEVEL     Logging verbosity. One of: DEBUG, INFO, WARNING, ERROR.
                      Default: INFO.
                      DEBUG shows HTTP request/response details, filter decisions
                      for every single listing, and cache hit/miss info.

-h, --help            Show help message (run_all.sh only).

Examples

# Full scrape (all pages, all properties)
./run_all.sh

# Quick validation run (1 page per source, max 10 properties each)
./run_all.sh --max-pages 1 --max-properties 10

# Full scrape with debug logging
./run_all.sh --log-level DEBUG

# Run a single scraper
python3 scrape_bezrealitky.py --max-pages 2 --max-properties 5 --log-level DEBUG

Running with Docker

The project includes a Docker setup for unattended operation with a cron-based schedule.

Container architecture

┌─────────────────────────────────────────┐
│  Container (python:3.13-alpine)         │
│                                         │
│  PID 1: python3 -m http.server :8080    │
│         serves /app/data/               │
│                                         │
│  crond:  runs run_all.sh at 06:00/18:00 │
│          Europe/Prague timezone          │
│                                         │
│  /app/        -- scripts (.py, .sh)     │
│  /app/data/   -- volume (JSON + HTML)   │
│         ^ symlinked from /app/byty_*    │
└─────────────────────────────────────────┘

On startup, the HTTP server starts immediately. The initial scrape runs in the background. Subsequent cron runs update data in-place twice daily at 06:00 and 18:00 CET/CEST.

Quick start

make run         # Build image + start container on port 8080
# Map available at http://localhost:8080/mapa_bytu.html

Makefile targets

Target Description
make help Show all available targets
make build Build the Docker image
make run Build and run the container (port 8080)
make stop Stop and remove the container
make logs Tail container logs
make scrape Trigger a manual scrape inside the running container
make restart Stop and re-run the container
make clean Stop container and remove the Docker image
make validation Run a limited scrape in a separate Docker container (port 8081)
make validation-stop Stop the validation container
make validation-local Run a limited scrape locally (1 page, 10 properties)
make validation-local-debug Same as above with --log-level DEBUG

Validation mode

Validation targets run scrapers with --max-pages 1 --max-properties 10 for a fast smoke test (~30 seconds instead of several minutes). The Docker validation target runs on port 8081 in a separate container so it doesn't interfere with production data.

Project structure

.
├── scrape_and_map.py       # Sreality scraper + map generator (generate_map())
├── scrape_realingo.py      # Realingo scraper
├── scrape_bezrealitky.py   # Bezrealitky scraper
├── scrape_idnes.py         # iDNES Reality scraper
├── scrape_psn.py           # PSN scraper
├── scrape_cityhome.py      # CityHome scraper
├── merge_and_map.py        # Merge all sources + generate final map
├── regen_map.py            # Regenerate map from cached Sreality data
├── run_all.sh              # Orchestrator script (runs all scrapers + merge)
├── mapa_bytu.html          # Generated interactive map (output)
├── Makefile                # Docker management + validation shortcuts
├── build/
│   ├── Dockerfile          # Container image definition (python:3.13-alpine)
│   ├── entrypoint.sh       # Container entrypoint (HTTP server + cron + initial scrape)
│   ├── crontab             # Cron schedule (06:00 and 18:00 CET)
│   └── CONTAINER.md        # Container-specific documentation
└── .gitignore              # Ignores byty_*.json, __pycache__, .vscode

Dependencies

None. All scrapers use only the Python standard library (urllib, json, re, argparse, logging, html.parser). The only external tool required is curl (used by scrape_psn.py for Cloudflare TLS compatibility).

The Docker image is based on python:3.13-alpine (~70 MB) with curl, bash, and tzdata added.

Caching behavior

Each scraper maintains a JSON file cache (byty_<source>.json). On each run:

  1. The previous JSON file is loaded and indexed by hash_id.
  2. For each listing found in the current run, if the hash_id exists in cache and the price is unchanged, the cached record is reused without fetching the detail page.
  3. New or changed listings trigger a detail page fetch.
  4. The JSON file is overwritten with the fresh results at the end.

This means the first run is slow (fetches every detail page with rate-limiting delays), but subsequent runs are much faster as they only fetch details for new or changed listings.

Rate limiting

Each scraper includes polite delays between requests:

Scraper Delay between requests
Sreality 0.3s (details), 0.5s (pages)
Realingo 0.3s (details), 0.5s (pages)
Bezrealitky 0.4s (details), 0.5s (pages)
iDNES 0.4s (details), 1.0s (pages) + retry backoff (3/6/9/12s)
PSN 0.5s (per project page)
CityHome 0.5s (per project GPS fetch)