Files

Jan Novak 327688d9d2 Add comprehensive project documentation

Cover the full pipeline (scrapers, merge, map generation), all 6 data
sources with their parsing methods, filter criteria, CLI arguments,
Docker setup, caching, rate limiting, and project structure.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

2026-02-14 22:40:23 +00:00

13 KiB

Raw Permalink Blame History

Maru hleda byt

Apartment search aggregator for Prague. Scrapes listings from 6 Czech real estate portals, filters them by configurable criteria, deduplicates across sources, and generates a single interactive map with all matching apartments.

Built for a specific use case: finding a 3+kk or larger apartment in Prague, excluding panel construction ("panelak") and housing estates ("sidliste"), with personal rating support.

How it works

┌─────────────────────────────────────────────────────────────┐
│                        run_all.sh                           │
│  Orchestrates all scrapers, then merges results into map    │
├─────────┬──────────┬──────────┬────────┬────────┬───────────┤
│Sreality │Realingo  │Bezreal.  │iDNES   │PSN     │CityHome   │
│ (API)   │ (HTML)   │ (HTML)   │ (HTML) │ (HTML) │ (HTML)    │
├─────────┴──────────┴──────────┴────────┴────────┴───────────┤
│              merge_and_map.py                               │
│  Loads all byty_*.json, deduplicates, generates HTML map    │
├─────────────────────────────────────────────────────────────┤
│              mapa_bytu.html                                 │
│  Interactive Leaflet.js map with filters & ratings          │
└─────────────────────────────────────────────────────────────┘

Pipeline

Scraping -- Each scraper independently fetches listings from its portal, applies filters, and saves results to a JSON file (byty_<source>.json).
Merging -- merge_and_map.py loads all 6 JSON files, deduplicates listings (by street name + price + area), and generates the final mapa_bytu.html.
Serving -- The HTML map can be opened locally as a file, or served via Docker with a built-in HTTP server.

Execution order in `run_all.sh`

Scrapers run sequentially (to avoid overwhelming any single portal), except PSN and CityHome which run in parallel (different sites). If a scraper fails, it is logged but does not abort the pipeline -- remaining scrapers continue.

1. scrape_and_map.py   (Sreality)
2. scrape_realingo.py  (Realingo)
3. scrape_bezrealitky.py (Bezrealitky)
4. scrape_idnes.py     (iDNES Reality)
5. scrape_psn.py + scrape_cityhome.py  (parallel)
6. merge_and_map.py    (merge + map generation)

Scrapers

All scrapers share the same CLI interface and a consistent two-phase approach:

Phase 1 -- Fetch listing pages (paginated) to get a list of all available apartments.
Phase 2 -- Fetch detail pages for each listing to get floor, construction type, and other data needed for filtering.

Each scraper uses a JSON file cache: if a listing's hash_id and price haven't changed since the last run, the cached data is reused and the detail page is not re-fetched. This significantly reduces runtime on subsequent runs.

Source details

Scraper	Portal	Data source	Output file	Notes
`scrape_and_map.py`	Sreality.cz	REST API (JSON)	`byty_sreality.json`	Main scraper. Also contains `generate_map()` used by all other scripts.
`scrape_realingo.py`	Realingo.cz	`__NEXT_DATA__` JSON in HTML	`byty_realingo.json`	Next.js app, data extracted from server-side props.
`scrape_bezrealitky.py`	Bezrealitky.cz	`__NEXT_DATA__` Apollo cache in HTML	`byty_bezrealitky.json`	Next.js app with Apollo GraphQL cache in page source.
`scrape_idnes.py`	Reality iDNES	HTML parsing (regex)	`byty_idnes.json`	Traditional HTML site. GPS extracted from `dataLayer.push()`. Retry logic with 5 attempts and exponential backoff.
`scrape_psn.py`	PSN.cz	RSC (React Server Components) escaped JSON in HTML	`byty_psn.json`	Uses `curl` instead of `urllib` due to Cloudflare SSL issues. Hardcoded list of Prague projects with GPS coordinates.
`scrape_cityhome.py`	city-home.cz	HTML table parsing (data attributes on `<tr>`)	`byty_cityhome.json`	CityHome/SATPO developer projects. GPS fetched from project locality pages.

Scraper filter criteria

All scrapers apply the same core filters (with minor per-source variations):

Filter	Value	Notes
Max price	13 500 000 CZK	PSN and CityHome use 14 000 000 CZK
Min area	69 m^2
Min floor	2. NP (2nd floor)	2nd floor apartments are included but flagged on the map
Dispositions	3+kk, 3+1, 4+kk, 4+1, 5+kk, 5+1, 6+
Region	Praha
Construction	Excludes panel ("panelak")
Location	Excludes housing estates ("sidliste")

Utility scripts

`merge_and_map.py`

Merges all byty_*.json files into byty_merged.json and generates mapa_bytu.html.

Deduplication logic: Two listings are considered duplicates if they share the same normalized street name + price + area. PSN and CityHome have priority during dedup (loaded first), so their listings are kept over duplicates from other portals.

`regen_map.py`

Regenerates the map from existing byty_sreality.json data without re-scraping. Fetches missing area values from the Sreality API, fixes URLs, and re-applies the area filter. Useful for tweaking map output after data has already been collected.

Interactive map (`mapa_bytu.html`)

The generated map is a standalone HTML file using Leaflet.js with CARTO basemap tiles. Features:

Color-coded markers by disposition (3+kk = blue, 3+1 = green, 4+kk = orange, etc.)
Heart-shaped markers for PSN and CityHome listings (developer favorites)
Source badge in each popup (Sreality, Realingo, Bezrealitky, iDNES, PSN, CityHome)
Client-side filters: minimum floor, maximum price, hide rejected
Rating system (persisted in localStorage):
- Star -- mark as favorite (enlarged marker with pulsing glow)
- Reject -- dim the marker, optionally hide it
- Notes -- free-text notes per listing
2nd floor warning -- listings on 2. NP show an orange warning in the popup
Statistics panel -- total count, price range, average price, disposition breakdown

CLI arguments

All scrapers accept the same arguments. When run via run_all.sh, these arguments are forwarded to every scraper.

--max-pages N         Maximum number of listing pages to scrape per source.
                      Limits the breadth of the initial listing fetch.
                      (For PSN: max pages per project)

--max-properties N    Maximum number of properties to fetch details for per source.
                      Limits the depth of the detail-fetching phase.

--log-level LEVEL     Logging verbosity. One of: DEBUG, INFO, WARNING, ERROR.
                      Default: INFO.
                      DEBUG shows HTTP request/response details, filter decisions
                      for every single listing, and cache hit/miss info.

-h, --help            Show help message (run_all.sh only).

Examples

# Full scrape (all pages, all properties)
./run_all.sh

# Quick validation run (1 page per source, max 10 properties each)
./run_all.sh --max-pages 1 --max-properties 10

# Full scrape with debug logging
./run_all.sh --log-level DEBUG

# Run a single scraper
python3 scrape_bezrealitky.py --max-pages 2 --max-properties 5 --log-level DEBUG

Running with Docker

The project includes a Docker setup for unattended operation with a cron-based schedule.

Container architecture

┌─────────────────────────────────────────┐
│  Container (python:3.13-alpine)         │
│                                         │
│  PID 1: python3 -m http.server :8080    │
│         serves /app/data/               │
│                                         │
│  crond:  runs run_all.sh at 06:00/18:00 │
│          Europe/Prague timezone          │
│                                         │
│  /app/        -- scripts (.py, .sh)     │
│  /app/data/   -- volume (JSON + HTML)   │
│         ^ symlinked from /app/byty_*    │
└─────────────────────────────────────────┘

On startup, the HTTP server starts immediately. The initial scrape runs in the background. Subsequent cron runs update data in-place twice daily at 06:00 and 18:00 CET/CEST.

Quick start

make run         # Build image + start container on port 8080
# Map available at http://localhost:8080/mapa_bytu.html

Makefile targets

Target	Description
`make help`	Show all available targets
`make build`	Build the Docker image
`make run`	Build and run the container (port 8080)
`make stop`	Stop and remove the container
`make logs`	Tail container logs
`make scrape`	Trigger a manual scrape inside the running container
`make restart`	Stop and re-run the container
`make clean`	Stop container and remove the Docker image
`make validation`	Run a limited scrape in a separate Docker container (port 8081)
`make validation-stop`	Stop the validation container
`make validation-local`	Run a limited scrape locally (1 page, 10 properties)
`make validation-local-debug`	Same as above with `--log-level DEBUG`

Validation mode

Validation targets run scrapers with --max-pages 1 --max-properties 10 for a fast smoke test (~30 seconds instead of several minutes). The Docker validation target runs on port 8081 in a separate container so it doesn't interfere with production data.

Project structure

.
├── scrape_and_map.py       # Sreality scraper + map generator (generate_map())
├── scrape_realingo.py      # Realingo scraper
├── scrape_bezrealitky.py   # Bezrealitky scraper
├── scrape_idnes.py         # iDNES Reality scraper
├── scrape_psn.py           # PSN scraper
├── scrape_cityhome.py      # CityHome scraper
├── merge_and_map.py        # Merge all sources + generate final map
├── regen_map.py            # Regenerate map from cached Sreality data
├── run_all.sh              # Orchestrator script (runs all scrapers + merge)
├── mapa_bytu.html          # Generated interactive map (output)
├── Makefile                # Docker management + validation shortcuts
├── build/
│   ├── Dockerfile          # Container image definition (python:3.13-alpine)
│   ├── entrypoint.sh       # Container entrypoint (HTTP server + cron + initial scrape)
│   ├── crontab             # Cron schedule (06:00 and 18:00 CET)
│   └── CONTAINER.md        # Container-specific documentation
└── .gitignore              # Ignores byty_*.json, __pycache__, .vscode

Dependencies

None. All scrapers use only the Python standard library (urllib, json, re, argparse, logging, html.parser). The only external tool required is curl (used by scrape_psn.py for Cloudflare TLS compatibility).

The Docker image is based on python:3.13-alpine (~70 MB) with curl, bash, and tzdata added.

Caching behavior

Each scraper maintains a JSON file cache (byty_<source>.json). On each run:

The previous JSON file is loaded and indexed by hash_id.
For each listing found in the current run, if the hash_id exists in cache and the price is unchanged, the cached record is reused without fetching the detail page.
New or changed listings trigger a detail page fetch.
The JSON file is overwritten with the fresh results at the end.

This means the first run is slow (fetches every detail page with rate-limiting delays), but subsequent runs are much faster as they only fetch details for new or changed listings.

Rate limiting

Each scraper includes polite delays between requests:

Scraper	Delay between requests
Sreality	0.3s (details), 0.5s (pages)
Realingo	0.3s (details), 0.5s (pages)
Bezrealitky	0.4s (details), 0.5s (pages)
iDNES	0.4s (details), 1.0s (pages) + retry backoff (3/6/9/12s)
PSN	0.5s (per project page)
CityHome	0.5s (per project GPS fetch)

13 KiB Raw Permalink Blame History