Add comprehensive project documentation
Cover the full pipeline (scrapers, merge, map generation), all 6 data sources with their parsing methods, filter criteria, CLI arguments, Docker setup, caching, rate limiting, and project structure. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
244
README.md
Normal file
244
README.md
Normal file
@@ -0,0 +1,244 @@
|
||||
# Maru hleda byt
|
||||
|
||||
Apartment search aggregator for Prague. Scrapes listings from 6 Czech real estate portals, filters them by configurable criteria, deduplicates across sources, and generates a single interactive map with all matching apartments.
|
||||
|
||||
Built for a specific use case: finding a 3+kk or larger apartment in Prague, excluding panel construction ("panelak") and housing estates ("sidliste"), with personal rating support.
|
||||
|
||||
## How it works
|
||||
|
||||
```
|
||||
┌─────────────────────────────────────────────────────────────┐
|
||||
│ run_all.sh │
|
||||
│ Orchestrates all scrapers, then merges results into map │
|
||||
├─────────┬──────────┬──────────┬────────┬────────┬───────────┤
|
||||
│Sreality │Realingo │Bezreal. │iDNES │PSN │CityHome │
|
||||
│ (API) │ (HTML) │ (HTML) │ (HTML) │ (HTML) │ (HTML) │
|
||||
├─────────┴──────────┴──────────┴────────┴────────┴───────────┤
|
||||
│ merge_and_map.py │
|
||||
│ Loads all byty_*.json, deduplicates, generates HTML map │
|
||||
├─────────────────────────────────────────────────────────────┤
|
||||
│ mapa_bytu.html │
|
||||
│ Interactive Leaflet.js map with filters & ratings │
|
||||
└─────────────────────────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
### Pipeline
|
||||
|
||||
1. **Scraping** -- Each scraper independently fetches listings from its portal, applies filters, and saves results to a JSON file (`byty_<source>.json`).
|
||||
2. **Merging** -- `merge_and_map.py` loads all 6 JSON files, deduplicates listings (by street name + price + area), and generates the final `mapa_bytu.html`.
|
||||
3. **Serving** -- The HTML map can be opened locally as a file, or served via Docker with a built-in HTTP server.
|
||||
|
||||
### Execution order in `run_all.sh`
|
||||
|
||||
Scrapers run sequentially (to avoid overwhelming any single portal), except PSN and CityHome which run in parallel (different sites). If a scraper fails, it is logged but does not abort the pipeline -- remaining scrapers continue.
|
||||
|
||||
```
|
||||
1. scrape_and_map.py (Sreality)
|
||||
2. scrape_realingo.py (Realingo)
|
||||
3. scrape_bezrealitky.py (Bezrealitky)
|
||||
4. scrape_idnes.py (iDNES Reality)
|
||||
5. scrape_psn.py + scrape_cityhome.py (parallel)
|
||||
6. merge_and_map.py (merge + map generation)
|
||||
```
|
||||
|
||||
## Scrapers
|
||||
|
||||
All scrapers share the same CLI interface and a consistent two-phase approach:
|
||||
|
||||
1. **Phase 1** -- Fetch listing pages (paginated) to get a list of all available apartments.
|
||||
2. **Phase 2** -- Fetch detail pages for each listing to get floor, construction type, and other data needed for filtering.
|
||||
|
||||
Each scraper uses a **JSON file cache**: if a listing's `hash_id` and `price` haven't changed since the last run, the cached data is reused and the detail page is not re-fetched. This significantly reduces runtime on subsequent runs.
|
||||
|
||||
### Source details
|
||||
|
||||
| Scraper | Portal | Data source | Output file | Notes |
|
||||
|---------|--------|-------------|-------------|-------|
|
||||
| `scrape_and_map.py` | [Sreality.cz](https://sreality.cz) | REST API (JSON) | `byty_sreality.json` | Main scraper. Also contains `generate_map()` used by all other scripts. |
|
||||
| `scrape_realingo.py` | [Realingo.cz](https://realingo.cz) | `__NEXT_DATA__` JSON in HTML | `byty_realingo.json` | Next.js app, data extracted from server-side props. |
|
||||
| `scrape_bezrealitky.py` | [Bezrealitky.cz](https://bezrealitky.cz) | `__NEXT_DATA__` Apollo cache in HTML | `byty_bezrealitky.json` | Next.js app with Apollo GraphQL cache in page source. |
|
||||
| `scrape_idnes.py` | [Reality iDNES](https://reality.idnes.cz) | HTML parsing (regex) | `byty_idnes.json` | Traditional HTML site. GPS extracted from `dataLayer.push()`. Retry logic with 5 attempts and exponential backoff. |
|
||||
| `scrape_psn.py` | [PSN.cz](https://psn.cz) | RSC (React Server Components) escaped JSON in HTML | `byty_psn.json` | Uses `curl` instead of `urllib` due to Cloudflare SSL issues. Hardcoded list of Prague projects with GPS coordinates. |
|
||||
| `scrape_cityhome.py` | [city-home.cz](https://city-home.cz) | HTML table parsing (data attributes on `<tr>`) | `byty_cityhome.json` | CityHome/SATPO developer projects. GPS fetched from project locality pages. |
|
||||
|
||||
### Scraper filter criteria
|
||||
|
||||
All scrapers apply the same core filters (with minor per-source variations):
|
||||
|
||||
| Filter | Value | Notes |
|
||||
|--------|-------|-------|
|
||||
| **Max price** | 13 500 000 CZK | PSN and CityHome use 14 000 000 CZK |
|
||||
| **Min area** | 69 m^2 | |
|
||||
| **Min floor** | 2. NP (2nd floor) | 2nd floor apartments are included but flagged on the map |
|
||||
| **Dispositions** | 3+kk, 3+1, 4+kk, 4+1, 5+kk, 5+1, 6+ | |
|
||||
| **Region** | Praha | |
|
||||
| **Construction** | Excludes panel ("panelak") | |
|
||||
| **Location** | Excludes housing estates ("sidliste") | |
|
||||
|
||||
## Utility scripts
|
||||
|
||||
### `merge_and_map.py`
|
||||
|
||||
Merges all `byty_*.json` files into `byty_merged.json` and generates `mapa_bytu.html`.
|
||||
|
||||
**Deduplication logic:** Two listings are considered duplicates if they share the same normalized street name + price + area. PSN and CityHome have priority during dedup (loaded first), so their listings are kept over duplicates from other portals.
|
||||
|
||||
### `regen_map.py`
|
||||
|
||||
Regenerates the map from existing `byty_sreality.json` data without re-scraping. Fetches missing area values from the Sreality API, fixes URLs, and re-applies the area filter. Useful for tweaking map output after data has already been collected.
|
||||
|
||||
## Interactive map (`mapa_bytu.html`)
|
||||
|
||||
The generated map is a standalone HTML file using Leaflet.js with CARTO basemap tiles. Features:
|
||||
|
||||
- **Color-coded markers** by disposition (3+kk = blue, 3+1 = green, 4+kk = orange, etc.)
|
||||
- **Heart-shaped markers** for PSN and CityHome listings (developer favorites)
|
||||
- **Source badge** in each popup (Sreality, Realingo, Bezrealitky, iDNES, PSN, CityHome)
|
||||
- **Client-side filters:** minimum floor, maximum price, hide rejected
|
||||
- **Rating system** (persisted in `localStorage`):
|
||||
- Star -- mark as favorite (enlarged marker with pulsing glow)
|
||||
- Reject -- dim the marker, optionally hide it
|
||||
- Notes -- free-text notes per listing
|
||||
- **2nd floor warning** -- listings on 2. NP show an orange warning in the popup
|
||||
- **Statistics panel** -- total count, price range, average price, disposition breakdown
|
||||
|
||||
## CLI arguments
|
||||
|
||||
All scrapers accept the same arguments. When run via `run_all.sh`, these arguments are forwarded to every scraper.
|
||||
|
||||
```
|
||||
--max-pages N Maximum number of listing pages to scrape per source.
|
||||
Limits the breadth of the initial listing fetch.
|
||||
(For PSN: max pages per project)
|
||||
|
||||
--max-properties N Maximum number of properties to fetch details for per source.
|
||||
Limits the depth of the detail-fetching phase.
|
||||
|
||||
--log-level LEVEL Logging verbosity. One of: DEBUG, INFO, WARNING, ERROR.
|
||||
Default: INFO.
|
||||
DEBUG shows HTTP request/response details, filter decisions
|
||||
for every single listing, and cache hit/miss info.
|
||||
|
||||
-h, --help Show help message (run_all.sh only).
|
||||
```
|
||||
|
||||
### Examples
|
||||
|
||||
```bash
|
||||
# Full scrape (all pages, all properties)
|
||||
./run_all.sh
|
||||
|
||||
# Quick validation run (1 page per source, max 10 properties each)
|
||||
./run_all.sh --max-pages 1 --max-properties 10
|
||||
|
||||
# Full scrape with debug logging
|
||||
./run_all.sh --log-level DEBUG
|
||||
|
||||
# Run a single scraper
|
||||
python3 scrape_bezrealitky.py --max-pages 2 --max-properties 5 --log-level DEBUG
|
||||
```
|
||||
|
||||
## Running with Docker
|
||||
|
||||
The project includes a Docker setup for unattended operation with a cron-based schedule.
|
||||
|
||||
### Container architecture
|
||||
|
||||
```
|
||||
┌─────────────────────────────────────────┐
|
||||
│ Container (python:3.13-alpine) │
|
||||
│ │
|
||||
│ PID 1: python3 -m http.server :8080 │
|
||||
│ serves /app/data/ │
|
||||
│ │
|
||||
│ crond: runs run_all.sh at 06:00/18:00 │
|
||||
│ Europe/Prague timezone │
|
||||
│ │
|
||||
│ /app/ -- scripts (.py, .sh) │
|
||||
│ /app/data/ -- volume (JSON + HTML) │
|
||||
│ ^ symlinked from /app/byty_* │
|
||||
└─────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
On startup, the HTTP server starts immediately. The initial scrape runs in the background. Subsequent cron runs update data in-place twice daily at 06:00 and 18:00 CET/CEST.
|
||||
|
||||
### Quick start
|
||||
|
||||
```bash
|
||||
make run # Build image + start container on port 8080
|
||||
# Map available at http://localhost:8080/mapa_bytu.html
|
||||
```
|
||||
|
||||
### Makefile targets
|
||||
|
||||
| Target | Description |
|
||||
|--------|-------------|
|
||||
| `make help` | Show all available targets |
|
||||
| `make build` | Build the Docker image |
|
||||
| `make run` | Build and run the container (port 8080) |
|
||||
| `make stop` | Stop and remove the container |
|
||||
| `make logs` | Tail container logs |
|
||||
| `make scrape` | Trigger a manual scrape inside the running container |
|
||||
| `make restart` | Stop and re-run the container |
|
||||
| `make clean` | Stop container and remove the Docker image |
|
||||
| `make validation` | Run a limited scrape in a separate Docker container (port 8081) |
|
||||
| `make validation-stop` | Stop the validation container |
|
||||
| `make validation-local` | Run a limited scrape locally (1 page, 10 properties) |
|
||||
| `make validation-local-debug` | Same as above with `--log-level DEBUG` |
|
||||
|
||||
### Validation mode
|
||||
|
||||
Validation targets run scrapers with `--max-pages 1 --max-properties 10` for a fast smoke test (~30 seconds instead of several minutes). The Docker validation target runs on port 8081 in a separate container so it doesn't interfere with production data.
|
||||
|
||||
## Project structure
|
||||
|
||||
```
|
||||
.
|
||||
├── scrape_and_map.py # Sreality scraper + map generator (generate_map())
|
||||
├── scrape_realingo.py # Realingo scraper
|
||||
├── scrape_bezrealitky.py # Bezrealitky scraper
|
||||
├── scrape_idnes.py # iDNES Reality scraper
|
||||
├── scrape_psn.py # PSN scraper
|
||||
├── scrape_cityhome.py # CityHome scraper
|
||||
├── merge_and_map.py # Merge all sources + generate final map
|
||||
├── regen_map.py # Regenerate map from cached Sreality data
|
||||
├── run_all.sh # Orchestrator script (runs all scrapers + merge)
|
||||
├── mapa_bytu.html # Generated interactive map (output)
|
||||
├── Makefile # Docker management + validation shortcuts
|
||||
├── build/
|
||||
│ ├── Dockerfile # Container image definition (python:3.13-alpine)
|
||||
│ ├── entrypoint.sh # Container entrypoint (HTTP server + cron + initial scrape)
|
||||
│ ├── crontab # Cron schedule (06:00 and 18:00 CET)
|
||||
│ └── CONTAINER.md # Container-specific documentation
|
||||
└── .gitignore # Ignores byty_*.json, __pycache__, .vscode
|
||||
```
|
||||
|
||||
## Dependencies
|
||||
|
||||
**None.** All scrapers use only the Python standard library (`urllib`, `json`, `re`, `argparse`, `logging`, `html.parser`). The only external tool required is `curl` (used by `scrape_psn.py` for Cloudflare TLS compatibility).
|
||||
|
||||
The Docker image is based on `python:3.13-alpine` (~70 MB) with `curl`, `bash`, and `tzdata` added.
|
||||
|
||||
## Caching behavior
|
||||
|
||||
Each scraper maintains a JSON file cache (`byty_<source>.json`). On each run:
|
||||
|
||||
1. The previous JSON file is loaded and indexed by `hash_id`.
|
||||
2. For each listing found in the current run, if the `hash_id` exists in cache **and** the price is unchanged, the cached record is reused without fetching the detail page.
|
||||
3. New or changed listings trigger a detail page fetch.
|
||||
4. The JSON file is overwritten with the fresh results at the end.
|
||||
|
||||
This means the first run is slow (fetches every detail page with rate-limiting delays), but subsequent runs are much faster as they only fetch details for new or changed listings.
|
||||
|
||||
## Rate limiting
|
||||
|
||||
Each scraper includes polite delays between requests:
|
||||
|
||||
| Scraper | Delay between requests |
|
||||
|---------|----------------------|
|
||||
| Sreality | 0.3s (details), 0.5s (pages) |
|
||||
| Realingo | 0.3s (details), 0.5s (pages) |
|
||||
| Bezrealitky | 0.4s (details), 0.5s (pages) |
|
||||
| iDNES | 0.4s (details), 1.0s (pages) + retry backoff (3/6/9/12s) |
|
||||
| PSN | 0.5s (per project page) |
|
||||
| CityHome | 0.5s (per project GPS fetch) |
|
||||
Reference in New Issue
Block a user