Seeking a credible “free-only” backtesting pipeline: data hygiene, microstructure realism, and validation standards
I am exploring whether a fully free stack for systematic strategy research can produce results that are methodologically defensible. The goal is not convenience, but a pipeline that withstands audit: no paid data, no proprietary simulators, and minimal hidden assumptions. I would like to converge on a community-validated blueprint and test suite.
Key topics where I’m looking for concrete, technically rigorous input:
Data integrity without paid vendors
- Equities: Which truly free sources include point-in-time corporate actions and, critically, delisted names to mitigate survivorship bias? If none, what is the least-bad approximation strategy that is explicitly documented and auditable?
- Futures: Methods to construct continuous contracts from free daily settlements (e.g., ratio or Panama back-adjustments) with reproducible roll logic and roll calendars. Any free datasets with historical contract metadata to avoid look-ahead?
- Crypto: Best free sources for historical trades and depth to support microstructure modeling; recommended normalization for symbol changes, contract specs, and funding rate histories.
- Time alignment: Robust, free solutions for trading calendars, session times, and holiday schedules across regions and assets.
Microstructure and cost modeling using only free inputs
- Spread proxies from OHLCV (e.g., Corwin-Schultz, high-low based estimates) and their observed error characteristics versus quote data; guidance for when they are usable by capitalization/liquidity buckets.
- Slippage/impact model that can be calibrated from free data: e.g., expected shortfall = k1·spread + k2·(size/ADV)alpha·vol, with literature-backed ranges for k2 and alpha. Any open benchmarks for calibrating these parameters by venue and period?
- Short-selling frictions with free data: borrow availability proxy, fee estimates, and regulatory constraints (uptick rules) at daily or intraday horizons.
- Realistic fills: event-driven partial fills, queue priority, and auction participation using only OHLCV or publicly available trade prints; what minimum set of assumptions yields defensible estimates?
Simulation engine characteristics feasible with free tools
- Event-driven vs vectorized frameworks that handle multi-asset, multi-timeframe, corporate actions, cash flows, and margin. Candidate stacks: Lean engine locally with user-supplied free data, Backtrader, QSTrader, vectorbt/backtesting.py. What gaps matter most in each for institutional-grade realism?
- Portfolio accounting: net vs gross exposure, financing costs, FX conversion at realistic fixing times, crypto funding, exchange fee tiers, and rebates modeled from public schedules.
Validation and overfitting control without paywalled stats
- Protocol combining anchored walk-forward, combinatorially symmetric cross-validation, Deflated Sharpe Ratio, and Probability of Backtest Overfitting; open-source implementations that are accurate and maintained.
- Leakage/unit tests: reproducible checks for look-ahead, survivorship, split/dividend misapplications, timezone errors, and asynchronous signal alignment.
- Robustness checks available with free data: cross-venue replication, stress under liquidity droughts, and parameter stability maps.
Reproducibility standards
- Data snapshotting with hashes, deterministic random seeds, containerized environments, and metadata logs so another researcher can bitwise-reproduce a backtest with the same free sources at a later date.
- A minimal “backtest audit” checklist suitable for publication or investment committee review.
Proposed starting point for a minimal, auditable free stack (open to critique):
- Data: Stooq/Yahoo for equities EOD with explicit caveats; NASDAQ Data Link CHRIS for daily futures settlements; Binance historical trades and order book snapshots for crypto; exchange fee schedules and funding rates where published; open trading calendars.
- Engine: Lean local or Backtrader for event-driven simulation; vectorbt for research prototyping; risk/reporting via empyrical/pyfolio-like metrics.
- Cost model baseline: spread proxy from Corwin-Schultz, volatility- and participation-based slippage term calibrated on crypto where depth is free, then scaled for equities/futures by spread and volatility; explicit sensitivity bands in reports.
- Validation: anchored walk-forward, deflated Sharpe, combinatorial CV, plus a leakage test suite run on synthetic datasets with known ground truth.
Questions to the community:
- What specific free datasets or methods close the biggest realism gaps above, especially for survivorship and delistings in equities?
- Has anyone published an open calibration of slippage/impact parameters from free microstructure data that generalizes beyond a single venue?
- Which open-source engine today strikes the best balance between event-driven realism and reproducibility for multi-asset portfolios?
- What would you add to a minimal “backtest audit” that can be executed with only free tools and data?
If there is interest, I can help assemble a shared reference repository with a canonical dataset snapshot, engines configured per the above, and a test battery that strategies must pass before performance is reported.