Cross-Asset Attention Hurt the Strategy
Added MultiheadAttention across the 4 commodities before the LSTM in panel-VLSTM. Mean Sharpe Δ −0.30. Crude_Oil specifically lost 1.17.
Conventional wisdom in cross-asset DL says attention captures shared structure that per-asset architectures miss. The Oxford 2026 commodity-momentum paper uses a cross-asset block; transformer-style architectures default to it. I added it to my panel-VLSTM as Phase A item S10.
The architecture:
input x[B, A=4, L=24, F] →
per-asset VSN → h[B, A, L, hidden]
cross-asset MultiheadAttention(d_model=hidden, n_heads=4) ← NEW
applied across the 4 assets at each timestep
residual + LayerNorm
per-asset LSTM (now sees cross-asset-aware features)
per-asset head → pos[B, A]
Trained on H200, 5-fold walk-forward, same Sharpe-loss objective as the original panel-VLSTM. Result:
| Commodity | panel_VLSTM (orig) | panel_VLSTM_XAttn | Δ |
|---|---|---|---|
| Gold | 1.33 | 1.46 | +0.13 |
| Silver | 0.74 | 1.13 | +0.39 |
| Copper | 1.79 | 1.25 | −0.54 |
| Crude_Oil | 2.70 | 1.52 | −1.17 |
| Mean | 1.64 | 1.34 | −0.30 |
Gold and Silver gained marginally. Copper and Crude_Oil collapsed. On average, the architecture lost 0.30 Sharpe.
The likely mechanism is one of these (or all):
-
Capacity over-allocation. A 4-head attention layer over a 32-dim embedding adds ~16K params. On a panel where the LSTM proper only has ~50K, that’s a 30% bump in trainable surface area on the path to a target with low SNR. Without strong cross-asset signal to extract, the attention layer becomes a fancy noise generator.
-
The 4 commodities don’t actually share much timestep-level signal. Gold and Silver co-move (gold-silver ratio is a stable 80-100). Copper trades on China industrial demand; Crude trades on OPEC + inventory. Pairing all four at every timestep dilutes the relationships that exist.
-
Pre-LSTM placement may be wrong. I put attention before the LSTM so the recurrent layer sees cross-asset-aware features. Post-LSTM placement (attention over the final hidden states) might preserve per-asset learning while still allowing cross-asset adjustment at the output.
I tested (1)+(2). Didn’t try (3) — Phase A’s scope is exhausted. Future iteration could sweep placement.
Bigger lesson: conventional architectural choices don’t generalize cleanly out of domain. The Oxford 2026 paper applies cross-asset attention to a different commodity universe with different cross-correlations. Their result doesn’t transfer to mine 1:1. “It worked in the paper” is not a sufficient justification for adding parameters to a model.
The Phase A v11 candidate pool keeps panel_VLSTM_XAttn for completeness. Honest walk-forward selection de-ranks it; the original panel_VLSTM stays in the top-6.
// Discussion
Comments are powered by GitHub Discussions via Giscus. Sign in with your GitHub account to add a reply, or discuss on X.