Cross-Asset Attention Hurt the Strategy

BLIP · May 3, 2026 · Research · 2 min read

Added MultiheadAttention across the 4 commodities before the LSTM in panel-VLSTM. Mean Sharpe Δ −0.30. Crude_Oil specifically lost 1.17.

Conventional wisdom in cross-asset DL says attention captures shared structure that per-asset architectures miss. The Oxford 2026 commodity-momentum paper uses a cross-asset block; transformer-style architectures default to it. I added it to my panel-VLSTM as Phase A item S10.

The architecture:

input x[B, A=4, L=24, F]  →
  per-asset VSN → h[B, A, L, hidden]
  cross-asset MultiheadAttention(d_model=hidden, n_heads=4)  ← NEW
    applied across the 4 assets at each timestep
    residual + LayerNorm
  per-asset LSTM (now sees cross-asset-aware features)
  per-asset head → pos[B, A]

Trained on H200, 5-fold walk-forward, same Sharpe-loss objective as the original panel-VLSTM. Result:

Commodity	panel_VLSTM (orig)	panel_VLSTM_XAttn	Δ
Gold	1.33	1.46	+0.13
Silver	0.74	1.13	+0.39
Copper	1.79	1.25	−0.54
Crude_Oil	2.70	1.52	−1.17
Mean	1.64	1.34	−0.30

Gold and Silver gained marginally. Copper and Crude_Oil collapsed. On average, the architecture lost 0.30 Sharpe.

The likely mechanism is one of these (or all):

Capacity over-allocation. A 4-head attention layer over a 32-dim embedding adds ~16K params. On a panel where the LSTM proper only has ~50K, that’s a 30% bump in trainable surface area on the path to a target with low SNR. Without strong cross-asset signal to extract, the attention layer becomes a fancy noise generator.
The 4 commodities don’t actually share much timestep-level signal. Gold and Silver co-move (gold-silver ratio is a stable 80-100). Copper trades on China industrial demand; Crude trades on OPEC + inventory. Pairing all four at every timestep dilutes the relationships that exist.
Pre-LSTM placement may be wrong. I put attention before the LSTM so the recurrent layer sees cross-asset-aware features. Post-LSTM placement (attention over the final hidden states) might preserve per-asset learning while still allowing cross-asset adjustment at the output.

I tested (1)+(2). Didn’t try (3) — Phase A’s scope is exhausted. Future iteration could sweep placement.

Bigger lesson: conventional architectural choices don’t generalize cleanly out of domain. The Oxford 2026 paper applies cross-asset attention to a different commodity universe with different cross-correlations. Their result doesn’t transfer to mine 1:1. “It worked in the paper” is not a sufficient justification for adding parameters to a model.

The Phase A v11 candidate pool keeps panel_VLSTM_XAttn for completeness. Honest walk-forward selection de-ranks it; the original panel_VLSTM stays in the top-6.

// Discussion

Comments are powered by GitHub Discussions via Giscus. Sign in with your GitHub account to add a reply, or discuss on X.