Stat 214 โ Connecting Language to fMRI ยท Spring 2026
Overview
Subjects listen to podcast stories while an fMRI scanner records brain activity every 2 seconds at ~94,000 voxels. We build encoding models: text embeddings in, predicted BOLD signal out.
Notation used throughout this guide
Symbol
Meaning
Typical value
$j$
Subject index
$j \in \{2, 3\}$
$s$
Story index
$s = 1, \ldots, 101$
$i$
Voxel index
$i = 1, \ldots, V_j$
$t$
Timepoint index (within a story or stacked)
$W_s$
Number of words in story $s$
$\sim 800$
$p$
Embedding dimension
$300$ (Word2Vec), $768$ (BERT)
$T_s$
Number of fMRI timepoints for story $s$
$\sim 241$
$T$
Total timepoints across all training stories
$\approx 23{,}000$
$V_j$
Number of voxels for subject $j$
$94{,}251$ (subj 2), $95{,}556$ (subj 3)
$E$
Raw word embedding matrix for one story
$\in \mathbb{R}^{W_s \times p}$
$E'$
Downsampled embedding matrix (after alignment)
$\in \mathbb{R}^{T_s \times p}$
$X$
Design matrix (delayed features, stacked across stories)
$\in \mathbb{R}^{T \times 4p}$
$Y$
BOLD response matrix (stacked across stories, one subject)
$\in \mathbb{R}^{T \times V_j}$
$\hat{B}$
Ridge weight matrix (all voxels). Column $i$ is $\hat{\beta}_i$
$\in \mathbb{R}^{4p \times V_j}$
$\hat{\beta}_i$
Weight vector for voxel $i$
$\in \mathbb{R}^{4p}$
$\alpha_{\text{ridge}}$
Ridge regularization parameter
selected via CV
$r_i$
Pearson CC for voxel $i$ on test story
When working within a single subject (as we usually do), we drop the $j$ subscript and write $V$ for $V_j$.
One model per subject. Ridge regression is fit with the same $X$ matrix across all ~94k voxels (different $y$ column per voxel). Evaluation: Pearson correlation between predicted and actual BOLD per voxel on held-out stories.
Dimensions at Each Step (one story, BERT, subject 2)
Step
Object
Shape
Notes
Raw text
Word list
$W_s$ words
e.g., $W_s \approx 800$ for a ~5 min story
Embedding
$E$
$W_s \times p$
$p = 768$ for BERT, $300$ for Word2Vec
Downsample
$E'$
$(T_s + 15) \times p$
Uses all tr_times entries (e.g., 256 for adollshouse)
Align to fMRI
$E'$
$T_s \times p$
Drop 5 padding rows at start, 10 at end โ matches Y
Add delays
$X_{\text{story}}$
$T_s \times 4p$
$4 \times 768 = 3072$ for BERT
Stack stories
$X$
$T \times 4p$
$T = \sum_s T_s \approx 23{,}000$
BOLD response
$Y$
$T \times V$
$V = 94{,}251$ for subject 2
Ridge solution
$\hat{B}$
$4p \times V$
One weight vector $\hat{\beta}_i$ per voxel (column)
Prediction
$\hat{Y}_{\text{test}}$
$T_{\text{test}} \times V$
Evaluate CC per column
Neuroimaging Terms
fMRI and the BOLD Signal
fMRI
Functional Magnetic Resonance Imaging. Measures brain activity indirectly by detecting changes in blood oxygenation. Does not measure neural firing directly: it captures the hemodynamic consequences of neural activity.
BOLD Signal
Blood-Oxygen-Level-Dependent signal. When neurons fire, local blood flow increases and the ratio of oxygenated to deoxygenated hemoglobin shifts. This change is what fMRI detects. The BOLD signal is a proxy for neural activity, not the activity itself.
Voxel
A volumetric pixel: a tiny 3D cube of brain tissue (~2mm per side in this experiment). Each subject has approximately 94,000 voxels. Think of each voxel as one spatial "measurement point" in the brain. The data for one story $s$ and one subject $j$ is a matrix $Y \in \mathbb{R}^{T_s \times V_j}$, where $T_s$ is the number of timepoints for that story and $V_j$ is the number of voxels for that subject ($V_2 = 94{,}251$, $V_3 = 95{,}556$). Each subject has a different voxel count because brains differ in size and shape. We fit entirely separate models per subject: no data is shared across subjects.
TR (Repetition Time)
The time interval between consecutive fMRI scans: TR $\approx$ 2.0045 seconds in this experiment. Every TR interval, the scanner captures one full-brain snapshot (called a volume or timepoint). A 10-minute story yields roughly 300 timepoints. We use $T_s$ to denote the number of timepoints for story $s$.
Hemodynamic Response
HRF (Hemodynamic Response Function)
The BOLD signal does not spike instantly when neurons fire. It peaks approximately 4-6 seconds after the neural event and takes ~15 seconds to return to baseline. This lag is the hemodynamic response function. The pipeline accounts for this by adding time-lagged copies of the features (see Delays).
Consequence for modeling: if a subject hears the word "dog" at time $t$, the brain's BOLD response to "dog" appears at roughly $t + 4$ to $t + 8$ seconds. This is why we need the make_delayed function: it gives the ridge regression access to features from 2, 4, 6, and 8 seconds before each BOLD measurement.
Text Embeddings
An embedding converts a word (or subword) into a numerical vector. This is what allows us to use text as input to a regression model.
Static Embeddings
Bag of Words
Simplest approach: represent each word as a one-hot or frequency vector. No word order, no semantics. Serves as a baseline.
Word2Vec
Learns embeddings from local context windows using a shallow neural network. Two variants: CBOW (predict target from neighbors) and Skip-gram (predict neighbors from target). Each word gets a fixed vector regardless of context. Pretrained versions are available (e.g., Google News 300d).
GloVe
Global Vectors for Word Representation. Learns from word-word co-occurrence statistics across an entire corpus (not just local windows). Uses a log-bilinear regression model. Captures slightly different relationships than Word2Vec due to the global perspective. Also static: one vector per word.
Static embeddings assign the same vector to a word regardless of context. "Bank" has one representation whether it refers to a riverbank or a financial institution. This is a fundamental limitation for modeling how the brain processes language in context.
BERT
BERT
Bidirectional Encoder Representations from Transformers. A transformer-based model pretrained via masked language modeling (MLM). Unlike static embeddings, BERT produces contextual representations: the vector for each word depends on the surrounding words. Uses multi-head self-attention to capture relationships between all pairs of words in a sequence.
Key properties:
Bidirectional: attends to both preceding and following words (unlike GPT, which is left-to-right only)
Max sequence length: 512 tokens. Long stories must be processed in overlapping windows
Uses WordPiece tokenizer, which splits rare words into subword units (no OOV problems)
Embeddings are typically extracted from the last hidden layer, with mean pooling over subwords
Masked Language Modeling (MLM)
BERT's pretraining objective. 15% of tokens are selected, then: 80% are replaced with [MASK], 10% with a random token, 10% unchanged. The model learns to predict the original tokens from context.
Fine-tuning vs transfer learning:
Transfer learning: freeze BERT's weights, only train new layers on top (e.g., a classification head). Embeddings come from the frozen model
Fine-tuning: allow BERT's weights to update during a second round of training on your specific data. More expressive but risks overfitting on small datasets
LoRA (Low-Rank Adaptation)
LoRA
A parameter-efficient fine-tuning method. In a transformer like BERT, each layer contains large weight matrices $W \in \mathbb{R}^{d_1 \times d_2}$ (e.g., the query projection in attention has $d_1 = d_2 = 768$). Full fine-tuning updates all entries of $W$, which means storing gradients for all ~110M parameters.
The idea behind LoRA: the task-specific update $\Delta W$ probably doesn't need the full rank of the matrix. So LoRA freezes $W$ and learns a low-rank factorization:
$$\Delta W = \frac{\gamma}{r} AB^\top, \qquad A \in \mathbb{R}^{d_1 \times r},\; B \in \mathbb{R}^{d_2 \times r}$$
where $r \ll \min(d_1, d_2)$ is the rank (e.g., $r = 4$ or $8$), and $\gamma$ is a scaling factor that controls the update magnitude (not to be confused with $\alpha_{\text{ridge}}$, the ridge regularization parameter). Only $A$ and $B$ are trained; the original $W$ stays frozen.
Low-rank intuition: a full $768 \times 768$ matrix has ~590K parameters per layer. A rank-4 factorization has only $768 \times 4 + 768 \times 4 = 6{,}144$ parameters. The assumption is that the direction BERT needs to shift for this task lives in a low-dimensional subspace, like PCA for weight updates.
Why LoRA for this lab? Full fine-tuning of BERT on ~24k words of story transcripts risks overfitting. LoRA's low parameter count acts as implicit regularization while still allowing the model to adapt its contextual representations to the narrative domain. Students fine-tune with LoRA on the MLM objective (same pretraining task, different data), then extract embeddings from the adapted model.
The Modeling Pipeline
Illustrated Pipeline (with dimensions)
Step 1
Raw text (one story $s$)
thequickbrown...end
$W_s \approx 800$ words with onset timestamps
โ embed (Word2Vec / GloVe / BERT)
Step 2
Word embeddings $E$
$E \in \mathbb{R}^{W_s \times p}$ (e.g., $800 \times 768$) โ each row = one word's embedding vector
โ Lanczos downsample to scanner acquisition rate
โ ridge: $\hat{B} = (X^\top X + \alpha_{\text{ridge}} I)^{-1} X^\top Y$ โ inverted once, applied to all $V_j$ columns
Step 6
Weight matrix $\hat{B}$
$\hat{B} \in \mathbb{R}^{4p \times V_j}$ โ each column $\hat{\beta}_i$ = one voxel's learned weights
โ predict on test story: $\hat{Y}_{\text{test}} = X_{\text{test}} \hat{B}$, evaluate $r_i = \mathrm{corr}(y_i, \hat{y}_i)$ per voxel
Step 7
Pearson CC per voxel ($r_i$)
Most voxels have $r_i \approx 0$ (gray). Language-responsive voxels reach $r_i \sim 0.3\text{-}0.5$ (gold).
Step 1: Lanczos Downsampling
Words occur at irregular intervals (~3 per second on average, but varying). The fMRI scanner captures snapshots at regular intervals (every TR $\approx$ 2 seconds). We need to convert from "one embedding per word, at irregular times" to "one embedding per timepoint, at regular times," which means interpolating onto a new time grid.
What downsampling does
Think of each word embedding as a point in time: word $k$ has embedding vector $e_k \in \mathbb{R}^p$ at onset time $t_k^{\text{word}}$. We want an embedding vector at each scanner timepoint $t_n^{\text{scan}}$ (for $n = 1, \ldots, T_s$). The Lanczos filter computes a weighted average of nearby word embeddings:
$$E'(t_n^{\text{scan}}) = \sum_{k=1}^{W_s} w_{nk} \cdot e_k$$
where the weights $w_{nk}$ depend on how close word $k$'s onset time is to scanner timepoint $n$. Words spoken near $t_n^{\text{scan}}$ get high weight; words spoken far from $t_n^{\text{scan}}$ get near-zero weight. The Lanczos kernel (a windowed sinc function) determines the exact weights, with a window parameter controlling how many nearby words contribute (default: 3 lobes).
Dimensions: story $s$ has $W_s$ words. The tr_times array has $T_s + 15$ entries (including 5 padding entries before the story and 10 after), so the downsampled matrix $E' \in \mathbb{R}^{(T_s + 15) \times p}$ has more rows than Y. The alignment step (Step 3) drops the padding rows to get $E' \in \mathbb{R}^{T_s \times p}$.
Example:adollshouse has $W_s = 1656$ words and tr_times has 256 entries, so downsampling gives $1656 \times 768 \to 256 \times 768$. After dropping the 5 + 10 padding rows (Step 3), you get $241 \times 768$, matching Y. Each row is a weighted blend of nearby word embeddings, with words closest in time weighted most heavily.
Because each timepoint's embedding is a blend of several words, the model never sees a "pure" single-word signal. This matters for interpretation: when SHAP says "the word 'dog' is important for this voxel," it means that removing 'dog' changes the blended embeddings at the timepoints where 'dog' contributes, which in turn changes the predicted BOLD signal. The word-level attribution works because the SHAP/LIME wrapper re-runs the full pipeline (including downsampling) for each text perturbation, so the effect of removing a single word propagates naturally through the blending step.
Step 2: Temporal Delays
Due to the HRF, the brain's response to a word heard at time $t$ appears in the fMRI data at approximately $t + 4$ to $t + 8$ seconds. The function make_delayed(stim, delays=[1,2,3,4]) accounts for this.
Delayed Feature Matrix
Starting from the downsampled embedding $E' \in \mathbb{R}^{T_s \times p}$, create shifted copies at delays of 1, 2, 3, and 4 TR intervals and concatenate horizontally:
$$X_{\text{story}} = [E'_{t-1},\; E'_{t-2},\; E'_{t-3},\; E'_{t-4}] \in \mathbb{R}^{T_s \times 4p}$$
Row $t$ contains the embedding features from 1, 2, 3, and 4 TR intervals before timepoint $t$. Early rows are zero-padded (there are no words before the story starts).
Why these specific delays? The HRF peaks at 4-6 seconds post-stimulus. With TR $\approx$ 2s, delays of 1-4 TR intervals correspond to 2, 4, 6, and 8 seconds. The ridge regression learns a weight for each delay at each voxel, effectively fitting the HRF shape per voxel. Some voxels respond faster (more weight on delay 1-2), others slower (more weight on delay 3-4).
Step 3: Aligning X to Y
The scanner started ~10 seconds before each story (giving 5 TRs with negative timestamps) and continued after the story ended. The raw data had 256 timepoints for adollshouse, but the Y data provided to you has only 241 rows: 15 timepoints were removed during preprocessing because fMRI responses at the edges of stories are noisier due to scanner transients and detrending artifacts.
When you downsample your embeddings using all 256 tr_times entries, you get 256 rows. You need to drop the same 15 rows that were removed from Y so the dimensions match. The question is which 15.
The Huth lab's own tutorial code uses [5+trim:-trim] with trim=5, giving [10:-5]: drop 10 from the start (5 pre-story padding + 5 noise buffer) and 5 from the end (noise buffer). The lab instructions say "trim the first 5 and last 10," which reverses this, but the [10:-5] split has stronger reference support and tends to give better correlations in practice. Either way, you need to remove exactly 15 rows total, and document your choice.
If you want to trim a few additional TRs from both X and Y beyond these 15 (e.g., because the BOLD signal still shows some edge effects in the kept region), that's a defensible judgment call. The edge artifacts are gradual rather than sharp, so the pre-trimming removed the worst of it but didn't eliminate residual effects entirely.
Stacking stories: after alignment, the per-story matrices are vertically concatenated across all training stories to form the full design matrix $X \in \mathbb{R}^{T \times 4p}$ and response matrix $Y \in \mathbb{R}^{T \times V}$, where $T = \sum_s T_s \approx 23{,}000$ is the total number of timepoints across all stories.
Step 4: Ridge Regression
The approach is sometimes called "mass-univariate" in the neuroimaging literature: we fit one independent regression model per voxel (~94k models), all sharing the same feature matrix $X$ but with different target columns $y_i$. Here's the notation.
Setup and Notation
Fix a subject $j \in \{2, 3\}$. Let $p$ denote the embedding dimension (e.g., $p = 300$ for Word2Vec, $p = 768$ for BERT). After applying delays with lags 1 through 4, the feature dimension becomes $4p$. Let $V_j$ denote the number of voxels for subject $j$ ($V_2 = 94{,}251$, $V_3 = 95{,}556$). Let $T$ denote the total number of timepoints across all training stories.
Design matrix (shared across voxels):
$$X \in \mathbb{R}^{T \times 4p}$$
Row $t$ of $X$ is $x(t)^\top$, the delayed embedding vector at time $t$. This matrix is the same for every voxel: it depends only on the text and the embedding method, not on which voxel we are predicting.
Response matrix (one column per voxel):
$$Y \in \mathbb{R}^{T \times V_j}$$
Column $i$ of $Y$ is the BOLD time series $y_i = (y_i(1), \ldots, y_i(T))^\top$ at voxel $i$. This is the fMRI data.
The Model (per voxel)
For each voxel $i = 1, \ldots, V_j$, we fit a separate ridge regression:
$$\hat{\beta}_i = \arg\min_{\beta \in \mathbb{R}^{4p}} \|y_i - X\beta\|_2^2 + \alpha_{\text{ridge}} \|\beta\|_2^2$$
The predicted BOLD signal at voxel $i$ and time $t$ is:
$$\hat{y}_i(t) = x(t)^\top \hat{\beta}_i$$
Each voxel gets its own weight vector $\hat{\beta}_i \in \mathbb{R}^{4p}$, learned from the same design matrix $X$ but a different target column $y_i$. Intuitively, the weights encode which embedding features (and at which time lags) are predictive of activity at that specific brain location.
Efficient Joint Solution
Although each voxel has its own $\hat{\beta}_i$, we can solve all $V_j$ regressions at once:
$$\hat{B} = (X^\top X + \alpha_{\text{ridge}} I)^{-1} X^\top Y \in \mathbb{R}^{4p \times V_j}$$
Column $i$ of $\hat{B}$ is $\hat{\beta}_i$. The important thing: $X^\top X + \alpha_{\text{ridge}} I$ is inverted only once (it doesn't depend on $i$), then multiplied by each column of $Y$. That's why fitting ~94k regressions is feasible.
Cross-validation: leave-one-story-out (LOSO). Train on all stories except one, predict on the held-out story. The regularization parameter $\alpha_{\text{ridge}}$ is selected via grid search over the LOSO folds.
Incremental LOSO trick
Don't recompute $X^\top X$ and $X^\top Y$ from scratch for every fold. Precompute the global sums across all stories:
$$X^\top X = \sum_s X_s^\top X_s, \qquad X^\top Y = \sum_s X_s^\top Y_s$$
Then for fold $s$ (holding out story $s$), the training-set quantities are just:
$$(X^\top X - X_s^\top X_s + \alpha_{\text{ridge}} I)^{-1}(X^\top Y - X_s^\top Y_s)$$
One matrix inverse per fold per $\alpha$ value, instead of rebuilding $X^\top X$ from ~100 stories each time. Same idea applies across $\alpha$ values: the subtraction is the same for all $\alpha$, only the diagonal shift changes.
Evaluation
On a held-out test story with $T_{\text{test}}$ timepoints, the predicted BOLD at voxel $i$ is $\hat{y}_i^{\text{test}} = X_{\text{test}} \hat{\beta}_i \in \mathbb{R}^{T_{\text{test}}}$. Model performance at voxel $i$ is the Pearson correlation:
$$r_i = \mathrm{corr}(y_i^{\text{test}}, \hat{y}_i^{\text{test}})$$
We report aggregate statistics across all voxels: mean CC, median CC, top 1% CC, and top 5% CC. Most voxels will have $r_i \approx 0$ (they are not involved in language processing), while a subset of language-responsive voxels will show meaningful correlations.
Train/Test Split Strategies
The lab instructions say "split the stories into a training and test set" but leave the details to you. There are several reasonable approaches, and the choice matters for how you report and compare results.
Fixed holdout: reserve a small number of stories (e.g., 2-5) as a test set, train on the rest. Simple, but your CC estimates depend on which specific stories you held out. A long, complex story might be harder to predict than a short simple one, so the particular test set can shift your numbers noticeably.
Leave-one-story-out (LOSO): hold out each story in turn, train on the remaining ~100, predict the held-out story. This gives you a CC distribution across held-out stories rather than a single number, which is more informative. It's also what Huth et al. use for $\alpha_{\text{ridge}}$ selection. The downside is computational cost (101 ridge fits per embedding method), though the incremental trick mentioned above helps a lot.
Multiple random splits: randomly partition stories into train/test several times, fit and evaluate each split, report averaged CCs. This gives you error bars on your evaluation metrics, which strengthens any comparison between embedding methods. If BERT beats GloVe on one split but not another, that's important to know.
Whatever you choose, use the same splits across all embedding methods so comparisons are fair. And if you report a single number (e.g., "mean CC = 0.12"), be clear about what it's averaging over: voxels, stories, splits, or some combination.
Do not use the same held-out stories for both hyperparameter selection ($\alpha_{\text{ridge}}$) and final model evaluation. If you select $\alpha$ using LOSO across all stories, then report CC on one of those same held-out folds, you've leaked information. Either reserve a separate final test set that's never touched during CV, or be explicit that your reported CCs are cross-validated estimates, not independent test performance.
Fit entirely separate models per subject. Subjects have different numbers of voxels and different brain geometry, so sharing data across subjects would introduce data leakage and wouldn't make sense anatomically.
Memory is the main computational bottleneck. The $X$ matrix ($T \times 4p$) is manageable, but $Y$ ($T \times 94{,}000$) in float64 is ~18 GB across all stories. Practical solutions: cast to float32 (halves memory), use memory-mapped loading (np.load(..., mmap_mode='r')), and process voxels in batches rather than all at once. Students who completed 3.1 should already have a working pipeline for this.
Interpretation (Lab 3.3)
The ridge weights $\hat{\beta}_i \in \mathbb{R}^{4p}$ don't tell you "which words matter" directly, because each BERT embedding dimension doesn't correspond to anything human-readable. (With one-hot bag-of-words you'd get this for free, but we lose it with distributed representations.) To get word-level interpretations, we use post-hoc attribution methods.
SHAP
SHAP (SHapley Additive exPlanations)
Grounded in cooperative game theory. Assigns each input feature (word) a contribution to the model's prediction based on Shapley values: the average marginal contribution of the feature across all possible coalitions. For BERT models, uses a partition explainer that replaces words with [MASK] tokens, preserving positional structure while removing semantic content.
Theoretically grounded (additive, symmetric), but computationally expensive. In practice, tends to highlight function words and discourse markers.
LIME
LIME (Local Interpretable Model-agnostic Explanations)
Assumes the model is approximately linear in a small neighborhood of the input. Generates perturbed inputs by randomly deleting words, records prediction changes, then fits a sparse linear surrogate model. The surrogate's weights serve as feature importance scores.
Fast and model-agnostic, but deletion changes the sequence length and shifts positions, which can be problematic for transformers. Tends to highlight content words rather than structural ones.
The lab requires both SHAP and LIME. They highlight different types of words because their perturbation mechanisms differ (masking vs deletion), so words flagged by both methods are more trustworthy. Only run interpretation on well-predicted voxels: interpreting a bad model tells you nothing (this is the "P" in PCS).
The Wrapper Pattern (how to make SHAP/LIME work here)
SHAP and LIME expect a function f(text) โ prediction, but our model is a ridge regression on delayed, downsampled embeddings. The trick is to wrap the entire pipeline in a class so SHAP/LIME can perturb the raw text and observe how predictions change.
class fMRI_Predictor:
def __init__(self, ridge_model, tokenizer, bert_model,
data_times, tr_times, voxel_index):
# Store all pipeline components
...
def predict(self, texts):
# texts: list of strings (perturbed versions from SHAP/LIME)
predictions = []
for text in texts:
words = text.split()
embeddings = get_bert_embeddings(words) # re-embed
downsampled = lanczos_downsample(embeddings)
delayed = make_delayed(downsampled, [1,2,3,4])
pred = self.ridge_model.predict(delayed)
predictions.append(pred[:, self.voxel_index].mean())
return np.array(predictions)
Then hand the wrapper to the explainers:
# SHAP: masks words with [MASK], preserving position
masker = shap.maskers.Text(r"\s+", mask_token="[MASK]")
explainer = shap.Explainer(predictor.predict, masker)
shap_values = explainer([text_input])
# LIME: deletes words, fits local linear surrogate
lime_explainer = LimeTextExplainer()
exp = lime_explainer.explain_instance(
text_input, predictor.predict, num_features=10)
SHAP replaces words with [MASK], so BERT still sees a token at that position. LIME deletes words entirely, changing the sequence length. This is why they tend to flag different words: SHAP picks up positional/structural importance, LIME picks up semantic content.
Practical note: this is slow because every perturbation requires a full forward pass through BERT + the preprocessing pipeline. For LIME with num_samples=100, that's 100 BERT forward passes per voxel. Run this on GPU via sbatch, not the login node.
PCS Framework
Predictability
Does the model predict well on held-out data? Evaluate via CC on test stories. Only interpret voxels where the model actually predicts well: the "P" in PCS is a prerequisite for meaningful "S" and interpretation.
Computability
Can you actually compute the answer? Memory management on Bridges-2, efficient ridge regression, managing 50GB of fMRI data: these are computability challenges. If the kernel crashes, you have a computability problem.
Stability
Do conclusions change under perturbations? Several levels of stability to test:
Input stability (embedding robustness): perturb the story text (typos, synonym substitution, random word replacement) and measure how much the embeddings change. Static embeddings (Word2Vec, GloVe) are robust but insensitive; BERT is sensitive but captures real semantic shifts.
Inference stability (interpretation robustness): shift the analysis time window slightly (e.g., TR 100-120 vs TR 102-122, a ~4 second shift) and compare the top SHAP/LIME words. Measure overlap with Jaccard similarity:
$$J(A, B) = \frac{|A \cap B|}{|A \cup B|}$$
where $A$ and $B$ are the top-10 word sets from each window. High Jaccard (e.g., $\geq 0.5$) means the interpretation is stable; low Jaccard means the "important words" are artifacts of the exact window choice.
Cross-subject stability: do the same voxel regions show high CC in both subjects? (Limited by the fact that voxels don't map one-to-one across subjects.)
Cross-embedding stability: do different embedding methods identify the same top voxels? Measure via Jaccard on top-5% voxel sets across methods.
Data Summary
Subject 2
Subject 3
Story files
101 (.npy)
101 (.npy)
Voxels (V)
94,251
95,556
Data type
float64
float64
Directory size
~25 GB
~25 GB
Per-story file size
~174 MB
~177 MB
Timepoints per story (example)
241
241
Transcripts:raw_text.pkl contains 109 stories (superset of fMRI stories). Each story is a DataSequence object with: word tokens (data), word onset times (data_times), and TR times (tr_times). Loading requires ridge_utils on the Python path.
Memory math: full Subject 2 in float64 $\approx$ 18 GB; in float32 $\approx$ 9 GB; single story in float32 $\approx$ 87 MB.
Infrastructure
Bridges-2 (PSC)
Pittsburgh Supercomputing Center HPC cluster. Data at: /ocean/projects/mth250011p/shared/215a/final_project/data. Use /ocean for storage (not your home directory, which has limited quota). Shared class storage: 512 GB (provisioning an additional 512 GB).
SCF (Berkeley Stats)
Stats Computing Facility, local alternative. Data mirrored at: /scratch/users/s214/lab3. Use /scratch/users/your_username for your working files. Some students have reported permission issues accessing the shared data path.
HuggingFace
100GB free storage, useful for sharing large files (embeddings, trained models) within your group. Can be accessed programmatically from either cluster.
Compute Rules
Do not run jobs on login nodes. This slows down the entire cluster for everyone. Submit jobs via sbatch or use an interactive JupyterHub session. Request memory explicitly (e.g., --mem=64G).
Only use GPUs for BERT fine-tuning and embedding extraction. Ridge regression and scikit-learn operations run on CPU only and won't benefit from a GPU. Don't hold a GPU allocation while running ridge fits.
Common Pain Points
Kernel crashes: usually from running on the login node or insufficient memory allocation
Memory: cast to float32 (halves memory), use memory-mapped loading (np.load(..., mmap_mode='r')), process voxels in batches. Bag-of-words embeddings are sparse (scipy.sparse), but Word2Vec, GloVe, and BERT embeddings are dense
Storage: don't store unnecessarily large files. Compress where possible
Permissions: ensure GitHub org and classroom invitations are accepted
Timeline & Deadlines
Milestone
Due Date
What's Expected
Lab 3.1 check-in
April 10, 23:59
1-2 page PDF: BoW, Word2Vec, GloVe embeddings โ ridge โ CC results. At least one figure. Collaboration plan