Stat 214 — Lab 3 Reference Guide

Subjects listen to podcast stories while an fMRI scanner records brain activity every 2 seconds at ~94,000 voxels. We build encoding models: text embeddings in, predicted BOLD signal out.

Notation used throughout this guide

Symbol	Meaning	Typical value
$j$	Subject index	$j \in \{2, 3\}$
$s$	Story index	$s = 1, \ldots, 101$
$i$	Voxel index	$i = 1, \ldots, V_j$
$t$	Timepoint index (within a story or stacked)
$W_s$	Number of words in story $s$	$\sim 800$
$p$	Embedding dimension	$300$ (Word2Vec), $768$ (BERT)
$T_s$	Number of fMRI timepoints for story $s$	$\sim 241$
$T$	Total timepoints across all training stories	$\approx 23{,}000$
$V_j$	Number of voxels for subject $j$	$94{,}251$ (subj 2), $95{,}556$ (subj 3)
$E$	Raw word embedding matrix for one story	$\in \mathbb{R}^{W_s \times p}$
$E'$	Downsampled embedding matrix (after alignment)	$\in \mathbb{R}^{T_s \times p}$
$X$	Design matrix (delayed features, stacked across stories)	$\in \mathbb{R}^{T \times 4p}$
$Y$	BOLD response matrix (stacked across stories, one subject)	$\in \mathbb{R}^{T \times V_j}$
$\hat{B}$	Ridge weight matrix (all voxels). Column $i$ is $\hat{\beta}_i$	$\in \mathbb{R}^{4p \times V_j}$
$\hat{\beta}_i$	Weight vector for voxel $i$	$\in \mathbb{R}^{4p}$
$\alpha_{\text{ridge}}$	Ridge regularization parameter	selected via CV
$r_i$	Pearson CC for voxel $i$ on test story

When working within a single subject (as we usually do), we drop the $j$ subscript and write $V$ for $V_j$.

The Full Pipeline

Raw text → Word embeddings → Lanczos downsample → Drop padding rows → Add delays (1-4 TR intervals) → Ridge regression → Predict BOLD

One model per subject. Ridge regression is fit with the same $X$ matrix across all ~94k voxels (different $y$ column per voxel). Evaluation: Pearson correlation between predicted and actual BOLD per voxel on held-out stories.

Dimensions at Each Step (one story, BERT, subject 2)

Step	Object	Shape	Notes
Raw text	Word list	$W_s$ words	e.g., $W_s \approx 800$ for a ~5 min story
Embedding	$E$	$W_s \times p$	$p = 768$ for BERT, $300$ for Word2Vec
Downsample	$E'$	$(T_s + 15) \times p$	Uses all `tr_times` entries (e.g., 256 for adollshouse)
Align to fMRI	$E'$	$T_s \times p$	Drop 5 padding rows at start, 10 at end → matches Y
Add delays	$X_{\text{story}}$	$T_s \times 4p$	$4 \times 768 = 3072$ for BERT
Stack stories	$X$	$T \times 4p$	$T = \sum_s T_s \approx 23{,}000$
BOLD response	$Y$	$T \times V$	$V = 94{,}251$ for subject 2
Ridge solution	$\hat{B}$	$4p \times V$	One weight vector $\hat{\beta}_i$ per voxel (column)
Prediction	$\hat{Y}_{\text{test}}$	$T_{\text{test}} \times V$	Evaluate CC per column

Neuroimaging Terms

fMRI and the BOLD Signal

Voxel A volumetric pixel: a tiny 3D cube of brain tissue (~2mm per side in this experiment). Each subject has approximately 94,000 voxels. Think of each voxel as one spatial "measurement point" in the brain. The data for one story $s$ and one subject $j$ is a matrix $Y \in \mathbb{R}^{T_s \times V_j}$, where $T_s$ is the number of timepoints for that story and $V_j$ is the number of voxels for that subject ($V_2 = 94{,}251$, $V_3 = 95{,}556$). Each subject has a different voxel count because brains differ in size and shape. We fit entirely separate models per subject: no data is shared across subjects.

TR (Repetition Time) The time interval between consecutive fMRI scans: TR $\approx$ 2.0045 seconds in this experiment. Every TR interval, the scanner captures one full-brain snapshot (called a volume or timepoint). A 10-minute story yields roughly 300 timepoints. We use $T_s$ to denote the number of timepoints for story $s$.

Hemodynamic Response

HRF (Hemodynamic Response Function) The BOLD signal does not spike instantly when neurons fire. It peaks approximately 4-6 seconds after the neural event and takes ~15 seconds to return to baseline. This lag is the hemodynamic response function. The pipeline accounts for this by adding time-lagged copies of the features (see Delays).

Consequence for modeling: if a subject hears the word "dog" at time $t$, the brain's BOLD response to "dog" appears at roughly $t + 4$ to $t + 8$ seconds. This is why we need the make_delayed function: it gives the ridge regression access to features from 2, 4, 6, and 8 seconds before each BOLD measurement.

Text Embeddings

An embedding converts a word (or subword) into a numerical vector. This is what allows us to use text as input to a regression model.

Static Embeddings

BERT

BERT Bidirectional Encoder Representations from Transformers. A transformer-based model pretrained via masked language modeling (MLM). Unlike static embeddings, BERT produces contextual representations: the vector for each word depends on the surrounding words. Uses multi-head self-attention to capture relationships between all pairs of words in a sequence.

Masked Language Modeling (MLM) BERT's pretraining objective. 15% of tokens are selected, then: 80% are replaced with [MASK], 10% with a random token, 10% unchanged. The model learns to predict the original tokens from context.

LoRA (Low-Rank Adaptation)

LoRA A parameter-efficient fine-tuning method. In a transformer like BERT, each layer contains large weight matrices $W \in \mathbb{R}^{d_1 \times d_2}$ (e.g., the query projection in attention has $d_1 = d_2 = 768$). Full fine-tuning updates all entries of $W$, which means storing gradients for all ~110M parameters.

The idea behind LoRA: the task-specific update $\Delta W$ probably doesn't need the full rank of the matrix. So LoRA freezes $W$ and learns a low-rank factorization: $$\Delta W = \frac{\gamma}{r} AB^\top, \qquad A \in \mathbb{R}^{d_1 \times r},\; B \in \mathbb{R}^{d_2 \times r}$$ where $r \ll \min(d_1, d_2)$ is the rank (e.g., $r = 4$ or $8$), and $\gamma$ is a scaling factor that controls the update magnitude (not to be confused with $\alpha_{\text{ridge}}$, the ridge regularization parameter). Only $A$ and $B$ are trained; the original $W$ stays frozen.

Low-rank intuition: a full $768 \times 768$ matrix has ~590K parameters per layer. A rank-4 factorization has only $768 \times 4 + 768 \times 4 = 6{,}144$ parameters. The assumption is that the direction BERT needs to shift for this task lives in a low-dimensional subspace, like PCA for weight updates.

Why LoRA for this lab? Full fine-tuning of BERT on ~24k words of story transcripts risks overfitting. LoRA's low parameter count acts as implicit regularization while still allowing the model to adapt its contextual representations to the narrative domain. Students fine-tune with LoRA on the MLM objective (same pretraining task, different data), then extract embeddings from the adapted model.

The Modeling Pipeline

Illustrated Pipeline (with dimensions)

Step 1

Raw text (one story $s$)

the quick brown ... end

$W_s \approx 800$ words with onset timestamps

↓ embed (Word2Vec / GloVe / BERT)

Step 2

Word embeddings $E$

$E \in \mathbb{R}^{W_s \times p}$ (e.g., $800 \times 768$) — each row = one word's embedding vector

↓ Lanczos downsample to scanner acquisition rate

Step 3

Downsampled $E'$

$E' \in \mathbb{R}^{(T_s+15) \times p}$ (e.g., $256 \times 768$) — includes padding rows

↓ drop 5 padding rows at start + 10 at end → $T_s \times p$, then make_delayed([1,2,3,4])

Step 4

Delayed features $X_{\text{story}}$

$X_{\text{story}} \in \mathbb{R}^{T_s \times 4p}$ (e.g., $241 \times 3072$) — 4 colors = 4 delay copies

↓ stack all training stories

Step 5

Combine features with fMRI response (one subject)

$X \in \mathbb{R}^{T \times 4p}$

paired with

$Y \in \mathbb{R}^{T \times V_j}$ ($V_j = 94{,}251$)

↓ ridge: $\hat{B} = (X^\top X + \alpha_{\text{ridge}} I)^{-1} X^\top Y$ — inverted once, applied to all $V_j$ columns

Step 6

Weight matrix $\hat{B}$

$\hat{B} \in \mathbb{R}^{4p \times V_j}$ — each column $\hat{\beta}_i$ = one voxel's learned weights

↓ predict on test story: $\hat{Y}_{\text{test}} = X_{\text{test}} \hat{B}$, evaluate $r_i = \mathrm{corr}(y_i, \hat{y}_i)$ per voxel

Step 7

Pearson CC per voxel ($r_i$)

Most voxels have $r_i \approx 0$ (gray). Language-responsive voxels reach $r_i \sim 0.3\text{-}0.5$ (gold).

Step 1: Lanczos Downsampling

Words occur at irregular intervals (~3 per second on average, but varying). The fMRI scanner captures snapshots at regular intervals (every TR $\approx$ 2 seconds). We need to convert from "one embedding per word, at irregular times" to "one embedding per timepoint, at regular times," which means interpolating onto a new time grid.

What downsampling does Think of each word embedding as a point in time: word $k$ has embedding vector $e_k \in \mathbb{R}^p$ at onset time $t_k^{\text{word}}$. We want an embedding vector at each scanner timepoint $t_n^{\text{scan}}$ (for $n = 1, \ldots, T_s$). The Lanczos filter computes a weighted average of nearby word embeddings: $$E'(t_n^{\text{scan}}) = \sum_{k=1}^{W_s} w_{nk} \cdot e_k$$ where the weights $w_{nk}$ depend on how close word $k$'s onset time is to scanner timepoint $n$. Words spoken near $t_n^{\text{scan}}$ get high weight; words spoken far from $t_n^{\text{scan}}$ get near-zero weight. The Lanczos kernel (a windowed sinc function) determines the exact weights, with a window parameter controlling how many nearby words contribute (default: 3 lobes).

Dimensions: story $s$ has $W_s$ words. The tr_times array has $T_s + 15$ entries (including 5 padding entries before the story and 10 after), so the downsampled matrix $E' \in \mathbb{R}^{(T_s + 15) \times p}$ has more rows than Y. The alignment step (Step 3) drops the padding rows to get $E' \in \mathbb{R}^{T_s \times p}$.

Example: adollshouse has $W_s = 1656$ words and tr_times has 256 entries, so downsampling gives $1656 \times 768 \to 256 \times 768$. After dropping the 5 + 10 padding rows (Step 3), you get $241 \times 768$, matching Y. Each row is a weighted blend of nearby word embeddings, with words closest in time weighted most heavily.

Because each timepoint's embedding is a blend of several words, the model never sees a "pure" single-word signal. This matters for interpretation: when SHAP says "the word 'dog' is important for this voxel," it means that removing 'dog' changes the blended embeddings at the timepoints where 'dog' contributes, which in turn changes the predicted BOLD signal. The word-level attribution works because the SHAP/LIME wrapper re-runs the full pipeline (including downsampling) for each text perturbation, so the effect of removing a single word propagates naturally through the blending step.

Step 2: Temporal Delays

Due to the HRF, the brain's response to a word heard at time $t$ appears in the fMRI data at approximately $t + 4$ to $t + 8$ seconds. The function make_delayed(stim, delays=[1,2,3,4]) accounts for this.

Delayed Feature Matrix Starting from the downsampled embedding $E' \in \mathbb{R}^{T_s \times p}$, create shifted copies at delays of 1, 2, 3, and 4 TR intervals and concatenate horizontally: $$X_{\text{story}} = [E'_{t-1},\; E'_{t-2},\; E'_{t-3},\; E'_{t-4}] \in \mathbb{R}^{T_s \times 4p}$$ Row $t$ contains the embedding features from 1, 2, 3, and 4 TR intervals before timepoint $t$. Early rows are zero-padded (there are no words before the story starts).

Example: Word2Vec ($p = 300$) becomes $T_s \times 1200$. BERT ($p = 768$) becomes $T_s \times 3072$.

Why these specific delays? The HRF peaks at 4-6 seconds post-stimulus. With TR $\approx$ 2s, delays of 1-4 TR intervals correspond to 2, 4, 6, and 8 seconds. The ridge regression learns a weight for each delay at each voxel, effectively fitting the HRF shape per voxel. Some voxels respond faster (more weight on delay 1-2), others slower (more weight on delay 3-4).

Step 3: Aligning X to Y

The scanner started ~10 seconds before each story (giving 5 TRs with negative timestamps) and continued after the story ended. The raw data had 256 timepoints for adollshouse, but the Y data provided to you has only 241 rows: 15 timepoints were removed during preprocessing because fMRI responses at the edges of stories are noisier due to scanner transients and detrending artifacts.

When you downsample your embeddings using all 256 tr_times entries, you get 256 rows. You need to drop the same 15 rows that were removed from Y so the dimensions match. The question is which 15.

The Huth lab's own tutorial code uses [5+trim:-trim] with trim=5, giving [10:-5]: drop 10 from the start (5 pre-story padding + 5 noise buffer) and 5 from the end (noise buffer). The lab instructions say "trim the first 5 and last 10," which reverses this, but the [10:-5] split has stronger reference support and tends to give better correlations in practice. Either way, you need to remove exactly 15 rows total, and document your choice.

If you want to trim a few additional TRs from both X and Y beyond these 15 (e.g., because the BOLD signal still shows some edge effects in the kept region), that's a defensible judgment call. The edge artifacts are gradual rather than sharp, so the pre-trimming removed the worst of it but didn't eliminate residual effects entirely.

Stacking stories: after alignment, the per-story matrices are vertically concatenated across all training stories to form the full design matrix $X \in \mathbb{R}^{T \times 4p}$ and response matrix $Y \in \mathbb{R}^{T \times V}$, where $T = \sum_s T_s \approx 23{,}000$ is the total number of timepoints across all stories.

Step 4: Ridge Regression

The approach is sometimes called "mass-univariate" in the neuroimaging literature: we fit one independent regression model per voxel (~94k models), all sharing the same feature matrix $X$ but with different target columns $y_i$. Here's the notation.

Setup and Notation Fix a subject $j \in \{2, 3\}$. Let $p$ denote the embedding dimension (e.g., $p = 300$ for Word2Vec, $p = 768$ for BERT). After applying delays with lags 1 through 4, the feature dimension becomes $4p$. Let $V_j$ denote the number of voxels for subject $j$ ($V_2 = 94{,}251$, $V_3 = 95{,}556$). Let $T$ denote the total number of timepoints across all training stories.

Design matrix (shared across voxels): $$X \in \mathbb{R}^{T \times 4p}$$ Row $t$ of $X$ is $x(t)^\top$, the delayed embedding vector at time $t$. This matrix is the same for every voxel: it depends only on the text and the embedding method, not on which voxel we are predicting.

Response matrix (one column per voxel): $$Y \in \mathbb{R}^{T \times V_j}$$ Column $i$ of $Y$ is the BOLD time series $y_i = (y_i(1), \ldots, y_i(T))^\top$ at voxel $i$. This is the fMRI data.

The Model (per voxel) For each voxel $i = 1, \ldots, V_j$, we fit a separate ridge regression: $$\hat{\beta}_i = \arg\min_{\beta \in \mathbb{R}^{4p}} \|y_i - X\beta\|_2^2 + \alpha_{\text{ridge}} \|\beta\|_2^2$$ The predicted BOLD signal at voxel $i$ and time $t$ is: $$\hat{y}_i(t) = x(t)^\top \hat{\beta}_i$$ Each voxel gets its own weight vector $\hat{\beta}_i \in \mathbb{R}^{4p}$, learned from the same design matrix $X$ but a different target column $y_i$. Intuitively, the weights encode which embedding features (and at which time lags) are predictive of activity at that specific brain location.

Efficient Joint Solution Although each voxel has its own $\hat{\beta}_i$, we can solve all $V_j$ regressions at once: $$\hat{B} = (X^\top X + \alpha_{\text{ridge}} I)^{-1} X^\top Y \in \mathbb{R}^{4p \times V_j}$$ Column $i$ of $\hat{B}$ is $\hat{\beta}_i$. The important thing: $X^\top X + \alpha_{\text{ridge}} I$ is inverted only once (it doesn't depend on $i$), then multiplied by each column of $Y$. That's why fitting ~94k regressions is feasible.

Cross-validation: leave-one-story-out (LOSO). Train on all stories except one, predict on the held-out story. The regularization parameter $\alpha_{\text{ridge}}$ is selected via grid search over the LOSO folds.

Incremental LOSO trick Don't recompute $X^\top X$ and $X^\top Y$ from scratch for every fold. Precompute the global sums across all stories: $$X^\top X = \sum_s X_s^\top X_s, \qquad X^\top Y = \sum_s X_s^\top Y_s$$ Then for fold $s$ (holding out story $s$), the training-set quantities are just: $$(X^\top X - X_s^\top X_s + \alpha_{\text{ridge}} I)^{-1}(X^\top Y - X_s^\top Y_s)$$ One matrix inverse per fold per $\alpha$ value, instead of rebuilding $X^\top X$ from ~100 stories each time. Same idea applies across $\alpha$ values: the subtraction is the same for all $\alpha$, only the diagonal shift changes.

Evaluation On a held-out test story with $T_{\text{test}}$ timepoints, the predicted BOLD at voxel $i$ is $\hat{y}_i^{\text{test}} = X_{\text{test}} \hat{\beta}_i \in \mathbb{R}^{T_{\text{test}}}$. Model performance at voxel $i$ is the Pearson correlation: $$r_i = \mathrm{corr}(y_i^{\text{test}}, \hat{y}_i^{\text{test}})$$ We report aggregate statistics across all voxels: mean CC, median CC, top 1% CC, and top 5% CC. Most voxels will have $r_i \approx 0$ (they are not involved in language processing), while a subset of language-responsive voxels will show meaningful correlations.

Train/Test Split Strategies

The lab instructions say "split the stories into a training and test set" but leave the details to you. There are several reasonable approaches, and the choice matters for how you report and compare results.

Fixed holdout: reserve a small number of stories (e.g., 2-5) as a test set, train on the rest. Simple, but your CC estimates depend on which specific stories you held out. A long, complex story might be harder to predict than a short simple one, so the particular test set can shift your numbers noticeably.

Leave-one-story-out (LOSO): hold out each story in turn, train on the remaining ~100, predict the held-out story. This gives you a CC distribution across held-out stories rather than a single number, which is more informative. It's also what Huth et al. use for $\alpha_{\text{ridge}}$ selection. The downside is computational cost (101 ridge fits per embedding method), though the incremental trick mentioned above helps a lot.

Multiple random splits: randomly partition stories into train/test several times, fit and evaluate each split, report averaged CCs. This gives you error bars on your evaluation metrics, which strengthens any comparison between embedding methods. If BERT beats GloVe on one split but not another, that's important to know.

Whatever you choose, use the same splits across all embedding methods so comparisons are fair. And if you report a single number (e.g., "mean CC = 0.12"), be clear about what it's averaging over: voxels, stories, splits, or some combination.

Do not use the same held-out stories for both hyperparameter selection ($\alpha_{\text{ridge}}$) and final model evaluation. If you select $\alpha$ using LOSO across all stories, then report CC on one of those same held-out folds, you've leaked information. Either reserve a separate final test set that's never touched during CV, or be explicit that your reported CCs are cross-validated estimates, not independent test performance.

Fit entirely separate models per subject. Subjects have different numbers of voxels and different brain geometry, so sharing data across subjects would introduce data leakage and wouldn't make sense anatomically.

Memory is the main computational bottleneck. The $X$ matrix ($T \times 4p$) is manageable, but $Y$ ($T \times 94{,}000$) in float64 is ~18 GB across all stories. Practical solutions: cast to float32 (halves memory), use memory-mapped loading (np.load(..., mmap_mode='r')), and process voxels in batches rather than all at once. Students who completed 3.1 should already have a working pipeline for this.

Interpretation (Lab 3.3)

The ridge weights $\hat{\beta}_i \in \mathbb{R}^{4p}$ don't tell you "which words matter" directly, because each BERT embedding dimension doesn't correspond to anything human-readable. (With one-hot bag-of-words you'd get this for free, but we lose it with distributed representations.) To get word-level interpretations, we use post-hoc attribution methods.

SHAP

SHAP (SHapley Additive exPlanations) Grounded in cooperative game theory. Assigns each input feature (word) a contribution to the model's prediction based on Shapley values: the average marginal contribution of the feature across all possible coalitions. For BERT models, uses a partition explainer that replaces words with [MASK] tokens, preserving positional structure while removing semantic content.

Theoretically grounded (additive, symmetric), but computationally expensive. In practice, tends to highlight function words and discourse markers.

LIME

LIME (Local Interpretable Model-agnostic Explanations) Assumes the model is approximately linear in a small neighborhood of the input. Generates perturbed inputs by randomly deleting words, records prediction changes, then fits a sparse linear surrogate model. The surrogate's weights serve as feature importance scores.

Fast and model-agnostic, but deletion changes the sequence length and shifts positions, which can be problematic for transformers. Tends to highlight content words rather than structural ones.

The lab requires both SHAP and LIME. They highlight different types of words because their perturbation mechanisms differ (masking vs deletion), so words flagged by both methods are more trustworthy. Only run interpretation on well-predicted voxels: interpreting a bad model tells you nothing (this is the "P" in PCS).

The Wrapper Pattern (how to make SHAP/LIME work here)

SHAP and LIME expect a function f(text) → prediction, but our model is a ridge regression on delayed, downsampled embeddings. The trick is to wrap the entire pipeline in a class so SHAP/LIME can perturb the raw text and observe how predictions change.

SHAP replaces words with [MASK], so BERT still sees a token at that position. LIME deletes words entirely, changing the sequence length. This is why they tend to flag different words: SHAP picks up positional/structural importance, LIME picks up semantic content.

Practical note: this is slow because every perturbation requires a full forward pass through BERT + the preprocessing pipeline. For LIME with num_samples=100, that's 100 BERT forward passes per voxel. Run this on GPU via sbatch, not the login node.

PCS Framework

Stability Do conclusions change under perturbations? Several levels of stability to test:

Input stability (embedding robustness): perturb the story text (typos, synonym substitution, random word replacement) and measure how much the embeddings change. Static embeddings (Word2Vec, GloVe) are robust but insensitive; BERT is sensitive but captures real semantic shifts.

Inference stability (interpretation robustness): shift the analysis time window slightly (e.g., TR 100-120 vs TR 102-122, a ~4 second shift) and compare the top SHAP/LIME words. Measure overlap with Jaccard similarity: $$J(A, B) = \frac{|A \cap B|}{|A \cup B|}$$ where $A$ and $B$ are the top-10 word sets from each window. High Jaccard (e.g., $\geq 0.5$) means the interpretation is stable; low Jaccard means the "important words" are artifacts of the exact window choice.

Cross-subject stability: do the same voxel regions show high CC in both subjects? (Limited by the fact that voxels don't map one-to-one across subjects.)

Cross-embedding stability: do different embedding methods identify the same top voxels? Measure via Jaccard on top-5% voxel sets across methods.

Data Summary

	Subject 2	Subject 3
Story files	101 (.npy)	101 (.npy)
Voxels (V)	94,251	95,556
Data type	float64	float64
Directory size	~25 GB	~25 GB
Per-story file size	~174 MB	~177 MB
Timepoints per story (example)	241	241

Transcripts: raw_text.pkl contains 109 stories (superset of fMRI stories). Each story is a DataSequence object with: word tokens (data), word onset times (data_times), and TR times (tr_times). Loading requires ridge_utils on the Python path.

Memory math: full Subject 2 in float64 $\approx$ 18 GB; in float32 $\approx$ 9 GB; single story in float32 $\approx$ 87 MB.

Infrastructure

Bridges-2 (PSC)

Pittsburgh Supercomputing Center HPC cluster. Data at: /ocean/projects/mth250011p/shared/215a/final_project/data. Use /ocean for storage (not your home directory, which has limited quota). Shared class storage: 512 GB (provisioning an additional 512 GB).

SCF (Berkeley Stats)

Stats Computing Facility, local alternative. Data mirrored at: /scratch/users/s214/lab3. Use /scratch/users/your_username for your working files. Some students have reported permission issues accessing the shared data path.

HuggingFace

100GB free storage, useful for sharing large files (embeddings, trained models) within your group. Can be accessed programmatically from either cluster.

Compute Rules

Do not run jobs on login nodes. This slows down the entire cluster for everyone. Submit jobs via sbatch or use an interactive JupyterHub session. Request memory explicitly (e.g., --mem=64G).

Only use GPUs for BERT fine-tuning and embedding extraction. Ridge regression and scikit-learn operations run on CPU only and won't benefit from a GPU. Don't hold a GPU allocation while running ridge fits.

Common Pain Points

Timeline & Deadlines

Quick Reference

Milestone	Due Date	What's Expected
Lab 3.1 check-in	April 10, 23:59	1-2 page PDF: BoW, Word2Vec, GloVe embeddings → ridge → CC results. At least one figure. Collaboration plan
Lab 3.2 check-in	April 24, 23:59	Pretrained BERT + fine-tuned BERT (LoRA) → ridge → comparison with 3.1 methods
Final report	May 8, 23:59	Full 20-page report covering all embeddings, modeling, SHAP/LIME interpretation, stability analysis. Push code to GitHub

Key Dimensions

TR: ~2.0045 seconds

Voxels per subject: ~94,000-96,000 (subject-native space)

Stories: 101 per subject (109 transcripts total)

Total words: ~23,800

Word2Vec dim: 300 (Google News pretrained)

GloVe dim: 50/100/200/300 (depending on version)

BERT hidden dim: 768 (bert-base-uncased)

After delays (4×): embedding dim × 4

Key Code Functions (preprocessing.py)

downsample_word_vectors(): Lanczos-filter word embeddings from word rate to TR rate

make_delayed(stim, delays): create time-lagged feature copies and concatenate horizontally

lanczosinterp2D(): low-level Lanczos interpolation (called by downsample)

Overview