@trilo: If we were studying

If we were studying visual processing, the matrix x could be implemented as an m x n 2-D matrix, where each element of the matrix corresponds to a grayscale value in a digital photograph of the object that the subject brain is observing. The size of the photograph is m values by n values.

You are correct, but in practice such a matrix is usually flattened into a vector. This makes the math easier.

To capture color, we could create three such matrices, where each contains the values from one channel in an RGB image. So the first matrix contains the red values, the second contains the green values, and the third represents the blue values.

For right now, I'm just dealing with black and white pixels. If we want to use colour in the future one option would to just concatenate three vectors into one. I don't know if this is the best option though.

The matrix a would capture neural activation values for those populations of neurons relevant to visual processing.

Yes, but keep in mind that a is probably more useful as a theoretical construct, as we have no way of measuring it directly and depending on how we define these populations we could get something completely different. In this post I described a as a way to obtain the feature space, but depending on the encoding model we could create a completely different feature space. See chapter 4 in the paper for more details. (also, a is a vector as well)

Is that a correct assumption? So if the subject started to smell the aromas of a heated lunch wafting in from a nearby break room, this should theoretically have no impact to the values in a. Is that right? If not, how would you handle that?

Well that aroma, along with millions of other things going on inside someones head, would just be considered noise and we would just have to deal with it when decoding. I'm only interested in encoding models because they are important for decoding as well, but I would assume that an encoding model would choose its a with entries only from relevant populations.

The fMRI that gives rise to the values in matrix y is a three dimensional picture of the brain. How would you propose to represent those values? Would each slice of an MRI scan (a 2D image) populate pixel values in separate matrices, y1, y2, y3, et cetera?

I'm not actually sure how fMRI data is represented in hardware, except that its based on voxels. The thing about tensors is that you can always flatten them to vectors though. :) The thing about flattening is that the structure gets lost, but if your model doesn't exploit that any way it doesn't really matter. That would definately be another thing to look into.

It takes a long time to take an fMRI of the entire brain (let's say something like 10 minutes). I'm assuming that the timestamps of each MRI slice are not what you had in mind for the t time parameters for a, or are they?

Well fMRI is pretty crappy, but it isn't THAT crappy. https://en.wikipedia.org/wiki/Functional_magnetic_resonance_imaging#Temporal_resolution

Is there any way to mask the fMRI data such that it excludes neural activations not relevant to visual processing (e.g., smelling cooked food, feeling pain from arthritis in one's neck or back while laying on the lab table, auditory cues from the MRI machine making noise)?

Thats called removing noise and its a pretty difficult task. That said, a neural network is pretty good at just looking at the interesting bits.

Thank you for these questions! They motivate me the write more.