Please imagine there is an autoencoder, of which the architecture is as follows:
1. a convolution layer without padding gets input of an intensity image I sized WXH, and generates 20 outputs (feature maps) using 20 sets of weights/biases. So each of those 20 outputs will be sized smaller than WXH
2. (optional) a pooling layer without padding generates smaller-sized 20 2D arrays, which are then vectorised
3. a fully connected layer generates another vector which may or may not have the same size of the vector above;
% #1, #2, and #3 belong to the encoder part
4. another fully connected layer mirroring #3, that is, the sizes of its input and output are those of output and input in #3, respectively. After it, the corresponding sections of output will be stacked into 20 2D arrays
5. (optional) an unpooling layer mirroring #2 in the same way. It again outputs 20 2D arrays;
6. a transposed convolution layer mirroring #2 by applying another 20 sets of weights/biases to the 20 2D arrays (one for each). It finally will generate 20 WXH-sized outputs.
So my question is, how should those 20 WXH-sized outputs be combined to give a sinle WXH-sized output as that of the intensity image I? Or should a transposed convolution layer generate 20 outputs with smaller sizes (lower resolution)?
Thanks a lot!