使用 ResNet50 创建[w, h, f] 的特性阵列
I m trying to implement this paper but I m not following something in it. It wants me to use ResNet50 to extract features from an image but tells me the extracted features will be of dimension [w, h, f]. Everything I m seeing with ResNet50, though, is giving me back a tensor of [f] (as in, it turns my whole image into features and not my pixels into features) Am I reading this wrong or do I just not understand what I m supposed to be doing with ResNet50? Relevant quotes from paper: "We obtain an intermediate visual feature representation Fc of size f. We use the ResNet50 [26] as our backbone convolutional architecture." "In a first step, the three-dimensional feature Fc is reshaped into a two-dimensional feature by keeping its width, i.e. obtaining a feature shape (f × h, w)."
first install timm, torch python packages via pip create model and load pre-trained weights import timm import torch model = timm.create_model( resnet50 , pretrained=True, features_only=True) # convert image torch tensor as ( nimages, channels, height, width ) ex- (1,3,224, 224) features = model( image ) print( features.shape ) (1, 2048, 224, 224)
I didn t read the paper in detail, but when they say [w, h, f] I don t think the w and h have to match the width and height of the original image. They likely just mean that if the output of your ResNet after the last Conv + Pooling layer is [w, h, f], you reshape it into 2d (making it it [fxh, w]) and then pass it through a fully-connected layer to make it f dimensional. Something like this import torch import torch.nn as nn import torchvision.models as models resnet = models.resnet50(pretrained=True) # Remove the last fully connected layer and adaptive pooling layers resnet = torch.nn.Sequential(*list(resnet.children())[:-2]) # Dummy image of shape [1, 3, 224, 224] image = torch.randn(1, 3, 224, 224) intermediate_features = resnet(image) # This will be [1, 2048, 7, 7] batch_size, channels, h, w = intermediate_features.size() # [1, 14336, 7] where f=14336 and w=7 reshaped_features = intermediate_features.view(batch_size, channels * h, w) fc_layer = nn.Linear(w, 1) # This layer reduces the w dimension to 1 final_output = fc_layer(reshaped_features) # [1, 14336, 1] final_output = final_output.squeeze(-1) # [1, 14336] print(final_output.shape) (My example also has batch size as a dimension because in the real world you work with batches of examples)

