论文翻译——Feature Pyramid Networks for Object Detection

技术2025-08-29 131

摘要：

特征金字塔是识别系统中检测不同尺度目标的基本组成部分。但最近的深度学习对象检测器已经避免了金字塔表示，部分原因是它们需要大量的计算和内存。本文利用深度卷积网络固有的多尺度金字塔层次结构，以额外的边际成本构造特征金字塔。提出了一种具有横向连接的自顶向下体系结构，用于构建各种尺度下的高级语义特征图。这种被称为特征金字塔网络(FPN)的架构，作为一种通用的特征提取器，在一些应用程序中得到了显著的改进。在一个基本的Faster R-CNN系统中使用FPN，我们的方法在没有铃铛和口哨的COCO检测基准上获得了最先进的单模型结果，超过了所有现有的单模型参赛作品，包括来自COCO 2016挑战赛的获胜者。此外，我们的方法可以在GPU上运行5 FPS，是一种实用和准确的多尺度目标检测解决方案。代码将向公众开放。

原文: 可修改后右键重新翻译

Feature pyramids are a basic component in recognition systems for detecting objects at different scales. But recent deep learning object detectors have avoided pyramid representations, in part because they are compute and memory intensive. In this paper , we exploit the inherent multi-scale, pyramidal hierarchy of deep convolutional networks to construct feature pyramids with marginal extra cost. A topdown architecture with lateral connections is developed for building high-level semantic feature maps at all scales. This architecture, called a Feature Pyramid Network (FPN), shows significant improvement as a generic feature extractor in several applications. Using FPN in a basic Faster R-CNN system, our method achieves state-of-the-art singlemodel results on the COCO detection benchmark without bells and whistles, surpassing all existing single-model entries including those from the COCO 2016 challenge winners. In addition, our method can run at 5 FPS on a GPU and thus is a practical and accurate solution to multi-scale object detection. Code will be made publicly available.

翻译：

在计算机视觉中，识别不同尺度的物体是一个基本的挑战。建立在图像金字塔上的特征金字塔（简而言之，我们称之为特征图像金字塔）构成了标准解决方案的基础[1]（图1（a））。这些金字塔是比例不变的，从某种意义上说，对象的比例变化是通过改变金字塔中的级别来抵消的。直观地说，此属性使模型能够通过扫描位置和金字塔级别的模型来检测大范围的对象。

Recognizing objects at vastly different scales is a fundamental challenge in computer vision. Feature pyramids built upon image pyramids (for short we call these featurized image pyramids) form the basis of a standard solution [1] (Fig. 1(a)). These pyramids are scale-invariant in the sense that an object’s scale change is offset by shifting its level in the pyramid. Intuitively, this property enables a model to detect objects across a large range of scales by scanning the model over both positions and pyramid levels.

1 介绍

翻译：

在手工设计特征的时代，图像金字塔被大量使用[5,25]。它们是如此的关键，以至于像DPM[7]这样的物体探测器需要密集的尺度采样才能获得好的结果(例如，每八度音阶10个尺度)。对于识别任务，工程特征在很大程度上已经被深度卷积网络(ConvNets)计算的特征所取代[19,20]。除了能够代表更高层次的语义之外，ConvNets对尺度上的方差也更加鲁棒，从而便于从单一输入尺度上计算的特征进行识别[15,11,29](图1(b))。但即便如此，金字塔仍然需要得到最准确的结果。最近ImageNet[33]和COCO[21]检测挑战中的所有热门条目都使用了对非饱和图像金字塔的多尺度测试(例如，[16,35])。对图像金字塔的每一层进行量化的主要优点是，它产生了一个多尺度的特征表示，其中所有的层都是语义很强的，包括高分辨率层。

Featurized image pyramids were heavily used in the era of hand-engineered features [5, 25]. They were so critical that object detectors like DPM [7] required dense scale sampling to achieve good results (e.g., 10 scales per octave). For recognition tasks, engineered features have largely been replaced with features computed by deep convolutional networks (ConvNets) [19, 20]. Aside from being capable of representing higher-level semantics, ConvNets are also more robust to variance in scale and thus facilitate recognition from features computed on a single input scale [15, 11, 29] (Fig. 1(b)). But even with this robustness, pyramids are still needed to get the most accurate results. All recent top entries in the ImageNet [33] and COCO [21] detection challenges use multi-scale testing on featurized image pyramids (e.g., [16, 35]). The principle advantage of featurizing each level of an image pyramid is that it produces a multi-scale feature representation in which all levels are semantically strong, including the high-resolution levels.

翻译：

然而，对图像金字塔的每个层次进行特征化处理有明显的局限性。推理时间大大增加（例如，增加了4倍[11]），使得这种方法在实际应用中不切实际。此外，在图像金字塔上端到端地训练深层网络在内存方面是不可行的，因此，如果加以利用，图像金字塔仅在测试时使用[15,11,16,35]，这在训练/测试时间推断之间造成不一致。基于这些原因，Fast/Faster R-CNN[11，29]选择在默认设置下不使用特征化的图像金字塔。

Nevertheless, featurizing each level of an image pyramid has obvious limitations. Inference time increases considerably (e.g., by four times [11]), making this approach impractical for real applications. Moreover, training deep networks end-to-end on an image pyramid is infeasible in terms of memory, and so, if exploited, image pyramids are used only at test time [15, 11, 16, 35], which creates an inconsistency between train/test-time inference. For these reasons, Fast and Faster R-CNN [11, 29] opt to not use featurized image pyramids under default settings.

翻译：

然而，图像金字塔并不是计算多尺度特征表示的唯一方法。deep ConvNe通过特征层次具有固有的多尺度、金字塔形状的子采样层，逐层计算特征层次。这种网络特征层次结构产生了不同空间分辨率的特征图，但由于深度的不同而引入了较大的语义鸿沟。高分辨率图的低层次特征损害了它们对物体识别的表现能力。

However, image pyramids are not the only way to compute a multi-scale feature representation. A deep ConvNet computes a feature hierarchy layer by layer, and with subsampling layers the feature hierarchy has an inherent multiscale, pyramidal shape. This in-network feature hierarchy produces feature maps of different spatial resolutions, but introduces large semantic gaps caused by different depths. The high-resolution maps have low-level features that harm their representational capacity for object recognition.

图1。（a）使用图像金字塔构建特征金字塔。特征是在每个图像尺度上独立计算的，这是缓慢的。（b）最近的检测系统已经选择只使用单尺度特征来更快地检测。（c）另一种方法是重用ConvNet计算的金字塔特征层次，就好像它是一个特征化的图像金字塔一样。（d）我们提出的特征金字塔网络（FPN）与（b）和（c）一样快速，但更精确。在这个图中，由blueoutlines表示的功能图和ticker轮廓表示语义更强的功能。

Figure 1. (a) Using an image pyramid to build a feature pyramid. Features are computed on each of the image scales independently, which is slow. (b) Recent detection systems have opted to use only single scale features for faster detection. (c) An alternative is to reuse the pyramidal feature hierarchy computed by a ConvNet as if it were a featurized image pyramid. (d) Our proposed Feature Pyramid Network (FPN) is fast like (b) and (c), but more accurate. In this figure, featuremapsareindicatebyblueoutlinesandthicker outlines denote semantically stronger features.

翻译：

单发探测器（SSD）[22]是第一次尝试使用ConvNet的金字塔特征层次结构，就好像它是一个特征化的图像金字塔（图1（c））。理想情况下，SSD风格的金字塔将重用在前向过程中计算的不同层的多尺度特征映射，从而实现免费。但是为了避免使用低级功能，SSD放弃了对已经计算的层的重用，而是从网络的高层开始构建金字塔（例如VGG nets[36]的conv4 3），然后添加几个新的层。因此，它错过了重用特征层次的高分辨率映射的机会。我们证明了这些对于探测小物体很重要。

The Single Shot Detector (SSD) [22] is one of the first attempts at using a ConvNet’s pyramidal feature hierarchy as if it were a featurized image pyramid (Fig. 1(c)). Ideally, the SSD-style pyramid would reuse the multi-scale feature maps from different layers computed in the forward pass and thus come free of cost. But to avoid using low-level features SSD foregoes reusing already computed layers and instead builds the pyramid starting from high up in the network (e.g., conv4 3 of VGG nets [36]) and then by adding several new layers. Thus it misses the opportunity to reuse the higher-resolution maps of the feature hierarchy. We show that these are important for detecting small objects.

翻译：

本文的目标是自然地利用ConvNet特征层次的金字塔形状，同时创建一个在所有尺度上都具有强大语义的特征金字塔。从语义上讲，我们通过一个低分辨率的结构来实现这一目标。其结果是一个特征金字塔，它在所有级别都有丰富的语义，并且可以从单个输入图像比例快速构建。换言之，我们展示了如何在不牺牲表现力、速度或内存的情况下创建可用于替换特征化图像金字塔的网络特征金字塔。

原文: 可修改后右键重新翻译

The goal of this paper is to naturally leverage the pyramidal shape of a ConvNet’s feature hierarchy while creating a feature pyramid that has strong semantics at all scales. To achieve this goal, we rely on an architecture that combines low-resolution, semantically strong features with high-resolution, semantically weak features via a top-down pathway and lateral connections (Fig. 1(d)). The result is a feature pyramid that has rich semantics at all levels and is built quickly from a single input image scale. In other words, we show how to create in-network feature pyramids that can be used to replace featurized image pyramids without sacrificing representational power, speed, or memory.

翻译：

采用自顶向下和跳跃连接的类似架构在最近的研究中很流行[28,17,8,26]。他们的目标是产生一个单一的高层次的特征地图，在这个地图上进行预测（图2上图）。相反，我们的方法将架构作为一个特征金字塔，在每个层次上独立地进行预测（例如，目标检测）（图2底部）。我们的模型与一个特征化的图像金字塔相呼应，这在这些作品中还没有被探索过。

原文: 可修改后右键重新翻译

Similar architectures adopting top-down and skip connections are popular in recent research [28, 17, 8, 26]. Their goals are to produce a single high-level feature map of a fine resolution on which the predictions are to be made (Fig. 2 top). On the contrary, our method leverages the architecture as a feature pyramid where predictions (e.g., object detections) are independently made on each level (Fig. 2 bottom). Our model echoes a featurized image pyramid, which has not been explored in these works.

翻译：

我们评估了我们称为特征金字塔网络（FPN）的方法在各种检测和分割系统中的应用[11,29,27]。在没有钟声和哨声的情况下，我们在具有挑战性的COCO检测基准[21]上报告了最先进的单模型结果，仅基于FPN和predict-predict预测基本上更快的R-CNN检测器[29]，超过了所有现有的大赛优胜者的精心设计的单模型参赛作品。在烧蚀实验中，我们发现，对于包围盒方案，FPN显著提高了8.0个点的平均召回率（AR）；对于目标检测，它提高了2.3个点的COCO式平均精度（AP）和3.8个点的PASCAL式AP，超过了ResNets上更快的R-CNN的单尺度基线[16]。我们的方法也很容易扩展到遮罩方案，并提高了实例分割的AR和速度，这是非常依赖于图像金字塔的最新方法。

原文: 可修改后右键重新翻译

We evaluate our method, called a Feature Pyramid Network (FPN), in various systems for detection and segmentation [11, 29, 27]. Without bells and whistles, we report a state-of-the-art single-model result on the challenging COCO detection benchmark [21] simply based on FPN and predict predict predict predict a basic Faster R-CNN detector [29], surpassing all existing heavily-engineered single-model entries of competition winners. In ablation experiments, we find that for bounding box proposals, FPN significantly increases the Average Recall (AR) by 8.0 points; for object detection, it improves the COCO-style Average Precision (AP) by 2.3 points and PASCAL-style AP by 3.8 points, over a strong single-scale baseline of Faster R-CNN on ResNets [16]. Our method is also easily extended to mask proposals and improves both instance segmentation AR and speed over state-of-the-art methods that heavily depend on image pyramids.

翻译：

此外，我们的金字塔结构可以用所有的尺度进行端到端的训练，并且在训练/测试时一致地使用，这将使使用图像金字塔的内存不可行。因此，FPNs能够获得比所有现有的最先进的方法更高的精度。此外，这种改进是在不增加单尺度基线测试时间的情况下实现的。我们相信这些进展将促进未来的研究和应用。我们的代码将公开。

原文: 可修改后右键重新翻译

In addition, our pyramid structure can be trained end-toend with all scales and is used consistently at train/test time, which would be memory-infeasible using image pyramids. As a result, FPNs are able to achieve higher accuracy than all existing state-of-the-art methods. Moreover, this improvement is achieved without increasing testing time over the single-scale baseline. We believe these advances will facilitate future research and applications. Our code will be made publicly available.

2 相关工作

翻译：

手工设计的功能和早期的神经网络。SIFT特征[25]最初在尺度空间极值处提取，用于特征点匹配。HOG特征[5]，以及后来的SIFT特征，都是在整个图像金字塔上密集计算的。这些HOG和SIFT金字塔在图像分类、目标检测、人体姿态估计等领域有着广泛的应用。快速计算特征图像金字塔也引起了人们的极大兴趣。Dollár等人。[6] 演示了快速金字塔计算，首先计算稀疏采样（按比例）金字塔，然后插值缺失水平。在HOG和SIFT之前，ConvNets[38,32]在人脸检测方面的早期工作计算了图像金字塔上的浅层网络，以跨尺度检测人脸

原文: 可修改后右键重新翻译

Hand-engineered features and early neural networks. SIFT features [25] were originally extracted at scale-space extrema and used for feature point matching. HOG features [5], and later SIFT features as well, were computed densely over entire image pyramids. These HOG and SIFT pyramids have been used in numerous works for image classification, object detection, human pose estimation, and more. There has also been significant interest in computing featurized image pyramids quickly. Dollár et al. [6] demonstrated fast pyramid computation by first computing a sparsely sampled (in scale) pyramid and then interpolating missing levels. Before HOG and SIFT, early work on face detection with ConvNets [38, 32] computed shallow networks over image pyramids to detect faces across scales

翻译：

深ConvNet物体探测器。随着现代deep ConvNets[19]的发展，像OverFeat[34]和R-CNN[12]这样的目标探测器在精度上有了显著的提高。OverFeat采用了类似于早期神经网络人脸检测器的策略，将ConvNet作为图像金字塔上的滑动窗口检测器。R-CNN采用了一种基于区域提案的策略[37]，其中每个提案在使用ConvNet分类之前都是标准化的。SPPnet[15]证明了这种基于区域的检测器可以更有效地应用于在单个图像尺度上提取的特征地图。最近更精确的检测方法，如Fast R-CNN[11]和Faster R-CNN[29]提倡使用从单一尺度计算的特征，因为它在精度和速度之间提供了很好的折衷。然而，多尺度检测仍然表现得更好，特别是对于小目标。

原文: 可修改后右键重新翻译

Deep ConvNet object detectors. With the development of modern deep ConvNets [19], object detectors like OverFeat [34] and R-CNN [12] showed dramatic improvements in accuracy. OverFeat adopted a strategy similar to early neural network face detectors by applying a ConvNet as a sliding window detector on an image pyramid. R-CNN adopted a region proposal-based strategy [37] in which each proposal was scale-normalized before classifying with a ConvNet. SPPnet [15] demonstrated that such region-based detectors could be applied much more efficiently on feature maps extracted on a single image scale. Recent and more accurate detection methods like Fast R-CNN [11] and Faster R-CNN [29] advocate using features computed from a single scale, because it offers a good trade-off between accuracy and speed. Multi-scale detection, however, still performs better, especially for small objects.

翻译：

方法使用多层。最近的一些方法通过在ConvNet中使用不同的层来改进检测和分割。FCN[24]将每个类别在多个尺度上的部分分数相加来计算语义分段。Hypercolumns[13]使用类似的方法来分割对象实例。其他几种方法（HyperNet[18]、ParseNet[23]和ION[2]）在计算预测之前将多层特征串联起来，这相当于对转换后的特征进行求和。SSD[22]和MS-CNN[3]在特征层次的多个层次上预测对象，而不结合特征或分数。

原文: 可修改后右键重新翻译

Methods using multiple layers. A number of recent approaches improve detection and segmentation by using different layers in a ConvNet. FCN [24] sums partial scores for each category over multiple scales to compute semantic segmentations. Hypercolumns [13] uses a similar method for object instance segmentation. Several other approaches (HyperNet [18], ParseNet [23], and ION [2]) concatenate features of multiple layers before computing predictions, which is equivalent to summing transformed features. SSD [22] and MS-CNN [3] predict objects at multiple layers of the feature hierarchy without combining features or scores.

翻译：

最近有一些方法利用横向/跳跃连接，将低层特征映射跨分辨率和语义级别关联起来，包括用于分割的U-Net[31]和SharpMask[28]，用于人脸检测的重组器网络[17]，以及用于关键点估计的堆叠沙漏网络[26]。Ghiasi等人。[8] 提出了一个拉普拉斯金字塔表示的FCNs逐步细化分割。尽管这些方法采用金字塔形状的结构，但它们不同于特征化图像金字塔[5,7,34]，在这些金字塔中，预测是在所有级别独立进行的，见图2。事实上，对于图2（上图）中的金字塔结构，仍然需要图像金字塔来识别跨越多个尺度的对象[28]。

原文: 可修改后右键重新翻译

There are recent methods exploiting lateral/skip connections that associate low-level feature maps across resolutions and semantic levels, including U-Net [31] and SharpMask [28] for segmentation, Recombinator networks [17] for face detection, and Stacked Hourglass networks [26] for keypoint estimation. Ghiasi et al. [8] present a Laplacian pyramid presentation for FCNs to progressively refine segmentation. Although these methods adopt architectures with pyramidal shapes, they are unlike featurized image pyramids [5, 7, 34] where predictions are made independently at all levels, see Fig. 2. In fact, for the pyramidal architecture in Fig. 2 (top), image pyramids are still needed to recognize objects across multiple scales [28].

3 特征金字塔网络

翻译：

我们的目标是利用ConvNet的金字塔特征层次结构，它具有从低到高的语义，并构建一个具有高层次语义的特征金字塔。由此得到的特征金字塔网络是通用的，在本文中我们重点研究滑动窗口提议者（Region proposition Network，简称RPN）[29]和基于区域的检测器（Fast R-CNN）[11]。我们还将FPNs推广到Sec.6中的实例分割方案中。

原文: 可修改后右键重新翻译

Our goal is to leverage a ConvNet’s pyramidal feature hierarchy, which has semantics from low to high levels, and build a feature pyramid with high-level semantics throughout. The resulting Feature Pyramid Network is generalpurpose and in this paper we focus on sliding window proposers (Region Proposal Network, RPN for short) [29] and region-based detectors (Fast R-CNN) [11]. We also generalize FPNs to instance segmentation proposals in Sec. 6.

翻译：

该方法以任意大小的单尺度图像为输入，以完全卷积的方式输出按比例大小的特征映射。这个过程独立于主干卷积体系结构（例如[19，36，16]），在本文中，我们使用resnet[16]给出了结果。金字塔的构建包括一个自下而上的路径，一个自上而下的路径和横向连接，如下所述。

原文: 可修改后右键重新翻译

Our method takes a single-scale image of an arbitrary size as input, and outputs proportionally sized feature maps at multiple levels, in a fully convolutional fashion. This process is independent of the backbone convolutional architectures (e.g., [19, 36, 16]), and in this paper we present results using ResNets [16]. The construction of our pyramid involves a bottom-up pathway, a top-down pathway, and lateral connections, as introduced in the following.

翻译：

自下而上的路径。自下而上的路径是主干ConvNet的前向计算，它以2为尺度计算由多个尺度的特征映射组成的特征层次。通常有许多层生成相同大小的输出映射，我们称这些层处于同一网络阶段。对于我们的特征金字塔，我们为每个阶段定义一个金字塔级别。我们选择每个阶段最后一层的输出作为我们的特征映射的参考集，我们将对其进行丰富以创建金字塔。这种选择是自然的，因为每个阶段的最深层应该具有最强的特性。

原文: 可修改后右键重新翻译

Bottom-up pathway. The bottom-up pathway is the feedforward computation of the backbone ConvNet, which computes a feature hierarchy consisting of feature maps at several scales with a scaling step of 2. There are often many layers producing output maps of the same size and we say these layers are in the same network stage. For our feature pyramid, we define one pyramid level for each stage. We choose the output of the last layer of each stage as our reference set of feature maps, which we will enrich to create our pyramid. This choice is natural since the deepest layer of each stage should have the strongest features.

翻译：

具体来说，对于resnet[16]，我们使用每个阶段最后一个残差块输出的特征激活。对于conv2、conv3、conv4和conv5输出，我们将这些最后残差块的输出表示为{C2、C3、C4、C5}，并注意到它们相对于输入图像具有{4、8、16、32}像素的跨距。由于conv1占用大量内存，我们不将conv1包含在金字塔中。

原文: 可修改后右键重新翻译

Specifically, for ResNets [16] we use the feature activations output by each stage’s last residual block. We denote the output of these last residual blocks as {C2,C3,C4,C5} for conv2, conv3, conv4, and conv5 outputs, and note that they have strides of {4, 8, 16, 32} pixels with respect to the input image. We do not include conv1 into the pyramid due to its large memory footprint.

翻译：

自上而下的路径和横向连接。自上而下的路径通过从更高的金字塔层次上采样空间上较粗但语义上更强的特征映射，产生更高分辨率的幻觉。这些特征然后通过横向连接的自底向上路径的特征来增强。每个横向连接合并了从自下而上路径和自顶向下路径的相同空间尺寸的特征图。自下而上的特征映射具有较低层次的语义，但由于其子采样次数较少，因此其激活更精确地本地化。

原文: 可修改后右键重新翻译

Top-down pathway and lateral connections. The topdown pathway hallucinates higher resolution features by upsampling spatially coarser, but semantically stronger, feature maps from higher pyramid levels. These features are then enhanced with features from the bottom-up pathway via lateral connections. Each lateral connection merges feature maps of the same spatial size from the bottom-up pathway and the top-down pathway. The bottom-up feature map is of lower-level semantics, but its activations are more accurately localized as it was subsampled fewer times.

翻译：

图3显示了构造自顶向下特征图的构建块。对于较粗分辨率的特征映射，我们将空间分辨率提高2倍（为简单起见，使用最近邻上采样）。upsam-pled映射然后通过元素加法与相应的自底向上映射（经过1×1卷积层以减小信道尺寸）合并。这个过程会反复进行，直到生成最精细的分辨率贴图。为了开始迭代，我们只需在C5上附加一个1×1的卷积层来生成最粗的分辨率图。最后，我们在每个合并后的地图上附加一个3×3的卷积来生成最终的特征映射，以减少上采样的混叠效应。最后一组特征地图称为{P2，P3，P4，P5}，分别对应于空间大小相同的{C2、C3、C4、C5}。

Fig. 3 shows the building block that constructs our topdown feature maps. With a coarser-resolution feature map, we upsample the spatial resolution by a factor of 2 (using nearest neighbor upsampling for simplicity). The upsam pled map is then merged with the corresponding bottom-up map (which undergoes a 1×1 convolutional layer to reduce channel dimensions) by element-wise addition. This process is iterated until the finest resolution map is generated. To start the iteration, we simply attach a 1×1 convolutional layer on C5to produce the coarsest resolution map. Finally, we append a 3×3 convolution on each merged map to generate the final feature map, which is to reduce the aliasing effect of upsampling. This final set of feature maps is called {P2,P3,P4,P5}, corresponding to {C2,C3,C4,C5} that are respectively of the same spatial sizes.

翻译：

由于金字塔的所有层次都像传统的特征图像金字塔一样使用共享的分类器/回归器，所以我们在所有的特征映射中固定特征维数（通道数，用d表示）。在本文中我们设置d=256，因此所有额外的卷积层都有256个信道输出。在这些额外的层中没有非线性，我们根据经验发现这些影响很小。

Because all levels of the pyramid use shared classifiers/regressors as in a traditional featurized image pyramid, we fix the feature dimension (numbers of channels, denoted as d) in all the feature maps. We set d = 256 in this paper and thus all extra convolutional layers have 256-channel outputs. There are no non-linearities in these extra layers, which we have empirically found to have minor impacts.

翻译：

简单性是我们设计的核心，我们发现我们的模型对许多设计选择都是健壮的。我们已经用更复杂的块体进行了实验（例如，使用多层残差块体[16]作为连接），并观察到稍微好一些的结果。设计更好的连接模块不是本文的重点，所以我们选择上面描述的简单设计。

原文: 可修改后右键重新翻译

Simplicity is central to our design and we have found that our model is robust to many design choices. We have experimented with more sophisticated blocks (e.g., using multilayer residual blocks [16] as the connections) and observed marginally better results. Designing better connection modules is not the focus of this paper, so we opt for the simple design described above.

4 应用

翻译：

我们的方法是在deep ConvNets中构建特征金字塔的通用解决方案。下面我们在RPN[29]中采用我们的方法生成边界框区域，并采用Fast R-CNN[11]中的方法进行目标检测。为了证明我们方法的简单性和有效性，我们对[29，11]的原始系统进行了最小程度的修改，以适应我们的特征金字塔。

原文: 可修改后右键重新翻译

Our method is a generic solution for building feature pyramids inside deep ConvNets. In the following we adopt our method in RPN [29] for bounding box proposal generation and in Fast R-CNN [11] for object detection. To demonstrate the simplicity and effectiveness of our method, we make minimal modifications to the original systems of [29, 11] when adapting them to our feature pyramid.

4.2 RPN的特征金字塔网络

翻译：

RPN[29]是一个滑动窗口类不可知对象检测器。在原始的RPN设计中，在3×3密集的滑动窗口上，在单尺度卷积特征映射的基础上，对一个小的子网络进行对象/非对象二元分类和边界框回归。这是通过一个3×3卷积层实现的，随后是两个同级的1×1卷积进行分类和回归，我们称之为网络头。对象/非对象标准和边界框回归目标是根据一组称为锚的参考框定义的[29]。锚具有多个预定义的比例和纵横比，以便覆盖不同形状的对象。

原文: 可修改后右键重新翻译

RPN [29] is a sliding-window class-agnostic object detector. In the original RPN design, a small subnetwork is evaluated on dense 3×3 sliding windows, on top of a singlescale convolutional feature map, performing object/nonobject binary classification and bounding box regression. This is realized by a 3×3 convolutional layer followed by two sibling 1×1 convolutions for classification and regression, which we refer to as a network head. The object/nonobject criterion and bounding box regression target are defined with respect to a set of reference boxes called anchors [29]. The anchors are of multiple pre-defined scales and aspect ratios in order to cover objects of different shapes.

翻译：

我们通过用FPN代替单尺度特征映射来适应RPN。我们在特征金字塔的每一层都附加一个相同设计的头部（3×3 conv和两个同级的1×1 conv）。因为头部会在所有金字塔级别的所有位置上密集滑动，所以不必在特定级别上使用多尺度锚点。相反，我们将单个比例的锚点指定给每个级别。形式上，我们将锚定定义为在{P2，P3，P4，P5，P6}上分别具有{32242128225625122}像素的区域。1在[29]中，我们还在每个级别使用多个纵横比{1:2，1:1，2:1}的锚。所以金字塔上总共有15个锚。

原文: 可修改后右键重新翻译

We adapt RPN by replacing the single-scale feature map with our FPN. We attach a head of the same design (3×3 conv and two sibling 1×1 convs) to each level on our feature pyramid. Because the head slides densely over all locations in all pyramid levels, it is not necessary to have multi-scale anchors on a specific level. Instead, we assign anchors of a single scale to each level. Formally, we define the anchors to have areas of {322,642,1282,2562,5122} pixels on {P2,P3,P4,P5,P6} respectively.1As in [29] we also use anchors of multiple aspect ratios {1:2, 1:1, 2:1} at each level. So in total there are 15 anchors over the pyramid.

翻译：

我们根据锚与地面真实边界框的交集比（IoU）为锚定分配训练标签，如[29]所示。从形式上讲，如果锚定点具有给定的地面真相箱的最高IoU或超过0.7的任何地面真相箱IoU，则为其分配一个正面标签；如果锚定的IoU低于0.3，则为所有地面真相箱分配负标签。请注意，基本真实框的比例并没有明确用于将它们指定给金字塔的级别；相反，基础真相框与锚定关联，锚定已被指定到金字塔级别。因此，除了[29]中的规则外，我们没有引入额外的规则。

原文: 可修改后右键重新翻译

We assign training labels to the anchors based on their Intersection-over-Union (IoU) ratios with ground-truth bounding boxes as in [29]. Formally, an anchor is assigned a positive label if it has the highest IoU for a given groundtruth box or an IoU over 0.7 with any ground-truth box, and a negative label if it has IoU lower than 0.3 for all ground-truth boxes. Note that scales of ground-truth boxes are not explicitly used to assign them to the levels of the pyramid; instead, ground-truth boxes are associated with anchors, which have been assigned to pyramid levels. As such, we introduce no extra rules in addition to those in [29].

翻译：

我们注意到头部的参数在所有特征金字塔级别上是共享的；我们还评估了没有共享参数的替代方案，并观察到类似的精度。这表明我们所有层次的语义共享参数都很好。这一优势类似于使用特征化图像金字塔，其中一个共同的头部分类器可以应用于在任何图像尺度上计算的特征。

原文: 可修改后右键重新翻译

We note that the parameters of the heads are shared across all feature pyramid levels; we have also evaluated the alternative without sharing parameters and observed similar accuracy. The good performance of sharing parameters indicates that all levels of our pyramid share similar semantic levels. This advantage is analogous to that of using a featurized image pyramid, where a common head classifier can be applied to features computed at any image scale.

翻译：

通过上述调整，RPN可以用我们的FPN进行自然训练和测试，其方式与[29]中相同。我们在实验中详细阐述了实现细节。

原文: 可修改后右键重新翻译

With the above adaptations, RPN can be naturally trained and tested with our FPN, in the same fashion as in [29]. We elaborate on the implementation details in the experiments.

4.2 Fast R-CNN的特征金字塔网络

翻译：

Fast R-CNN[11]是一种基于区域的对象检测器，其中使用感兴趣区域（RoI）池来提取特征。快速R-CNN通常是在单一比例尺的特征地图上执行的。要将其与我们的FPN一起使用，我们需要将不同比例的roi分配到金字塔级别。

原文: 可修改后右键重新翻译

Fast R-CNN [11] is a region-based object detector in which Region-of-Interest (RoI) pooling is used to extract features. Fast R-CNN is most commonly performed on a single-scale feature map. To use it with our FPN, we need to assign RoIs of different scales to the pyramid levels.

翻译：

我们看待我们的特征金字塔就好像它是从一个图像金字塔产生的。因此，当基于区域的检测器在图像金字塔上运行时，我们可以调整它们的分配策略[15,11]。从形式上讲，我们将宽度为w、高度为h的RoI（在网络的输入图像上）分配给特征金字塔的水平pk，方法是：

原文: 可修改后右键重新翻译

We view our feature pyramid as if it were produced from an image pyramid. Thus we can adapt the assignment strategy of region-based detectors [15, 11] in the case when they are run on image pyramids. Formally, we assign an RoI of width w and height h (on the input image to the network) to the level Pkof our feature pyramid by:

翻译：

这里224是典型的ImageNet预训练大小，k0是w×h=2242的RoI应该映射到的目标级别。类似于基于ResNet的更快的R-CNN系统[16]，它使用c4作为单尺度特征映射，我们将k0设置为4。直觉上，等式。（1）这意味着，如果RoI的比例变小（例如，224的1/2），它应该映射到更精细的分辨率级别（比如k=3）。

原文: 可修改后右键重新翻译

Here 224 is the canonical ImageNet pre-training size, and k0is the target level on which an RoI with w × h = 2242 should be mapped into. Analogous to the ResNet-based Faster R-CNN system [16] that uses C4as the single-scale feature map, we set k0to 4. Intuitively, Eqn. (1) means that if the RoI’s scale becomes smaller (say, 1/2 of 224), it should be mapped into a finer-resolution level (say, k = 3).

翻译：

我们将预测头（在Fast R-CNN中，头是类特定的分类器和边界盒回归器）附加到所有级别的所有roi。同样，头部都共享参数，而不考虑它们的级别。在[16]中，一个ResNet的conv5层（一个9层深的子网）被用作conv4特征的头部，但是我们的方法已经利用conv5构造了特征金字塔。因此与文献[16]不同的是，我们简单地采用RoI池提取7×7特征，并在最终分类和边界盒回归层之前附加两个隐藏的1024-d完全连接（fc）层（每个层后面接ReLU）。这些层是随机初始化的，因为resnet中没有预先训练过的fc层。请注意，与标准conv5磁头相比，我们的2-fc MLP磁头重量更轻，速度更快。

原文: 可修改后右键重新翻译

We attach predictor heads (in Fast R-CNN the heads are class-specific classifiers and bounding box regressors) to all RoIs of all levels. Again, the heads all share parameters, regardless of their levels. In [16], a ResNet’s conv5 layers (a 9-layer deep subnetwork) are adopted as the head on top of the conv4 features, but our method has already harnessed conv5 to construct the feature pyramid. So unlike [16], we simply adopt RoI pooling to extract 7×7 features, and attach two hidden 1,024-d fully-connected (fc) layers (each followed by ReLU) before the final classification and bounding box regression layers. These layers are randomly initialized, as there are no pre-trained fc layers available in ResNets. Note that compared to the standard conv5 head, our 2-fc MLP head is lighter weight and faster.

翻译：

基于这些适应性，我们可以在特征金字塔的顶部训练和测试快速的RCNN。实验部分给出了具体的实现细节。

原文: 可修改后右键重新翻译

Based on these adaptations, we can train and test Fast RCNN on top of the feature pyramid. Implementation details are given in the experimental section.

5 目标检测实验

翻译：

我们在80类COCO检测数据集上进行了实验[21]。我们使用80k列车图像和35k val图像子集的联合进行训练（trainval35k[2]），并报告val图像的5k子集（minival）上的烧蚀。我们还报告了标准测试集（测试标准）[21]的最终结果，该测试集没有公开的标签。

原文: 可修改后右键重新翻译

We perform experiments on the 80 category COCO detection dataset [21]. We train using the union of 80k train images and a 35k subset of val images (trainval35k [2]), and report ablations on a 5k subset of val images (minival). We also report final results on the standard test set (test-std) [21] which has no disclosed labels.

5.1 建议区域与RPN

翻译：

根据[21]中的定义，我们评估了小、中、大对象（ARs、arm和ARl）的COCO风格平均回忆（AR）和AR。我们报告每个图像100个和1000个建议的结果（AR100和AR1k）。

原文: 可修改后右键重新翻译

We evaluate the COCO-style Average Recall (AR) and AR on small, medium, and large objects (ARs, A Rm, and ARl) following the definitions in [21]. We report results for 100 and 1000 proposals per images (AR100and AR1k).

翻译：

实施细节。表1中的所有架构都经过了端到端的培训。输入图像的大小调整为其较短的一侧有800像素。我们在8个GPU上采用同步SGD训练。一个小批量涉及2个图像每个GPU和256个锚定每个图像。我们使用0.0001的重量衰减和0.9的动量。前30k个小批量的学习率为0.02，下一个10k的学习率为0.002。对于所有的RPN实验（包括基线），我们将图像之外的锚定框包括在内进行训练，这与[29]中忽略这些锚定框不同。其他实现细节见[29]。在8个GPU上用FPN训练RPN需要大约8个小时。

原文: 可修改后右键重新翻译

Implementation details. All architectures in Table 1 are trained end-to-end. The input image is resized such that its shorter side has 800 pixels. We adopt synchronized SGD training on 8 GPUs. A mini-batch involves 2 images per GPU and 256 anchors per image. We use a weight decay of 0.0001 and a momentum of 0.9. The learning rate is 0.02 for the first 30k mini-batches and 0.002 for the next 10k. For all RPN experiments (including baselines), we include the anchor boxes that are outside the image for training, which is unlike [29] where these anchor boxes are ignored. Other implementation details are as in [29]. Training RPN with FPN on 8 GPUs takes about 8 hours on COCO.

5.1.1 消融实验

翻译：

与基线的比较。为了与原始的RPN[29]进行公平比较，我们使用单比例尺地图ofC4（与[16]）orC5运行两条基线（表1（a，b）），两者都使用与我们相同的超参数，包括使用{32242128225625122}的5个比例尺锚。表1（b）与（a）相比没有任何优势，这表明一个单一的高层次特征映射是不够的，因为在粗糙的分辨率和更强的语义之间存在权衡。

原文: 可修改后右键重新翻译

Comparisons with baselines. For fair comparisons with original RPNs [29], we run two baselines (Table 1(a, b)) using the single-scale map ofC4(the same as [16]) orC5, both using the same hyper-parameters as ours, including using 5 scale anchors of {322,642,1282,2562,5122}. Table 1 (b) shows no advantage over (a), indicating that a single higherlevel feature map is not enough because there is a trade-off between coarser resolutions and stronger semantics.

翻译：

在RPN中加入FPN，AR1K提高到56.3（表1（c）），比单一比例的RPN基线（表1（a））增加了8.0个点。此外，在小目标（AR1k s）上的性能提高了12.9个百分点。我们的金字塔表示大大提高了RPN对对象尺度变化的鲁棒性。

原文: 可修改后右键重新翻译

Placing FPN in RPN improves AR1kto 56.3 (Table 1 (c)), which is 8.0 points increase over the single-scale RPN baseline (Table 1 (a)). In addition, the performance on small objects (AR1k s) is boosted by a large margin of 12.9 points. Our pyramid representation greatly improves RPN’s robustness to object scale variation.

翻译：

自上而下的充实有多重要？表1（d）显示了没有自顶向下路径的特征金字塔的结果。通过这种修改，1×1横向连接和3×3卷积连接到自下而上的金字塔上。该架构模拟重用金字塔特征层次的效果（图1（b））。

原文: 可修改后右键重新翻译

How important is top-down enrichment? Table 1(d) shows the results of our feature pyramid without the topdown pathway. With this modification, the 1×1 lateral connections followed by 3×3 convolutions are attached to the bottom-up pyramid. This architecture simulates the effect of reusing the pyramidal feature hierarchy (Fig. 1(b)).

翻译：

表1（d）中的结果与RPN基线相当，远远落后于我们的结果。我们推测这是因为在自下而上的金字塔（图1（b））的不同层次之间存在着很大的语义差距，特别是对于非常深的resnet。我们还评估了表1（d）的一个变体，没有共享磁头的参数，但是观察到类似的性能下降。这个问题不能简单地由级别特定的负责人来解决。

原文: 可修改后右键重新翻译

The results in Table 1(d) are just on par with the RPN baseline and lag far behind ours. We conjecture that this is because there are large semantic gaps between different levels on the bottom-up pyramid (Fig. 1(b)), especially for very deep ResNets. We have also evaluated a variant of Table 1(d) without sharing the parameters of the heads, but observed similarly degraded performance. This issue cannot be simply remedied by level-specific heads.

翻译：

横向连接有多重要？表1（e）显示了没有1×1横向连接的自上而下特征金字塔的烧蚀结果。这种自上而下的金字塔具有很强的语义特征和精细的分辨率。但我们认为这些特征的位置并不精确，因为这些地图已经过多次降采样和上采样。更精确的要素位置可以通过横向连接从底部向上地图的精细层次传递到自上而下的地图。因此，FPN的AR1K得分比表1（e）高10分。

原文: 可修改后右键重新翻译

How important are lateral connections? Table 1(e) shows the ablation results of a top-down feature pyramid without the 1×1 lateral connections. This top-down pyramid has strong semantic features and fine resolutions. But we argue that the locations of these features are not precise, because these maps have been downsampled and upsampled several times. More precise locations of features can be directly passed from the finerlevelsofthebottom-upmapsvia the lateral connections to the top-down maps. As a results, FPN has an AR1kscore 10 points higher than Table 1(e).

翻译：

金字塔表示有多重要？不用求助于金字塔表示，人们可以将头部附加到P2的最高分辨率、强语义特征映射（即金字塔中最精细的层次）。与单比例尺基线类似，我们将所有定位点分配给p2特征映射。这个变量（表1（f））优于基线，但不如我们的方法。RPN是一种窗口大小固定的滑动窗口检测器，因此在金字塔级上扫描可以提高其对尺度变化的鲁棒性。此外，我们注意到，单独使用p2会导致更多的锚定（750k，表1（f）），这是由其较大的空间分辨率造成的。这一结果表明，较大数量的锚本身并不足以提高精度。

原文: 可修改后右键重新翻译

How important are pyramid representations? Instead of resorting to pyramid representations, one can attach the head to the highest-resolution, strongly semantic feature maps of P2(i.e., the finest level in our pyramids). Similar to the single-scale baselines, we assign all anchors to the P2feature map. This variant (Table 1(f)) is better than the baseline but inferior to our approach. RPN is a sliding window detector with a fixed window size, so scanning over pyramid levels can increase its robustness to scale variance. In addition, we note that using P2alone leads to more anchors (750k, Table 1(f)) caused by its large spatial resolution. This result suggests that a larger number of anchors is not sufficient in itself to improve accuracy.

表1。在COCO-minival集上使用RPN[29]计算边界盒建议结果。所有模型均在trainval35k上进行培训。列“横向”和“自上而下”分别表示存在横向连接和自上而下连接。“特征”一栏表示头部附着的特征地图。所有结果都基于ResNet-50并共享相同的超参数。

表2。用快速R-CNN[11]对一组固定的方案（RPN，{Pk}，表1（c）），在COCO-minival集上评估目标检测结果。模特在trainval35k系列上接受培训。所有结果都基于ResNet-50并共享相同的超参数。

Table 2. Object detection results using Fast R-CNN [11] o n a fixed set of proposals (RPN, {Pk}, Table 1(c)), evaluated on the COCO minival set. Models are trained on the trainval35k set. All results are based on ResNet-50 and share the same hyper-parameters.

表3。在COCO-minival集合上使用更快的R-CNN[29]评估目标检测结果。RPN的主干网与Fast R-CNN一致。模型在trainval35k设备上进行培训，并使用[16]作者提供的ResNet-50。

Table 3. Object detection results using Faster R-CNN [29] evaluated on the COCO minival set. The backbone network for RPN are consistent with Fast R-CNN. Models are trained on the trainval35k set and use ResNet-50.†Provided by authors of [16].

5.2 Fast/Faster R-CNN目标检测

翻译：

接下来我们研究基于区域（非滑动窗口）检测器的FPN。我们用COCO式平均精度（AP）和PASCAL式AP（单个IoU阈值为0.5）来评估目标检测。根据[21]中的定义，我们还报告了小尺寸、中尺寸和大尺寸物体（即APs、apm和APl）的COCO-AP。

原文: 可修改后右键重新翻译

Next we investigate FPN for region-based (non-sliding window) detectors. We evaluate object detection by the COCO-style Average Precision (AP) and PASCAL-style AP (at a single IoU threshold of 0.5). We also report COCO AP on objects of small, medium, and large sizes (namely, APs, A Pm, and APl) following the definitions in [21].

翻译：

实施细节。输入图像的大小调整为其较短的一侧有800像素。同步SGD用于在8个gpu上训练模型。每个小批量涉及2个图像每个GPU和512个roi每个图像。我们使用0.0001的重量衰减和0.9的动量。前60k个小批量的学习率为0.02，下一个20k的学习率为0.002。我们使用每个图像2000个roi进行训练，1000个roi用于测试。在COCO数据集上，用FPN训练快速R-CNN大约需要10个小时。

Implementation details. The input image is resized such that its shorter side has 800 pixels. Synchronized SGD is used to train the model on 8 GPUs. Each mini-batch involves 2 image per GPU and 512 RoIs per image. We use a weight decay of 0.0001 and a momentum of 0.9. The learning rate is 0.02 for the first 60k mini-batches and 0.002 for the next 20k. We use 2000 RoIs per image for training and 1000 for testing. Training Fast R-CNN with FPN takes about 10 hours on the COCO dataset.

5.2.1 快速R-CNN（关于固定提案）

翻译：

为了更好地研究FPN对基于区域的探测器的影响，我们在一组固定的建议下进行了快速R-CNN的烧蚀。我们选择在FPN上冻结由RPN计算的建议（表1（c）），因为它对探测器要识别的小目标具有良好的性能。为了简单起见，我们不在Fast R-CNN和RPN之间共享特性，除非指定。

原文: 可修改后右键重新翻译

To better investigate FPN’s effects on the region-based detector alone, we conduct ablations of Fast R-CNN on a fixed set of proposals. We choose to freeze the proposals as computed by RPN on FPN (Table 1(c)), because it has good performance on small objects that are to be recognized by the detector. For simplicity we do not share features between Fast R-CNN and RPN, except when specified.

翻译：

作为一个基于ResNet的快速R-CNN基线，遵循[16]，我们采用RoI池，输出大小为14×14，并将所有conv5层作为头部的隐藏层。表31.9给出了AP。表2（b）是利用具有2个隐藏fc层的MLP头部的基线，类似于我们架构中的头部。它得到的AP为28.8，表明2-fc头部与表2（a）中的基线相比没有任何正交优势。

原文: 可修改后右键重新翻译

As a ResNet-based Fast R-CNN baseline, following [16], we adopt RoI pooling with an output size of 14×14 and attach all conv5 layers as the hidden layers of the head. This gives an AP of 31.9 in Table 2(a). Table 2(b) is a baseline exploiting an MLP head with 2 hidden fc layers, similar to the head in our architecture. It gets an AP of 28.8, indicating that the 2-fc head does not give us any orthogonal advantage over the baseline in Table 2(a).

翻译：

表2（c）显示了我们在Fast R-CNN中的FPN结果。与表2（a）中的基线相比，我们的方法提高了AP 2.0点，小目标AP提高了2.1点。与同样采用2fc磁头的基线（表2（b））相比，我们的方法提高了5.1个百分点。5这些比较表明，对于基于区域的目标检测器，我们的特征金字塔优于单尺度特征。

原文: 可修改后右键重新翻译

Table 2(c) shows the results of our FPN in Fast R-CNN. Comparing with the baseline in Table 2(a), our method improves AP by 2.0 points and small object AP by 2.1 points. Comparing with the baseline that also adopts a 2fc head (Table 2(b)), our method improves AP by 5.1 points.5These comparisons indicate that our feature pyramid is superior to single-scale features for a region-based object detector.

翻译：

表2（d）和（e）表明，移除自上而下的连接或移除侧向连接会导致较差的结果，类似于我们在上述小节中观察到的RPN。值得注意的是，删除自上而下的连接（表2（d））会显著降低准确性，这表明快速R-CNN在高分辨率地图上使用低层特征会受到影响。

原文: 可修改后右键重新翻译

Table 2(d) and (e) show that removing top-down connections or removing lateral connections leads to inferior results, similar to what we have observed in the above subsection for RPN. It is noteworthy that removing top-down connections (Table 2(d)) significantly degrades the accuracy, suggesting that Fast R-CNN suffers from using the low-level features at the high-resolution maps.

翻译：

在表2（f）中，我们对P2的单个最细比例尺特征图采用快速R-CNN。其结果（33.4ap）略低于使用所有金字塔水平（33.9ap，表2（c））。我们认为这是因为RoI池是一种扭曲的操作，对区域的规模不太敏感。尽管这个变体有很好的准确性，但它是基于{Pk}的RPN建议，因此已经从金字塔表示中获益。

原文: 可修改后右键重新翻译

In Table 2(f), we adopt Fast R-CNN on the single finest scale feature map of P2. Its result (33.4 AP) is marginally worse than that of using all pyramid levels (33.9 AP , Table 2(c)). We argue that this is because RoI pooling is a warping-like operation, which is less sensitive to the region’s scales. Despite the good accuracy of this variant, it is based on the RPN proposals of {Pk} and has thus already benefited from the pyramid representation.

5.2.2 更快的R-CNN（关于一致的提案）

翻译：

在上面我们使用了一套固定的方案来调查探测器。但是，在一个更快的R-CNN系统中[29]，RPN和Fast R-CNN必须使用相同的网络主干，以使特征共享成为可能。表3显示了我们的方法与两个基线之间的比较，这两个基线都使用RPN和Fast R-CNN的一致主干架构。表3（a）显示了我们复制的基线更快的R-CNN系统，如[16]所述。在受控设置下，我们的FPN（表3（c））比这个强基线好2.3个AP和3.8个点AP@0.5。

原文: 可修改后右键重新翻译

In the above we used a fixed set of proposals to investigate the detectors. But in a Faster R-CNN system [29], the RPN and Fast R-CNN must use the same network backbone in order to make feature sharing possible. Table 3 shows the comparisons between our method and two baselines, all using consistent backbone architectures for RPN and Fast R-CNN. Table 3(a) shows our reproduction of the baseline Faster R-CNN system as described in [16]. Under controlled settings, our FPN (Table 3(c)) is better than this strong baseline by 2.3 points AP and 3.8 points AP@0.5.

翻译：

共享功能。在上面，为了简单起见，我们不共享RPN和Fast R-CNN之间的特性。在表5中，我们根据[29]中描述的4步训练评估共享特性。与[29]相似，我们发现共享特征在一定程度上提高了准确性。特性共享也减少了测试时间。

原文: 可修改后右键重新翻译

Sharing features. In the above, for simplicity we do not share the features between RPN and Fast R-CNN. In Table 5, we evaluate sharing features following the 4-step training described in [29]. Similar to [29], we find that sharing features improves accuracy by a small margin. Feature sharing also reduces the testing time.

翻译：

运行时间。通过功能共享，我们基于FPN的更快的R-CNN系统在单个NVIDIA M40 GPU上的推理时间为0.165秒，对于ResNet-101.6A，推理时间为0.19秒。相比之下，表3（a）中的单标度ResNet-50基线运行时间为0.32秒。我们的方法通过在FPN中添加额外的层来引入少量的额外成本，但是具有更轻的头部重量。总的来说，我们的系统比基于ResNet的更快的R-CNN对应系统更快。我们相信我们的方法的效率和简单性将有助于未来的研究和应用。

原文: 可修改后右键重新翻译

Running time. With feature sharing, our FPN-based Faster R-CNN system has inference time of 0.165 seconds per image on a single NVIDIA M40 GPU for ResNet-50, and 0.19 seconds for ResNet-101.6As a comparison, the single-scale ResNet-50 baseline in Table 3(a) runs at 0.32 seconds. Our method introduces small extra cost by the extra layers in the FPN, but has a lighter weight head. Overall our system is faster than the ResNet-based Faster R-CNN counterpart. We believe the efficiency and simplicity of our method will benefit future research and applications.

5.2.3 与COCO大赛获奖者相比

翻译：

我们发现表5中的ResNet-101模型在默认学习率计划下没有得到充分的训练。因此，我们在训练快速R-CNN步骤时，以每种学习速率增加2倍的小批量。这将minival上的AP提高到35.6，而不共享特性。这个模型是我们提交给COCO检测排行榜的模型，如表4所示。由于时间有限，我们还没有评估它的特性共享版本，正如表5所示，这应该稍微好一点。

原文: 可修改后右键重新翻译

We find that our ResNet-101 model in Table 5 is not sufficiently trained with the default learning rate schedule. So we increase the number of mini-batches by 2× at each learning rate when training the Fast R-CNN step. This increases AP on minival to 35.6, without sharing features. This model is the one we submitted to the COCO detection leaderboard, shown in Table 4. We have not evaluated its feature-sharing version due to limited time, which should be slightly better as implied by Table 5.

翻译：

表4将我们的方法与COCO竞赛获奖者的单模型结果进行了比较，包括2016年的获奖者G-RMI和2015年的获奖者更快的R-CNN++。我们的单款车型在不增加额外花哨的情况下，已经超越了这些实力强大、设计精良的竞争对手。在测试开发集上，我们的方法比现有的最佳结果增加了0.5个AP（36.2 vs.35.7）和3.4个点AP@0.5（59.1比55.7）。值得注意的是，我们的方法不依赖于图像金字塔，只使用单一的输入图像尺度，但对小尺度对象仍然具有突出的AP。这只能通过以前的方法通过高分辨率图像输入来实现。

Table 4 compares our method with the single-model results of the COCO competition winners, including the 2016 winner G-RMI and the 2015 winner Faster R-CNN+++. Without adding bells and whistles, our single-model entry has surpassed these strong, heavily engineered competitors. On the test-dev set, our method increases over the existing best results by 0.5 points of AP (36.2 vs. 35.7) and 3.4 points of AP@0.5 (59.1 vs. 55.7). It is worth noting that our method does not rely on image pyramids and only uses a single input image scale, but still has outstanding AP on small-scale objects. This could only be achieved by highresolution image inputs with previous methods.

表4。COCO检测基准的单模型结果比较。一些结果在测试标准集上不可用，因此，我们还包括测试开发结果（以及多路径[40]o n minival）。†：http://image-net.org/challenges/会谈/2016/GRMI-COCO-幻灯片.pdf.‡: http://mscoco.org/dataset/#检测-排行榜§：本条AttractioNet[10]采用VGG-16提案，Wide ResNet[39]进行目标检测，严格来说不是一个单一的模型结果。

翻译：

表5。使用更快的R-CNN和我们的FPNs，在minival上进行评估，得到更多的目标检测结果。共享特性使训练时间增加1.5倍（使用4步训练[29]），但减少了测试时间。

原文: 可修改后右键重新翻译

Table 5. More object detection results using Faster R-CNN and our FPNs, evaluated on minival. Sharing features increases train time by 1.5× (using 4-step training [29]), but reduces test time.

此外，我们的方法没有利用许多流行的改进，如迭代回归[9]、硬负挖掘[35]、上下文建模[16]、更强的数据扩充[22]等。这些改进是对FPNs的补充，应该会进一步提高精度。

原文: 可修改后右键重新翻译

Moreover, our method does not exploit many popular improvements, such as iterative regression [9], hard negative mining [35], context modeling [16], stronger data augmentation [22], etc. These improvements are complementary to FPNs and should boost accuracy further.

翻译：

最近，FPN在COCO竞争的所有轨道上都获得了新的最高结果，包括检测、实例分割和关键点估计。详见[14]。

原文: 可修改后右键重新翻译

Recently, FPN has enabled new top results in all tracks of the COCO competition, including detection, instance segmentation, and keypoint estimation. See [14] for details.

图4。目标段建议的FPN。特征金字塔的构造与目标检测相同。我们在5×5窗口上应用一个小的MLP来生成输出维数为14×14的密集目标段。橙色显示的是遮罩对应于每个金字塔级别的图像区域的大小（此处显示P3−5级）。同时显示相应的图像区域大小（浅橙色）和规范对象大小（深橙色）。半倍频程由MLP在7x7窗口（7≈5√2）上处理，此处未显示。详情见附录。

Figure 4. FPN for object segment proposals. The feature pyramid is constructed with identical structure as for object detection. We apply a small MLP on 5×5 windows to generate dense object segments with output dimension of 14×14. Shown in orange are the size of the image regions the mask corresponds to for each pyramid level (levels P3−5are shown here). Both the corresponding image region size (light orange) and canonical object size (dark orange) are shown. Half octaves are handled by an MLP on 7x7 windows (7 ≈ 5√2), not shown here. Details are in the appendix.

表6。在前5k COCO-val图像上评估实例分割方案。所有的模特都在火车上接受训练。DeepMask、SharpMask和FPN使用ResNet-50，而InstanceFCN使用VGG-16。DeepMask和SharpMask性能是根据https://github。com/facebook research/deepmask（两者都是“缩放”变体）。†运行时间是在NVIDIA M40 GPU上测量的，除了InstanceFCN计时是基于较慢的K40。

Table 6. Instance segmentation proposals evaluated on the first 5k COCO val images. All models are trained on the train set. DeepMask, SharpMask, and FPN use ResNet-50 while InstanceFCN uses VGG-16. DeepMask and SharpMask performance is computed with models available from https://github. com/facebookresearch/deepmask (both are the ‘zoom’ variants).†Runtimes are measured on an NVIDIA M40 GPU, except the InstanceFCN timing which is based on the slower K40.

6 扩展：分段建议

翻译：

我们的方法是一个通用的金字塔表示法，可以用于除目标检测以外的其他应用。在本节中，我们根据DeepMask/SharpMask框架[27,28]，使用fpn生成细分建议。DeepMask/SharpMask对图像作物进行训练，以预测实例片段和对象/非对象分数。在推理时，这些模型以卷积方式运行以在图像中生成密集的建议。为了在多个尺度上生成分段，需要图像金字塔[27,28]。使用FPN生成掩模方案很容易。我们使用一个完全卷积的设置来训练和推理。我们在第二节中构建了我们的特征金字塔。5.1，设d=128。我们用14×5的卷积模型对每一个14×5的小目标进行预测。此外，受[27,28]图像金字塔中每倍频程使用2个刻度的激励，我们使用输入大小为7×7的第二个MLP来处理半倍频程。两个mlp在RPN中起着类似的锚定作用。该体系结构经过端到端的培训；完整的实现细节在附录中给出。

原文: 可修改后右键重新翻译

Our method is a generic pyramid representation and can be used in applications other than object detection. In this section we use FPNs to generate segmentation proposals, following the DeepMask/SharpMask framework [27, 28]. DeepMask/SharpMask were trained on image crops for predicting instance segments and object/non-object scores. At inference time, these models are run convolutionally to generate dense proposals in an image. To generate segments at multiple scales, image pyramids are necessary [27, 28]. It is easy to adapt FPN to generate mask proposals. We use a fully convolutional setup for both training and inference. We construct our feature pyramid as in Sec. 5.1 and set d = 128. On top of each level of the feature pyramid, we apply a small 5×5 MLP to predict 14×14 masks and object scores in a fully convolutional fashion, see Fig. 4. Additionally, motivated by the use of 2 scales per octave in the image pyramid of [27,28], we use a second MLP of input size 7×7 to handle half octaves. The two MLPs play a similar role as anchors in RPN. The architecture is trained end-to-end; full implementation details are given in the appendix.

6.1 分段建议结果

翻译：

结果见表6。我们报告小、中、大对象的AR段和AR段，通常用于1000个提案。我们的基线FPN模型与单个5×5mlp实现了43.4的AR。切换到稍大一点的7×7 MLP，精度基本不变。同时使用两个mlp可以将精度提高到45.7ar。将掩模输出尺寸从14×14增加到28×28又增加了AR值（较大的尺寸开始降低精度）。最后，将训练迭代次数加倍，AR增加到48.1。

原文: 可修改后右键重新翻译

Results are shown in Table 6. We report segment AR and segment AR on small, medium, and large objects, always for 1000 proposals. Our baseline FPN model with a single 5×5 MLP achieves an AR of 43.4. Switching to a slightly larger 7×7 MLP leaves accuracy largely unchanged. Using both MLPs together increases accuracy to 45.7 AR. Increasing mask output size from 14×14 to 28×28 increases AR another point (larger sizes begin to degrade accuracy). Finally, doubling the training iterations increases AR to 48.1.

翻译：

我们还报告了与DeepMask[27]、SharpMask[28]和InstanceFCN[4]的比较，后者是生成掩码建议的最先进方法。我们比这些方法的精度高出8.3个百分点，特别是在小目标上，我们的精度几乎翻了一番。

原文: 可修改后右键重新翻译

We also report comparisons to DeepMask [27], SharpMask [28], and InstanceFCN [4], the previous state of the art methods in mask proposal generation. We outperform the accuracy of these approaches by over 8.3 points AR. In particular, we nearly double the accuracy on small objects.

翻译：

现有的掩模建议方法[27，28，4]基于密集采样的图像金字塔（例如，在[27，28]中按2{−2:0.5:1}缩放），使得它们的计算开销很大。我们基于FPNs的方法要快得多（我们的模型以4到6 fps的速度运行）。这些结果表明，我们的模型是一个通用的特征抽取器，可以代替图像金字塔来处理其他多尺度检测问题。

原文: 可修改后右键重新翻译

Existing mask proposal methods [27, 28, 4] are based on densely sampled image pyramids (e.g., scaled by2{−2:0.5:1} in [27, 28]), making them computationally expensive. Our approach, based on FPNs, is substantially faster (our models run at 4 to 6 fps). These results demonstrate that our model is a generic feature extractor and can replace image pyramids for other multi-scale detection problems.

7 结论

翻译：

我们提出了一个在ConvNets中构建特征金字塔的简洁框架。我们的方法比几个强基线和竞争优胜者有显著的改进。从而为特征金字塔的研究和应用提供了一种实用的解决方案，而不需要计算图像金字塔。最后，我们的研究表明，尽管deep ConvNets具有强大的表示能力和对尺度变化的隐式鲁棒性，但是使用金字塔表示来显式地解决多尺度问题仍然是至关重要的。

原文: 可修改后右键重新翻译

We have presented a clean and simple framework for building feature pyramids inside ConvNets. Our method shows significant improvements over several strong baselines and competition winners. Thus, it provides a practical solution for research and applications of feature pyramids, without the need of computing image pyramids. Finally, our study suggests that despite the strong representational power of deep ConvNets and their implicit robustness to scale variation, it is still critical to explicitly address multiscale problems using pyramid representations.

Processed: 0.572, SQL: 9