An efficient visual servo tracker for herd monitoring by UAV

Area and objects of study

The area selected for this study is in Maduo County, under the jurisdiction of Golog Tibetan Autonomous Prefecture, in the southern part of Qinghai Province (Fig. 1a). Maduo County locates at the source of the Yellow River and belongs to a typical plateau area, with an average annual temperature of − 4.0 °C. Due to its unique geographical location and ecological environment, the local flora and fauna resources are very abundant, and animal husbandry is particularly developed. It is highly reasonable to choose this area to conduct research on AI-based precise grazing technology.

Figure 1
figure 1

The area selected for the present study: (a) The location of Maduo County; (b) the distribution of UAV sampling points in Maduo County; (c,d) aerial images of the studied area.

In April 2023, the authors of this paper and research colleagues went to Maduo County for aerial photography, flying a total of 20 sorties at height of 100 m for sampling. The sampling points are shown Fig. 1b. Finally, the domestic Tibetan yaks were selected as the research objects (Fig. 1c,d), which have a color characteristic of mainly black and gray, and rarely white. The yaks move very slowly and steadily, and their stride frequency is usually between 120 and 140 steps per minute.

System overview

To acquire data in the selected area, a P600 type intelligent UAV (Chengdu Bobei Technology Co., Ltd., China) was used (Fig. 2). The specific parameters of the UAV are detailed in Appendix A to the supplementary material. Compared to other models of intelligent UAVs (such as P230, P450, Dji Phantom, etc.), the P600 UAV has outstanding advantages in flight stability, endurance, and load capacity, making it more suitable for long-term data collection in cold high-altitude areas. It is equipped with an RTK positioning system, with a positioning accuracy of up to centimeters, a more precise flight path, and a more stable attitude. It can collect high-quality data in complex high-altitude areas and fly safely. The body is equipped with an NX onboard computer with a computing power of up to 21TOPS, which can run most mainstream algorithms and perform real-time data processing and analysis while collecting data.

Figure 2
figure 2

Data acquisition equipment used for this study.

In addition, it is equipped with pods, two-dimensional planar Lidar, GPS and other intelligent devices, to achieve pod selection and tracking, LiDAR obstacle avoidance, as well as UAV position and speed guidance flight. Furthermore, Q10F 10 × single light pod equipped with a USB interface was incorporated with the P600 UAV, and a specific robot operating system (ROS2) driver was developed for P600. This equipment is able to capture real-time images through the pod within the airborne computer. It could also follow the targets and adjust the position to always keep a constant distance from moving targets. During the target tracking process, both UAV and pod can achieve fully autonomous control via ROS2.

The Q10F 10 × single light pod can obtain real-time images of targets with an image resolution of up to 5 cm, providing target data for the built-in tracking and detection algorithms of ROS2. The ROS2 system can enable the P600 UAV to obtain real-time images of targets from the onboard computer through the Q10F 10 × single light pod. Then, through the built-in tracking algorithm in the onboard system, based on image vision, it can not only recognize and track specific targets (targets and UAVs), but also calculate the approximate distance between the UAV and the tracked target by changing the size of the target detection box in the vision. In addition, the ROS2 system can also adjust the UAV’s position as the target approaches, always maintaining a fixed distance from the target to avoid interfering with the target’s activities. The combination of the Q10F 10 × optronic pod and the ROS2 system allows the P600 UAV to not only be fully autonomous, but also to track the target with an intelligent pod.

Based on the function, the system can be divided into three components, including the controller, the detector, and the tracker. The overall technical route is shown in Fig. 3. In this study, two walking Tibetan yaks were selected as the tracking objectives. When each of the camera frames processes, several confirmed tracking paths are sent to the control system, which calculates the required speed in 4 different control variables in accordance with the embedded algorithms and the real-time location of UAV. Afterwards, the speed is sent to the autopilot to control UAV for tracking the targets. As a basic component of the control system, ROS2 plays a crucial role in information exchange between UAV and the tracking program. Moreover, the algorithms for speed calculation in 4 control variables differ from each other, so that they are described separately in Section “Improved Deep SORT algorithm”.

Figure 3
figure 3

The overall technical framework proposed in this study.


Since this study aims at tracking and identifying target objects in scenarios with high dynamic density and low training data, YOLOv758 was chosen as the baseline model for balancing the limited computational power and the airborne computer speed.

The YOLOv7 model was developed in 2022 by Wang and Bochkovskiy et al., integrating strategies including E-ELAN (Extended Efficient Layer Aggregation Network)23, cascade-based model scaling59 and model reparameterization60 to appropriately balance the detection efficiency and accuracy. The YOLOv 7 network comprises 4 different modules: Input module, backbone network, head network, and prediction network.

The main reason for choosing YOLOv7 as the detection model is that current deep learning based object detection algorithms can be divided into two-stage detection methods and single-stage detection methods. The two-stage detection methods include RCNN, Fast-RCNN, Mask RCNN, etc. Single stage detection methods include SSD, YOLOv1-YOLOv8, etc. Compared to two-stage detection methods, single-stage detection methods have better real-time performance and are more suitable for UAV platforms. In the single-stage detection method, compared with SSD and YOLOv1-YOLOv6 models, YOLOv7 performs better in terms of comprehensive detection accuracy, detection rate, and network convergence speed. Compared to the YOLOv8 model, although YOLOv7 may not perform as well as YOLOv8 in terms of detection speed and accuracy, it is lighter in model complexity and can be deployed on unmanned aerial vehicle platforms with limited computing power.

We selected 20 domestic yak video sequence data from the study area as datasets, 10 of which were used as target detection datasets and 10 as target tracking datasets. These two sets of data were used as two benchmarks for yak detection and tracking, and YOLOv7 model detection was used to identify yaks.

To improve the model’s ability to detect yaks, the target detection dataset was further divided into 3 sets, including training, validation and test sets with a ratio of 7:2:1. The YOLOv7 model is adopted to train the dataset by adjusting the model parameters to achieve high stability of the model. The yak hair color is generally pure black or black and white. In order to obtain more yak hair texture features, 2400 yak hair images from 10 video sequences in the target detection dataset were intercepted and the dataset was divided with the ratio of 7:2:1, which was trained again using YOLOv7.


The Deep SORT algorithm was used as the baseline algorithm for the tracker and two improvements in the algorithm were made. Firstly, optical flow for motion estimation7 was introduced into the scheme to improve the motion prediction accuracy of KF. Secondly, an extended version of the original tracking method, named as low confidence track filtering method, was used to improve the ability of the tracker for handling unreliable detection results, which might occur in the real-world target detection due to the complex environment. By this means, the quantity of the false positive paths could be significantly reduced, avoiding the unreliable detection. The specific process is shown in Fig. 4.

Figure 4
figure 4

Multiple object tracking pipeline.

In order to apply Deep SORT to yak tracking and monitoring, we first need a large number of yak datasets to extract the appearance features of trained yaks. Since the target tracking dataset has 10 video sequence data, which is insufficient, we re-generate the target tracking dataset by setting the truncation rate and occlusion rate parameters to cut, rotate and synthesize the video frame images.

The occlusion rate defines the degree of occlusion by the proportion of the yak bounding box that is occluded. We categorize the degree of occlusion into three categories: no occlusion, partial occlusion, and heavy occlusion. Specifically, a yak is defined as partially shaded if it is between 1 and 50% shaded, and as heavily shaded if it is greater than 50% shaded.

The cutoff rate is an indication of how far the yak is outside the bounding box and is used for training sample selection. In order to minimize the effect of noise, we discarded yak data with truncation rate more than 0.5 or occlusion rate more than 0.5. About 100 video data with the same interval were selected as a batch, and the video frames were intercepted and then resized to JPEG images of the same size (500,500) to obtain a total of 6000 yak images. We annotated the 6000 images using Label-image software and stored them in XML format as a target tracking dataset to be used as a benchmark for yak tracking.

The Deep SORT algorithm adopted in this work uses KF to estimate the existing track in the current frame. The states applied in KF are defined as \((x,y,\gamma ,h,\dot{x},\dot{y},\dot{\gamma },\dot{h})\), in which \((x,y,\gamma ,h)\) represents the bounding box position, and \((\dot{x},\dot{y},\dot{\gamma },\dot{h})\) represents the single coordinate velocity. KF involved in Deep SORT is the standard version using a constant velocity and a linear observation. When each new frame appears, the position of each existing track will be estimated based on the previous one, and the track estimation only needs spatial information.

In order to achieve the appearance information of the detection results and tracks, appearance descriptors were used for extracting features from the detection images and tracking the images from the previous frames. As a CNN model trained on a large-scale recognition dataset, the appearance descriptor is capable of extracting features in the feature space based on that the features from same identity are similar to each other.

By estimating the position and appearance information of existing tracks, in each future frame new detection results could be associated with the existing tracks. New detection results need to have confidence levels above the detection confidence threshold \(t_{d}\) to become candidates for data association. All the detections do not meet this criterion will be filtered out. A cost matrix is used in Deep SORT for representing spatial and visual similarity between the new detections and the existing tracks, which contains two distance parameters. The first one is the Mahalanobis distance represented by formula (1) for spatial information:

$$ d^{(1)} (i,j) = (d_{j} – y_{i} )^{T} s_{i}^{ – 1} (d_{j} – y_{i} ) $$


where \(y_{i}\) represents the i-th orbit, \(s_{i}^{ – 1}\) represents the covariance of d and y, \((y_{i} ,s{}_{i})\) represents the projection of the i-th orbit in the space of measurement, and \(d_{j}\) represents the j-th new detection. It is the distance between the estimated position of the i-th orbit and the j-th new detection. The second distance represents the appearance information as shown below by formula (2):

$$ d^{(2)} (i,j) = \min \left\{ {1 – r_{j}^{T} r_{k}^{(i)} |r_{k}^{(i)} \in R_{i} } \right\} $$


where r represents an appearance descriptor, \(R_{i}\) represents the appearance of the last one hundred objects associated to the i-th track. Besides, each of the distance is accompanied by gate matrix \(b_{i,j}^{(1)}\) and \(b_{i,j}^{(2)}\), if the distance is less than a predefined threshold, it is equal to 1, otherwise it is equal to 0. The comprehensive cost matrix is presented in formula (3):

$$ c_{i,j} = \lambda d^{(1)} (i,j) + (1 – \lambda )d^{(2)} (i,j) $$


The gate function \(b_{i,j} = \prod {_{m = 1}^{2} b_{i,j}^{(m)} }\) is used to set the threshold, it is equal to 1 only when both the space and the appearance gate functions are 1, otherwise, it is equal to 0, indicating whether (i, j) effectively matches both space and appearance. The cost matrix is used for each of the new frame to associate the new detection with the tracks of the existing gate matrix.

In case of a successful association of the new detection with the existing track, the new detection is included into the track, and track shows a non-association age of zero. In case the new detection cannot be associated with the existing track in the F-frame, it is initialized as a tentative track. The original algorithm of Deep SORT verifies whether the tentative track is associated to the new detection in the frame \((f + 1),(f + 2),\;…\;(f + t_{tentative} )\). In case of a successful association, an update of the track to a confirmed one will be conducted. Otherwise, the temporary track will be immediately deleted. For existing tracks without successful association with the new detection in each frame, their non-association ages increase by 1. In case that the non-association ages exceed the threshold, the corresponding tracks will also be removed.

Improved deep SORT algorithm

Combination of KF and optical flow

As a classic tracking algorithm, the Lucas-Kanad (LK) optical flow61 algorithm has been widely applied due to its competitive real-time speed and strong robustness. To address the problems derived from KF, optical flow is also used to estimate objects in this study, and several assumptions are made, including constant brightness between the adjacent frames, slow movement of the targets, and similar motion pixels of the same images. There is no doubt that the loss of the object detection will challenge the updating of KF and lead to the interruption of trajectory. Therefore, the boundary frames of objects are predicted by using the light flow. In addition to the bounding frame of the F-frame generated with original detector in the data set, optical flow is also adopted to predict the position of the object based upon information in the previous frame. It could provide more historical clues to the information of the previous frame. As shown in Fig. 5, the yellow-colored bounding boxes represent the original detection results and the red ones are the results of the optical flow.

Figure 5
figure 5

Comparison of the detection results (yellow bounding boxes: Original detection results; red bounding boxes: Detection results from optical flow).

It can be observed that the former produces a more accurate trace input, nevertheless, the primitive detection in complex environments cannot be ignored. To compensate for the adverse effect on performance, combination of them as input for current frame tracking is required, which could provide more reliable state of motion for KF. At the same time, a constant velocity of the object in the frame is assumed, and KF is used to construct a model of linear motion defined in 8-dimensional space:

$$ S = (x,y,\gamma ,h,\dot{x},\dot{y},\dot{\gamma },\dot{h}) $$


where (x, y) represent bounding box center coordinates, \(\gamma\) represents the aspect ratio, h means high, and \((\dot{x},\dot{y},\dot{\gamma },\dot{h})\) represents the speed of objects in the frame.

Filtering of low confidence tracks

False positive tracks derived from unreliable detection results seriously affect the performance of the tracker. At present, the most advanced detection tracking technology still faces a large number of false positive tracks and other problems. To better solve this problem, a filter for low confidence tracks was included into our tracker.

In this tracker, not only a confidence threshold \(t_{d}\) is used to filter out detections with confidence below this threshold, but also average confidence values \(t_{{ave_{d} }}\) are calculated for new detections in the frame \((f + 1),(f + 2),\;…\;(f + t_{tentative} )\) related to tentative tracks. Only when these average values are greater than the predefined threshold \(t_{{ave_{d} }}\), update of the corresponding tentative tracks to the confirmed tracks could be performed. Otherwise, these tentative tracks will be deleted. By this means, the detection results are filtered by two threshold stages of \(t_{d}\) and \(t_{{ave_{d} }}\) rather than simply by \(t_{d}\) alone. Therefore, the threshold \(t_{d}\) with a preset lower value can avoid losing detection, and extraction helps for suppressing false positive tracks produced with low confidence threshold (low \(t_{d}\)). The algorithm used in this study to filter low confidence tracks is detailed in Appendix B to the supplementary material.

Visual servo control

In this study, a servo control system using helicopters and cameras62 is applied for MOT. The system consists of 4 control variables, including lateral control, longitudinal control, vertical control, as well as yaw rate control.

The lateral control aims at keeping the camera frame center aligning with the horizontal middle of tracked objects by using a PID controller that takes the sum of the horizontal distances of each object as the proportional input, the sum of the differences between the current and previous centers as the derivative input, and the cumulative error as the integral input.

According to the PID formula, the lateral speed of \(\dot{x}_{uav}\) in the lateral coordinate system of the UAV could be calculated as follows:

$$ \dot{x}_{UAV} = Kp_{x} Sp_{x} + Ki_{x} Si_{x} + Kd_{x} Sd_{x} $$


The longitudinal control adjusts the forward and backward speed of the helicopter based on the heights of bounding boxes of the objects, which indicate the distance of objects to the camera. This control unit uses a PID controller that takes the sum of differences between current and minimum heights and between current and maximum heights as the proportional input for calculation of forward and backward speeds, respectively. Besides, it takes the sum of height change rates of each object as the derivative input.

The speed required for vertical control of the UAV is divided into two parts: one based on the height of the object, and the other based on the area where the object is located. Therefore, the final velocity \(\dot{y}_{uav}\) of the UAV coordinate system is calculated longitudinally by formula (6).

$$ \begin{aligned} \dot{y}_{UAV} & = \dot{y}h_{UAV} + \dot{y}a_{UAV} = Kp_{y} Sp_{y} \_a + Kd_{y} Sd_{y} \_a + Kp_{y} Sp_{y} \_h_{b} \\ & + b_{f} Kp_{y} Sp_{y} \_h_{f} + Kd_{y} Sd_{y} \_h \\ \end{aligned} $$


The vertical control loosely regulates the height of the UAV based on a predefined range. In comparison with the response to the lateral speeds, the response of the autopilot to low vertical speeds to achieve accurate height adjustment is relatively slower. Therefore, it is often that after the autopilot receives such a vertical speed command, the height of UAV does not change.

The yaw rate control rotates the helicopter around its vertical axis to keep it perpendicular to the line connecting the two objects outermost of the camera frame, which estimate the yaw angle by using a ratio between horizontal distance and image width, and a ratio between height difference and standard height for each class of objects. Afterwards, this angle is divided by the processing time and multiplied by a coefficient to achieve the yaw rate.

Since the final command is the yaw rate, the calculated yaw rate is divided by the processing time and multiplied by a factor as shown in Eq. (7).

$$ \dot{\phi }_{UAV} = Kp_{\phi } \frac{{\phi_{UAV} }}{\Delta t} $$


In summary, after introducing the principle for calculation of the required speeds in all 4 directions, the complete equation of the final speed command based on the world transformation Eqs. (5), (6) and (7) is as follows:

$$ \begin{aligned} V_{world} = & \left[ {\begin{array}{*{20}c} {\dot{x}} \\ {\dot{y}} \\ {\dot{z}} \\ {\dot{\psi }} \\ \end{array} } \right]_{World} = \left[ {\begin{array}{*{20}c} {\cos \psi } & { – \sin \psi } & 0 & 0 \\ {\sin \psi } & {\cos \psi } & 0 & 0 \\ 0 & 0 & 1 & 0 \\ 0 & 0 & 0 & 1 \\ \end{array} } \right] \\ & \left[ {\begin{array}{*{20}c} {Kp_{x} Sp_{x} + Ki_{x} Si_{x} + Kd_{x} Sd_{x} } \\ \begin{gathered} Kp_{y} Sp_{y} \_a + Kd_{y} Sd_{y} \_a + Kp_{y} Sp_{y} \_h_{b} \hfill \\ + b_{f} Kd_{y} Sd_{y} \_h_{f} + Kd_{y} Sd_{y} \_h \hfill \\ \end{gathered} \\ {Kp_{z} \Delta h} \\ {Kp_{\psi } \frac{\psi }{\Delta t}} \\ \end{array} } \right]_{UAV} \\ \end{aligned} $$


The flight controller can calculate the expected acceleration (that is, the three-axis expected thrust) according to \(V_{World}\) (expected velocity) and the current velocity, and then convert the desired attitude angle according to the UAV dynamics model. The highly dynamic control algorithm of the UAV attitude loop can ensure the speed and stability of attitude tracking.

Source link


Your email address will not be published. Required fields are marked *