Joint fuzzy background and adaptive foreground model for moving target detection

Dawei ZHANG; Peng WANG; Yongfeng DONG; Linhao LI; Xin LI

doi:10.1007/s11704-022-2099-0

Frontiers of Computer Science >

2024 , Vol. 18 >Issue 2: 182306

DOI: https://doi.org/10.1007/s11704-022-2099-0

RESEARCH ARTICLE

Joint fuzzy background and adaptive foreground model for moving target detection

Dawei ZHANG ¹ ,
Peng WANG ¹ ,
Yongfeng DONG ¹ ,
Linhao LI ^,¹ ,
Xin LI ²

Expand

¹. School of Artificial Intelligence, Hebei University of Technology, Tianjin 300401, China
². Lane Department of Computer Science and Electrical Engineering, West Virginia University, Morgantown, WV 26506-6109, USA

lilinhao@hebut.edu.cn

Received date: 17 Feb 2022

Accepted date: 18 Dec 2022

Copyright

2024 Higher Education Press

Fold

Abstract

Moving target detection is one of the most basic tasks in computer vision. In conventional wisdom, the problem is solved by iterative optimization under either Matrix Decomposition (MD) or Matrix Factorization (MF) framework. MD utilizes foreground information to facilitate background recovery. MF uses noise-based weights to fine-tune the background. So both noise and foreground information contribute to the recovery of the background. To jointly exploit their advantages, inspired by two framework complementary characteristics, we propose to simultaneously exploit the advantages of these two optimizing approaches in a unified framework called Joint Matrix Decomposition and Factorization (JMDF). To improve background extraction, a fuzzy factorization is designed. The fuzzy membership of the background/foreground association is calculated during the factorization process to distinguish their contributions of both to background estimation. To describe the spatio-temporal continuity of foreground more accurately, we propose to incorporate the first order temporal difference into the group sparsity constraint adaptively. The temporal constraint is adjusted adaptively. Both foreground and the background are jointly estimated through an effective alternate optimization process, and the noise can be modeled with the specific probability distribution. The experimental results of vast real videos illustrate the effectiveness of our method. Compared with the current state-of-the-art technology, our method can usually form the clearer background and extract the more accurate foreground. Anti-noise experiments show the noise robustness of our method.

Key words： matrix decomposition; matrix factorization; generalized sparsity; noise modeling

Cite this article

Dawei ZHANG , Peng WANG , Yongfeng DONG , Linhao LI , Xin LI . Joint fuzzy background and adaptive foreground model for moving target detection[J]. Frontiers of Computer Science, 2024 , 18(2) : 182306 . DOI: 10.1007/s11704-022-2099-0

1 Introduction

Moving target detection [1] is a problem that been extensively studied in computer vision and video processing applications, such as object tracking [2] and action recognition [3]. Foreground (FG), often consisting of moving targets, corresponds to the pixels that change over time; by contrast, Background (BG) is characterized by relatively stable pixels, and is considered low-rank [4, 5]. The task of moving target detection is intrinsically connected with the separation of FG from BG (a.k.a., background extraction or subtraction [6]). Pioneering works in this field include both parametric models (e.g., the mixture of Gaussian [7]) and non-parametric models [8].

Rapid advances in the theory of sparse coding and matrix completion have offered new tools for modeling background subtraction in the past decade. After that, optimizing ways of the matrix-based background models can be classified into two categories: Matrix Decomposition (MD) [9-11] vs. Matrix Factorization (MF) [6,12,13], as shown in Fig.1(a) and (b). The former simultaneously decomposes the video data into two additive components - i.e., low-rank BG and sparsity FG. Due to the formulation of linear decomposition, the two components can be optimized iteratively [11] until a compromised solution is reached - e.g., via Alternating Direction Method of Multipliers (ADMM) [14]. At this stage, the non-convex improvement of the nuclear norm is mainly used to constrain BG. Most of studies introduce the temporal and spatial prior information of FG into the model (e.g., Markov random fields (MRF) [15], total variation [16], group sparsity [11], and graph Laplacian operator [13]).

Fig.1 Comparison of the proposed combined methodology and two conventional frameworks. (a) Matrix decomposition (MD) way; (b) matrix factorization (MF) way; (c) the proposed JMDF: we simultaneously take FG model, BG model, and noise model into account by extending the fuzzy membership degree

Full size|PPT slide

The latter class of MF models [6,12,13] only consider low-rank BG as the main component and regards the others (FG and noise) as outliers. By imposing various additional assumptions such as Bayesian prior [17], mixed noise [18,19,12] and non-negative requirement [20], MF aims at extracting the BG component as a low-rank matrix from the video directly. Additional constraints on FG components such as smoothness can be enforced by standard total variation (TV)-based regularization [6].

It is enlightening to observe the complementary nature of MD-based and MF-based approaches, which two are all designed for optimizing the non-convex background modeling task. On one hand, MD-based models optimize BG and FG simultaneously, enjoying the accuracy of detecting moving objects by exploiting the local continuity constraint of FG components. However, the class of MD-based models has shown limited capability in handling complex BG such as dynamic [21] - e.g., slight changes in BG may be labelled as FG objects. What's more, accuracy FG is always expensive to obtain because common constraints, like the group sparsity-inducing norm [11], can not produce the satisfying performance unless sufficient parameter adjustments are conducted handcraftedly.

On the other hand, MF-based models focus on the optimization of the underlying low-rank structure for BG only, overlooking the rich constraints of FG component. Consequently, MF-based models tend to suffer from sensitivity to outliers - e.g., some localized objects in FG might be mistaken as noise perturbations in BG due to overemphasis on the low rank constraint of BG. So, how to achieve an improved tradeoff between sensitivity and robustness for video of diverse content has remained an open question in the literature of moving target detection.

Inspired by these, we propose to combine these two lines of research into a unified framework called Joint Matrix Decomposition and Factorization (JMDF), as shown in Fig.1(c). The strategy of joint optimization promotes the interaction between FG and BG, which is beneficial for both to reach or get closer to the optimal solution. Importantly, we relax the strict low rank constraint of MF-based BG model by using a new fuzzy membership formulation. More specifically, motivated by the noise-deduced weights for matrix factorization [6], we propose a fuzzy membership defined by noise and foreground information for each pixel in the video, denoting the soft labels (FG/BG) instead of hard classification. The introduction of fuzzy membership is particularly effective for describing the fuzzy opposite relationship between FG and BG, as well as for the class of video containing complex content such as interference in the BG. Additionally, an adaptively anisotropic neighborhood sparsity constraint is constructed to ensure the spatio-temporal continuity of FG pixels. Such joint modeling of BG, FG, and noise components leads to a new formulation of optimization problem, which can be solved by ADMM. The main contributions of this paper are as follows.

1) We present a unified JMDF framework for moving target detection, combining MD-based and MF-based optimizing ways. Through jointly optimizing the two classes of models, we can iteratively refine the results of FG and BG estimation. We have derived an ADMM-based solution algorithm for the formulated joint optimization problem.

2) We propose a fuzzy membership degree to represent the soft association between each pixel and BG. Such fuzzy membership is designed based on noise analysis and FG information. Finally, the membership degree of each position is fused by itself and its first-order time information. It is adaptively updated in iterations, and aims at robustly handling FG interference and assist factorization model to restore clean static BG. Moreover, the fuzzy membership can also complement FG to a certain extent.

3) We have constructed an adaptive generalized group sparsity constraint for promoting the spatio-temporal continuity of moving objects in the FG. The existing group sparsity-deduced models only enforce the spatial continuity of moving objects in FG. To enforce temporal consistency along with spatial continuity, we propose to take the first-order temporal of adjacent frames into structural sparsity modeling. The correlation of adjacent video frames can be used to adaptively adjust the tradeoff between spatial and temporal continuity constraints.

4) Our developed JMDF model has been experimentally verified on two commonly used moving target detection video data sets, I2R [22] and CDnet2014 [23]. After comparison with five leading algorithms, we have found that the proposed JMDF often performs at least as good as other competing methods in terms of F-measure performance. The performance of JMDF on moving target detection is generally better for videos that contain dynamic and complex BG.

The rest of this paper are organized as follows. In Section 2, we briefly review the existing work on foreground detection or background subtraction. Section 3 gives the JDMF model definition and the derivation of the solution algorithm. Section 4 shows the experimental results, including the comparison between JDMF and other methods. Finally, we make some concluding remarks about this work and the future research directions in Section 5.

2 Related work

Existing moving target detection algorithms can be divided into four categories: statistical methods and low-rank based models (MD and MF) which solve the task using no supervised information, deep learning approaches that are established based on enough supervised training data, and signal processing techniques in which some semi-supervised models are constructed.

Statistical models Statistics-based BG models usually use single pixel values or neighborhood pixels as input features [24]. In recent years, some scholars proposed a series of improved Mixture of Gaussian (MoG) distribution [25,26] to establish a BG model. Elguebaly et al. [27] adopted asymmetric Gaussian mixture to perform background subtraction tasks. Kernel density estimation is a non-parametric estimation method used to describe the distribution of unknown data. Yang et al. [28] employed two thresholds to describe the uncertainty of color video detection based on Visual Background Extraction(ViBe). The self-organizing map network can be used for the research of moving target detection in dynamic BG and improve the robustness of the model to illumination [29]. Wang et al. [30] proposed joint Gaussian conditional random fields for BG extraction.

Low-rank and sparsity models The objective of the MD-based methods is to estimate both FG and BG. In the surveillance video taken by a fixed camera, the BG is stable and similar. Therefore, it is usually estimated using low-rank approximation (e.g., nuclear norm [9], weighted Schatten p-norm [31] and tensor-wise norm [32]). In addition, The moving objects take up only a tiny portion of the scenes of a video so that the FG pixels are considered sparse. Robust Principal Component Analysis (RPCA) [5,9] decomposes the video into low-rank BG and sparse FG matrices. Since then, stable principal component tracking [33] online RPCA [34], and fast low-rank approximation [35,36] methods were proposed. Markov Random Field (MRF) is adopted to constrain the local smoothness of FG [15,37]. They are joint optimizations of low-rank matrix and discrete FG indicator matrix and show high performance in eliminating the interference of dynamic BG and complementing the FG. MRF can also be used as a post-processing process for the original RPCA [38]. Javed et al. [39] considered the impact of global illumination changes on foreground detection, and then proposed online RPCA based on MRF and multiple features. Li et al. [40] proposed to use the multi-modal fusion of grayscale and thermal video to achieve foreground detection method, which adaptively pursues the cross-modality low-rank representation. However, these models ignore the prior spatial structure and the temporal continuous prior of the FG, resulting in partial missing of the FG extracted by the model and interference from the dynamic BG.

In MD, most of the existing work is to impose spatial structure sparsity constraints and time continuity constraints on the FG, forming many variants of it. The two-dimensional Total Variational (2DTV) norm was used to constrain the spatial continuity of the FG [41]. The three-dimensional Total Variation (3DTV) norm was introduced to constrain FG in the spatio-temporal dimensions [16,42]. Liu et al. [11] replaced the

l 1

norm of the FG with group sparsity constraint and proposed Low-Rank and Structured Sparse Decomposition (LSD). Zhang et al. [43] added noise modeling to LSD to improve the robustness of the model to interference noise. Javed et al. [13] adopted the graph Laplacian operator to apply spatio-temporal regularization to the FG sparse component, thereby constructing the super-pixel temporal and spatial feature maps of the video matrix respectively. Wu et al. [44] expressed the observation video as static BG, dynamic BG and sparse FG, and used superpixel-based

l 0

norm group sparse constraint to model the FG. Liu et al. [45] used RPCA to obtain the trajectory of the sparse FG in time slices X-T and Y-T, and then converted the FG modeling task into a structured sparse signal recovery problem. Xin et al. [46] introduced adaptive Generalized Fusion Lasso (GFL) for structural modeling of FG. Javed et al. [47] combined superpixels and GLF constraints and proposed the background subtraction of online matrix decomposition. Ye et al. [48] proposed a Motion-Assisted Matrix Restoration (MAMR) model for FG-BG separation of video clips. In this model, a dense motion field is estimated for each frame and mapped to a weighting matrix.

MF-based methods take BG modeling as the main target. The FG is extracted by means of background subtraction and threshold segmentation. The traditional MF algorithms mostly use the

l 2

norm or

l 1

norm as the noise measure, assuming that the noise obeys the Gaussian or Laplace distribution. This simple assumption of noise distribution is not suitable for video modeling tasks. Many factors in the video make the noise complicated, such as shadow, changing light, dynamic BG, etc. To solve this problem, Meng et al. [18] were the first to use MoG to model complex noise with low-rank factorization. Then, they extended to a mixed power exponent distribution [10] and noise MRF modeling [49]. Yong [6] extended the offline detection method of literature [18] to online form, and used 2DTV norm as a post-processing step. Guo et al. [12] regarded foreground detection as an outlier estimation problem. Considering that the discrete FG and BG indicator matrices were difficult to optimize, these two matrices were processed continuously. Meanwhile, the

l 1

norm and maximum entropy constraints were applied. As we all know,

l 1

norm is more robust to sparse outlier noise than

l 2

norm, so it is more reasonable to use

l 1

norm to measure noise in the process of factorization. However,

l 1

norm is non-smooth, resulting in a slow solution. To address this problem, Liu et al. [50] used a smooth

l 1

norm to handle the background subtraction task.

Deep learning-based approaches Recent trend for estimating the moving target from video data is to automatically extract discriminative features by powerful deep learning architectures [51-55]. This class of data-driven approaches have strict requirements in terms of collecting and annotating an effective training set, which is often time-consuming and resource expensive for practical applications. Meanwhile, it is generally difficult to take varying uncertainty factors (e.g., noise interference, camera motions, and illumination conditions) into account while constructing the training datasets used by deep models.

Signal processing-based approaches As the video can be decomposed as the stable background and the sparse foreground, which actually resembles the decomposition of the stable signal and sparse noise. So, some reseachers try to solve the problem by some signal processing based ways [56]. The sparsity based models are actually the matrix-wise sparse representation of video signal [57]. In addition, filtering-based ways [58] and graph signal processing-based ways have also shown theis effectiveness. Here, graph signal processing-based ways [59] are semi-supervised methods that requiring less label data than deep learning methods.

3 Proposed method

3.1 Overall model

Given a sequence of n video frames, each of height and width

h

and

w

, we stretch these frames into column vectors and stack them as matrix

D ∈ R m × n

m = h × w

. In an ideal and over-simplified situation, each frame contains the static BG

B ∈ R m × n

and the moving FG

S ∈ R m × n

only. The observation matrix

D

can be decomposed into

D = B + S

. The task of moving target detection can be formulated into the following optimization problem:

(1)

min B, S ⁡ ‖ B ‖ ∗ + κ ‖ S ‖ 1 s . t . D = B + S,

where

‖ ⋅ ‖ ∗

and

‖ ⋅ ‖ 1

define the nuclear norm and

l 1

norm respectively.

κ

control the sparsity of the matrix

S

. Later, taking into account the dynamic BG and other interference components in the video, some work introduced the noise term

E

into Eq. (1), i.e.,

D = B + S + E

. Generally,

l 1

norm or

l 2

norm was used to model

E

To better capture the low-rank structure of BG contaminated by unknown noise distribution, matrix factorization (MF) is introduced rencetly to the task of BG modeling in online Mixture-of-Gaussian MF (OMoGMF) [6]. The basic idea behind low-rank MF (LRMF) is to factorize the data matrix into a pair of low-rank matrices

(U, V)

. The general form of MF is as follows:

(2)

min U, V ⁡ ‖ W ⊙ (D − U V) ‖ p,

where

W

is a weight matrix deduced by noise modeling results. In the above MF model, BG is considered the primary component of video data; FG and noise are treated as outliers characterized by a mixture of unknown probability distributions.

As mentioned above, both MD-based and MF-based approaches have their own strengths and limitations. we combine the their advantages and propose JMDF framework - i.e.,

(3)

min U, V, B, S ⁡ F (D − B) + λ Ω (S) s . t . D = B + S + E, B = U V,

where

F

denotes an objective function derived from MF model and

Ω

is a regularization function designed for the moving FG

S

. We will first summarize our new modeling strategies next and then elaborate on the design of BG/FG models, respectively.

First, we adopt matrix factorization and set

r = 1

to extract static BG rather than relaxation terms such as the nuclear norm. The aim is to avoid the time consuming problem. Besides, we introduce fuzzy mathematics theory into the BG model. Specifically, FG and BG are treated as vague concepts. The background membership degree indicates the probability that the pixel belongs to BG. It is not only designed from the noise modeling result, but also takes into account the FG information, because the FG has the stronger interference to the BG restoration. This strategy can restore a clean static BG and complete FG to a certain extent.

Second, learning from the work ideas of [43,44], we assume that other noise such as dynamic BG comes from Gaussian distribution with mean value of zero, so it is reasonable to use Gaussian kernel to model noise. The key step of noise modeling is estimation of distribution variance. The estimation result of structural sparsity FG is used to mark the regions which may be affected by gaussian noise, and the variance of gaussian distribution is estimated by the noise of these regions.

Third, for FG modeling, unlike existing group sparsity-inducing norm, a novel spatio-temporal sparsity constraint is introduced in our model. Specifically, we design an adaptive generalization group sparsity constraint by taking the tradeoff between the spatial and temporal consistency of FG. In view of the uncertainty with temporal sampling of video frames, we propose to adaptively the spatio-temporal neighborhood sparsity terms to better suppress the dynamic BG interference and enhance the FG integrity.

In summary, our JMDF model simultaneously takes BG, FG, and noise modeling into account as shown in Fig.1(c). The detailed flowchart of our moving target detection algorithm based on JMDF model is shown in Fig.2. We will elaborate on the design of BG and FG models in the following subsections.

Fig.2 Flowchart of the proposed framework. First, the video BG is modeled by fuzzy factorization. Second, background subtraction is conducted, and then the Spatio-temporal constraints are applied to obtain the FG component. After that, the Gaussian noise in residual and the FG component update the membership degree together. Finally, the above process is iterated until convergence

Full size|PPT slide

3.2 Fuzzy factorization of BG

In conventional low-rank models, a binary indicator matrix was used to denote the clear association of each pixel with BG or FG. Such a hard decision based approach has been extended into a soft decision-based approach in MF. However, the Expectation-Maximization (EM) method adopted [6] has the tendency of getting trapped in the local optimum. Meanwhile, the model mentioned in [12] heavily relies on a pre-specified prior for hyperparameters lacking a good adaptive property, which makes its performance degrade on real-world data. Besides, soft decision-based approaches are basically in the MF framework, so they cannot use FG information to modify BG.

Based on the above observations, we propose a novel fuzzy factorization approach with soft decision. Specifically, each pixel has a fuzzy correlation with BG/FG. If there is stronger evidence that a pixel (

i, j

) is BG, the likelihood constraint

D i j = B i j

will be strengthened. Otherwise, it will be weakened. Such intuition can be formalized by formulating a fuzzy matrix factorization as follows:

(4)

min U, V, M ⁡ ‖ M ⊙ (D − B) ‖ F 2, s . t . B = U V,

where

U ∈ R m × r

and

V ∈ R r × n

are low rank matrices. we set

r = 1

in this paper.

B = U V

explicitly enforces the low rank constraint of BG. Our design of

M ∈ [0, 1) m × n

- the BG membership degree matrix - reflects the objective of improved robustness and adaptive property, as we will elaborate next. If

M i j

represents the probability that a pixel (

i, j

) is classified as BG,

1 − M i j

is the likelihood that a pixel belongs to FG. First, we propose to define

M i j

by a Gaussian kernel and estimated FG:

(5)

M i j = {exp ⁡ (D i j − B i j) 2 2 σ 2, S i j = 0, 0, S i j ≠ 0,

where

D i j

B i j

, and

S i j

represent the elements in

D

B

, and

S

, respectively. Note that

σ 2

is the adaptive fuzzy factor, indicating the estimation of the noise distributon variance. Then, the first-order time information is introduced into the membership matrix, the final membership expression is as follows:

(6)

M i j = 1 1 + | T N (i, j) | [(D i j − B i j) 2 2 σ 2 I (S i j = 0) + ∑ p, q ∈ T N (i, j) (D p q − B p q) 2 2 σ 2 I (S p q = 0)],

where

T N (i, j)

represents the set of indexes in the first-order time neighborhood of index

(i, j)

I (⋅)

is the indicating function, that is, if the input condition is true, the output of the function is 1, otherwise, 0.

Gaussian kernel and FG information contribute to the improved robustness property of our fuzzy factorization model. From the perspective of hard decision, if

S i j = 0

, the pixel at position

i j

is classified as BG. However, in each iteration,

S

is estimated by the FG model, its result is not credible. For soft decision, it is necessary to give confidence support for

S i j = 0

. In this case, we assume that the ij position is BG, then this position is only affected by Gaussian noise, i.e.,

D i j

B i j

E i j

, instead of the original decomposition form (D = B + S + E). The confidence degree of the hypothesis, namely background membership degree, should be given based on Gaussian kernel of the estimated noise, so it is negatively correlated with noise intensity. Specifically, the lower the noise intensity, the greater the probability that

i j

position is a BG pixel; on the contrary, the probability of

i j

position being a BG pixel is smaller. This method can reduce the chance that the FG is incorrectly classified as the BG. If

S i j ≠ 0

, we make a hard decision on the

i j

position. Compared with BG being incorrectly classified as FG, FG interference can aggravate the BG estimation error. So we activate the zero membership to alleviate its interference for improved robustness.

It is worth noting that the FG is one of the components of the membership degree, so it should also have the characteristics of spatio-temporal structure. However, to a greater extent, the foreground gives membership spatial characteristics. In this case, we fuse the first-order time information of membership degree to highlight the characteristics of temporal structure.

The fuzzy factor

σ 2

in Eq. (5) is updated with each iteration, which improves the adaptability of soft decision making. It can be updated according to the estimation results of FG and the variance estimation formula of gaussian distribution in statistics:

(7)

σ 2 = 1 + ∑ i, j I (S i j = 0) (D i j − B i j) 2 1 + ∑ i, j I (S i j = 0) .

At the initial stage of iterative optimization, the estimation error of the FG model is very large, and even the sparse solution may not be generated. In order to avoid the failure of variance estimation, constant term is added to modify the traditional variance estimation formula.

In addition, both

M i j

will be adaptively updated with each iteration optimization. We will study the detailed formulation and solution based on the JMDF model in the following sections.

To illustrate more clearly the advantages of the proposed fuzzy membership degree in BG and FG modeling, Fig.3 is given, which shows the background membership heat maps and the FG masks in the iteration process. The dynamic BG and FG are unstable components which destroy the low rank of real BG. It can be seen from the heat maps that these unstable components are endowed with the smaller background membership degree, thus reducing their interference to the BG estimation. Moreover, although the extracted FG will be partially missing in the fifth iteration, our strategy focusing on noise gives these missing parts the small background membership degree. There is an opportunity to complete FG in subsequent iterations. As the number of iterations increases, the background membership degree plays a more significant role in FG completion. In the ninth iteration, our model estimates a substantially complete FG mask.

Fig.3 Background membership degree in iterative process. The leftmost image is the original video frame. For else images, from left to right, there are the FG masks and background membership heat maps in the fifth, seventh and ninth iterations

Full size|PPT slide

3.3 Spatio-temporal constraints of FG

To promote the spatio-temporal continuity of FG, we propose an adaptive generalized group sparsity constraint, which is defined as:

(8)

Ω (S) = λ ∑ k = 1 n ∑ g i j ∈ G ‖ S : k g i j ‖ ∞ + α ∑ k = 2 n β k, k − 1 ‖ S : k − S : k − 1 ‖ F 2,

where

S : k

is the

k

-th column of

S

S : k g i j ∈ R | g i j |

is a locally overlapping group defined by

S i j

and its spatial neighborhood elements,

G

is a set of overlapping groups in each image, and

β k, k − 1

denotes an adaptive weight for temporal consistency. To characterize the spatial support of FG, we have chosen a circular neighborhood

g i j = {(p, q) | (p − i) 2 + (q − j) 2 ≤ r N 2}

with radius of

r N

. The pair of regularization parameters

λ > 0, α > 0

jointly control the tradeoff between group sparsity penalty for

S

in space and time.

The first term of Eq. (8) corresponds to the spatial continuity constraint of FG. It needs to divide the frame to be processed into many overlapping groups through a sliding window.

‖ ⋅ ‖ ∞

is the

l ∞

norm, which calculates the maximum absolute value of the elements in the vector. The first term can be regarded as the

l 1

norm of the vector including infinite norms of overlapping groups. It forces the elements in the group to maintain similarity and can sparse some overlapping groups.

In our current implementation, we extend the conventional rectangular sliding window [11,43] to the circular window. By using the center of the circle as an anchor point, we can obtain the circular neighborhood with varying radius, as shown in Fig.4. When the radius is 1.5, the circular neighborhood degenerates into a

3 × 3

rectangle (green blocks in the top row). As the radius increases to 2, the shape of a circular window becomes anisotropic for pixels at the borders (orange block is the anchor point; blue squares denote the spatial neighborhood). Thanks to the well-known Isoperimetric Theorem [60], the area of a circular window is strictly larger than that of the corresponding square window of the same perimeter, which means our method can capture a higher-order neighborhood relationship in space. As will be verified in the experimental parts, a neighborhood radius of 1.5 is appropriate for subsampled video frames. The model achieves better results for raw video frames when the neighborhood radius is set to be 2.

Fig.4 On the left, the matrix represents the image to be processed, where the label is the index of each pixel. The sub-graph on the right top shows the conventional neighborhood of pixels. The right bottom illustrates some potential group neighborhood patterns located on the different positions of the image under different circular neighborhood radius

Full size|PPT slide

The second term of Eq. (8) is the first-order temporal difference, aiming at promoting the consistency of moving targets between adjacent frames. Intuitively, we would like the weighting coefficient

β k, k − 1

reflects the correlation between the

k

th frame and

(k − 1)

th frame of FG. In view of the iso-intensity constraint along the motion trajectory [61], surveillance video itself possesses strong temporal correlation, especially in the absence of camera motion. We advocate the use of the

l 1

distance between adjacent video frames for constructing

β k, k − 1

- i.e.,

(9)

β k, k − 1 = exp ⁡ − ‖ D : k − D : k − 1 ‖ 1 γ,

where

γ

is the scale parameter of the Laplace distribution characterizing the frame-differences. The smaller the distance between

D : k

and

D : k − 1

, the larger that coefficient

β k, k − 1

will be, indicating that the temporal continuity of the two frames is stronger. In practical applications, we note that the video frames in the data set may be processed by key frame extraction and compression. These preprocessing operations of video frames often introduce new uncertainties to the dataset and have negative impact on the temporal continuity. Thus, it is desirable to characterize the temporal continuity by adaptively estimating the scale parameter

γ

. Given a sliding window in the temporal domain, the maximum-likelihood estimation of Laplacian scale parameter is given by

(10)

γ = ∑ k = 2 n ‖ D : k − D : k − 1 ‖ 1 n − 1 .

Additionally, we note that regularization

α

can further fine-tune the strength of temporal continuity constraint.

3.4 JMDF model formulation and solution

Based on the above BG, FG, and noise models, we formulate JMDF as the following joint optimization problem:

(11)

min U, V, B, M, S ⁡ ‖ M ⊙ (D − B) ‖ F 2 + 12 ‖ D − B − S ‖ F 2 + Ω (S) s . t . B = U V,

here, the explicit modeling of the noise term

E

is not considered.

B

and

S

are reconstructed by minimizing the decomposition error.

Full size|PPT slide

Then, to solve this problem, we adopt the alternating direction method of multipliers (ADMM) optimization strategy, the complete steps of which are given by Algorithm 1. Using the augmented Lagrangian multiplier method to remove the equality constraint in Eq. (11), the augmented Lagrangian optimization problem is obtained as:

(12)

min U, V, B, S, Z ⁡ ‖ M ⊙ (D − B) ‖ F 2 + 12 ‖ D − B − S ‖ F 2 + Ω (S) + ⟨ Z, B − U V ⟩ + μ 2 ‖ B − U V ‖ F 2,

where

Z ∈ R m × n

is the Lagrangian multiplier, and

μ > 0

For

U

and

V

, we need to consider the following subproblem:

(13)

min U, V ⁡ ‖ (B (t) + 1 μ (t) Z (t)) − U V ‖ F 2,

letting

H = B (t) + 1 μ (t) Z (t)

. By taking the first order partial derivatives of

U

and

V

respectively, their closed-form solutions are calculated,

(14)

U (t + 1) = (H V (t) T) (V (t) V (t) T) − 1,

(15)

V (t + 1) = (U (t + 1) T U (t + 1)) − 1 (U (t + 1) T H) .

For

B

, we need to consider the following sub-problem,

(16)

min B ⁡ ‖ M ⊙ (D − B) ‖ F 2 + 12 ‖ D − B − S (t) ‖ F 2 + ⟨ Z (t), B ⟩ + μ 2 ‖ B − U (t + 1) V (t + 1) ‖ F 2,

drawing lessons from the derivation in [19], we optimize Eq. (16) in a scalar form,

(17)

min B i j ⁡ M i j (D i j − B i j) 2 + 12 (D i j − B i j) 2 − (D i j − B i j) S i j + Z i j B i j + μ 2 B i j − μ B i j (U i : V : j),

where

S i j

and

Z i j

represent the elements in

S

and

Z

, respectively.

U i :

is the

i

th row of the matrix

U

and

V : j

is the

j

th column of the matrix

V

. Letting

P i j = D i j − B i j

, Eq. (17) is equivalent to

(18)

min P i j ⁡ 12 (P i j − Q i j) 2 + M i j 1 + μ P i j 2,

where

Q i j = (μ D i j − μ (U i : V : j) + Z i j + S i j) / (1 + μ)

. By taking the first order partial derivatives of

P i j

, we calculate the solution of

P i j

, and finally derive the closed-form solution of

B

(19)

B (t + 1) = D − μ (t) (D − U (t + 1) V (t + 1)) + Z (t) + S (t) 1 + μ (t) 1 + 2 M (t) .

For

S

, we need to consider the following sub-problem,

(20)

min S ⁡ 12 ‖ (D − B (t + 1)) − S ‖ F 2 + Ω (S),

we decompose the matrix optimization problem into a set of frame-based problems, while considering the spatio-temporal continuity between frames. Since the temporal continuity constraint is applied at the beginning of the second frame, the processing of the first FG frame

S : 1 (t + 1)

is as follows,

(21)

min S : 1 ⁡ 12 ‖ (D : 1 − B : 1) − S : 1 ‖ F 2 + λ ∑ g i j ∈ G ‖ S : 1 g i j ‖ ∞ .

Then, the processing of other FG frames

S : k (t + 1)

is as follows,

(22)

min S : k ⁡ 12 ‖ K − S : k ‖ F 2 + λ ∑ g i j G ‖ S : k g i j ‖ ∞,

where

K = D : k − B : k + α β k, k − 1 S : k − 1 (t + 1) 1 + α β k, k − 1

Their solutions can be obtained by solving a quadratic min-cost flow problem [43]. The function

s p a m s . P r o x i m a l G r a p h

in Python is used to achieve the solutions.

For

σ 2

, we use statistical parameter estimation to update,

(23)

σ 2 (t + 1) = 1 + ∑ i, j I (S i j (t + 1) = 0) (D i j − B i j (t + 1)) 2 1 + ∑ i, j I (S i j (t + 1) = 0) .

The membership degree matrix

M

is updated via

(24)

M i j (t + 1) = 1 1 + | T N (i, j) | [(D i j (t + 1) − B i j (t + 1)) 2 2 σ 2 I (S i j (t + 1) = 0) + ∑ p, q ∈ T N (i, j) (D p q (t + 1) − B p q (t + 1)) 2 2 σ 2 I (S p q (t + 1) = 0)] .

The iterative solution algorithm will not terminate until the equality constraint

B = U V

is satisfied (up to a pre-defined tolerance

ε

), that is

‖ B − U V ‖ F 2 / ‖ D ‖ F 2 ≤ ε

, or the maximal number of iterations is reached. In our experiments, the tolerance factor

ε

is set to 1

e − 6

4 Experiments

In this section, we summarize the experimental results of this work and compare the proposed algorithm with the existing methods: generalized shrinkage thresholding operator (GSTO) [62], extended low-rank and structured sparse decomposition (E-LSD) [43], robust outlier estimate for low-rank matrix recovery (ROUTE-LRMF) [12], DEtecting Contiguous Outliers in the Low-rank Representation (DECOLOR) [15], Online MoG-MF(OMoGMF+TV) [6] and PCP [9]. These algorithms are downloaded from the author’s website or reproduced from the corresponding papers. The parameters of these algorithms are set to default values or optimized according to the range of values. The video surveillance data set includes I2R data set and ChangeDetection dataset (CDnet2014). CDnet2014 is now the most common dataset for background modeling task, that almost all methods are tested on it [63,64]. The experimental environment of JMDF, ROUTE-LRMF and E-LSD is Python3.7 in Linux system, and the experiments of other models are run on MATLAB R2018b, and the equipment is configured with Intel(R) Core(TM) i9-10900X CPU and 128G RAM. To quantitatively evaluate the performance of the algorithm, we calculated the harmonic average F-measure of the precision rate and the recall rate, which is defined as:

(25)

F − m e a s u r e = 2 × r e c a l l × p r e c i s i o n r e c a l l + p r e c i s i o n .

In the experiment, 220 video frames are processed in batches each time. In addition, we select some video clips in I2R to adjust the parameter

λ

of Algorithm 1. According to Fig.5, we set

λ = 0.12

for scenes with strong interference, while

λ = 0.06

for cases with weak dynamic BG and static BG. To explore the influence of different neighborhood radius on the model, we use the neighborhood radius of 1, 1.5, and 2 to build the model, denoted by JMDF01, JMDF02 and JMDF03 respectively. We add median filtering processing after JMDF03 model, because the circle neighborhood sparsity with radius 2 detects the FG of rough edges.

Fig.5 $λ$ value in three types of scenes. (a) Weak dynamic BG; (b) strong interference; (c) static BG

Full size|PPT slide

At the early stage of iteration, video frames can be run in parallel with the spatial structural constraint (lightweight mode). At the late stage of iteration, joint spatio-temporal constraints are activated (full mode). If the algorithm does not reach the maximum number of iterations and converges instead, we will enforce spatio-temporal continuity constraint after the convergence.

4.1 FG estimation on I2R

We first compare the video modeling capabilities on the I2R dataset. It contains nine videos, which have a variety of scenes including static BG (“Bootstrap” and “Hall”), dynamic BG (“Campus” and “Fountain”), slow movement (“Curtain” and “Watersurface”), and target staying (“ShopMall” and “Lobby”). For each video, the ground-truth (manually-segmented FG regions) of 20 frames are provided in the dataset. In our experiments, we use these 20 frames with ground truth to test the performance of JMDF. The comparison results of F-measure are shown in Tab.1. The overall performance of JMDF03 is the best, especially when dealing with dynamic BG. The performance of JMDF02 are slightly lower than JMDF03. From the F-measures of the 9 videos, the performance of JMDF03 is more stable than that of JMDF02 and exceeds the comparison models in the 8 videos, we choose the neighborhood radius

r N

as 2.

Tab.1 Comparison of foreground detection results on i2r dataset

Video	PCP [9]	DECOLOR [15]	E-LSD [43]	ROUTE [12]	OMoGMF+TV [6]	GSTO [62]	JMDF01	JMDF02	JMDF03
Bootstrap	0.639	0.581	0.685	0.641	0.669	0.711	0.673	0.690	0.693
Campus	0.444	0.767	0.784	0.409	0.813	0.810	0.763	0.826	0.828
Curtain	0.692	0.781	0.832	0.785	0.836	0.812	0.883	0.892	0.914
Escalator	0.572	0.724	0.715	0.588	0.651	0.733	0.660	0.691	0.734
Fountain	0.683	0.833	0.831	0.727	0.825	0.855	0.871	0.881	0.884
Hall	0.520	0.643	0.671	0.615	0.652	0.673	0.671	0.683	0.690
ShopMall	0.692	0.671	0.744	0.707	0.701	0.747	0.742	0.741	0.748
WaterSurface	0.781	0.836	0.887	0.853	0.902	0.901	0.929	0.938	0.940
Lobby	0.652	0.607	0.742	0.711	0.787	0.813	0.835	0.844	0.834
Average	0.631	0.716	0.766	0.671	0.760	0.784	0.781	0.798	0.807

We can see from the results that JMDF has obvious advantages in dynamic BG, such as “Fountain”, “WaterSurface”, “Curtian”, “Escalator”, and “Campus”, which is benefited from the generalized group sparsity constraint. For “Bootstrap” video, the detection accuracy of JMDF is equal to that of E-LSD, and tied to the second place. In general, the average F-measure of JMDF is higher than the state-of-the-art GSTO, and significantly better than all comparison algorithms listed in this paper, especially in dynamic BG.

The detection results of randomly selected frames are shown in Fig.6. Since a neighborhood radius of 2 is selected, we only show the foreground detection results of JMDF03. The slow motion and stagnation of FG make it more challenging to accurately restore BG, which will affect the accuracy of moving target detection. In “WaterSurface”, “Curtain” and “Lobby”, the moving targets are in a short-stay state. PCP is unable to capture them. ROUTE can alleviate this problem through matrix factorization, but it has no structural constraint for FG, and the detection results are still partially missing. E-LSD with group sparsity and can also capture short-stay targets. In our BG modeling, we have used background membership degree to vaguely describe the properties of pixels and suppress the FG interference to BG restoration. It follows that our method can restore a cleaner BG, and better solves the challenges caused by the short-stay FG. Besides, the generalized group sparsity constraint on spatio-temporal association can not only promote FG integrity, but also alleviate the interference of dynamic BG. However, some dynamic BG and other complex scenes are still challenging, such as illumination change, running escalators, and strong shadow interference. The detection effect of these models is not good due to abundant shadows in “Bootstrap” videos. Although JMDF can capture FG of short-term scenes to a large extent, it still cannot achieve satisfactory results when targets stay for a long time (e.g., “Hall” and “ShopMall”).

Fig.6 Foreground detection results on the I2R dataset. From left to right, the video sequences selected in turn are “WaterSurface”, “Fountain”, “Curtain”, “Campus”, “Escalator”, “Hall”, “Bootstrap”,“ShopMall” and “Lobby”

Full size|PPT slide

4.2 FG Estimation on CDnet2014

The CDnet2014 dataset consists of 46 videos, spanning 11 scenes, such as bad weather, night monitoring, and dynamic BG. We conduct experiments on 7 categories (“cameraJitter”, “PTZ”, “shadow”, and “nightvideos” categories were excluded, because our goal is not to model video in mobile or jitter cameras, night surveillance, and a lot of shadow scenes). We select 220 consecutive frames from each video for evaluation. The video size inside this dataset is too large, which can lead to a very time-consuming computation. In the preprocessing stage, we downsample these video frames. The F-measure obtained by JMDF and other comparison methods for processing each video is shown in Tab.2. The detection results of randomly selected frames are shown in Fig.7. Since the model achieves the best results when the neighborhood radius

r N

is 1.5, we focus on this situation.

Tab.2 Comparison of foreground detection results on cdnet2014 dataset

Sequence	Video	PCP [9]	DECOLOR [15]	E-LSD [43]	ROUTE [12]	OMoGMF+TV [6]	GSTO [62]	JMDF01	JMDF02	JMDF03
Baseline	highway	0.791	0.881	0.963	0.752	0.935	0.937	0.982	0.992	0.979
	office	0.643	0.843	0.924	0.882	0.845	0.906	0.876	0.886	0.903
	pedestrians	0.952	0.701	0.990	0.953	0.982	0.943	0.987	0.994	0.984
	PETS2006	0.690	0.908	0.812	0.675	0.806	0.842	0.861	0.873	0.833
	Average	0.769	0.833	0.922	0.816	0.892	0.907	0.927	0.936	0.925
BadWeather	skating	0.652	0.927	0.921	0.832	0.901	0.945	0.961	0.963	0.953
	snowFall	0.626	0.908	0.795	0.742	0.813	0.846	0.935	0.940	0.932
	blizzard	0.883	0.851	0.873	0.902	0.867	0.918	0.942	0.942	0.941
	wetSnow	0.608	0.904	0.861	0.632	0.822	0.861	0.886	0.884	0.880
	Average	0.692	0.898	0.863	0.777	0.851	0.893	0.931	0.932	0.927
LowFramerate	port_0_17fps	0.062	0.041	0.454	0.052	0.382	0.251	0.314	0.394	0.442
	tramCrossroad_1fps	0.756	0.782	0.864	0.761	0.854	0.832	0.844	0.843	0.842
	tunnelExit_0_35fps	0.583	0.678	0.622	0.545	0.559	0.742	0.783	0.761	0.734
	turnpike_0_5fps	0.752	0.721	0.897	0.752	0.883	0.798	0.895	0.901	0.903
	Average	0.541	0.556	0.709	0.528	0.670	0.656	0.709	0.725	0.730
Turbulence	turbulence0	0.692	0.344	0.754	0.632	0.728	0.581	0.826	0.882	0.883
	turbulence1	0.382	0.421	0.693	0.543	0.798	0.715	0.859	0.870	0.852
	turbulence2	0.043	0.472	0.987	0.512	0.993	0.871	0.982	0.984	0.973
	turbulence3	0.845	0.702	0.936	0.859	0.926	0.860	0.941	0.954	0.960
	Average	0.491	0.482	0.843	0.637	0.861	0.757	0.902	0.923	0.917
Thermal	corridor	0.402	0.973	0.881	0.771	0.962	0.923	0.991	0.993	0.992
	diningRoom	0.464	0.911	0.724	0.616	0.608	0.905	0.772	0.793	0.784
	lakeSide	0.673	0.722	0.435	0.684	0.322	0.776	0.680	0.662	0.604
	library	0.467	0.536	0.971	0.885	0.983	0.952	0.985	0.991	0.989
	park	0.618	0.814	0.727	0.634	0.617	0.859	0.856	0.851	0.853
	Average	0.525	0.791	0.748	0.718	0.698	0.883	0.857	0.858	0.844
IntermittentObjectMotion	abandonedBox	0.710	0.721	0.932	0.823	0.924	0.906	0.904	0.904	0.882
	parking	0.622	0.211	0.382	0.383	0.274	0.801	0.768	0.767	0.624
	sofa	0.634	0.732	0.703	0.632	0.690	0.693	0.691	0.698	0.696
	streetLight	0.421	0.643	0.614	0.440	0.591	0.633	0.631	0.653	0.614
	tramstop	0.242	0.350	0.352	0.243	0.336	0.376	0.347	0.364	0.353
	winterDriveway	0.371	0.808	0.766	0.773	0.646	0.795	0.784	0.785	0.784
	Average	0.500	0.578	0.625	0.549	0.577	0.701	0.688	0.695	0.659
DynamicBackground	boats	0.426	0.903	0.931	0.463	0.907	0.905	0.902	0.916	0.954
	canoe	0.121	0.264	0.822	0.434	0.801	0.778	0.887	0.933	0.923
	fall	0.445	0.707	0.701	0.363	0.566	0.828	0.679	0.764	0.842
	fountain01	0.041	0.024	0.081	0.053	0.042	0.184	0.171	0.220	0.171
	fountain02	0.722	0.726	0.727	0.744	0.805	0.824	0.854	0.832	0.784
	overpass	0.492	0.845	0.793	0.711	0.802	0.872	0.858	0.875	0.871
	Average	0.375	0.578	0.676	0.461	0.654	0.732	0.725	0.762	0.758
Overall average		0.556	0.674	0.769	0.641	0.743	0.790	0.820	0.833	0.823

Fig.7 Foreground detection results on the CDnet2014 dataset. From left to right, the video sequences selected in turn are “baseline”, “badWeather”, “lowFramerate”, “turbulence”, “thermal”, “intermittentObjectMotion”, and “dynamicBackground”

Full size|PPT slide

In a simple scene such as the “baseline”, the six methods all show high detection results. Compared with E-LSD, JMDF is slightly improved. The “thermal” is a video taken by a thermal imager. GSTO obtains the optimal average F-measure value, while JMDF is in the suboptimal position. The “turbulence” contains strong interference noise that causes blurry pictures and relatively small moving targets. Most of the competing algorithms are not suitable for the special scenes, but our method can achieve outstanding F-measure of 0.92. OMoGMF+TV has also some success on “turbulence” scene, because it can model complex mixtures of noise. In “badWeather”, the average F-measure of JMDF reaches 0.93, which is higher than GSTO. Capturing FG in dynamic BG is a common challenge and difficulty in moving foreground detection tasks. In the “dynamicBackground” scene, JMDF has the highest average F-measure. The robustness of JMDF to interferences with “boats” and “canoe” videos exceeds most of the methods, which shows that our method is suitable for dealing with the situation where the dynamic BG is fluctuating water surface. The “fall” video contains violently swaying leaves, leading to more noise in FG extracted by JMDF. The most difficult video to process in foreground detection is “fountain01”. Its moving target area is far away and very small, while multiple fountains are near and large. All the models in this paper cannot accurately extract FG. Each video in “intermittentObjectMotion” contains intermittently moving targets. At this time, the average F-measure of JMDF is the best. Our approach is also geared towards this scene. Overall, JMDF performed well on most videos. Compared with other competitive methods, it can show excellent performance in thermal imaging, blurred scenes and dynamic BG.

After experimenting with both datasets, it was found that a neighborhood radius of 2 works best without downsampling the video frames. After downsampling, a neighborhood radius of 1.5 is appropriate. We believe that after the images are downsampled, the size of the overlapping groups becomes relatively larger and the amount of neighborhood information carried within the groups is richer. At this time, the constraint intensity of circle neighborhood sparsity with radius 2 is too large, which may adversely affect the foreground detection. This explains that, as a whole, the radius taken as 1.5 is the best result on the CDnet dataset. However, even when the images are subsampled, JMDF03 achieves a higher F-measure under dynamic BG. Therefore, the sparsity strategy with a neighborhood radius of 2 is more effective in this scene.

4.3 Noise robustness of JMDF

Due to camera hardware problems and the complexity of the surveillance scene, the captured surveillance videos sometimes add external noise. This paper studies the robustness of JMDF to different noises. Here, four types of noise are considered: Gaussian noise, salt and pepper noise, speckle noise, and Poisson noise. Gaussian additive noise is a common noise caused by uneven illumination and high sensor temperature during acquisition and transmission. Salt and pepper noise can randomly change some pixels to 0 or 255, completely destroying the image information. Speckle noise is a class of multiplicative noise dependent on the image itself. Poisson noise, a.k.a., shot noise, characterizes the uncertainty associated with the measurement of light. Except for Poisson noise, our experiment sets three variance values to represent different levels of noise intensity. As shown in Fig.8, the frames contaminated by salt and pepper noise can be regarded as being affected by outliers. These noises increase the difficulty of moving target detection. Even under high-intensity noise, the detection results of noisy videos are reasonably similar to the detection results of clean video, which justifies that JMDF is robust to different types of noise.

Fig.8 Robustness to different noises. The source video frame and the result from our JMDF are displayed in the lower right corner of the figure. We add Gaussian noise, speckle noise, salt and pepper noise, and Poisson noise to the source video, respectively. Except for Poisson noise, we fix the noise mean to 0, and change the noise variance, whose values are shown in the first row of each sub-figure. Then, we plot the noisy frame examples in the second row and the results from JMDF in the third row. As the variance increases, the video gradually becomes blurred, which increases the difficulty of the detection task. Our method achieves good performance even in videos with large noise variance

Full size|PPT slide

5 Conclusions

In this paper, we have presented a new framework called JMDF for moving target detection. In the factorization part, we propose a fuzzy membership degree that integrates noise modeling and FG information to represent the soft association between each pixel and BG/FG. It allows the model to recover the clean static BG (no FG residue). For FG modeling, we propose to take the first-order temporal difference of adjacent frames into a group sparsity constraint adaptively, which promotes the spatio-temporal continuity of FG. The experimental results on real video datasets show that JMDF has achieved better detection results when compared with the current relatively advanced methods, especially in scenes with dynamic BG, bad weather, turbulence and intermittent motion.

In the future, we expect to study how to combine deep learning implementations with model-based methods such as MF or MD. Deep neural networks (DNN) are widely used in the field of image processing with powerful feature extraction and expression capabilities. It can automatically learn shallow and deep features, bypassing the steps of feature engineering such as JMDF. How to unfold the ADMM-based solution algorithm into a DNN-based implementation, so we can jointly optimize both the network and hyperparameters in an end-to-end manner? Unlike other vision tasks, BG extraction has remained an area in which deep learning has not shown apparent advantages over model-based approaches. In addition, it is also worth studying the online optimization strategy of JMDF to make it more appropriate for the applications of real-time moving object detection.

Acknowledgements

This work was supported in part by the National Natural Science Foundation of China (Grant No. 61902106) and in part by the Natural Science Foundation of Hebei Province (No. F2020202028).

References

Publishing order | Descend order by publishing year | Descend order by cited within

1	Garcia-Garcia B, Bouwmans T, Silva A J R . Background subtraction in real applications: challenges, current models and future directions. Computer Science Review, 2020, 35: 100204

2	Stauffer C, Grimson W E L. Adaptive background mixture models for real-time tracking. In: Proceedings of 1999 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. 1999, 246−252

3	Weinland D, Boyer E. Action recognition using exemplar-based embedding. In: Proceedings of 2008 IEEE Conference on Computer Vision and Pattern Recognition. 2008, 1−7

4	Lin Z, Zhang H. Low-Rank Models in Visual Analysis: Theories, Algorithms, and Applications. Sea Harbor Drive Orlando: Academic Press, 2017

5	Vaswani N, Bouwmans T, Javed S, Narayanamurthy P . Robust subspace learning: robust PCA, robust subspace tracking, and robust subspace recovery. IEEE Signal Processing Magazine, 2018, 35( 4): 32–55

6	Yong H, Meng D, Zuo W, Zhang L . Robust online matrix factorization for dynamic background subtraction. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2018, 40( 7): 1726–1740

7	Zivkovic Z. Improved adaptive Gaussian mixture model for background subtraction. In: Proceedings of the 17th International Conference on Pattern Recognition. 2004, 28−31

8	Elgammal A, Harwood D, Davis L. Non-parametric model for background subtraction. In: Proceedings of the 6th European Conference on Computer Vision. 2000, 751−767

9	J Candès E J, Li X, Ma Y, Wright J . Robust principal component analysis?. Journal of the ACM, 2011, 58( 3): 11

10	Chen J, Yang J . Robust subspace segmentation via low-rank representation. IEEE Transactions on Cybernetics, 2014, 44( 8): 1432–1445

11	Liu X, Zhao G, Yao J, Qi C . Background subtraction based on low-rank and structured sparse decomposition. IEEE Transactions on Image Processing, 2015, 24( 8): 2502–2514

12	Guo X, Lin Z . Low-rank matrix recovery via robust outlier estimation. IEEE Transactions on Image Processing, 2018, 27( 11): 5316–5327

13	Javed S, Mahmood A, Al-Maadeed S, Bouwmans T, Jung S K . Moving object detection in complex scene using spatiotemporal structured-sparse RPCA. IEEE Transactions on Image Processing, 2019, 28( 2): 1007–1022

14	Boyd S, Parikh N, Chu E, Peleato B, Eckstein J . Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends® in Machine Learning, 2011, 3( 1): 1–122

15	Zhou X, Yang C, Yu W . Moving object detection by detecting contiguous outliers in the low-rank representation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2013, 35( 3): 597–610

16	Cao X, Yang L, Guo X . Total variation regularized RPCA for irregularly moving object detection under dynamic background. IEEE Transactions on Cybernetics, 2016, 46( 4): 1014–1027

17	Xie H B, Li C, Xu R Y D, Mengersen K. Robust kernelized Bayesian matrix factorization for video background/foreground separation. In: Proceedings of the 5th International Conference on Machine Learning, Optimization, and Data Science. 2019, 484−495

18	Meng D, De La Torre F. Robust matrix factorization with unknown noise. In: Proceedings of 2013 IEEE International Conference on Computer Vision. 2013, 1337−1344

19	Cao X, Chen Y, Zhao Q, Meng D, Wang Y, Wang D, Xu Z. Low-rank matrix factorization under general mixture noise distributions. In: Proceedings of 2015 IEEE International Conference on Computer Vision. 2015, 1493−1501

20	Chu Y, Wu X, Liu T, Liu J. A basis-background subtraction method using non-negative matrix factorization. In: Proceedings of SPIE 7546, Second International Conference on Digital Image Processing. 2010, 75461A

21	Li L, Hu Q, Li X . Moving object detection in video via hierarchical modeling and alternating optimization. IEEE Transactions on Image Processing, 2019, 28( 4): 2021–2036

22	Li L, Huang W, Gu I Y H, Tian Q . Statistical modeling of complex backgrounds for foreground object detection. IEEE Transactions on Image Processing, 2004, 13( 11): 1459–1472

23	Brutzer S, Höferlin B, Heidemann G. Evaluation of background subtraction techniques for video surveillance. In: Proceedings of CVPR 2011. 2011, 1937−1944

24	Bouwmans T . Recent advanced statistical background modeling for foreground detection - a systematic survey. Recent Patents on Computer Science, 2011, 4( 3): 147–176

25	Nebili W, Farou B, Seridi H . Using resources competition and memory cell development to select the best GMM for background subtraction. International Journal of Strategic Information Technology and Applications, 2019, 10( 2): 21–43

26	Chen M, Wei X, Yang Q, Li Q, Wang G, Yang M H . Spatiotemporal GMM for background subtraction with superpixel hierarchy. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2018, 40( 6): 1518–1525

27	Elguebaly T, Bouguila N . Background subtraction using finite mixtures of asymmetric Gaussian distributions and shadow detection. Machine Vision and Applications, 2014, 25( 5): 1145–1162

28	Yang Y, Han D, Ding J, Yang Y. An improved ViBe for video moving object detection based on evidential reasoning. In: Proceedings of 2016 IEEE International Conference on Multisensor Fusion and Integration for Intelligent Systems. 2016, 26−31

29	Ramírez-Alonso G, Chacón-Murguía M I . Auto-adaptive parallel SOM architecture with a modular analysis for dynamic object segmentation in videos. Neurocomputing, 2016, 175: 990–1000

30	Wang H C, Lai Y C, Cheng W H, Cheng C Y, Hua K L . Background extraction based on joint gaussian conditional random fields. IEEE Transactions on Circuits and Systems for Video Technology, 2018, 28( 11): 3127–3140

31	Xie Y, Gu S, Liu Y, Zuo W, Zhang W, Zhang L . Weighted schatten p-norm minimization for image denoising and background subtraction. IEEE Transactions on Image Processing, 2016, 25( 10): 4842–4857

32	Hu W, Yang Y, Zhang W, Xie Y . Moving object detection using tensor-based low-rank and saliently fused-sparse decomposition. IEEE Transactions on Image Processing, 2017, 26( 2): 724–737

33	Yin L, Parekh A, Selesnick I . Stable principal component pursuit via convex analysis. IEEE Transactions on Signal Processing, 2019, 67( 10): 2595–2607

34	Feng J, Xu H, Yan S. Online robust PCA via stochastic optimization. In: Proceedings of the 26th International Conference on Neural Information Processing Systems. 2013, 404−412

35	Guo K, Liu L, Xu X, Xu D, Tao D . GoDec+: fast and robust low-rank matrix decomposition based on maximum correntropy. IEEE Transactions on Neural Networks and Learning Systems, 2018, 29( 6): 2323–2336

36	Cao H, Shang X, Wang Y, Song M, Chen S, Chang C I. GO decomposition (GoDec) approach to finding low rank and sparsity matrices for hyperspectral target detection. In: Proceedings of 2020 IEEE International Geoscience and Remote Sensing Symposium. 2020, 2807−2810

37	Shakeri M, Zhang H . COROLA: a sequential solution to moving object detection using low-rank approximation. Computer Vision and Image Understanding, 2016, 146: 27–39

38	Javed S, Oh S H, Sobral A, Bouwmans T, Jung S K. OR-PCA with MRF for robust foreground detection in highly dynamic backgrounds. In: Proceedings of the 12th Asian Conference on Computer Vision. 2014, 284−299

39	Javed S, Oh S H, Bouwmans T, Jung S K . Robust background subtraction to global illumination changes via multiple features-based online robust principal components analysis with Markov random field. Journal of Electronic Imaging, 2015, 24( 4): 043011

40	Li C, Wang X, Zhang L, Tang J, Wu H, Lin L . Weighted low-rank decomposition for robust grayscale-thermal foreground detection. IEEE Transactions on Circuits and Systems for Video Technology, 2017, 27( 4): 725–738

41	Zhu L, Hao Y, Song Y . L_1/2 norm and spatial continuity regularized low-rank approximation for moving object detection in dynamic background. IEEE Signal Processing Letters, 2018, 25( 1): 15–19

42	Xu Y, Wu Z, Chanussot J, Mura M D, Bertozzi A L, Wei Z . Low-rank decomposition and total variation regularization of hyperspectral video sequences. IEEE Transactions on Geoscience and Remote Sensing, 2018, 56( 3): 1680–1694

43	Zhang J, Jia X, Hu J . Error bounded foreground and background modeling for moving object detection in satellite videos. IEEE Transactions on Geoscience and Remote Sensing, 2020, 58( 4): 2659–2669

44	Wu M, Sun Y, Hang R, Liu Q, Liu G . Multi-component group sparse RPCA model for motion object detection under complex dynamic background. Neurocomputing, 2018, 314: 120–131

45	Liu X, Zhao G . Background subtraction using multi-channel fused lasso. Electronic Imaging, 2019, 2019( 11): 269

46	Xin B, Tian Y, Wang Y, Gao W. Background subtraction via generalized fused lasso foreground modeling. In: Proceedings of 2015 IEEE conference on Computer Vision and Pattern Recognition. 2015, 4676−4684

47	Javed S, Oh S H, Sobral A, Bouwmans T, Jung S K. Background subtraction via superpixel-based online matrix decomposition with structured foreground constraints. In: Proceedings of 2015 IEEE International Conference on Computer Vision Workshop. 2015, 930−938

48	Ye X, Yang J, Sun X, Li K, Hou C, Wang Y . Foreground-background separation from video clips via motion-assisted matrix restoration. IEEE Transactions on Circuits and Systems for Video Technology, 2015, 25( 11): 1721–1734

49	Cao X, Zhao Q, Meng D, Chen Y, Xu Z . Robust low-rank matrix factorization under general mixture noise distributions. IEEE Transactions on Image Processing, 2016, 25( 10): 4677–4690

50	Liu Q, Li X . Efficient low-rank matrix factorization based on ℓ_1,ε-norm for online background subtraction. IEEE Transactions on Circuits and Systems for Video Technology, 2022, 32( 7): 4900–4904

51	Zhu Z, Meng Y, Kong D, Zhang X, Guo Y, Zhao Y . To see in the dark: N2DGAN for background modeling in nighttime scene. IEEE Transactions on Circuits and Systems for Video Technology, 2021, 31( 2): 492–502

52	Patil P W, Murala S . MSFgNet: a novel compact end-to-end deep network for moving object detection. IEEE Transactions on Intelligent Transportation Systems, 2019, 20( 11): 4066–4077

53	Zhao C, Basu A . Dynamic deep pixel distribution learning for background subtraction. IEEE Transactions on Circuits and Systems for Video Technology, 2020, 30( 11): 4192–4206

54	Hou B, Liu Y, Ling N. A super-fast deep network for moving object detection. In: Proceedings of 2020 IEEE International Symposium on Circuits and Systems. 2020, 1−5

55	Patil P W, Murala S, Dhall A, Chaudhary S. MsEDNet: multi-scale deep saliency learning for moving object detection. In: Proceedings of 2018 IEEE International Conference on Systems, Man, and Cybernetics. 2018, 1670−1675

56	Yan L F, Tu X Y. Background modeling based on chebyshev approximation. Journal of System Simulation, 2008, 20(4): 944−946, 1001

57	Stagliano A, Noceti N, Verri A, Odone F . Online space-variant background modeling with sparse coding. IEEE Transactions on Image Processing, 2015, 24( 8): 2415–2428

58	Messelodi S, Modena M C, Segata N, Zanin M. A Kalman filter based background updating algorithm robust to sharp illumination changes. In: Proceedings of the 13th International Conference on Image Analysis and Processing. 2005, 163−170

59	Giraldo J H, Javed S, Sultana M, Jung S K, Bouwmans T. The emerging field of graph signal processing for moving object segmentation. In: Proceedings of the 27th International Workshop on Frontiers of Computer Vision. 2021, 31−45

60	Pólya G. Isoperimetric Inequalities in Mathematical Physics. Princeton: Princeton University Press, 1951

61	Tekalp A M. Digital Video Processing. 2nd ed. Upper Saddle River: Prentice Hall Press, 2015

62	Li L, Wang Z, Hu Q, Dong Y . Adaptive nonconvex sparsity based background subtraction for intelligent video surveillance. IEEE Transactions on Industrial Informatics, 2021, 17( 6): 4168–4178

63	Javed S, Narayanamurthy P, Bouwmans T, Vaswani N. Robust PCA and robust subspace tracking: a comparative evaluation. In: Proceedings of 2018 IEEE Statistical Signal Processing Workshop. 2018, 836−840

64	Bouwmans T, Aybat N S, Zahzah E H. Handbook of Robust Low-Rank and Sparse Matrix Decomposition: Applications in Image and Video Processing. Boca Raton: CRC Press, 2016

Options

Outlines

About the journal

Browse

Authors & reviewers

Abstract

Cite this article

1 Introduction

Fig.1 Comparison of the proposed combined methodology and two conventional frameworks. (a) Matrix decomposition (MD) way; (b) matrix factorization (MF) way; (c) the proposed JMDF: we simultaneously take FG model, BG model, and noise model into account by extending the fuzzy membership degree

2 Related work

3 Proposed method

3.1 Overall model

3.2 Fuzzy factorization of BG

Fig.3 Background membership degree in iterative process. The leftmost image is the original video frame. For else images, from left to right, there are the FG masks and background membership heat maps in the fifth, seventh and ninth iterations

3.3 Spatio-temporal constraints of FG

3.4 JMDF model formulation and solution

4 Experiments

Fig.5 λ value in three types of scenes. (a) Weak dynamic BG; (b) strong interference; (c) static BG

4.1 FG estimation on I2R

Tab.1 Comparison of foreground detection results on i2r dataset

Fig.6 Foreground detection results on the I2R dataset. From left to right, the video sequences selected in turn are “WaterSurface”, “Fountain”, “Curtain”, “Campus”, “Escalator”, “Hall”, “Bootstrap”,“ShopMall” and “Lobby”

4.2 FG Estimation on CDnet2014

Tab.2 Comparison of foreground detection results on cdnet2014 dataset

Fig.7 Foreground detection results on the CDnet2014 dataset. From left to right, the video sequences selected in turn are “baseline”, “badWeather”, “lowFramerate”, “turbulence”, “thermal”, “intermittentObjectMotion”, and “dynamicBackground”

4.3 Noise robustness of JMDF

5 Conclusions

Acknowledgements

References

Fig.5 $λ$ value in three types of scenes. (a) Weak dynamic BG; (b) strong interference; (c) static BG