Parameters estimation and application of generalized exponential distribution under grouped and right-censored data

Yuzhu TIAN; Maozai TIAN; Ping CHEN

doi:10.3868/s140-DDD-023-0013-x

Front. Math. China ›› 2023, Vol. 18 ›› Issue (3) :165 -174. DOI: 10.3868/s140-DDD-023-0013-x

RESEARCH　ARTICLE

Parameters estimation and application of generalized exponential distribution under grouped and right-censored data

Yuzhu TIAN ¹^,²^,^†
, Maozai TIAN ²
, Ping CHEN ³

Author information +

History +

PDF (594KB)

Abstract

Generalized exponential distribution is a class of important distribution in lifedata analysis, especially in some skewed lifedata. The Parameter estimation problem for generalized exponential distribution model with grouped and right-censored data is considered. The maximum likelihood estimators are obtained using the EM algorithm. Some simulations are carried out to illustrate that the proposed algorithm is effective for the model. Finally, a set of medicine data is analyzed by generalized exponential distribution.

Graphical abstract

Keywords

Generalized exponential distribution / grouped and right-censored data / EM algorithm

Cite this article

Download citation ▾

Yuzhu TIAN, Maozai TIAN, Ping CHEN. Parameters estimation and application of generalized exponential distribution under grouped and right-censored data. Front. Math. China, 2023, 18(3): 165-174 DOI:10.3868/s140-DDD-023-0013-x

登录浏览全文

4963

注册一个新账户忘记密码

1 Introduction

Gupta and Kundu (1999) [2] proposed the two-parameter generalized exponential distribution as an alternative to the Gamma, Weibull and log-normal distributions, and investigated some of the different properties between them. The two-parameter generalized exponential distribution has important applications in survival analysis, product life analysis and reliability engineering, especially for skewed failure life data. Many results have been discussed for generalized exponential distributions for general sample data [1, 3−5, 10−12]. However, in survival analysis and lifetime data studies, the complex situation of grouping and right-censored sample data [7−9] often arises. In this paper, we use EM algorithms to consider parameter estimation for two-parameter generalized exponential distributions with such sample data.

The probability density function, survival function and hazard rate function of the two-parameter generalized exponential distribution model are

(1)

f (x; α, λ) = α λ (1 − e − λ x) α − 1 e − λ x, x ⩾ 0,

(2)

S (x; α, λ) = 1 − (1 − e − λ x) α, x ⩾ 0,

(3)

h (x; α, λ) = f (x; α, λ) S (x; α, λ) = α λ (1 − e − λ x) α − 1 e − λ x 1 − (1 − e − λ x) α, x ⩾ 0,

where

α, λ > 0

are the shape and scale parameters of the model, respectively. When the shape parameter

α = 1

, the model is a general exponential distribution model. The hazard function

h (x; α, λ)

of the model (1) does not depend on

λ

, but only on

α

. And when

α > 1

h (x; α, λ)

is increasing; when

α < 1

h (x; α, λ)

is decreasing; when

α = 1

h (x; α, λ)

is a constant [3]. Section 2 of this paper briefly describes the maximum likelihood estimation of the parameters. Section 3 considers the estimation of the parameters of the model using the EM algorithm. Section 4 presents the numerical simulations. Section 5 analyzes a set of real data using a generalized exponential distribution model.

2 The log-likelihood function of model (1) for the grouped data and right-censored case

Suppose the lifetime of a product follows the two-parameter generalized exponential distribution (1), its distribution function is

(4)

F (x; α, λ) = (1 − e − λ x) α, x ⩾ 0 .

Now take

n

products for the lifetime test, and get the data as follows:

[0, + ∞)

is divided into

N + 1

intervals, and the first

N

interval is denoted as

[T i − 1, T i)

, where

i = 1, 2, …, N; 0 = T 0 < T 1 < ⋯ < T N < + ∞

. Then use

c i

to denote the number of failed products falling into

[T i − 1, T i)

, and use

d i

to denote the number of products censored at

T i

. Then we have

n = ∑ i = 1 N (c i + d i)

The likelihood function is

L (α, λ) = ∏ i = 1 N [F (T i; α, λ) − F (T i − 1; α, λ)] c i ⋅ [1 − F (T i; α, λ)] d i = ∏ i = 1 N [(1 − e − λ T i) α − (1 − e − λ T i − 1) α] c i ⋅ [1 − (1 − e − λ T i) α] d i .

The log-likelihood function is

log ⁡ L (α, λ) = ∑ i = 1 N {c i ⋅ log ⁡ [(1 − e − λ T i) α − (1 − e − λ T i − 1) α] + d i ⋅ log ⁡ [1 − (1 − e − λ T i) α]} .

The log-likelihood function takes the partial derivatives of the parameters, and let

∂ log ⁡ L (α, λ) ∂ α = ∑ i = 1 N {c i (1 − e − λ T i) α ⋅ log ⁡ (1 − e − λ T i) − (1 − e − λ T i − 1) α ⋅ log ⁡ (1 − e − λ T i − 1) (1 − e − λ T i) α − (1 − e − λ T i − 1) α − d i (1 − e − λ T i) α ⋅ log ⁡ (1 − e − λ T i) (1 − e − λ T i) α} = 0, ∂ log ⁡ L (α, λ) ∂ λ = α ⋅ ∑ i = 1 N {c i (1 − e − λ T i) α − 1 e − λ T i T i − (1 − e − λ T i − 1) α − 1 e − λ T i − 1 T i − 1 (1 − e − λ T i) α − (1 − e − λ T i − 1) α − d i (1 − e − λ T i) α − 1 e − λ T i T i 1 − (1 − e − λ T i) α} = 0.

Solving the above system of equations yields a maximum likelihood estimate for the parameters

α

and

λ

. However, due to the complexity of the above system of non-linear equations, it is not possible to obtain explicit expressions for the parameter estimates, and even using numerical solutions (e.g., Newton’s method) to find the maximum likelihood estimates is quite complicated. The following uses EM algorithm to obtain maximum likelihood estimation of parameters more efficiently.

3 Parameter estimation methods

3.1 Introduction to the EM algorithm

The EM algorithm, proposed by Dempster et al. in 1977, is an iterative algorithm for solving the MLE of a model with missing data, mainly using observed data. Assuming that the complete data

Z = (Y, X)

consists of the observed data

Y

and the missing data

X

, suppose

f (Y, X ∣ η)

be the joint probability density of the complete data

Z

, and

f (X ∣ Y, η)

be the conditional density of the missing data

X

given the observed data

Y = y

, where

η

is the evaluated parameter, and the MLE of

η

is obtained by finding the maximum value of the log-likelihood

L (η ∣ Y)

of the observed data

Y

. To maximize

L (η ∣ Y)

, consider the log-likelihood

L (η ∣ Z) = log ⁡ [f (Y ∣ η) ⋅ f (X ∣ Y, η)]

given by the complete data. The

E M

algorithm consists of two steps: Step

E

and Step

M

Step

E

: Given an initial value of

η (0)

, assume that the estimate of

η

obtained after the

(t − 1)

th iteration of the algorithm is

η (t − 1)

, and define the expectation of the log-likelihood of the complete data, the so-called

Q

function, as

Q (η ∣ η (t − 1)) = ∫ η (t − 1) L (η ∣ Z) ⋅ f (X ∣ Y, η) d X = E η (t − 1) {L (η ∣ Z)} .

Step

M

: Maximize

Q (η ∣ η (t − 1))

to give

η (t)

as an update to

η

Repeat Steps

E

and

M

so that the estimate gradually approaches the true parameter, i.e.,

‖ η (t) − η (t − 1) ‖

is less than some small value

ε

, which proves that maximizing

L (η ∣ Y)

is equivalent to maximizing the

Q

function. In practice, several different

η (0)

should be taken for comparison to prevent the algorithm from falling into a local maximum.

3.2 The steps of parameter estimation

Suppose that the lifetimes of

n

products

X 1, X 2, …, X n

are independently and identically distributed in the two-parameter generalized exponential distribution (1). The

n

products are subjected to a lifetime test and fall into the interval

[T i − 1, T i)

or are censored at

T i

. We can only observe

c i

, the number of

X j

in the interval

[T i − 1, T i)

and

d i

, the number of censored

X j

T i

, where

i = 1, 2, …, N

;

j = 1, 2, …, n; 0 = T 0 < T 1 < ⋯ < T N < + ∞

. The lifetime of the product is

X = (X 1, X 2, …, X n)

, but

X

is unobservable, called missing data in the

E M

algorithm, and the observable is

Y = (c 1, c 2, …, c N, d 1, d 2, …, d N)

, which together form the complete data

Z = (X, Y)

. To apply the

E M

algorithm, we then introduce random variables

X i h, X i l

, which denote the product lifetimes falling in the interval

[T i − 1, T i)

and are censored at

T i

, respectively. In the following, we obtain the maximum likelihood estimates of the estimated parameters based on the Steps

E

and

M

in the EM algorithm.

Since the information of

X

contains all the information of the observation

Y

, we have

f (α, λ ∣ X, Y) = f (α, λ ∣ X)

. The log-likelihood of the complete data from the probability density function of the generalized exponential distribution (1) is

log ⁡ f (α, λ ∣ X) = log ⁡ ∏ i = 1 N [{α λ (1 − e − λ x i h) α − 1 e − λ x i h} c i ⋅ {α λ (1 − e − λ x i l) α − 1 e − λ x i l} d i] = n ⋅ log ⁡ (α λ) + ∑ i = 1 N {c i ⋅ log ⁡ [(1 − e − λ x i h) α − 1 e − λ x i h] + d i ⋅ log ⁡ [(1 − e − λ x i l) α − 1 e − λ x i l]} .

Given the initial values of the parameters

α (0)

and

λ (0)

, the steps of

E M

algorithm are:

Step

E

: Given the estimates

α (t − 1), λ (t − 1)

at step

t − 1

of the parameters, then the

Q

function at the step

t

Q (α, λ ∣ α (t − 1), λ (t − 1), Y) = E [log ⁡ f (α, λ ∣ X) ∣ α (t − 1), λ (t − 1), Y] = n ⋅ log ⁡ (α λ) + ∑ i = 1 N c i ⋅ E {log ⁡ [(1 − e − λ x i h) α − 1 e − λ x i h] ∣ α (t − 1), λ (t − 1), Y} + ∑ i = 1 N d i ⋅ E {log ⁡ [(1 − e − λ x i l) α − 1 e − λ x i l] | α (t − 1), λ (t − 1), Y} .

In the above

Q

function, the conditional probability density functions of

X i h

and

X i l

are denoted as

p i h (x) = f i h (x ∣ α (t − 1), λ (t − 1), Y) = α (t − 1) λ (t − 1) (1 − e − λ (t − 1) x) α (t − 1) − 1 e − λ (t − 1) x (1 − e − λ (t − 1) T i) α (t − 1) − (1 − e − λ (t − 1) T i − 1) α (t − 1), x ∈ [T i − 1, T i),

and

p i l (x) = f i l (x ∣ α (t − 1), λ (t − 1), Y) = α (t − 1) λ (t − 1) (1 − e − λ (t − 1) x) α (t − 1) − 1 e − λ (t − 1) x 1 − (1 − e − λ (t − 1) T i) α (t − 1), x ∈ [T i, + ∞) .

Thus we get

Q (α, λ ∣ α (t − 1), λ (t − 1), Y) = n ⋅ log ⁡ (α λ) + ∑ i = 1 N c i ∫ T i − 1 T i p i h (x) ⋅ log ⁡ [(1 − e − λ x) α − 1 e − λ x] d x + ∑ i = 1 N d i ∫ T i + ∞ p i l (x) ⋅ log ⁡ [(1 − e − λ x) α − 1 e − λ x] d x .

Step

M

: Maximize the

Q

function to obtain the estimators

α (t), λ (t)

for the

t

-step of the parameters

α, λ

, i.e., the point

α (t), λ (t)

of the extreme value of

Q (α, λ ∣ α (t − 1), λ (t − 1), Y)

obtained by deriving

Q (α, λ ∣ α (t − 1), λ (t − 1), Y)

for parameters

α, λ

, respectively.

Derivation of

α, λ

gives respectively:

∂ Q ∂ α = n α + ∑ i = 1 N c i ∫ T i − 1 T i p i h (x) ⋅ log ⁡ (1 − e − λ x) d x + ∑ i = 1 N d i ∫ T i + ∞ p i l (x) ⋅ log ⁡ (1 − e − λ x) d x, ∂ Q ∂ λ = n λ + ∑ i = 1 N c i ∫ T i − 1 T i p i h (x) ⋅ x (α e − λ x − 1) 1 − e − λ x d x + ∑ i = 1 N d i ∫ T i + ∞ p i l (x) ⋅ x (α e − λ x − 1) 1 − e − λ x d x .

Let

∂ Q ∂ α = 0, ∂ Q ∂ λ = 0

. We obtain

(5)

α = − n ∑ i = 1 N c i ∫ T i − 1 T i p i h (x) ⋅ log ⁡ (1 − e − λ x) d x + ∑ i = 1 N d i ∫ T i + ∞ p i l (x) ⋅ log ⁡ (1 − e − λ x) d x,

(6)

λ = − n ∑ i = 1 N c i ∫ T i − 1 T i p i h (x) ⋅ x (α e − λ x − 1) 1 − e − λ x d x + ∑ i = 1 N d i ∫ T i + ∞ p i l (x) ⋅ x (α e − λ x − 1) 1 − e − λ x d x .

The

(5), (6)

solved above is the desired

(α (t), λ (t))

, thus completing one iteration

(α (t − 1), λ (t − 1)) → (α (t), λ (t))

, repeating equations (5), (6) above until

α, λ

converge.

4 Simulation study

Suppose

X i, i = 1, 2, …, n

are independent identically distributed samples from the generalized exponential distribution model (1), and we consider the simulation example used in [1]: the true parameters are assumed to be

α = 1. 50, λ = 0.06

, and the sample data are divided into

N = 9

i.e., 10 groups, taking

T 0 =

0, T 1 = 5.5, T 2 = 10.5, T 3 = 15.5, T 4 = 20.5, T 5 = 25.5, T 6 = 30.5, T 7 = 40.5

T 8 = 50.5

T 9 = 60.5, T 10 = + ∞

, with an error precision as 0.001. And for

j ⩽ 8

, assuming that the probability of a product being censored at

T j

j / 9

, and that all products are not invalid at the end of

T 9

are censored. Consider the estimated effect of each replicate trial

s = 100, 200, 500

times for sample sizes

n = 60, 120, 200, 500, 1000

, respectively. If the estimate obtained on the

k

th trial is

η k = (α k, λ k) (k = 1, 2, …, s)

, then the final estimate and the estimated mean square error are

m e a n j = 1 s ∑ k = 1 s η j k, m s e j = 1 s − 1 ∑ k = 1 s (η j k − m e a n j) 2,

where

η j

denotes the

j

th component of

η

, and the corresponding results are estimated in Tab.1 and Tab.2. All the calculations in this paper have been done using Matlab2009b.

From Tab.1 and Tab.2, we can see that the

E M

algorithm has good estimation effects for the generalized exponential distribution with grouped data and right-censored data. And the overall estimation effects become better as the sample size increases and the number of simulation repetitions increases.

5 Analysis of a set of clinical data

This section analyzes a set of real data to illustrate the practical implications of the methodology of this paper. Angina pectoris is a clinical syndrome caused by acute, transient ischemia and hypoxia of the myocardium due to inadequate coronary blood supply, mostly in men. The following data on 2418 male patients with angina pectoris are taken from the work of Parker et al. [6]. The survival time was calculated in years from the time of diagnosis, with 16 intervals, the first 15 intervals being one year long, i.e.,

I j = (j − 1, j], j = 1, 2, …, 15, I 16 = (15, ∞)

. The number of deaths and cases lost to follow-up in each interval is shown in Tab.3.

The above data were estimated in [6] using a non-parametric product-limit estimation method for the survival and hazard rate functions. It was concluded that mortality was highest in the first year after diagnosis and remained essentially constant from the end of the first year to the beginning of the 10th year, fluctuating between 0.09 and 0.12, the hazard rate function generally higher after 10 years. Thus, regardless of age, sex or race, patients who survive beyond one year have a better prognosis than those who are newly diagnosed, with a 5-year survival rate of 0.5193. In this paper, we consider using the generalized exponential distribution model (1) to analyze this dataset, and use our EM algorithm to estimate the shape parameter

α

and scale parameter

λ

α^= 0.769

and

λ^= 0.106

, respectively, then the survival function and hazard rate function are

(7)

S^(x) = 1 − (1 − e − 0.769 x) 0.106, x ⩾ 0,

(8)

h^(x) = 0.0815 ⋅ (1 − e − 0.769 x) − 0.231 e − 0.769 x 1 − (1 − e − 0.769 x) 0.106, x ⩾ 0 .

Since the shape parameter is estimated to be

α^= 0.769 < 1

, the hazard rate function is decreasing. Fig.1 shows that the hazard rate function is monotonically decreasing, with relatively large values in the first two years, 0.1501 in the 1st year, 0.1341 in the 2nd year; slowly decreasing from the 3rd year to the 10th year; very slowly decreasing from the 10th year to the 30th year, remaining at about 0.106−0.115. According to the fitted life expectancy models (7) and (8), the average life expectancy is 7.9264 (years) and the 5-year survival rate is 0.4953 (which is close to the analysis in [6]). Another important life indicator is the average remaining life, which at time

t

is given by

μ (t) = 1 S (t) ∫ t ∞ S (x) d x .

This gives an average remaining life expectancy of 8.4710 (years) at 1 year, 8.9964 (years) at 5 years, and 9.2205 (years) at 10 years. Again, it can be concluded that patients who have been alive for several years have a longer average remaining life expectancy than those who have just been diagnosed, regardless of age, sex, or race. Of course, with the continuous improvements in modern medical technology and the gradual improvement in the effectiveness of the drugs used to treat angina, the survival rate and the average remaining life expectancy of patients with angina have greatly improved.

References

Publishing order | Descend order by publishing year | Descend order by cited within

[1]	Chen D G, Lio Y L. Parameter estimations for generalized exponential distribution under progressive type-I interval censoring. Comput Stat Data Anal 2010; 54(6): 1581–1591

[2]	Gupta R D, Kundu D. Generalized exponential distributions. Austr New Zealand J Statist 1999; 41(2): 173–188

[3]	Gupta R D, Kundu D. Generalized exponential distribution: existing results and some recent developments. J Statist Plann Inference 2007; 137(11): 3537–3547

[4]	Gupta R D, Kundu D. Generalized exponential distribution: Bayesian estimations. Comput Statist Data Anal 2008; 52(4): 1873–1883

[5]	Kundu D, Pradhan B. Estimating the parameters of the generalized exponential distribution in presence of hybrid censoring. Commun Stat Theory Methods 2009; 38(12): 2030–2041

[6]	LeeE TWangJ W. Statistical Methods for Survival Data Analysis, 3rd ed. New York: John Wiley & Sons, 2003

[7]	Liu L P. Estimation of MLE for Weibull distribution with grouped and censored data. Chinese Journal of Applied Probability and Statistics 2001; 17(2): 133–138

[8]	Liu X, Chen H, Fei H L. Estimation of the parameters in the lognormal distribution with grouped and right-censored data. Chinese Journal of Applied Probability and Statistics 2008; 24(4): 371–380

[9]	Pettitt A N. Re-weighted least squares estimation with censored and grouped data: an application of the EM algorithm. Royal Statistical Society 1985; 47(2): 253–260

[10]	Raqab M Z. Inferences for generalized exponential distribution based on record statistics. J Statist Plann Inference 2002; 104(2): 339–350

[11]	Raqab M Z, Madi M T. Bayesian inference for the generalized exponential distribution. J Statist Comput Simul 2005; 75(10): 841–852

[12]	Sarhan A M. Analysis of incomplete, censored data in competing risks models with generalized exponential distribution. IEEE Trans Reliability 2007; 56(1): 132–138

RIGHTS & PERMISSIONS

Higher Education Press 2023

PDF (594KB)

1253

Accesses

Citation

Detail

Sections

Recommended

About the journal

Aims & scope

Editorial board

Abstracting / indexing

Contact us

Browse

Online first

Latest issue

All volumes and issues

Collections

Most accessed

Most cited

Collections

Authors & reviewers

Online submisson

Abstract

Graphical abstract

Keywords

Cite this article

1 Introduction

2 The log-likelihood function of model (1) for the grouped data and right-censored case

3 Parameter estimation methods

3.1 Introduction to the EM algorithm

3.2 The steps of parameter estimation

4 Simulation study

5 Analysis of a set of clinical data

References

RIGHTS & PERMISSIONS