Parameters estimation and application of generalized exponential distribution under grouped and right-censored data

Yuzhu TIAN , Maozai TIAN , Ping CHEN

Front. Math. China ›› 2023, Vol. 18 ›› Issue (3) : 165 -174.

PDF (594KB)
Front. Math. China ›› 2023, Vol. 18 ›› Issue (3) : 165 -174. DOI: 10.3868/s140-DDD-023-0013-x
RESEARCH ARTICLE
RESEARCH ARTICLE

Parameters estimation and application of generalized exponential distribution under grouped and right-censored data

Author information +
History +
PDF (594KB)

Abstract

Generalized exponential distribution is a class of important distribution in lifedata analysis, especially in some skewed lifedata. The Parameter estimation problem for generalized exponential distribution model with grouped and right-censored data is considered. The maximum likelihood estimators are obtained using the EM algorithm. Some simulations are carried out to illustrate that the proposed algorithm is effective for the model. Finally, a set of medicine data is analyzed by generalized exponential distribution.

Graphical abstract

Keywords

Generalized exponential distribution / grouped and right-censored data / EM algorithm

Cite this article

Download citation ▾
Yuzhu TIAN, Maozai TIAN, Ping CHEN. Parameters estimation and application of generalized exponential distribution under grouped and right-censored data. Front. Math. China, 2023, 18(3): 165-174 DOI:10.3868/s140-DDD-023-0013-x

登录浏览全文

4963

注册一个新账户 忘记密码

1 Introduction

Gupta and Kundu (1999) [2] proposed the two-parameter generalized exponential distribution as an alternative to the Gamma, Weibull and log-normal distributions, and investigated some of the different properties between them. The two-parameter generalized exponential distribution has important applications in survival analysis, product life analysis and reliability engineering, especially for skewed failure life data. Many results have been discussed for generalized exponential distributions for general sample data [1, 35, 1012]. However, in survival analysis and lifetime data studies, the complex situation of grouping and right-censored sample data [79] often arises. In this paper, we use EM algorithms to consider parameter estimation for two-parameter generalized exponential distributions with such sample data.

The probability density function, survival function and hazard rate function of the two-parameter generalized exponential distribution model are

f(x;α,λ)=αλ(1eλx)α1eλx,x0,

S(x;α,λ)=1(1eλx)α,x0,

h(x;α,λ)=f(x;α,λ)S(x;α,λ)=αλ(1eλx)α1eλx1(1eλx)α,x0,

where α,λ>0 are the shape and scale parameters of the model, respectively. When the shape parameter α=1, the model is a general exponential distribution model. The hazard function h(x;α,λ) of the model (1) does not depend on λ, but only on α. And when α>1, h(x;α,λ) is increasing; when α<1, h(x;α,λ) is decreasing; when α=1, h(x;α,λ) is a constant [3]. Section 2 of this paper briefly describes the maximum likelihood estimation of the parameters. Section 3 considers the estimation of the parameters of the model using the EM algorithm. Section 4 presents the numerical simulations. Section 5 analyzes a set of real data using a generalized exponential distribution model.

2 The log-likelihood function of model (1) for the grouped data and right-censored case

Suppose the lifetime of a product follows the two-parameter generalized exponential distribution (1), its distribution function is

F(x;α,λ)=(1eλx)α,x0.

Now take n products for the lifetime test, and get the data as follows: [0,+) is divided into N+1 intervals, and the first N interval is denoted as [Ti1,Ti), where i=1,2,,N;0=T0<T1<<TN<+. Then use ci to denote the number of failed products falling into [Ti1,Ti), and use di to denote the number of products censored at Ti. Then we have n=i=1N(ci+di).

The likelihood function is

L(α,λ)=i=1N[F(Ti;α,λ)F(Ti1;α,λ)]ci[1F(Ti;α,λ)]di=i=1N[(1eλTi)α(1eλTi1)α]ci[1(1eλTi)α]di.

The log-likelihood function is

logL(α,λ)=i=1N{cilog[(1eλTi)α(1eλTi1)α]+dilog[1(1eλTi)α]}.

The log-likelihood function takes the partial derivatives of the parameters, and let

logL(α,λ)α=i=1N{ci(1eλTi)αlog(1eλTi)(1eλTi1)αlog(1eλTi1)(1eλTi)α(1eλTi1)αdi(1eλTi)αlog(1eλTi)(1eλTi)α}=0,logL(α,λ)λ=αi=1N{ci(1eλTi)α1eλTiTi(1eλTi1)α1eλTi1Ti1(1eλTi)α(1eλTi1)αdi(1eλTi)α1eλTiTi1(1eλTi)α}=0.

Solving the above system of equations yields a maximum likelihood estimate for the parameters α and λ. However, due to the complexity of the above system of non-linear equations, it is not possible to obtain explicit expressions for the parameter estimates, and even using numerical solutions (e.g., Newton’s method) to find the maximum likelihood estimates is quite complicated. The following uses EM algorithm to obtain maximum likelihood estimation of parameters more efficiently.

3 Parameter estimation methods

3.1 Introduction to the EM algorithm

The EM algorithm, proposed by Dempster et al. in 1977, is an iterative algorithm for solving the MLE of a model with missing data, mainly using observed data. Assuming that the complete data Z=(Y,X) consists of the observed data Y and the missing data X, suppose f(Y,Xη) be the joint probability density of the complete data Z, and f(XY,η) be the conditional density of the missing data X given the observed data Y=y, where η is the evaluated parameter, and the MLE of η is obtained by finding the maximum value of the log-likelihood L(ηY) of the observed data Y. To maximize L(ηY), consider the log-likelihood L(ηZ)=log[f(Yη)f(XY,η)] given by the complete data. The EM algorithm consists of two steps: Step E and Step M.

Step E: Given an initial value of η(0), assume that the estimate of η obtained after the (t1)th iteration of the algorithm is η(t1), and define the expectation of the log-likelihood of the complete data, the so-called Q function, as

Q(ηη(t1))=η(t1)L(ηZ)f(XY,η)dX=Eη(t1){L(ηZ)}.

Step M: Maximize Q(ηη(t1)) to give η(t) as an update to η.

Repeat Steps E and M so that the estimate gradually approaches the true parameter, i.e., η(t)η(t1) is less than some small value ε, which proves that maximizing L(ηY) is equivalent to maximizing the Q function. In practice, several different η(0) should be taken for comparison to prevent the algorithm from falling into a local maximum.

3.2 The steps of parameter estimation

Suppose that the lifetimes of n products X1,X2,,Xn are independently and identically distributed in the two-parameter generalized exponential distribution (1). The n products are subjected to a lifetime test and fall into the interval [Ti1,Ti) or are censored at Ti. We can only observe ci, the number of Xj in the interval [Ti1,Ti) and di, the number of censored Xj at Ti, where i=1,2,,N; j=1,2,,n;0=T0<T1<<TN<+. The lifetime of the product is X=(X1,X2,,Xn), but X is unobservable, called missing data in the EM algorithm, and the observable is Y=(c1,c2,,cN,d1,d2,,dN), which together form the complete data Z=(X,Y). To apply the EM algorithm, we then introduce random variables Xih,Xil, which denote the product lifetimes falling in the interval [Ti1,Ti) and are censored at Ti, respectively. In the following, we obtain the maximum likelihood estimates of the estimated parameters based on the Steps E and M in the EM algorithm.

Since the information of X contains all the information of the observation Y, we have f(α,λX,Y)=f(α,λX). The log-likelihood of the complete data from the probability density function of the generalized exponential distribution (1) is

logf(α,λX)=logi=1N[{αλ(1eλxih)α1eλxih}ci{αλ(1eλxil)α1eλxil}di]=nlog(αλ)+i=1N{cilog[(1eλxih)α1eλxih]+dilog[(1eλxil)α1eλxil]}.

Given the initial values of the parameters α(0) and λ(0), the steps of EM algorithm are:

Step E: Given the estimates α(t1),λ(t1) at step t1 of the parameters, then the Q function at the step t is

Q(α,λα(t1),λ(t1),Y)=E[logf(α,λX)α(t1),λ(t1),Y]=nlog(αλ)+i=1NciE{log[(1eλxih)α1eλxih]α(t1),λ(t1),Y}+i=1NdiE{log[(1eλxil)α1eλxil]|α(t1),λ(t1),Y}.

In the above Q function, the conditional probability density functions of Xih and Xil are denoted as

pih(x)=fih(xα(t1),λ(t1),Y)=α(t1)λ(t1)(1eλ(t1)x)α(t1)1eλ(t1)x(1eλ(t1)Ti)α(t1)(1eλ(t1)Ti1)α(t1),x[Ti1,Ti),

and

pil(x)=fil(xα(t1),λ(t1),Y)=α(t1)λ(t1)(1eλ(t1)x)α(t1)1eλ(t1)x1(1eλ(t1)Ti)α(t1),x[Ti,+).

Thus we get

Q(α,λα(t1),λ(t1),Y)=nlog(αλ)+i=1NciTi1Tipih(x)log[(1eλx)α1eλx]dx+i=1NdiTi+pil(x)log[(1eλx)α1eλx]dx.

Step M: Maximize the Q function to obtain the estimators α(t),λ(t) for the t-step of the parameters α,λ, i.e., the point α(t),λ(t) of the extreme value of Q(α,λα(t1),λ(t1),Y) obtained by deriving Q(α,λα(t1),λ(t1),Y) for parameters α,λ, respectively.

Derivation of α,λ gives respectively:

Qα=nα+i=1NciTi1Tipih(x)log(1eλx)dx+i=1NdiTi+pil(x)log(1eλx)dx,Qλ=nλ+i=1NciTi1Tipih(x)x(αeλx1)1eλxdx+i=1NdiTi+pil(x)x(αeλx1)1eλxdx.

Let Qα=0,Qλ=0. We obtain

α=ni=1NciTi1Tipih(x)log(1eλx)dx+i=1NdiTi+pil(x)log(1eλx)dx,

λ=ni=1NciTi1Tipih(x)x(αeλx1)1eλxdx+i=1NdiTi+pil(x)x(αeλx1)1eλxdx.

The (5),(6) solved above is the desired (α(t),λ(t)), thus completing one iteration (α(t1),λ(t1))(α(t),λ(t)), repeating equations (5), (6) above until α,λ converge.

4 Simulation study

Suppose Xi,i=1,2,,n are independent identically distributed samples from the generalized exponential distribution model (1), and we consider the simulation example used in [1]: the true parameters are assumed to be α=1.50,λ=0.06, and the sample data are divided into N=9 i.e., 10 groups, taking T0= 0,T1=5.5,T2=10.5,T3=15.5,T4=20.5,T5=25.5,T6=30.5,T7=40.5, T8=50.5, T9=60.5,T10=+, with an error precision as 0.001. And for j8, assuming that the probability of a product being censored at Tj is j/9, and that all products are not invalid at the end of T9 are censored. Consider the estimated effect of each replicate trial s=100,200,500 times for sample sizes n=60,120,200,500,1000, respectively. If the estimate obtained on the kth trial is ηk=(αk,λk)(k=1,2,,s), then the final estimate and the estimated mean square error are

meanj=1sk=1sηjk,msej=1s1k=1s(ηjkmeanj)2,

where ηj denotes the jth component of η, and the corresponding results are estimated in Tab.1 and Tab.2. All the calculations in this paper have been done using Matlab2009b.

From Tab.1 and Tab.2, we can see that the EM algorithm has good estimation effects for the generalized exponential distribution with grouped data and right-censored data. And the overall estimation effects become better as the sample size increases and the number of simulation repetitions increases.

5 Analysis of a set of clinical data

This section analyzes a set of real data to illustrate the practical implications of the methodology of this paper. Angina pectoris is a clinical syndrome caused by acute, transient ischemia and hypoxia of the myocardium due to inadequate coronary blood supply, mostly in men. The following data on 2418 male patients with angina pectoris are taken from the work of Parker et al. [6]. The survival time was calculated in years from the time of diagnosis, with 16 intervals, the first 15 intervals being one year long, i.e., Ij=(j1,j],j=1,2,,15,I16=(15,). The number of deaths and cases lost to follow-up in each interval is shown in Tab.3.

The above data were estimated in [6] using a non-parametric product-limit estimation method for the survival and hazard rate functions. It was concluded that mortality was highest in the first year after diagnosis and remained essentially constant from the end of the first year to the beginning of the 10th year, fluctuating between 0.09 and 0.12, the hazard rate function generally higher after 10 years. Thus, regardless of age, sex or race, patients who survive beyond one year have a better prognosis than those who are newly diagnosed, with a 5-year survival rate of 0.5193. In this paper, we consider using the generalized exponential distribution model (1) to analyze this dataset, and use our EM algorithm to estimate the shape parameter α and scale parameter λ as α^=0.769 and λ^=0.106, respectively, then the survival function and hazard rate function are

S^(x)=1(1e0.769x)0.106,x0,

h^(x)=0.0815(1e0.769x)0.231e0.769x1(1e0.769x)0.106,x0.

Since the shape parameter is estimated to be α^=0.769<1, the hazard rate function is decreasing. Fig.1 shows that the hazard rate function is monotonically decreasing, with relatively large values in the first two years, 0.1501 in the 1st year, 0.1341 in the 2nd year; slowly decreasing from the 3rd year to the 10th year; very slowly decreasing from the 10th year to the 30th year, remaining at about 0.106−0.115. According to the fitted life expectancy models (7) and (8), the average life expectancy is 7.9264 (years) and the 5-year survival rate is 0.4953 (which is close to the analysis in [6]). Another important life indicator is the average remaining life, which at time t is given by

μ(t)=1S(t)tS(x)dx.

This gives an average remaining life expectancy of 8.4710 (years) at 1 year, 8.9964 (years) at 5 years, and 9.2205 (years) at 10 years. Again, it can be concluded that patients who have been alive for several years have a longer average remaining life expectancy than those who have just been diagnosed, regardless of age, sex, or race. Of course, with the continuous improvements in modern medical technology and the gradual improvement in the effectiveness of the drugs used to treat angina, the survival rate and the average remaining life expectancy of patients with angina have greatly improved.

References

[1]

Chen D G, Lio Y L. Parameter estimations for generalized exponential distribution under progressive type-I interval censoring. Comput Stat Data Anal 2010; 54(6): 1581–1591

[2]

Gupta R D, Kundu D. Generalized exponential distributions. Austr New Zealand J Statist 1999; 41(2): 173–188

[3]

Gupta R D, Kundu D. Generalized exponential distribution: existing results and some recent developments. J Statist Plann Inference 2007; 137(11): 3537–3547

[4]

Gupta R D, Kundu D. Generalized exponential distribution: Bayesian estimations. Comput Statist Data Anal 2008; 52(4): 1873–1883

[5]

Kundu D, Pradhan B. Estimating the parameters of the generalized exponential distribution in presence of hybrid censoring. Commun Stat Theory Methods 2009; 38(12): 2030–2041

[6]

LeeE TWangJ W. Statistical Methods for Survival Data Analysis, 3rd ed. New York: John Wiley & Sons, 2003

[7]

Liu L P. Estimation of MLE for Weibull distribution with grouped and censored data. Chinese Journal of Applied Probability and Statistics 2001; 17(2): 133–138

[8]

Liu X, Chen H, Fei H L. Estimation of the parameters in the lognormal distribution with grouped and right-censored data. Chinese Journal of Applied Probability and Statistics 2008; 24(4): 371–380

[9]

Pettitt A N. Re-weighted least squares estimation with censored and grouped data: an application of the EM algorithm. Royal Statistical Society 1985; 47(2): 253–260

[10]

Raqab M Z. Inferences for generalized exponential distribution based on record statistics. J Statist Plann Inference 2002; 104(2): 339–350

[11]

Raqab M Z, Madi M T. Bayesian inference for the generalized exponential distribution. J Statist Comput Simul 2005; 75(10): 841–852

[12]

Sarhan A M. Analysis of incomplete, censored data in competing risks models with generalized exponential distribution. IEEE Trans Reliability 2007; 56(1): 132–138

RIGHTS & PERMISSIONS

Higher Education Press 2023

AI Summary AI Mindmap
PDF (594KB)

905

Accesses

0

Citation

Detail

Sections
Recommended

AI思维导图

/