Template for modules of the revised handbook



Method: Donor imputation0.General informationMethod: Donor imputation 0.1Module codeMethod-donor imputation0.2Version historyVersionDateDescription of changesAuthorInstitute1.002-12-2011First versionAndrzej M?odakGUS (PL)2.015-02-2012Revised versionAndrzej M?odakGUS (PL)3.020-06-2012Third versionAndrzej M?odakGUS (PL)0.3Template version and print dateTemplate version used1.0 p 3 d.d. 28-6-2011Print date TIME \@ "d-M-yyyy H:mm" 20-6-2012 11:1420-6-2012 10:2815-2-2012 12:12Contents TOC \o "1-2" \h \z \u General description – Method: PAGEREF _Toc316296954 \h 31.Summary PAGEREF _Toc316296955 \h 32.General description PAGEREF _Toc316296956 \h 3HYPERLINK \l "_Toc316296957"3.Example – not tool specific PAGEREF _Toc316296957 \h 12134.Examples – tool specific PAGEREF _Toc316296958 \h 15HYPERLINK \l "_Toc316296959"5.Glossary PAGEREF _Toc316296959 \h 19186.Literature PAGEREF _Toc316296960 \h 19Specific description – Method: PAGEREF _Toc316296961 \h 20A.1Purpose of the method PAGEREF _Toc316296962 \h 20HYPERLINK \l "_Toc316296963"A.2Recommended use of the method PAGEREF _Toc316296963 \h 2120HYPERLINK \l "_Toc316296964"A.3Possible disadvantages of the method PAGEREF _Toc316296964 \h 2120HYPERLINK \l "_Toc316296965"A.4Variants of the method PAGEREF _Toc316296965 \h 2120HYPERLINK \l "_Toc316296966"A.5Input data sets PAGEREF _Toc316296966 \h 2120HYPERLINK \l "_Toc316296967"A.6Logical preconditions PAGEREF _Toc316296967 \h 2120HYPERLINK \l "_Toc316296968"A.7Tuning parameters PAGEREF _Toc316296968 \h 2221HYPERLINK \l "_Toc316296969"A.8Recommended use of the individual variants of the method PAGEREF _Toc316296969 \h 2221HYPERLINK \l "_Toc316296970"A.9Output data sets PAGEREF _Toc316296970 \h 2221HYPERLINK \l "_Toc316296971"A.10Properties of output data sets PAGEREF _Toc316296971 \h 2221HYPERLINK \l "_Toc316296972"A.11Unit of processing PAGEREF _Toc316296972 \h 2221HYPERLINK \l "_Toc316296973"A.12User interaction - not tool specific PAGEREF _Toc316296973 \h 2221HYPERLINK \l "_Toc316296974"A.13Logging indicators PAGEREF _Toc316296974 \h 2221HYPERLINK \l "_Toc316296975"A.14Quality indicators of the output data PAGEREF _Toc316296975 \h 2221HYPERLINK \l "_Toc316296976"A.15Actual use of the method PAGEREF _Toc316296976 \h 2221HYPERLINK \l "_Toc316296977"A.16Relationship with other modules PAGEREF _Toc316296977 \h 2221General description – Method: SummaryThe module presents various methods of donor imputation. This approach uses data coming from specially selected response records (called donors) to fill gaps in other incomplete records (called recipients). A donor can be selected by means of various methods. We describe random, sequential and several types of deterministic approaches in this context (belonging to the class of hot–deck imputation – which uses currently processed data) as well as cold–deck imputation, where older external data are used.General descriptionDonor imputation which will be described here is one of two main types of imputation methods (apart from model–based imputation described in a separate theme module with the same title). Of course, it is also possible to use various combinations of methods belonging to both classes in order to improve the quality of implants and, consequently, inferences about population parameters.Donor imputation involves replacing missing values of one or more variables for units of non–respondents in this case (and called recipients) with observed values from specially chosen other units (called donors), for which such values are available. Donor imputation methods are divided into hot–deck imputation and cold–deck imputation. As R. R. Andridge and R. J. A. Little (2010) point out, the term “hot deck” comes from the use of computer punch cards for data storage. In the context of imputation, it refers to a deck of donors available for non–respondents. The deck is “hot” when it is currently being processed, as opposed to being “cold”, when previously processed data are used as donors, i.e. data coming from earlier data collection, an earlier period or a different data set.Hot–deck imputationAs was indicated above, hot–deck imputation relies on data currently collected and verified to impute missing values for given variables. There are three main types of hot–deck imputation. The first one is called random hot–deck. For each recipient, its donor is selected randomly from a set of potential donors. In sequential hot–deck the donor is chosen according to an order with respect to a given auxiliary variable introduced in the data set. In the third approach, called the nearest neighbour or deterministic hot–deckmethod, the choice of a donor for a given recipient is optimized using additional algorithms, criteria or auxiliary variables with complete data. Random hot–deckTo compensate for item non–response, random hot–deck imputation procedures replace missing values with values that occur in the sample (or population) chosen at random. Of course, the simplest solution is a completely random choice of a donor whose data will be used for imputation for a given recipient. An example of this solution in the context of statistical matching can be found in M. D’Orazio et al. (2006)). This approach is fastest but usually seriously biased. That is, if the number of data gaps is relatively large or the distribution of the analyzed variable is significantly asymmetric, random hot–imputation may generate many too smoothed or, conversely, a few outlying implants. That is, the distribution of implants can be too even or very few strong outliers can occur. In both cases the estimation of population parameters will be strongly biased (in the former case there exists a danger of underestimation, in the latter – overestimation of the mean, for example). Slightly better results could be obtained when the sampling of donors is made with the probability proportional to its share in the total, i.e. if y1,y2,…,yn (n∈N, where N is the set of natural numbers) are values of the variable Y for donors, then yi will be chosen with probability PY≥yi=i:Y≥yij=1nj:Y≥yjn, i=1,2,…,k, where A denotes the number of elements of a given set A. For example, if we have the following y–observations: 2,3,1,5,4,1,50, then j=1nj:Y≥yj=21 and the number 50 will be chosen with probability 1/21, whereas 2 with probability 5/21 and 5 with probability 2/21. Thus, extra y–values are selected with the smallest probability and hence the possible bias (overestimating) is, in to a large extent,d eliminated.A commonly used improvement of random hot–deck is a preliminary division of the set of donors into disjoint subsets called imputation cells. This division is made on the basis of some fundamental features of analyzed units, which may be collected from external sources. For example, if we are going to study economic entities, imputation cells may be constructed by categorizing them by size (expressed, say, by the number of employees). Given several classes, we can make a pre–choice of the sampling frame – by establishing the size class which a given non–response unit belongs to. Implants for this class will come from a donor drawn randomly from this cell. R. R. Andridge and R. J. A. Little (2010) analyze the impact of a variable used to create imputation cells depending on whether it is associated with the imputed variable or with the binary variable, indicating whether the imputed variable is missing or not. They conclude that to observe a reduction in the bias of the population mean for the imputed variable, the variable defining the imputation cells must be associated with both of the above aspects. The variance of the mean decreases significantly in this case.G. Chauvet et al. (2011) believe that the random imputation method has an important advantage. Namely, while nearest neighbour imputation methods (which are described in detail in the next section) usually lead to asymptotically unbiased estimators of totals of the population if the underlying imputation or non–response model is correctly specified, they are not appropriate when the objective is to estimate a quantile (e.g., a median). It is an important problem, because more often an estimation of order statistics rather than the classical mean is preferred (e.g. to reduce the impact of outliers). This drawback is due to the fact that imputation methods of this type tend to distort the distribution of variables to be imputed. As a result, estimators of quantiles could be severely biased, especially if the non–response rate is appreciable. To preserve the distribution, it is customary to use a random imputation method. On the other hand, due to the random selection of residuals, random imputation usually generates an additional amount of variability (called imputation variance). To overcome this inconvenience, the authors of the above paper introduce the balance random imputation method. Their idea is based on a special functional generalization of imputation (including donor imputation) using a function of available auxiliary variables and their parameters supplemented by the product of variance and residual factors. In balanced imputation, where structural parameters and variance are replaced with their estimators computed using available data for Y and random quantities generated from the empirical distribution function of residuals and balanced random implants are defined using the main model with these estimators and values of auxiliary variables for a given recipient. The authors also analyze properties of imputed estimators of structural and population parameters. These estimators preserve the distribution of the variable to be imputed and are almost as efficient as the corresponding deterministic imputation methods. The study of variance for these methods is complex and in many respects still poses a research challenge.Summarizing the presentation of random hot–deck imputation methods, we can say that by introducing additional assumptions and constraints, one can improve the primarily strictly random choice of implants to obtain unbiased and efficient estimators of population parameters. The use of this type of imputation is recommended especially if variables are categorical (nominal or ordinal) – especially polychotomous.Sequential hot–deck imputationA. Isra?ls et al. (2011) also identify the category of sequential hot–deck imputation. According to their classification, the main difference is that in random hot–deck imputation we usually use imputation classes, which are formed on the basis of categorical auxiliary variables and for a given recipient implants are randomly chosen from a group of potential donors with the same characteristics as the recipients, whereas in sequential hot–deck imputation, groups are not actively formed, but for each missing item, the value of the target variable is imputed from the next record in the data file with the same values on certain background characteristics expressed by relevant categorical variables. The authors of the above paper argue that for very large files to which hot–-deck imputation is applied, the sequential hot deck method is sometimes used following practical considerations. In their opinion, the processing time would otherwise increase substantially, while the quality of imputation would not change significantly. To obtain a random donor, records will first have to be placed in a random order in the file, but given a random selection mechanism, this is no longer necessary. On the other hand, if there are more categorical auxiliary variables, to improve the quality of such imputation, it would be better to introduce an arbitrary order within each imputation cell. Implants associated by sequential hot–deck methods will then be much more reliable but implementation of a relevant sorting procedure will require more time.Nearest neighbour imputation (deterministic hot–deck)In this type of imputation, which A. Isra?ls et al. (2011) calls nearest neighbour imputation (NNI), no randomness is used to select a donor for every recipient. That is, a record in the same file with similar values of key auxiliary variables is found: for example, we look for a person at a similar age or an enterprise with a similar number of employees. The idea is that the more correspondence there is between values of important auxiliary variables of two objects, the more reliable are the values of the variable to be imputed. The main difference between random (or sequential) and deterministic hot deck is such that in the former method, donors must have exactly the same values for key categorical variables, while in deterministic hot-deckNNI only some similarity is expected and other measurement scales of such variables (interval or ratio) are also welcome. In the deterministic hot–deckNNI approach, no imputation classes are formed, and some discrepancy in terms of auxiliary variables between recipients and their donors is allowed. The best choice of a donor now depends on the rank of the auxiliary variable or the distance of records computed using auxiliary information. If the nearest neighbour imputation is based on the distance ofbetween records, then it is sometimes also called also the Another option is distance hot-deck imputation. It is based on Tthe general idea is here toof finding a donor which is closest to a given recipient in terms of values of a collection of auxiliary variables X1,X2,…,Xm with fully available data using somea distance between points in a multivariate space, which representing the given records. This distance is usually measured by a specially established formula. When defining the distance of objects, we have to remember that it should satisfy requirements of a topological metric, i.e. as a function d:U×U→R it should have reflexivity, symmetry and triangle inequality properties.Let D denote a set of donors and B – a set of recipients. Let us assume that X1,X2,…,Xm (X=[xij], i=1,2,…,n, j=1,2,…,m are data for auxiliary variables. Having defined the distance, we can determine for a given recipient h∈B its nearest donor i*∈D, for which the assumed distance in terms of auxiliary variables is minimized, i.e.i*=argmini∈Ddhi . (16)Of course, optimum donors can be determined ambiguously. This may occur if there is more than one record satisfying the right hand side of (16) or if we fix a tolerance value for minimization, i.e. we look for such donors i∈D that dhi<ε for a small ε>0. Then we have a set of possible donors for such a recipient. In this case, we can choose the final donor analyzing the similarity of the value of a particular auxiliary variable between donors and recipients according to their arbitrarily established hierarchy of importance, starting from the most important one. Of course, a completely random choice of a donor from this selected set is also possible, but this solution seems to be rather ineffective and therefore should be treated as a last resort. Another option, suggested by ?. Langsrud is to define the distance by taking into account the hierarchy of importance. It is also a good idea, but it depends on how’importance’ is defined. It may be understood in terms of a share of total variation or order of variables according to subjestive opinios of experts. The most popular (from the practical point of view) metric is the Minkowski metric given as:dhi=j=1mxhj-xijp1p, (24)where p is any natural number.If p=1, then the metric (24) is called the urban metric (alternatively: Hamming distance, city block, or Manhattan metric). Its ‘urbanistic’ attributes were added as a result of associating the two–dimensional form of the formula (4) with the rectangular arrangement of streets in New York (in this case the distance between two points on a plane is the sum of absolute values of relevant coordinates of such points, i.e. the sum of side lengths of a right–angled triangle (parallel to OX and OY axes), whose hypotenuse connects the two points. If p=2, then (4) is the classical Euclidean distance. If p tends to infinity, the Minkowski metrics provides the maximum deviation. The Minkowski metric is sometime used in an averaged version, i.e.dhi=1mj=1mxhj-xijp1p.If p=1, then such metric is called the Czekanowski metric and if p=2 , then it is said to be average distance. The metric of the Minkowski class is sensitive to outliers and therefore in some practical situations it is also worth looking for other possibilities in this context.Weighting/normalization. It is worth noting that to standardize the scale of measurement (and therefore some important basic characteristics) a normalization of analyzed auxiliary variables is recommended. The general formula for such a normalization is zij=xij-ajbjwhere aj and bj are constant parameters related to Xj for i=1,2,…,n, j=1,2,…,m. If itthe formula is aimed at transforming given variables into another variable with zero mean (or median) standard deviation or median absolute deviation equal to one, then takeing aj=xj and bj=sj or aj=med(Xj) and bj=madXj, respectively, the formulait is called standardization. If we would like the range to be equal to one then, we usually take usually aj=mini=1,2,…,nxij and bj=maxi=1,2,…,nxij-mini=1,2,…,nxij, j=1,2,…,m, and this operation is called unitarization. If aj=0 and bj and is sum, mean, minimum, maximum, median, sum of values of the j-th variable, sum of squares of ‘s or square root from it then the normalization is said to be quotient transformation. Instead of classical statistics, one can also use here also This can be done either by classical standardization or normalization by much more efficient formulas using singularle points in a multivariate space (i.,e. Rm), where each unit can be represented, e.g. is theit can be the j-th coordinate of the Weber median and the median absolute deviation of from the relevant coordinate of the Weber median, multiplied by 1,.4826. Using Tthis variant ersionenables to , we can exploit all (even hidden) connections between analyzed variables (cf. A. M?odak (2006 a, 2006 b) and A. Zelia? (2002)). Normalization usually doesn’t usually affect the distance value (this is one of the basic conditions of its correctness), but it is useful for purposes of analyzing final results and constructing implants or complex measures. On the other hand, in some situations the weighting of auxiliary variables iscan be also desirable. The term ”weighting” can be understood in two ways: as assigning weights to units or to variables. If we analyse samplesIn the former case, weights can be defined as reversed probabilities of inclusion of a given unit in a sample (assuming that some units have a larger probability toof being drawn than others) or numerical corrections of the non–response problem. However, the authors of the them module expressed the correct opinionare right to point out that such weighting practically does not have an influence on hot–deck imputation, nu only on the quality of model-based imputation and estimation. Another option is weighting X–variables. This action can have an impact on the donor imputation, especially in the case onof the nearest neighbour method, because the values and structure of weights will affects the value of distances between records. For example, the weighted Minkowski distance will have the form dhi(w)=j=1mwjxhj-xijp1p, where wj is the weight associated with Xj under the assumptions that wj∈0,1, j=1,2,…,m and j=1mwj=1. The weights of variables can be defined in various ways. The most popular solution in this context is to define the weights as according to their the taxonomical importance of variables expressed as a share of variation in the total sum of coefficients of variation for all auxiliary variables (i.e. wj=Vjj=1mVj, where Vj is the coefficient of variation for Xj, j=1,2,…,m). Another option, suggested by ?. Langsrud is to define the distance by taking into account the hierarchy of importance in a more subjective way, e.g. . It is also a good idea, but it depends on how’importance’ is defined. It may be understood in terms of a share of total variation or order of variables according to subjestive opinions of experts expressed by relevant ratings and taking mean results otf such an assessment.R. R. Andridge and R. J. A. Little (2010) have formulated several other suggestions concerning possible distance measures. Namely, they propose using the maximum deviation in this context:dhi=maxj=1,2,…,mxhj-xij.or the Mahalanobis distance,dhi=xh-xiTΣX-1xh-xiwhere ΣX is an estimate of the covariance matrix of X1,X2,…,Xm (X=[xij], i∈D, j=1,2,…,m), or the predictive mean: dhi=Y(xh)-Y(xi)2where Yxh=xhTβ is the predicted value of Y for h∈B from the regression of Y on X using the available data and Yxh=yh if h∈D.If there is more than one variable to be imputed, i.e. Y1,Y2,…,Yp for some p∈N , then two separate cases are possible. The first one is such that data gaps occur for the same set of records. Of course, one can perform imputation separately for each imputed variable. The drawback of this approach is that the mutual connections between Y1,Y2,…,Yp will be neither preserved nor exploited. For this reason, D. A. Marker and D. R. Judkins (2002) have proposed the single partition hot deck method leading to the creation of a donor pool for a given recipient using the multivariate counterpart of the predictive mean metric of the form:dhi=Y(xh)-Y(xi)TΣY(xi)-1Y(xh)-Y(xi)where ΣY(xi) is the estimated residual covariance matrix of Y=(Y1,Y2,…,Yp) given xi, i=1,2,…,n. Another approach in this context is to create a donor pool for Yj using previously imputed values for Y1,Y2,…,Yj-1 and a metric which can connect the value of Yj with X and Y1,Y2,…,Yj-1, j=1,2,…,p. The proper choice of the metric to be used has an impact on the degree of association with auxiliary variables and relevant values of Y’s and therefore on the quality of final results. If Yj can be observed only whenever Y1,Y2,…,Yj-1 are observed, the above method should be used sequentially. The second case occurs if the distribution of gaps in Y1,Y2,…,Yp is different. Then, we fill them using a simple method (separately for each variable) and in the next steps partitions based on the best set of adjustment variables for each record are defined and relevant re–imputation is performed. The procedure is iterative and stops when the required level of convergence is achieved. Alternatively, instead of relying on such partitions, we can use a hierarchy of usefulness for imputation, i.e. the value for an item higher up in the hierarchy can be used for imputation of the value for a lower item. Another possibility is to draw residuals from fully available data from the appropriate regression model (cf. R. R. Andridge and R. J. A. Little (2010)).The second problem is what to do if an auxiliary variable is categorical, i.e. its values are measured on nominal or ordinal measurement scale. In this situation such arithmetic operations as adding or division or averaging are nor allowable. That is, we can only directly compare the values and inference whether they are equal or not. Option. However other variables can be measured on another scales. The formula which takes the character of variables into account is the Gower distance (cf. J. C. Gower and P. Legendre (1986)). It is defined as:dhi=1-δhi,where δhi=j=1mwjρhijj=1mwj, and wj is the weight associated with the variable Xj, and ρhij denotes the Gower’s probability measure established as:– if Xj is nominal, thenρhij=1 if xhj=xij,0 if xhj≠xij,– if Xj is ordinal, interval or ratio, thenρhij=1- xhj-xijRj xhj-xij,where Rj is the range of Xj,. for every j=1,2,…,m.In this way, each variable is treated according to its performance. Moreover, the obtained measure reflects a practical sense of the distance and can be easily interpreted. However, no special weighting will be introduced; that is, we assume that all weights are equal to 1 (i.e. wj=1 for every j=1,2,…,m).The above description was developed using the form preferred in the SAS software. However, in the StatMatch package of R software it is defined in a different way. That is, for interval or ratio variables, it is defined as:ρhij=1- xhj-xijRj,where Rj is the range of Xj. However, if Xj is normalized, the use of range seems to be unnecessary. The next method is called rank hot–deck and is described e.g. by M. D’Orazio et al. (2006) in terms of its usefulness for the statistical matching of two sets with categorical variables. The categorical character of the analyzed variable is most often the best motivation to use it. Let then X be the auxiliary variable and Y – the variable to be imputed for a set of recipients B. Keeping the notations adopted earlier, we use the empirical distribution function of X for the set A?U given as:FXFXA(x)=1Ani∈Ai=1nI-∞,xxiI-∞,xxi (3)for every x∈R, and A=B,D (R is the set of real numbers, |A| denotes the number of elements of A and IA(?) denotes – the indicator of A, i.e. IAx=1 if x∈A and IAx=0, otherwise). That is, the formula (3) reflects the relevant fraction of observations of X not larger than x. For each recipient record, the closest donors are chosen by considering the distance amongbetween the percentage points of the empirical cumulative distribution function. That is, the formula (3) reflects the relevant fraction of observations of X not larger than x.This method finds a donor record for each record in the recipient data set which has the closest distance to the given recipient in terms of empirical cumulative distribution. This distance is computed by considering the estimated empirical cumulative distribution for the reference variable in sets of recipients and donors. For more details see description of the StatMatch package in the manual of R software or M. D’Orazio et al. (2006).Hence, record i∈B is associated with record i*∈D for which the value of the empirical distribution function is the closest, i.e. ifFXBxi-FXD(xi*)=mink∈DFXBxi-FXD(xk)The value is therefore the implant for record i. The StatMatch Package of the R software describes the methods as follows: For each recipient record the closest donors are chosen by considering the distance among the percentage points of the empirical cumulative distribution function. This description may be slightly misleading because, in general, percentage points are only units of measurement and the general idea is to minimize the absolute difference between the realization of the cumulative distribution function, while the question of how it is expressed is a secondary problem.Now we will present some examples of other interesting metrics, which can be used in the nearest neighbour method to define the distance between records. First of them is the river metric. Its name is derived from practical analogies of its construction on a plane. The distance between two points is defined to be zero if these points coincide. Otherwise, it is assumed to be the sum of distances of such points from the horizontal axis (i.e. absolute difference of their second coordinates) and the absolute value of the difference of their first coordinates. The OX axis is then a ‘river’, which is the only way one can take to get from one point to the other.YXFig 1. The river metricIn the multivariate case it is defined recursively and depends on the definition of distances from hyperplanes in each step of recursion (cf. A. M?odak (2006 a)). In the simplest case, we have:dhi=xh1-xi1+j=2mxhj+xij.This method has an advantage of being – depending on the choice of an indirect metric –a conglomerate of various measurement algorithms. So, one can adjust it to current needs. The river metrics can be used especially if we prefer to analyse partial differences between variables and their differences on final results of imputation.Of course, it is possible to use non–metric methods of distance measurement. Such formulas must satisfy the conditions of reflexivity, symmetry and non–negativity. A good example of such a formula is the median distance defined as a median of absolute partial distances:dhi=medj=1,2,…,mxhj-xij.Its main advantage is its robustness to outliers.Another interesting model is the so–called Walesiak distance, also known also as the Generalized Distance Measure– GDM (K. Jajuga and M. Walesiak (2003)) defined as: dhi=12-j=1mfxhj, xijfxij, xhj+j=1ml=1, l≠h,imfxhj, xljfxij, xlj2j=1ml=1mfxhj, xlj2+j=1ml=1mfxij, xlj2. (45)An important advantage of this measure is its universality for various scales of measurements. For variables measured on the ratio or interval scale the following different formulas are applied:fxaj, xbj=xaj–xbjBut for ordinal variablesfxaj, xbj=1 if xaj>xbj,0 if xaj=xbj,-1 if xaj<xbj.This measure takes values from [0, 1]. It is equal to zero if the analyzed objects are identical and 1 when they are extremely non–similar. If all variables are ordinal, then 1 could be taken also if only one type of order relation follows, i.e. – if values for object a are smaller (larger, equal to) than for b in the case of all variables. Another advantage of this approach is that this formula (45) is invariant under transformations of variables by linear functions (and on the ordinal scale – any strictly increasing transformations). This construction has, however, one additional tacit assumption – in the set of analyzed objects there exists at least one pair such that values observed for both objects are different (to avoid the zero value of the denominator in (45)).The Walesiak GDM method is derived from the general correlation coefficient between objects and therefore can be used when we would like to assess a degree of similarity between objects and exploit all their statistical interconnections. It is applied in the clusterSim package of the R software.If for every j=1,2,…,m we have defined imputation cells and Cjh denotes the cell according to the variable Xj, j=1,2,…, m which h belongs to, then they R. R. Andridge and R. J. A. Little (2010) suggest also defining:dhi=j=1mλihj,whereλihj=0 if h∈Cji,1 if h?Cji,for every j=1,2,…, m.The Walesiak (GDM) and Gower formulas are recommended to be used if the character of auxiliary variables depends on the measurement scale. The Mahalanobis distance is useful in the case of clearly correlated auxiliary variables, but it can be sensitive to variables with little predictive power. To avoid this inconvenience, it would be better to use the predictive mean formula. There are many special treatments of hot–deck imputation, which can improve the quality of implants and, consequently, the quality of population parameters. For exampleThe first of them can be called the ratio nearest neighbour (RNNI) method (cf. , as R. R. Andridge and R. J. A. Little (2010)) argue that sometimes it is better to impute logY and then exponentiate the obtained value. This option can be useful if we use the relevant special case of the general model of imputation formulated by G. Chauvet et al. (2011) related to the nearest neighbour method. In their approach modelling of some parameters is required and sometimes it can be better to use logY instead of Y. A. It refers to a situation may also occur when the target variable Y is strongly correlated with auxiliary variable Z measuring e.g. the unit size. In this case, it may be reasonable to treat the missing variable as the ratio S=Y/Z, impute sj=si from the donor i∈D and hence take yj=zisj (j∈B). Such a procedure will be useful if Z is not included in the set of auxiliary variables or if it is used for purposes of crude categorization of the population.Among other improved procedures, one can also mention the weighted sequential hot–deck procedure, which can be applied when the weights are connected with the variable to be imputed and multiple imputation of the same donor for various recipients is allowed. To limit the increase in imputation variance, it is recommended that respondents and non–respondents are divided into two subsets and sorted. Next, sample weights are rescaled to add up to the total of the weights for respondents. Thus, the set of potential donors is determined by the sort order and weight adjustment of both types of records. Consequently, the weighted mean obtained from imputation should be equal, in terms of the expected value, to the population mean (R. L. Williams and R. E. Folsom (1981)). A. Isra?ls et al. (2011) remark that deterministic hot–decknearest neighbour imputation is used especially for imputation with the help of auxiliary variables, if dividing records into classes based on these variables could imply a serious loss of information. The fact of two records belonging to a relatively large class is not necessarily equivalent to their being essentially similar (e.g. one of them can be very close to the lower limit of the class and the second – to the upper). Because in the deterministic hot–deckNNI method the distance function between the potential donor and recipient is minimized, it is essential that the importance of every variable can be determined by specially adjusted weights. Hot–deck imputation can be conducted using the StatMatch package of R software (M. D’ Orazio et al. (2006)) or using specially constructed algorithms in the SAS Enterprise Guide 4.2 computer environment.Cold–deck imputationAs opposed to the hot deck approach, this type of imputation uses a set of data not directly related to study variables to impute the missing values. Such datasets may be derived from a previous survey or from another source (e.g. administrative database). Cold deck imputation is commonly used in datasets containing time series. The practical use of this method may be motivated by a necessity to combine data from different sources in order to complete a time series for a particular variable of interest for which little information is available. The cold–deck method typically involves using data from other records, perhaps in the same industry, sometimes from earlier years, to impute missing data items. This approach is described by T. K. White and J. P. Reiter (2007), who studied how the U.S. Census Bureau implements the cold-deck method in the Census of Manufactures (CMF) and Annual Survey of Manufactures (ASM). The Bureau uses a variety of methods to impute data in the ASM and CMF. Based on a special analysis of the item-edit flags in the 2002 CMF and the 2003-2005 ASM, “cold-deck” turns out to be one of the most common methods In general, cold–deck imputation seems to be slightly better for estimating higher quantiles of the population.If we rely on old data we can use – as auxiliary tools – some popular models of the time series process, such as the Autoregressive Distributed Lag (ADL) or the Autoregressive Integrated Moving Average (ARIMA) process. Here the use of software like X–12 ARIMA would be recommended.Extensions of donor imputationThe Ffirst and natural extension of the classical nearest neighbour imputation is the k-nearest neighbour imputation method (KNNI). According to this approach, we use similar tools as in the case of classical variant, but for a given recipient we find k (where k is the natural number, greater than 1) possible consecutive donors subsequently closest to the recipient in terms of variables for which data are complete. Next, the missing value is estimated as the average of the corresponding variables in the selected k records. The KNNI method has no theoretical criteria for selecting the best k–value and the k–value has to be determined empirically or on the basis of the intuition of the researcher or experiences of experts. There existis also a weighted option of the KNNI approach, so called imputation with k–Nearest Neighbour (WKNNI). Its ideaIt is based on the main assumptions of KNNI, but the imputed value now takes into account the different distances to the neighbours, using a weighted mean or the most repeatedfrequent value according to a distance measure of similarity. The reciprocal of the Euclidean distance calculated over observed attributes of the in target value is here used as a measure of similarity of records (of course, instead of the Euclidean formula some another distance function from the collection presented in this module can be chosen). The missing value of the target variable is imputed as an average weighted by the similarity measure: The remark on subjective performance of the selection of the k parameter remains still valid. For more details see e.g. J. Luengo et al. (2009).In the case of sample surveys J. K. Kim and W. Fuller (2004) analyze fractional hot deck imputation, where each missing observation is replaced with a set of imputed values and a weight is assigned to each imputed value. This analysis refers to a model in which observations that could be potential donors are independently and identically distributed. The same authors (W. A. Fuller and J. K. Kim (2005)) investigate another option – for a model with a population divided into imputation cells, where response probabilities are equal within each such cell. In this much more general model, for each recipient a special factor is applied to the original weight for an element when it is used as a donor for a given unit (is is called the imputation fraction); it reflects the fraction that a given donor donates to the value of the missing item. Thus, the implant is, in fact, defined as the weighted mean of respondent values weighted by imputation fractions. W. A. Fuller and J. K. Kim (2005) define the fully efficient estimator of total parameters and apply it to the situation of fractional imputation, receiving the fully efficient fractionally imputed estimator and consider replicate variance estimation defined using this estimator. On the other hand, given that in this model, for each recipient a large number of donors can be used (and therefore the estimator could be inefficient in a subpopulation) and supposing that for each recipient the same number of donors is assigned, they suggest assigning donors to recipients in order to approximate the distribution of all respondents in the cell, and then constructing an adjustment procedure of the imputation fraction using specially established categorization of Y within each cell. Interestingly, J. N. K. Rao and J. Shao (1992) point out that the common practice to treat imputed values as if they were true values and then compute variance estimates using standard formulae could lead to a serious underestimation of true variance, when the proportion of missing values for an item is appreciable. Thus, they propose adjusted jackknife methods of imputation (their idea is to repeat estimation attempts by dropping some units from the main sample), where the first phase units are selected using random sampling with replacement. . Example – not tool specificThe above methods of donor imputation were used for data presented in Table 4 of the module “Model based imputation”. By way of reminder, they concern public companies, either casinos or other companies conducting activities classified primarily as gaming, whose market capitalization is greater than $100 million and which were included in the Capital IQ Company Screening Report prepared by Aswath Damodaran, Professor of Finance at the Stern School of Business at New York University and presented at . To maintain consistency with computations presented in this module, we assume that the variable to be imputed is the same as that used previously (revenue) and randomly sampled gaps in data remain unchanged. We conducted five types of imputation trials using the following methods:Random hot deck –: aAn explanatory variable which is most correlated with revenue was identified (it was capital invested – the Pearson’s coefficient amounted to 0.9570; this result may be slightly surprising e.g. in the context of pre-tax margin, which is ‘by definition’ dependent on revenue) and the records were divided according its values into four classes (starting from minimum, ending at maximum, dividing this interval into four equal intervals). Next, for each recipient a donor was sampled from the class which the recipient belongs to;Sequential hot–deck: Performed similarly as random hot-deck with one difference: for a given recipient a donor from the class which the recipient belongs to was chosen in such a way that its value for the analyzed auxiliary variable was the smallest;Deterministic hot–deck (Nnearest neighbour) – the nearest neighbour approach using the Czekanowski metric (4) was used. We have used here operational income and capital invested as covariates, so this metric can produce effective results;Ratio nearest neighbour – according to the suggestion by G. Chauvet et al. (2011) described earlier we treated the missing variable Y (revenues) as the ratio S=Y/Z, where Z is the auxiliary variable most correlated with it (i.e. capital invested), imputed sj=si from relevant donor i∈D using the remaining variables and the nearest neighbour method and hence we have taken yj=zisj (j∈B).Mean imputation – all gaps were filled by the mean of the available data for revenue;Ratio imputation – the relation between revenue and capital invested was used to impute missing data.The use of more than one explanatory variable in the case of the nearest neighbour method was motivated both by the construction of these methods (it is efficient when more than one auxiliary variable is used) and the fact that in the single–-variable case it can serve as an equivalent of some other methods. The remaining methods are meant to be applied given one auxiliary variable, but its choice also depends on the realization of remaining possible variables (e.g. by maximizing their correlation with the dependent variables). Thus, the bias usually generated by the sequential hot–deck method is reduced. The last two models were used to enable a full comparison of methods described in both modules. The summary results of such imputation are presented in Table 1. Data which had to be imputed (the relevant units were sampled from the uniform distribution) are indicated by green bold font.Table 1. Original and imputed data for revenueCompanyOriginal dataImputed datarandom hot–decknearest neighboursequential hot–deckratio nearest neighbourdeterministic hot–deckmean imputationratio imputationAmeristar Casinos Inc.1277.01277.012771277.01277.012771277.01277.0Archon Corp.43.243.243.243.243.243.243.243.2Bally Technologies, Inc.950.72117.91008.143.2758.01008.11373.9306.0Berjaya Land Bhd904.1904.1904.1904.1904,1904.1904.1904.1Boyd Gaming Corp.1837.01837.018371837.01837.018371837.01837.0Codere, S.A.1487.71487.71487.71487.71487.71487.71487.71487.7Crown Limited2117.92117.92117.92117.92117.92117.92117.92117.9Dover Downs Gaming & Entertainment Inc.239.1239.1239.1239.1239.1239.1239.1239.1Gamehost Income Fund53.01006.9145.343.288.3145.31373.945.3Genting Berhad2595.62595.62595.62595.62595.62595.62595.62595.6Gold Reef Casino Resorts Ltd.254.1254.1254.1254.1254.1254.1254.1254.1Great Canadian Gaming Corp.384.7384.7384.7384.7384.7384.7384.7384.7Groupe Partouche SA797.91277.01487.743.21432.41487.71373.9540.7Kangwon Land Corp.948.1384.71363.943.21609.01363.91373.9964.6Melco Crown Entertainment Ltd.1342.51342.51342.51342.51342.51342.51342.51342.5MGM Mirage7513.07513.075137513.07513.075137513.07513.0Monarch Casino & Resort Inc.145.3145.3145.3145.3145.3145.3145.3145.3NagaCorp Ltd.188.4188.4188.4188.4188.4188.4188.4188.4Olympic Entertainment Group258.0258.0258258.0258.0258258.0258.0Paradise Co., Ltd.337.3337.3337.3337.3337.3337.3337.3337.3Penn National Gaming Inc.2423.12423.12423.12423.12423.12423.12423.12423.1Pinnacle Entertainment Inc.1006.91006.91006.91006.91006.91006.91006.91006.9Resorts World Bhd1363.91363.91363.91363.91363.91363.91363.91363.9Sky City Entertainment Group Ltd.610.9610.9610.9610.9610.9610.9610.9610.9Sun International Ltd.974.72595.61487.743.21399.41487.71373.9528.3Tabcorp Holdings Ltd.2888.62888.62888.62888.62888.62888.62888.62888.6The Rank Group Plc1008.11008.11008.11008.11008.11008.11008.11008.1Wynn Resorts Ltd.3084.31006.92117.943.22165.12117.91373.91881.8Source: Own tabulation on the basis of data from Table 4 in the “Model based imputation” module and using the SAS Enterprise Guide 4.2.As can be seen, the best (in fact ideal) quality of imputation is produced by the ratio approach – the sum of absolute values of partial deviations from the true values amounted to 2575.1. This is probably the result of a very strong correlation between revenue and the auxiliary variable – capital investment. The output of the remaining imputation methods is much worse: the deviation from reality in the case of random hot–deck is 6861.9, for deterministic hot–decknearest neighbour imputation – 2734.7, for sequential ratio nearest neighbour imputation hot–deck – 28676549.35, and for mean imputation – 4855.6. Please recall that within the broad field of regression imputation the most precise approach was the predictive mean matching (2605.5). Of course, such a performance of the ratio method does not guarantee a similar precision in other situations: the correlation between target and some auxiliary variables is often weaker and adjustment of the multiple regression may be better. To better assess the usefulness of deterministic hot–deck (i.e. nearest neighbour), we performed one more experiment (which we have called ratio deterministic hot-deck). That is, according to the suggestion by G. Chauvet et al. (2011) described earlier we treated the missing variable Y (revenues) as the ratio S=Y/Z, where Z is the auxiliary variable most correlated with it (i.e. capital invested), imputed sj=si from relevant donor i∈D using remaining variables and nearest neighbour method and hence we have taken yj=zisj (j∈B). In this case we have obtained the following implants: 758.0 for Bally Technologies, Inc., 88.3 for Gamehost Income Fund, 1432.4 for Groupe Partouche SA, 1609.0 for Kangwon Land Corp., 188.4 for NagaCorp Ltd., 1399.4 for Sun International Ltd. and 2165.1 for Wynn Resorts Ltd. The total deviation from true vales was relatively small but higher than in the case of the ‘classical’ nearest neighbour method and amounted to 2867.25. The differences between true and imputed summary statistics are presented in Table 2.Table 2. Basic descriptive statistics for original and imputed revenue.SpecificationMeanVarianceMini-mumMaxi-mumLower Quar-tileMe-dianUpper Quar-tileSkew-nessKurto-sisOriginal full data1322.682237519.1343.207513.0297.65962.71662.42.825010.531Random hot–deck1379.122158047.7943.207513.0361.001007.51977.52.840010.971Sequential hot–deck1088.772344922.4143.207513.094.250497.801425.82.956311.238Deterministic hot–deckNearest neighbour1351.322117526.4043.207513.0297.651142.61662.42.949511.775Ratio deterministic hot-decknearest neighbour1345.662135551.1643.207513.0297.651142.61723.02.920611.579Mean imputation1373.932035728.9843.207513.0361.001353.21430.83.096312.759Ratio imputation1231.902190386.8143.207513.0281.98934.41662.43.027611.894Source: Own tabulation using the SAS Enterprise Guide 4.2.So, the best estimation of the mean was obtained using the ratio deterministic hot–decknearest neighbour approach – absolute deviation from the true value amounted only to 22.98 and the classical nearest neighbour method – 28.64 (in the remaining cases – over 50). So, the quality of estimation of total descriptive statistics can be improved when using ratio corrected nearest neighbour method. In the odd quartiles the situation in this respect is even better: their relevant deviations in the case of deterministic hot– deck approach amount to zero. However, in the case of the median, ratio imputation (28.3) is the best. The skewness and kurtosis are similar to those observed in the original data. Thus, we can conclude that, depending on the target statistics for population, various methods of imputation should be applied.Examples – tool specificThis example includes the syntax for the use of a method with particular software. In this case it is the code of SAS Enterprise Guide 4.2 used to make the computation performed in chapter 3. /* Loading the data*/data dfirmy;set sasied.firmy;run;proc iml;/* Identification of a variable being best correlated with revenues*/*/use dfirmy;read all var _NUM_ into danef;mcor=corr(danef);a=nrow(mcor);n=nrow(danef);m=ncol(danef);maxc=0;do i=2 to a; if abs(mcor[i,1])>maxc then do; maxc=mcor[i,1]; zmp=i; end;end;print "The best auxiliary variable is" zmp "coefficient of correlation with revenues" maxc; /*Division of records into classes according to the best auxiliary variable*/cp=j(n,1,0.);dimp=j(n,6,0.);do i=1 to n; dimp[i,1]=danef[i,1];end;do i=1 to n; cp[i]=danef[i,zmp];end;cls=j(n,1,0); maxcp=max(cp);mincp=min(cp);ncl=4;do i=1 to n; do p=0 to 3; if p<3 then do; if (mincp+(p*(maxcp-mincp)/4)<=cp[i])&(cp[i]<mincp+((p+1)*(maxcp-mincp))/4) then cls[i]=p+1; end; else do; if (mincp+(p*(maxcp-mincp)/4)<=cp[i])&(cp[i]<=mincp+((p+1)*(maxcp-mincp))/4) then cls[i]=p+1; end; end;end;numc=j(ncl,1,0);do g=1 to ncl; lc=0; do i=1 to n; if cls[i]=g then lc=lc+1; end; numc[g]=lc;end;do g=1 to ncl; if numc[g]>0 then do; odc=j(numc[g],1,0.); k=0; do until (k=numc[g]); do i=1 to n; if cls[i]=g then do; k=k+1; odc[k]=danef[i,zmp]; end; end; end; do i=1 to numc[g]-1; q=i; klucz=odc[i]; do p=i+1 to numc[g]; if odc[p]<klucz then do; q=p; klucz=odc[p]; end; end; odc[q]=odc[i]; odc[i]=klucz; end;/*random hot-deck imputation*/ do i=1 to n; if (i=3|i=9|i=13|i=14|i=25|i=28) then do; if cls[i]=g then do; do until (int(numc[g]*dl)>0&int(numc[g]*dl)^=3&int(numc[g]*dl)^=9 &int(numc[g]*dl)^=13&int(numc[g]*dl)^=14&int(numc[g]*dl)^=25 &int(numc[g]*dl)^=28); call randgen(dl,'uniform'); end; _dl=int(numc[g]*dl); dimp[i,2]=danef[_dl,1]; end; end; else dimp[i,2]=danef[i,1]; end;/*sequential hot-deck imputation */ do i=1 to n; if (i=3|i=9|i=13|i=14|i=25|i=28) then do; if cls[i]=g then do; cjk=0; jk=1; do until (cjk=1); do _i=1 to n; if (_i^=3&_i^=9&_i^=13&_i^=14&_i^=25&_i^=28&odc[jk]=danef[_i,zmp]) then do; ic=_i; cjk=1; end; end; if cjk=0 then jk=jk+1; end; dimp[i,3]=danef[ic,1]; end; end; else dimp[i,3]=danef[i,1]; end;end;end;/*deterministic imputation (Czekanowski metrics)*/wyr=j(n,1,0);do i=1 to n; if (i=3|i=9|i=13|i=14|i=25|i=28) then do; wyr[i]=1; mind=10E60; _mind=10E60; do h=1 to n; czek=0.; _czek=0.; do j=2 to m; if h^=i then do; czek=czek+abs(danef[i,j]-danef[h,j])/(m-1); if j<>zmp then _czek=_czek+abs(danef[i,j]-danef[h,j])/(m-2); end; end; if czek<mind&h^=i then do; mind=czek; imin=h; end; if _czek<_mind&h^=i then do; _mind=_czek; _imin=h; end; end; dimp[i,4]=danef[imin,1]; dimp[i,5]=(danef[_imin,1]/danef[_imin,zmp])*danef[i,zmp]; end; else do; dimp[i,4]=danef[i,1]; dimp[i,5]=danef[i,1]; end;end;/*Mean imputation*/ms=0.;sl=0;do i=1 to n; if wyr[i]=0 then do; ms=ms+danef[i,1]; sl=sl+1; end;end;imean=ms/sl;do i=1 to n; if (i=3|i=9|i=13|i=14|i=25|i=28) then do; dimp[i,65]=imean; end; else dimp[i,65]=danef[i,1];end;/*Ratio imputation*/imean1=j(4,1,0.);imean2=j(4,1,0.);do g=1 to ncl; ms=0.; _ms=0.; sl=0; do i=1 to n; if wyr[i]=0 then do; ms=ms+danef[i,1]; _ms=_ms+danef[i,zmp]; sl=sl+1; end; end; imean1[g]=ms/sl; imean2[g]=_ms/sl;end;do i=1 to n; if (i=3|i=9|i=13|i=14|i=25|i=28) then do; dimp[i,76]=(imean1[cls[i]]/imean2[cls[i]])*danef[i,zmp]; end; else dimp[i,76]]=danef[i,1];end;/*Release of the output*/cw={"Original" "Random hot deck" "Sequential hot deck" "Deterministic hot deck" “Ratio deterministic hot deck” "Mean" "Ratio"};create daneimp from dimp[colname=cw];append from dimp;quit;/*Output analysis*/proc means data=daneimp fw=12 printalltypes chartype qmethod=os alpha=0.05 vardef=df mean var min max q1 median q3 skew kurt;var 'Original'n 'Random hot deck'n 'Sequential hot deck'n 'Deterministic hot deck'n 'Ratio deterministic hot deck'n 'Mean'n 'Ratio'n;run;.GlossaryTermDefinitionHot–deck. imputationA method of imputation where imputed data for a recipient are derived from a donor specially chosen from the pool (or relevant groups determined by a categorical criterion) of possible donors using currently processed available data.Random hot–deck. imputationA method of imputation where for each recipient its donor is selected randomly from a set (or relevant subset) of potential donors.Sequential hot–deck. ImputationA method of imputation where groups of donors are not actively formed, but for each item of non–respondent, the score for the target variable is imputed from the next record in the data file with the same scores for certain background characteristics expressed by relevant categorical variables.Deterministic hot–deckNearest neighbour. Imputation (NNI)A method of imputation where a record in the same file with similar values (the similarity is assessed using specially chosen distance formula) of key auxiliary variables is found.Ratio nearest neighbour (RNNI)A variant of the NNI method referring to a situation when the target variable is strongly correlated with somean auxiliary variable measuring e.g. the unit size. In this case, it may be reasonable to treat the missing variable as thea ratio of these two variables and, impute its using NNI from the best donor and multiplying this imputed value by the value of the aforementioned auxiliary variable for the recipient.Cold–deck imputationA method of imputation where a set of data other than concerning the variables under analysis is used to impute the missing values by hot-deck methods. These datasets may be derived from a previous survey or from another source (e.g. administrative database).RR package; a free software environment for statistical computing and graphics consisting of several thousand modules enabling professional and highly specialized statistical research and analysis, See .SASFormer Statistical Analysis System, a complex software supporting statistical surveys, data analysis and dissemination of results (including business analysis and business intelligence) See X–12 ARIMAThe commonly used seasonal adjustment software produced, distributed, and maintained by the U.S.Census Bureau. It enables effective modeling of time series and assessment of model selection capabilities for linear regression models with ARIMA process errors, investigantion of seasonal and trend tendencies as well as diagnostics of quality and stability of the adjustments. See: LiteratureAndrigde R. R., Little R. J. A. (2010), A Review of Hot Deck Imputation of Survey Non–response, International Statistical Review, vol. 70, pp. 40 – 64.Chauvet G., Deville J.-C., Haziza D. (2011), On Balanced Random Imputation in Surveys, Biometrika, vol. 98, pp. 459 – 471.Fuller W. A., Kim J. K. (2005), Hot Deck Imputation for the Response Model, Survey Methodology, vol. 31, pp. 139–149. Isra?ls A., Kuyvenhoven L., van der Laan J., Pannekoek J., Nordholt E. S. (2011), Imputation, Series: Statistical Methods (201112), Statistics Netherlands, The Hague/Heerlen, Netherlands.Jajuga K., Walesiak M. (2003), On the general distance measure, [in:] M. Schweiger, O. Opitz (eds.), Exploratory Data Analysis in Statistical Research, Proceedings from the 25th Annual Conference of the Gesellschaft für Klassification e.V., University of Munich, Series: Studies in Classification, Data Analysis and Knowledge Organization, Springer Verlag, Berlin – Heidelberg, pp. 104 – 109.Kim J. K., Fuller W. A. (2004), Fractional hot deck imputation, Biometrika, vol. 91, pp. 559–578, Luengo J., García S., Herrera F. (2009), Imputation of Missing Values. Methods' Description. University of Granada, Department of Computer Science and Artificial Intelligence, Granada, Spain, HYPERLINK "" D. A., Judkins D. R. (2002), Large scale imputation for complex surveys, [in:] Survey Nonresponse, John Wiley and Sons New York pp. 329 – 341. M?odak A. (2006 a), Taxonomic analysis in regional statistics, ed. by DIFIN – Advisory and Information Centre, Warszawa, Poland (in Polish).M?odak A. (2006 b), Multilateral normalisations of diagnostic features, Statistics in Transition, vol 7., pp. 1125 – 1139.D’Orazio, M., Di Zio, M., Scanu, M. (2006), Statistical Matching. Theory and Practice, John Wiley & Sons, New York.Rao J. N. K., Shao J. (1992), Jackknife variance estimation with survey data under hot deck imputation, Biometrika, vol. 79, pp. 811–822.White T. K., Reiter J. P. (2007), Multiple Imputation in the Annual Survey of Manufactures, 2007 Research Conference Papers, Federal Committee on Statistical Methodology, Office of Management and Budget, U.S.A., .Williams R. L., Folsom R. E. (1981), Weighted hot–deck imputation of medical expenditures based on a record check subsample, ASA Proceedings of the Section on Survey Research Methods, pp. 406 – 411.Zelia? A. (2002), Some Notes on the Selection of Normalization of Diagnostic Variables, Statistics in Transition, vol. 5, pp. 787 – 802.Specific description – Method: Purpose of the methodDonor imputation methods are aimed at producing efficient imputations of missing data for some records when other data are available for all records in the analyzed database. The imputed data are taken from those records for which they are complete.Recommended use of the methodDonor imputation is a safe method that can be used in many situations without making model assumptions is preferred (e.g. especially if there is no auxiliary variable or if auxiliary variables are expressed on various measurement scales, for instance if a part of them is ordinal and others are interval variables). The model–based imputation can be better only if we are sure that the model is correct. In many practical situation there is no such certainty and hence the donor imputation seems to be more useful.scales, or if correlation between them is larger or difficult to determine (e.g. due to difference between measurement scales, where not all elementary operations are allowed).This class of methods is especially useful for variables of various type and measures on various measurement scales (nominal, ordinal, interval or ratio, depending on the particular method).This class offers users a wide number of possibilities with respect to the submethods and formulas of distance between records adequate to given data and assumed purposes. It is computationally efficient for larger sets of donors and recipients. Possible disadvantages of the methodSensitivity to outliers in some cases of distance formulas or to random errors in random solutions.Threat of an ineffective choice of weights in their weighted options. Variants of the methodVarious alternative distance formula can be used.Various weighting possibilities can be used depending on a situation These methods can be applied using varying degrees of randomness in the choice of donors.Input data setsSets of donors – from which data are to be imputed.Sets of recipients – where data are to be imputed.Logical preconditionsMissing valuesWithout missing values it cannot be used, but the larger is their number the lower is the imputation precision.Erroneous valuesRather not allowed, they negatively affect the results of imputation; however, if they are identified, they can be removed and the respective records are then tested as recipients, taking the remark A.6.1. into accountOther preconditionsProper choice of method and its subformulas. Tuning parametersA calibration of weights used in weighted options may be desirable is special situationsRecommended use of the individual variants of the methodThe choice of effective variant (interns of formula and parameters) should be developed in relevant case studies.Output data setsRecords with imputed dataProperties of output data sets An output set can be a good basis for estimating required data for small areas at various territorial levelsUnit of input data suitable for the methodEach economic and spatial unit User interaction - not tool specificNot applicableLogging indicatorsVariables available in administrative data sources and obtained in sample surveys (the latter only for sampled records) Quality indicators of the output dataExperiments show that the precision of imputation (i.e. complex distance between implants and ‘true’ values) as well as deviation oaf estimate of aggregated descriptive statistics using imputed values form its ‘true’ values seems to be satisfactory.Actual use of the methodNot applicableInterconnections with other modulesThemes that refer explicitly to this moduleWeightingQuality assessmentEstimationSamplingRelated methods described in other modulesCalibrationTheme–ImputationModel based imputationDesign of imputation.Mathematical techniques used by the method described in this moduleMetrics and distance conditions Distance formulasSampling of donors Usage of uniform distribution GSBPM phases where the method described in this module is used5.4 ImputeTools that implement the method described in this moduleThere is the StatMatch package in R softwaree. The GDM distance can be computed using clusterSim package of the R. One can mention here also original algorithms written using SAS Enterprise Guide by the Author of this module which can be used in practice.The Process step performed by the methodData imputation and calculation of aggregates. ................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download