Template for modules of the revised handbook



Theme: Design of sampling methods

0 General information

[The general information items are included in the header of the published module. They are used internally to link between different modules, and for version management purposes.]

0.1 Module code

Theme: Design of sampling methods

0.2 Version history

|Version |Date |Description of changes |Author |Institute |

|1.0p3 |21-6-2012 |Second version |Ioannis Nikolaidis |ELSTAT |

| | | | | |

| | | | | |

| | | | | |

| | | | | |

| | | | | |

| | | | | |

| | | | | |

0.3 Template version and print date

|Template version used |1.0 p 3 d.d. 28-6-2011 |

|Print date |21-6-2012 14:25 |

Contents

General section – Theme: Design of the sampling methods 3

1. Summary 3

2. General description 3

2.1 Sampling 3

2.2 Sample design 3

2.3 Application of sample design methods 5

2.4 Examples on sample design methods 13

3. Design issues 16

4. Available software tools 17

5. Decision tree of methods 18

6. Glossary 19

7. Literature 23

Specific section – Theme: Design of the sampling methods 24

A.1 Interconnections with other modules 24

2. Determining the boundaries of size classes used as strata 24

General section – Theme: Design of the sampling methods

Summary

Sampling provides the mean of gathering information about a population without examining it entirely. Sample design comprises of the selection method, the sample structure and plans for drawing inferences about the entire population. Sample designs can vary from simple to complex and depend on the type of information required and the selection method applied. The design affects the sample size and the way of conducting the analysis of sample results. The greater the required precision of estimates and the greater the complexity of design applied, the larger the required sample size.

Many sample designs are conducted by applying random selection, because this achieves inferences from the sample to the population, at quantified levels of precision. Random selection prevents from arising bias as opposed to non-random selections (purposive or quota sampling). However, a random selection may not always be required, e.g. in such cases where the samples are small and the sample data will not be extrapolated to draw inferences about the entire population.

General description

3 Sampling

Sampling is a tool for selecting a subset of units from a target population in order to draw inferences about the entire population on the basis of data collected for these units. The subset of the selected units is called a sample. The units that make up the target population must be described in terms of characteristics that clearly identify them (Cochran, 1977). However, some units of the target population may be excluded due to operational constraints, such as the high cost of data collection in some remote areas or the difficulty of identifying and contacting certain units. The included population is called survey population. The list of units from which the sample will be drawn is the sampling frame. The frame population can be obtained and identified by existing information (Rensen, 1998).

4 Sample design

A sample design is a set of rules that specifies how a sample of a given size is to be selected. It provides information on the target and final sample sizes, strata definitions and the sample selection methodology. In the theory of finite population sampling, a sampling design specifies for every possible sample its probability of being drawn. Mathematically, a sampling design is denoted by the function[pic], which gives the probability of drawing a sample [pic] (Cochran, 1977).

Sample design methods generally refer to the technique used to select the sample units for measurement (Rensen, 1998). There are two types of sampling:

• Probability or random sampling: Known positive selection probabilities for all population units. It provides reliable inferences about the entire population.

• Non-probability sampling: A subjective method is applied to select the sample and some units have zero selection probabilities.

There are different types of probability sample designs. The most basic one is simple random sampling. The designs increase in complexity to encompass systematic sampling, probability-proportional-to-size sampling, cluster sampling, stratified sampling, multistage sampling and multi-phase sampling. Each of these sampling techniques is useful in different situations. If the objective of the survey is simply to provide overall population estimates and stratification would be inappropriate or impossible, simple random sampling may be the best. If the survey is performed by interviewers (though it is not typical in business statistics) making the cost of survey collection high and the resources are bounded, cluster sampling is used. If an effective stratification is possible and/or subpopulation estimates are also desired (such as estimates by region or size of business), stratified sampling is usually performed. The main advantage of probability sampling is that since each unit is randomly selected and the inclusion probabilities of the sampling units can be determined, reliable estimates of the survey parameters and estimates of their sampling errors can be calculated.

Non-probability sampling is often used as an inexpensive, quick and convenient alternative to probability sampling. However, but it is not a valid substitute for probability sampling, since the inclusion probabilities of elements cannot be calculated due to selection bias and sometimes the absence of a frame. As a result, there is no way of producing reliable estimates and those of their sampling error. In order to make inferences about the population, the characteristics of the population must follow an adequate model or be uniformly or randomly distributed over the population. The commonly used non-probability sampling techniques are purposive sampling, based on the deliberate choice of a sample, quota sampling, based on population proportions and balance sampling, which resembles with the quota sampling in the sense that control variables are used to guide the sample selection. The cut-off sampling applies probabilistic selection for a part of the population, whereas for the remainder population the selection probability is equal to zero (‘take-no’ part). This type of sampling is an intermediate between probability and non-probability sampling, because there is actually a deliberate exclusion of part of the target population from the sample selection.

The different ways of probability and non-probability sample selection as well as the different estimation methods are described among others in the Statistics Canada publication “Survey Methods and Practices” (2003), the Eurostat publication “Survey sampling reference guidelines” (2008), Rensen (1998), Dalen (2005) and Montaquila and Kalton (2011).

Which sample design fits best for a particular survey, depends on the availability of suitable sampling frames, the auxiliary information included in the frame, the needed level of precision, the detailed information to be obtained for subpopulations, called domains of analysis, the estimators to be applied for the parameter estimates and other operational constraints such as budget and time. The choice of a suitable probability sample ensures that the sample can support the required inferences without considerable biases. If an inappropriate probability sample design is used, the survey estimates will have larger variances than in a more suitable design scheme. Biases may appear in the estimates due to non-response and non-coverage errors. Suitable weighting of sampling units reduces the bias due to non-response, but it may increase the variance of the estimates due to extra random weighting. Corrections and weightings for non-coverage are more difficult than those for non-response, because coverage rates cannot be obtained from the sample, but only from outside sources (Kish, 1992).

5 Application of sample design methods

Simple Random Sampling

According to simple random sampling (SRS) each population unit has an equal selection probability and each combination of n units has an equal selection probability. The formulas for estimating population parameters and conducting hypothesis tests are not complicated. However, it needs a complete population frame and by chance, some domains may be over represented in the sample while other may not be sampled at all.

Simple random sampling facilitates the researchers to use simple weights and formulas for estimating the population parameters, conducting hypothesis testing and applying theories of distribution for complex statistics. However, this type of sampling is not applied widely on its own, because it produces estimates less efficient than in case of a more complex designs. Additionally, with SRS there is a probability that some domains may be over-represented or not be sampled at all. Thus, in most cases complex sample designs are applied, having as a basis simple random sampling (eg. stratified element sampling, cluster and multistage sampling, usually also stratified).

Stratification

Stratified sampling is used both to decrease the variances of the estimates and to gain efficient domain estimators. Commonly, the major domains (e.g. economic activity branches, regions) for which separate and accurate estimates are sought, can be defined as strata. For these domains (called design domains) the sampling methods and the sampling fractions can deliberately vary. In order to improve the efficiency of a sampling strategy compared with SRS, additional strata can be defined in the domains taking into account the distribution of the main population variables. In this course, the aim is to assure strong homogeneity within the various strata (the units in strata should be similar with respect to the variables of interest) and a possibly great difference between them. This is achieved if the stratification variables are strongly correlated with the survey variables of interest. The benefits of stratification diminish if the variables of interest are not highly correlated with the stratification variables. Stratification variables unrelated to each other, but related to survey variables should be preferred (Kish, 1965).

When the variables of interest have skewed distributions, as it commonly happens in business surveys, then a small number of large units contribute to the parameters in a large share. In this case, within the design domains, a separate stratum is created including all large units (take-all stratum), which are surveyed exhaustively (Glasser, 1962; Hidiroglou, 1986; Lavallee and Hidiroglou, 1988; Hidiroglou, Choudhry and Lavallee, 1991). Generally, the take-all stratum is indispensable if the sampling rates do not afford to select enough large elements and although the ‘upper tail’ of the distribution of the variable [pic]is small, it accounts for a large portion of the aggregate [pic] and influences sufficiently the estimates (Kish, 1965).

Commonly, in element stratified sampling within the really sampled (take-some) strata the sample is selected by applying systematic sampling. Before drawing the systematic sample, the population is often ordered by variables considered strongly related to the main survey variables, in order to gain in precision of the estimates to be obtained from a population with a monotonic trend (Kalton, 1983).

Two-phase sampling

When the sampling frame lacks of auxiliary information that could be used to stratify the population, a two-phase sampling scheme may be applied (Kish, 1965). In the first phase a large sample is selected to obtain the required stratification information. This first sample is then stratified to provide, in the second phase, a subsample from each stratum, in order to collect more detailed information.

For example, we suppose that detailed information is needed about the local units of enterprises. The sampling frame only lists the local units, with no auxiliary information about their sizes (number of employees), which is needed for stratification. In the first phase, a large sample of local units is selected so as to gather the required information for the stratification variable (e.g. number of employees). Then this first sample is stratified and in the second phase, a smaller sample of these units is drawn in each stratum, in order to collect more detailed information

Subclasses

A subclass is the sample of a domain. If a domain was not considered as a design one, then selecting subclass members from the sample has the effect that zero values are assigned to the variables of non-members. The proportion of zero values (blanks) increases as the subclass proportion decreases. When crossclasses (subclasses that cut across the strata) become smaller, the variability of survey variables increases greatly, and as a result, the stratification gains in accuracy tend to become lost, especially for small crossclasses (Kish and Frankel, 1974). This occurs because in a stratum [pic] only [pic] elements out of the [pic] elements of the population and only [pic] elements out of the [pic] elements of the sample belong to the subclass. Although [pic] is fixed, the number [pic] of subclass members in the sample is a random variable. As an example of crossclasses, one can consider the size classes of enterprises defined by employment, when the stratification variable is not the employment, but the annual turnover (Kish, 1965).

Boundaries of size classes used as strata

The choice of boundaries for size classes used as strata depends both on the needs for creating design domains and on the nature of the distribution of the stratification variables. For continuous variables, a practical procedure for obtaining these boundaries is the cumulative CUM [pic] rule (Dalenius and Hodges, 1959), creating optimal boundaries both for small and large number of size classes. The rule is roughly equivalent to make [pic] constant, as conjectured by Dalenius and Gurney (Cochran, 1977), where [pic] is the relative size of size class [pic] and [pic] is the standard deviation of the variable [pic]. Ekman’s similar rule makes [pic] constant, where [pic] is the width of size class [pic] (Ekman, 1959; Cochran, 1977). Additionally, for creating boundaries of continuous variables the rule of equal aggregate stratum sizes may be applied, making equal values [pic] in the size classes (Hansen, Hurwitz and Madow, 1953; Sethi, 1963), where [pic] is the mean of the stratification variable[pic] in size class h. This would work well when the coefficients of variation [pic] are about the same for each size class. Then the equality of [pic] implies that [pic] is equal between strata, which yields a solution close to the best one (Hess, Seth and Balakrishnan, 1966).

Number of size classes used as strata

The question relevant to the decision of the number of size classes is at what rate the variance of [pic] decreases as L (number of size classes) increases ([pic] is the estimated value of [pic] in stratified sampling, given the sampling size). For the decision of the number of size classes, we may first apply the CUM [pic] or the Dalenius and Gurney rule or Ekman’s rule to stratify units in L=4 to 8 strata, and subsequently, given the sampling size, in each separate case we calculate the variance [pic] of the estimate[pic]. As L increases, the values [pic] decrease. If very little reduction in variance appears beyond L=H, then we decide that the ideal number of the size classes should be equal to H.

When homogeneous strata are defined using categorical variables, then for creating both the boundaries and determining the number of strata, cluster analysis may be applied (Holland, 2007). This analysis combines categorical data and finds the “natural groupings” of units.

Sample allocation to strata

Disproportional sampling fractions [pic] for strata [pic] can be used deliberately in order to decrease the element variances of the variables of interest or the costs. These are achieved by applying ‘optimal allocations’ to strata, according to the allocation formula [pic] ([pic]: per-element cost of sampling in stratum [pic]). Applying this method, important gains in precision and costs are achieved, especially in business surveys, where the variables of interest have skewed distributions. These deliberate differences in the sampling fractions [pic] should be large to be effective, by factors from 2 to 10 and even greater (Kish, 1992). Smaller differences seldom produce suitably larger effects (Kish 1965, 1987, 1992). The differences among the [pic]’s should be highly related to survey variables and especially to the stratification variables. The [pic]’s may be simple integral multiples of a basic sampling rate [pic], like [pic] or [pic] (Kish, 1992). In order to avoid biases in the produced estimates, these differences among sampling fractions are always compensated with inverse weights. In business surveys, the sampling fractions are reduced for small size class units, which implies that small units get large sampling weights.

In stratified sampling, the application of formulae for ‘optimum allocation’ often depends on unknown population parameters. Thus, Neyman allocation requires knowledge of the stratum standard deviations of variables of interest (mainly that of the stratification variable) to be used in the allocation [pic]. If the values of [pic] are not available, the strata means [pic] of variable [pic] (usually the stratification variable) can be used, assuming constant coefficient of variations across strata. In this case, the optimal allocation becomes [pic], where [pic] is the stratum total (Kalton, 1983, Murthy, 1967). If update values of [pic] are not available, alternatively, the optimal allocation may become [pic], where [pic] is the width of size class [pic]. Additionally, if [pic] is constant, Neyman allocation gives a constant sample size [pic] in all size strata providing this way a satisfactory allocation rule (Cochran, 1977).

Both proportional allocation [pic] and the square root allocation [pic] do not give significant gains in precision for not uniform population distributions. However, in business surveys for a given design domain, the sample belonging to a particular size class can be allocated to regions applying proportional or square root allocation if the survey variables of the same size stratum do not have skewed distributions over the population.

Allocations to design domains can be carried out using different sampling fractions [pic]. The sampling fraction may increase from [pic]to [pic] ([pic]) in order to reduce the sampling errors of estimates in one or more design domains, especially for small domains. In other cases the sampling fractions may be reduced to [pic] to save overall costs (Kish, 1988; 1992).

Cluster sampling

The first intention in a sample design is the use of elements as sampling units. However, when individual selection of elements seems too expensive or the list of elements to select in the sample is not available, then the survey tasks can be facilitated by selecting clusters, which are units containing several elements. The number of elements in a cluster is called the size of a cluster. The clusters in most population surveys are of unequal size. As cluster sizes are unequal, selection of clusters is improved with stratification. The selection of clusters in strata may be done with equal probabilities, with systematic sampling or with probabilities proportional to their sizes. In business surveys, cluster sampling can be applied with the enterprise as a sampling unit and all its kind-of-activity or local units as observation units.

Clustering reduces the cost of data collection and also provides efficiency in frame constructions within clusters, but it results in loss of precision of the estimates, compared to a sample of elements of the same size. Cluster sampling increases the element variances, but these increases can be ameliorated by stratification, that should accompany the cluster sampling. Both the reductions of stratification and the increases of clustering are expressed by “design effects” (Kish, 1965, 8.2; 1995). In cluster sampling the loss of precision occurs due to the usual homogeneity of elements in clusters, which is measured and assessed through the intra-class correlation. The design effect represents the combined effect of a number of components such as stratification, clustering and unequal weighting. Clustering should be preferred over individual selection when the lower cost of element compensates the loss of precision, as it often happens in large and widespread samples (Kish, 1965).

Multistage sampling

The greater the spread of a sample of elements over the population, the greater precision is obtained, but it increases the cost. A good trade-off between the two conflicting effects of clustering and cost of survey can be achieved by conducting subsampling within clusters. In multistage sampling, first we select the clusters (primary sampling units or first stage units) and then in one or more further stages the elements (secondary units or second stage units, …, final units or final stage units). For example, in the structural earning survey a two-stage sampling may be applied with the local unit as a primary sampling unit and the employees as observation units (final units).

Subsampling decreases the effect of clustering, which may decrease the variance of estimates greatly, without incurring a proportional increase in cost. Through subsampling the larger clusters are divided into smaller clusters. Additionally, the second stage frames used for subsampling are needed only for the selected first stage units and so on (United Nations, 1950).

In two-stage sampling, the overall uniform sampling fraction [pic] can be written as a product of the first and second stage sampling fractions [pic] and [pic], respectively, as follows:

[pic].

The fractions [pic] and [pic] refer not only to sampling fractions of the first and second stage population units, but also to true inclusion selection probabilities. The equal probability [pic] for all population elements comprises of two probabilities operations: selection of cluster with [pic] in the fist stage and second stage selection with [pic] within the selected clusters (Kish, 1989).

Multistage stratified sampling

Multistage sampling is generally performed with stratification, namely stratification has even more advantage for cluster than for element sampling. For example, the relative gains of proportional stratification are greater for cluster than for element sampling from the same set of strata.

In multistage stratified sampling, within strata the primary sampling units (PSUs) may be selected either with equal probabilities or probabilities proportional to their sizes (PPS). Equal probabilities may be applied when the sizes of clusters (PSUs) are approximately equal in the strata. For unequal clusters the selections usually can be done with PPS. The precision of the estimates can be increased if the variables of interest are strongly correlated with the size of PSUs. As it is common that the elements in clusters have no skewed distributions, equal probabilities are used for all elements to subsample without replacement in the selected PSUs. For example, the employees in the selected local units can be selected with equal probabilities. However, if the primary sampling units (clusters) are the enterprises and the final units are the local units, then as the local units usually have skewed distributions, these will be selected by applying stratified sampling in the selected clusters (enterprises). This is commonly applied if the observation unit is the local unit (e.g. in the labour cost survey), and frame with local units is not available for the sample design. Using frame with enterprises based on business register, enterprises are selected applying stratified sampling and in the next step for the selected enterprises a frame (list) of local units is compiled. Extra information is attached in each of the local unit in their list (such as economic activity, region, number of employees) in order to use this information for determining the stratification criteria for the selection of enterprises and defining the inclusion probabilities of elements within the clusters.

In multistage stratified sampling, self-weighting samples are often preferred in the design domains, especially when the survey variables are uniformly distributed over the population. The sampling fractions [pic] in strata [pic] are constant and equal to the basic sampling fraction [pic] of the design domain. Thus, we should focus on methods for selecting elements with equal probabilities, denoted by the constant overall sampling fraction [pic]. One case of selecting self-weighting samples of elements is if equal subsamples are obtained from unequal sized clusters, which have been selected with probabilities proportional to their sizes.

In multistage stratified sampling, the selection of PSUs should be encouraged to be done with probabilities proportional to their sizes, because this greatly facilitates the variance estimation. Variance estimates can be computed by taking into account only the PSUs and the strata, without the need to consider the complexities of the subsequent stages of the sample selection.

Determination of sample size

After both the sample design and the estimator have been decided, the sample size should be determined. The level of precision needed for the survey estimates affects the sample size. However, the sample size of a survey is a trade-off between the level of precision to be achieved, the survey budget and any other operational constraints such as time. In order to achieve a certain level of precision, the sample size depends on the following factors (National Audit Office, 2001; Trochim, 2002; Dalen 2005):

• Heterogeneity of the survey population: Larger sample is necessary to survey a more diverse population.

• Desired precision: Larger sample is needed to get smaller sampling errors of estimates, and therefore higher precision.

• Population size: The bigger the population, the bigger the sample needed. The precision of estimates is usually influenced more by the total sample size than by the sampling fraction.

• Type of sample design: Smaller sample is necessary in case of stratified sampling, whereas larger sample in case of clustering.

• Nature of analysis: Complex multivariate statistics need larger samples for estimates and analysis, as the sample size should be determined taking into account the precision of all main survey variables and all domains of analysis.

Optimizing the sample size for more than one variable, the optimum sample size would be determined for each main variable (Bethel, 1989). Next, the maximum value of all resulted sample sizes will be considered as final sample size. In business surveys, the turnover and the number of employees are considered to be important variables, which are available in the frame and have strong correlation with most of the survey variables. Thus, for these variables the optimum sample sizes are calculated and the maximum of their values constitutes to the final sample size.

Non-probability sampling

a. Purposive sampling

Compiling short-term indices

For rapid estimation of short-term indicators the samples are often small. Commonly, the calculation of indices is based on comparisons between the current period and the period of the base year. The values of surveyed variables [pic]at current period [pic]are strongly related to the respective values[pic] of the base year known for all elements. Thus, model based estimators (e.g. ratio estimators) are used for compiling indices. As the samples for data collection are small, purposive sampling may be preferred for selecting the largest units, purposive stratified sampling for selecting the largest units in each stratum, and cut-off stratified sampling for selecting elements randomly above a certain size threshold (Särndal, Swensson, and Wretman, 1992).

Applying the above methods for compiling indices, bias is created, because the sample data cannot be extrapolated to the entire population. However, some of the biases of short-term indices (or changes over time or ratios between two periods) is removed due to subtraction of similar biases of the estimates [pic]at different time points. Such removing of the biases make the comparisons between potentially biased periodic surveys meaningful (Bell and Hillmer, 1990; Kish, 1994).

Ratio and regression estimators

For ratio estimators, apart from selecting the largest units or applying stratified sampling with the selection of the largest units in each stratum, Karmel and Jain (1987) proposed that the ratio estimation performs well by applying [pic] (where [pic]is the auxiliary variable used as stratification variable) for sample allocation to strata and purposive selection of the largest units within each stratum.

The regression estimator works well by selecting the largest and the smallest units in each stratum (Karmel and Jain, 1987). Additionally, for applying model based estimators, balanced sampling may be used. When we have small samples it is preferable to use non-probability sampling (purposive sampling), which result in small bias, rather than probability sampling, that is related with large variance. Additionally, modelling with purposive sampling may not be suitable for all variables studied in multipurpose survey. In practice such purposive selections are not used with large samples because of the bias caused by failures in model assumption (Kalton, 1983).

The ratio and regression estimator are superior to the simple expansion estimators when the samples are small (Särndal, Swensson and Wretman, 1992). The multipurpose nature of surveys complicates the use of models in sample design. While models may suggest efficient non-random sampling, a random sample design is a necessary compromise that reflects the data requirements of survey objectives.

b. Quota sampling

If expansion estimators based on data from small samples for subgroups are used for parameters’ estimates, quota sampling may be applied. Applying these methods, first the population is divided into strata or population cells using control variables (e.g. region, economic activity, size class of enterprises), then a number of sample units are selected in cells, so that the distribution characteristics of the total sample units yield the same distribution of the characteristics as the totality.

When quota sampling is applied, a simple method for determining the sample sizes [pic] in population cells [pic] is to take the [pic]’s proportionally to the population cell counts [pic] assuming that these are known. Another choice is to take the [pic]’s proportionally to [pic]or[pic] if the standard deviation values [pic] or the population mean values [pic] of the main variable [pic] are known in the different cells (Särndal, Swensson and Wretman, 1992).

c. Balanced sampling

Balanced sampling may be used for both model based and expansion estimators. Control variables (quantitative or qualitative with known population totals) are used to guide the sample selection, in such a way that the sample mean of each control variable is approximately equal to the population mean of the respective control variable.

e. Cut-off sampling

In business surveys, cut-off sampling is commonly applied for targeting estimates of short term indices (e.g. turnover index) and for applying model based estimators (ratio or regression estimators). In cut-off sampling, there is a deliberate exclusion of a part of the target population from sample selection, because it would cost too much, while the bias caused by the cut-off is deemed negligible.

Concluding remarks

The population participating in business surveys continually changes on account of births, deaths, splits, merges and classification changes. The sampling design should reflect the changing structure of the population. When determining sample size, we have to take into account the required levels of precision needed for the survey estimates, the type of design (stratification, clustering), the estimator to be used, furthermore the availability of auxiliary information and operational constraints such as budget and time.

Stratified random sampling with a systematic sample within each strata is the sampling design that satisfies the requirements of business statistics production. For highly skewed populations, we have to include in the survey a stratum of large units that will be sampled with certainty. Furthermore, for given domains of interest unbiased (or nearly unbiased) estimates are developed along with the associated measures of reliability (variance, coefficients of variation). A desired property of the estimation is that the estimates of domain totals should add up to the population total when the domains are exhaustive and non-overlapping (design domains). This can be ensured by using weights which are independent from the domains.

The sample design can be improved assuming good probability samples without considerable biases. The biases of non-response and non-coverage tend to show much similarity across countries. The pressure to reduce sample sizes substantially or to produce accurate estimates for small domains, has increased the need for the effective use of auxiliary information in business surveys combining data from other surveys and administrative record systems. However, the multipurpose nature and the large surveyed characteristics of most business surveys provide several reasons to prefer the simple expansion estimators in most circumstances.

6 Examples on sample design methods

Design of one-stage stratified sampling

Design of the Greek annual survey on constructions

In the specific survey one-stage stratified sampling was applied, with sampling unit the enterprise. The sample design (determination of sample size, selection of sampling units) was based on the updated business register (BR) of the Hellenic Statistical Authority.

The enterprises included in the survey were stratified as follows:

• By Region – NUTS 2

• By class of NACE Rev.2 (4-digit level of economic activity), within each Region

• By size class of the enterprise. In each of the major strata (major stratum = Region x economic activity), the enterprises were stratified into L=5 size classes, according to their size, determined by their annual turnover in the business register, as follows:

|Size Class |Turnover (in €) |

|Class 1 |1 – 89.999 |

|Class 2 |90.000 – 249.999 |

|Class 3 |250.000 – 1.499.999 |

|Class 4 |1.500.000 – 9.999.999 |

|Class 5 |10.000.000+ |

The final stratum that contains size classes with L=5, is a census (take-all) stratum.

The variable used for the construction of size classes, the size class boundaries and the number of classes were determined as follows:

• The variable used for the creation of the size classes of the enterprises belonging to the BR is the annual turnover [pic]of the enterprise, since in every economic activity turnover is highly correlated with most of the survey characteristics. If we could stratify the enterprises by the value of [pic] in Regions and economic activity (4-digit code of NACE Rev.2), then there would be no overlap between strata, and the variance within strata would be much smaller than the over-all variance, particularly in the case of many strata.

• Given the number of final strata, for the determination of the best size class boundaries, Dalenius and Gurney rule was applied ([pic]= constant and [pic]).

• The question relevant to the decision of the number of size classes is at what rate the variance of [pic] decreases as L (number of size classes) increases ([pic] is the estimated value of [pic] in stratified sampling, given the sampling size). So, applying the Dalenius and Gurney rule the enterprises were stratified in L=4 to 7 strata, and subsequently, given the sampling size, in each separate case the variance[pic] of [pic] was calculated. As L was increased, the values of [pic] decreased. As very little reduction in variance appeared beyond L=5, we decided that the ideal number of the size classes should be equal to 5.

The sampling size is 3,080 enterprises, the sampling rate 2.9%. The sample size of the enterprises was defined in such a way that the relevant standard error (coefficient of variation, CV) of the variables “number of employees” and “turnover” at 2-digit code level of economic activity and for the whole country should not exceed 3%. The sampling units (enterprises) were distributed among the size strata by applying the method of optimal (Neyman) allocation.

In each final strata [pic], a sample of [pic] enterprises was selected. The enterprises to be surveyed were selected from the total of the [pic] enterprises with equal probabilities and by applying systematic sampling. The sampling units (enterprises) were selected from the sample frame based on data from the BR.

Design of multistage sampling

Greek structural earning survey

The survey covered all local units belonging to enterprises with 10 or more employees in annual average in the areas of economic activity defined by sections C–K plus M, N and O of NACE Rev.1. The sampling frame used for the selection of the primary sampling units (enterprises) was the register of enterprises with reference year 2004. This register is compiled by data coming from administrative sources (Social Insurance Foundation and Tax Authorities). For sections M and N special registers were used with reference year 2005.

The local units for data collection, that is, the survey units, were not included in the register of the enterprises. Therefore, we applied two-stage stratified random sampling. At the first stage we selected enterprises and their local units, so actually the primary sampling unit is the local unit, while at the second stage we selected employees (final unit).

More specifically, the process for the selection of the sampling units was the following:

The enterprises with 10 or more employees included in the survey were stratified by:

a. Geographical Region – NUTS 1 (Northern Greece, Central Greece, Attica, Aegean Islands and Crete)

b. Two-digit NACE Rev.1.1 code of economic activity within each Geographical Region

(Geography x Economic Activity = Major stratum), and

c. Size class of the enterprise. In each of the major strata, the enterprises were stratified into 4 size classes, according to their size, determined by their number of employees in the business register, as follows:

|Size Class |Annual average number of employees |

|Class 1 |10–19.9 |

|Class 2 |20–49.9 |

|Class 3 |50–99.9 |

|Class 4 |100 + |

The size class 4 (100+ employees) was exhaustively surveyed (take-all class).

In each final strata the sampling units were selected as follows:

1st stage: Firstly, a sample of [pic] enterprises was selected out of the [pic] enterprises in the ultimate stratum h with equal probabilities of selection. If an enterprise had more than one local units, then all local units of the enterprise were selected. In fact, the primary sampling unit is the local unit.

2nd stage: In each selected local unit (namely local unit [pic]) a sample of [pic] employees was selected out of the [pic] employees included in the local unit during the survey period ([pic]=1,2, …,[pic]) with equal probabilities of selection.

In each Section (one-digit economic-activity code), the sample size of enterprises (1st stage of sampling) and the number of employees to be surveyed (2nd stage of sampling) were defined in such a way that the coefficient of variation of the variable “number of employees” should not exceed 3% and that of the variable “gross monthly earnings” no more than 5%, based on data of the previous survey (reference year 2002).

So the sample of employees in each final stratum was calculated proportionally to the total number of employees in the stratum.

Within each stratum the number of the employees was divided in such a way that the sample of employees were self-weighted.

The final sample size of local units was 8,540 out of 49,995 (sampling rate=17.1% according to estimations based on sampling data) and the active sample size of employees was 47,940 out of 1,489,184 (sampling rate=3.2%).

Design of purposive sampling

Greek Industrial Production Index (IPI)

Data is disseminated for the activities of Sections B, C, D and E (Mining – Quarrying, Manufacturing, Electricity and Water Supply) of Eurostat NACE Rev. 2 at the 2-digit, 3-digit and 4-digit level and at the main industrial groupings (capital goods, intermediate goods, durable consumer goods, non-durable consumer goods and energy). No geographical breakdown is made for the above data. Data are monthly and presented in the form of indices and growth rates. Each month, a gross series and a working day adjusted series are calculated. The base year is the year 2005 (2005=100.0%).

The sampling frame for selecting the products is based on the results of the Greek annual survey of the Production and Sales of Industrial Products (PRODCOM) for the year 2005. The local units of the Greek annual industrial survey for 2005 and the Greek annual mining-quarrying survey for 2005 are used as a sampling frame for selecting the establishments (local units) and the KAUs (Kind of Activities), which produce the selected products.

In the first phase, the survey for the Industrial Production Index (2005=100.0%) covers 349 products. These are selected by purposive sampling in such a way that the sample of products should represent at least 40% of the total production value at the four-digit level of economic activity and 70% of the total production value at the two-digit level of economic activity, according to the results of the annual PRODCOM survey for the year 2005. The measurement of the surveyed products is made in terms of output quantities or in terms of production value or turnover, according to the specific situation in each branch of economic activity.

In the second phase, the sampling unit used is the KAU, and the sample of the units surveyed for the Industrial Production Index (2005=100.0%) comprises of 1,468 units. Units are selected also by purposive sampling in such a way that the sample of units should a) produce the selected products and b) represent at least 40% of the total production at the four-digit level of economic activity and 70% of the total production at the two-digit level of economic activity, according to the results of the annual PRODCOM survey for the year 2005. In the manufacturing section, only establishments employing ten or more persons are included, but there is no such restriction in the other sectors covered by the Index. In terms of the volume of production, the sample of units surveyed for the index covers approximately 96% of the total production.

Design issues

The following sampling techniques are applied in Business Surveys (alternatively or in combination):

• One-stage stratified random sampling with enterprise or local unit (establishment) as surveyed unit and economic activity, geography and size class of surveyed units as stratification criteria. Systematic sampling in each final stratum.

• Two-stage stratified random sampling with establishment as primary sampling unit (PSU), employee as final unit and economic activity, geography and size class of enterprises as stratification criteria. Selection of PSUs with equal probabilities or probabilities proportional to their sizes in each final stratum.

• Cut-off or purposive stratified sampling for selecting small samples of enterprises, local units, or kind of activities used for compiling short term economic indices.

Available software tools

Packages for sample designs ():

• AM Software from American Institutes for Research.

• Bascula from Statistics Netherlands.

• CENVAR from U.S. Bureau of the Census.

• CLUSTERS from University of Essex.

• Epi Info from Centers for Disease Control.

• Generalized Estimation System (GES) from Statistics Canada.

• IVEware from University of Michigan.

• PCCARP from Iowa State University.

• R survey package from the R Project.

• SAS/STAT from SAS Institute.

• SPSS Complex Samples from SPSS Inc.

• Stata from Stata Corporation.

• SUDAAN from Research Triangle Institute.

• VPLX from U.S. Bureau of the Census.

• WesVar from Westat, Inc.

Decision tree of methods

[pic]

Glossary

[Only mention terms in this module-specific local glossary that are independent of a particular tool and with no SDMX equivalent. Copies of SDMX definitions from the Statistical Data and Metadata Exchange, or some other global glossary, can be included for the convenience of the reader. Local terms are marked by an asterisk (*)]

|Term |Definition |Source of definition (link) |Synonyms |

| | | |(optional) |

|Coefficient of variation |The coefficient of variation is defined as the ratio of | |

| |the square root of the variance of the estimator to the |efficient_of_variation | |

| |expected value. | | |

|Covariance |In probability theory and statistics, covariance is a | |

| |measure of how much two random variables change together. |variance | |

| | | |

| | |s/covariance/ | |

|Cluster analysis |The task of assigning a set of objects into groups (called| |

| |clusters) so that the objects in the same cluster are more|v13.pdf | |

| |similar (in some sense or another) to each other than to | |

| |those in other clusters. |uster_analysis | |

|Cluster sampling |Cluster Sampling is a sampling technique used when | |

| |"natural" groupings are evident in a statistical |uster_sampling | |

| |population. The total population is divided into these | | |

| |groups (or clusters) and a sample of the groups is | | |

| |selected. | | |

|Correspondence analysis |Correspondence analysis is an exploratory data analytic | |

| |technique designed to analyze simple two-way and multi-way|idams/advguide/Chapt6_5.htm | |

| |tables containing some measure of correspondence between | |

| |the rows and columns. As opposed to traditional hypothesis|rrespondence_analysis | |

| |testing designed to verify a priori hypotheses about | | |

| |relations between variables, exploratory data analysis is | | |

| |used to identify systematic relations between variables | | |

| |when there are no (or rather incomplete) a priori | | |

| |expectations as to the nature of those relations. | | |

|Cut-off sampling |In cut-off sampling a probabilistic selection for a part | |

| |of the population is applied, whereas for the remainder |detail.asp?ID=5713 | |

| |population the selection probability is equal to zero. | |

| |According to OECD definition, cut-off is a sampling |toff | |

| |procedure in which a predetermined threshold is | | |

| |established with all units in the universe at or above the| | |

| |threshold being included in the sample and all units below| | |

| |the threshold being excluded. The threshold is usually | | |

| |specified in terms of the size of some known relevant | | |

| |variable. In the case of establishments, size is usually | | |

| |defined in terms of employment or output. | | |

|Design effect |The ratio of the variance of an estimate under the complex| |

| |sample design to the variance of the same estimate that |sign_effect | |

| |would have been obtained from a simple random sample of | | |

| |the same size. | | |

|Element |Objects that possess the information sought and about | |

| |which inferences are to be made according to the following|detail.asp?ID=538 | |

| |OECD definition: a data element is a unit of data for | | |

| |which the definition, identification, representation, and | | |

| |permissible values are specified by means of a set of | | |

| |attributes. | | |

|Estimator |An estimator is the mathematical function by means of | |

| |which the estimate of a particular parameter is calculated|int_estimation | |

| |with. | |

| | |timator | |

|Intra-class or |A measure of the homogeneity in clusters. | |

|intra-cluster correlation | |traclass_correlation | |

|Multistage sampling |Multistage sampling is a technique according to which A | |

| |sample which is selected by stages, the sampling units at |detail.asp?ID=3726 | |

| |each stage being sub-sampled from the (larger) units | |

| |chosen at the previous stage. |ltistage_sampling | |

| |The sampling units pertaining to the first stage are | | |

| |called primary or first stage units; and similarly for | | |

| |second stage units, etc. | | |

|Parameter |Parameters are functions of the study variable values. | |

| |They are unknown, quantitative measures (e.g., totals, |detail.asp?ID=4903 | |

| |means etc) for the entire population or for specified | | |

| |domains which are of interest to the investigation. | | |

|Population |The aggregate of all elements, sharing some common set of | |

| |characteristics, that comprises the universe for the |detail.asp?ID=2079 | |

| |purpose of the statistical research problem. | |

| | |mpling_(statistics) | |

|Probability distribution |In probability theory, probability distribution is a | |

| |function that describes the probability of a random |obability_distribution | |

| |variable taking certain values. | | |

|Probability-proportional-t|Samples are drawn in proportion to their size giving | |

|o-size sampling |higher chance of selection to the larger items. |detail.asp?ID=3839 | |

|Purposive sampling |Selection on deliberate choice of a sample. | |

| | |detail.asp?ID=3902 | |

| | | |

| | |nprobability_sampling | |

|Sampling error |In statistics, sampling error is the amount of inaccuracy | |

| |in estimating some value that is caused by only a portion |mpling_error | |

| |of a population (i.e. a sample) rather than the whole | | |

| |population. | | |

|Sampling unit |Sampling unit is one of the units into which an aggregate | |

| |is divided for the purpose of sampling, each unit being |detail.asp?ID=2381 | |

| |regarded as individual and indivisible when the selection | |

| |is made. |mpling_(statistics) | |

|Sampling with replacement |When a sampling unit is drawn from a finite population and| |

| |is returned to that population, after its |detail.asp?ID=3835 | |

| |characteristic(s) have been recorded, before the next unit| |

| |is drawn, the sampling is said to be “with replacement”. |mple_random_sample | |

| |In the contrary case the sampling is “without replacement.| | |

|Sampling without |A sampling technique in which an element cannot be | |

|replacement |included in the sample more than once. |detail.asp?ID=3822 | |

| | | |

| | |mple_random_sample | |

| | | | |

|Simple random sampling |Each population unit has an equal selection probability | |

| |and also each combination of n units has an equal |detail.asp?ID=3841 | |

| |selection probability. | |

| | |mple_random_sample | |

|Standard deviation |Standard deviation is a widely used measure of variability| |

| |or diversity in statistics and probability theory. It |andard_deviation | |

| |shows how much variation or "dispersion" exists from the | | |

| |average (mean, or expected value). A low standard | | |

| |deviation indicates that the data points tend to be very | | |

| |close to the mean, whereas high standard deviation | | |

| |indicates that the data points are spread out over a large| | |

| |range of values. | | |

|Stratified sampling |The population is divided into homogeneous, not | |

| |overlapping subpopulations called strata, and independent |ratified_sampling | |

| |samples are selected from each stratum. | | |

|Systematic sampling | Each element has equal selection probability, but | |

| |combinations of elements have different probabilities. The|stematic_sampling | |

| |sampling units are selected from the population at regular| |

| |intervals. |detail.asp?ID=3864 | |

|Target population |Set of elements or objects that possesses the information| |

| |sought by the researcher and about which inferences are to|detail.asp?ID=2079 | |

| |be made. | |

| | |mpling_(statistics) | |

| | | | |

Literature

Bell, W. R. and Hillmer, S. C. (1990). The Time Series Approach to Estimation for Repeated Surveys. Survey Methodology. 16, No 2, 195-212.

Bethel, J. (1989). Sample allocation in multivariate surveys. Survey Methodology, 15, 47-57.

Cochran, W. G. (1977). Sampling Techniques (3rd edition). John Wiley and Sons. New York, Chichester, Brisbane, Toronto, Singapore.

Dalen, J. (2005). Sampling Issues in Business Surveys. Pilot Project 1 of the European Community's Phare 2002 Multi Beneficiary Statistics Programme (Lot 1).



Dalenius, T and Hodges, J. L (1959). Minimum variance stratification. Journal of the American Statistical Association, 54, 88-101.

Ekman, G. (1959). An approximation useful in univariate stratification. Ann. Math. Stat., 30, 219-229

Eurostat (2008). Survey sampling reference guidelines. Introduction to sample design and estimation techniques.

Glasser, G. J. (1962). On the complete coverage of large units in a statistical study. Int. Stat. Rev., 30, 28-32.

Greenacre, M. J. (2007). Correspondence Analysis in Practice. Chapman & Hall/CRC

Hansen, M. H., Hurwitz, W. N. and Madow, W. G. (1953). Sample Survey Methods and Theory. Vol. 1, John Wiley and Sons, New York .

Hess, I., Sethi, V. K. and Balakrishnan, T. R. (1966). Stratification: A practical investigation. Journal of the American Statistical Association, 61, 74-90.

Hidiroglou, M. A. (1986). The construction of a self-representing stratum of large units in survey design. The American Statistician, 40, 27-31.

Hidiroglou, M. A., Choudhry, G. H. and Lavallee, P. (1991). A sampling and Estimation Methodology for Sub-Annual Business Surveys. Survey Methodology, 17, 195-210.

Holland, S. M. (2006). Cluster Analysis. Department of Geology, University of Georgia, Athens, GA 30602-2501

Kalton, G (1983). Models in the Practice of Survey Sampling. Int. Stat. Rev., 51, 175-188

Karmel, T .S and Jain, M. (1987). Comparison of Purposive and Random Sampling Schemes for Estimating Capital Expenditures. JASA, 82, 52-57.

Kish, L (1965). Survey Sampling. John Wiley and Sons. New York.

Kish, L. and Frankel, M. R. (1974). Inference from Complex Samples. Journal of the Royal Statistical Society, Ser. B 36, 1-37.

Kish, L. and Anderson, D. W. (1978). Multivariate and Multipurpose Stratification. Journal of the American Statistical Association, 73, 24-34.

Kish, L. (1987). Statistical Design for Research. John Wiley and Sons. New York, Chichester, Brisbane, Toronto, Singapore

Kish, L. (1988). Multipurpose sample designs. Survey Methodology, 14, 19-32.

Kish. L. (1989). Sampling methods for agricultural surveys. FAO Statistical development series 3. Food and Agriculture Organization of the United Nations. Rome.

Kish, L. (1992). Weighting for Unequal Pi . Journal of Official Statistics. Vol. 8, No 2, 183-200.

Kish, L. (1994). Multipopulation Survey Designs. Int. Stat. Rev., 62, 167-186.

Kish, L. (1995). Methods for design effects. Journal of Official Statistics. Vol. 11, No 1, 55-77.

Lavallee, P and Hidiroglou, M. A. (1988). On the Stratification of Skewed Populations. Survey Methodology, 14, 33-43.

Montaquila, J. M and Kalton G. (2011). Sampling from Finite Populations. Westat 1600 Research Blvd., Rockville, MD 20850, U.S.A. Based on an article from Lovric, Miodrag (2011), International Encyclopedia of Statistical Science. Heidelberg: Springer Science +Business Media, LLC



National Audit Office (2001). A Practical Guide to Sampling. Statistical and Technical Team.

Renssen, R. H. (1998). A Course in Sampling Theory. Statistics. Netherlands.

Särndal C. E., Swensson, B. and Wretman. J. (1992). Model Assisted Survey Sampling. Springer-Verlang.New York, Inc.

Sethi, V. K (1963). A note on optimum stratification foe estimating the population means. Journal of the American Statistical Association, 5, 20-23.

Statistics Canada (2003). Survey Methods and Practices. Catalogue no. 12-587-X.Ottawa



Statistics Canada (2009). Statistics CanadaQuality Guidelines. Catalogue no. 12-539-X.Ottawa

Trochim, W. M. K (2002). Sampling.

United Nations Statistical Office (1950). The preparation of Sampling Survey Reports, New York: U.N. Series C, No 1.



Specific section – Theme: Design of the sampling methods

1 Interconnections with other modules

[The links to other modules yield additional information of various type relevant to the method described in this module. It also indicates which information is covered by other modules that should therefore be deleted from this module]

1 Related themes described in other modules

2 Methods explicitly referred to in this module

1. Stratification

Boundaries of suze classes used as strata

2. Number of size classes used as strata

3. Sample allocation to strata

4. Cluster sampling

5. Multistage sampling

6. Determination of sample size

7. Non -probability sampling

1 Mathematical techniques explicitly referred to in this module

2 GSBPM phases explicitly referred to in this module

1. GSBPM Phase 2.2

3 Tools explicitly referred to in this module

1. Software tools for sample design

4 Process steps explicitly referred to in this module

Sample design

-----------------------

SRS or

Systematic sampling or

Cluster sampling or

Multistage sampling

Cluster sampling

Strata =

Size classes

Multistage sampling

Systematic sampling

Simple Random Sampling (SRS)

Strata=

Domains x Size classes

Domain estimations needed

No domain estimations needed

Stratification sampling

Quota sampling

SRS or

Purposive sampling or

Cut – off sampling

Domain estimations needed

No domain estimations needed

High precision needed

No high precision needed

Choice of sampling method

................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download