Current and Alternative Sources of Data on the Science and Engineering Workforce
As has been noted, the S&E population is rare with respect to a general population sampling frame and there is no separate, complete, and up-to-date frame of the S&E population for use throughout the decade. Lacking a separate frame, a large-scale screening exercise is needed to identify a sample of the S&E population. In this situation, it is natural to seek another large-scale data collection to provide the screening. Currently, the decennial census long form serves this purpose, but it is not an ideal vehicle. Because the long form does not include field of degree, further screening is needed to distinguish individuals with S&E degrees from non-S&E degree holders. Also, since the census is conducted only every 10 years, supplementary sampling is needed to update it. This is the role of the NSRCG and SED but, as noted earlier, the update does not include new non-S&E degree earners working in S&E jobs and new immigrants with only foreign degrees.
This section examines the possible use of some large-scale ongoing federal surveys to provide the screening phase of surveys of the S&E population. In considering alternative surveys for this purpose, many factors need to be addressed including the coverage of the target S&E population they provide, the ability to screen out nonmembers of the S&E population, the timeliness and frequency of the surveys, the response rates to the surveys and the likely response rates to the follow-up surveys of the S&E population, issues of operational feasibility, and cost implications. However, as indicated below, an overriding factor that rules out most potential surveys is that their sample sizes are too small for screening purposes; the screening sample size must be large enough to generate the sample sizes needed for the S&E population and for specified subpopulations.
The required sample sizes are, of course, a direct function of the reliability standards established for key survey estimates for the total S&E population and for subpopulations. Appendix D gives the desired and acceptable coefficients of variation (CVs) for estimates of the sizes of various S&E subpopulations of interest that were developed by NSF (e.g., the desired CV for the total S&E population is 0.10% but 0.20% is acceptable). The discussion below explores the screening sample sizes needed to achieve those CVs.
Variance formulas for simple random sampling are employed to provide rough guidance on the size of screener sample needed to satisfy the reliability specifications given in appendix D; in practice, these formulas will likely underestimate the actual variances (i.e., the design effects will exceed 1). Design effects will exceed 1 as a result of unequal selection probabilities, clustering, and weighting adjustments for nonresponse. Also, no allowance is made for the sample loss from nonresponse. Thus, the screener sample sizes presented are underestimates of those actually required.
To achieve the desired CV of 0.10% for the total S&E population, a screening sample of the general population would need to be large enough to yield a sample of about 893,000 scientists and engineers. If the frame consists of about 101.1 million households, approximately 7.25 million households would have to be screened to obtain a sample of 893,000 members of an S&E population of about 12.5 million people. Moreover, this very large sample size is calculated without any allowance for a design effect of greater than 1 or for survey nonresponse. This size of sample, already larger than operationally feasible, still would not satisfy the CV requirements for some rare subgroups of the S&E population, such as race/ethnicity groups. These rare population subgroups would have to be oversampled substantially to achieve the required levels of CVs for them, which would further increase the screening cost. To obtain the acceptable CV of 0.20% for the estimate of the size of the total S&E population, more than 1.8 million households would need to be screened, again assuming no clustering, equal selection probabilities, and complete response.
We consider the following three options of screener sample size in terms of number of households for a general household population survey:
Tables 1, 2, and 3 give the approximate CVs of the survey estimates for various sized subgroups of the S&E population for the above three options assuming equal probabilities of selection, no clustering of households, and a total of 101.1 million households in the United States. The achieved CVs actually would be higher, depending on the design effect.
The CVs for selected subgroups of the S&E population known to be of interest to user groups are presented in table 4. These CVs are computed under the three options, and assume no oversampling and no clustering effect.
As table 4 shows, none of the options can satisfy the precision requirement for estimating the number of female scientists and engineers given in appendix D. The acceptable CV is 0.3%, whereas even option C, with a screener sample size of 2.0 million households, achieves a CV only as low as 0.4%. The screener sample size would have to be increased to about 3.6 million households to achieve the acceptable CV. Similarly, the required (or acceptable) CVs for estimating the numbers of Hispanic and black scientists and engineers cannot be achieved even under option C. The screener sample size would have to be increased to about 2.6 million households to achieve the required CV of 1.0% for Hispanics and blacks. On the other hand, the required CV of 1.0% for the estimate of the size of the Asian S&E population can be met even under option A, using a screener sample size of 1.0 million households. It should be noted that the above CVs are obtained under the assumption that the survey design effect is equal to 1 (i.e., simple random sampling of households and no clustering of S&E within households). In practice, however, the design effect would be somewhat larger because of clustering of the sample and unequal selection probabilities.
The above discussion indicates that a very large screener sample size is needed to generate sample sizes that meet the CV requirements for some small S&E population subgroups. Note, however, that not all of the rest of the screened sample of the S&E population is needed for the survey. A subsample of sufficient size can be selected to satisfy the reliability requirements for other subgroups and for combinations of them.
With the very large number of households that would need to be screened to meet the various reliability requirements, conducting a new independent survey of scientists and engineers with its own screening would be extremely costly. For this reason, the approach adopted here is to explore the effectiveness of using an existing large-scale household survey for the screening.
Use of multiple frames is beneficial in sample surveys when the combined frames offer a more efficient means of accomplishing the survey objectives. The current SESTAT is in fact an example of a multiple-frame approach. A multiple-frame approach would also likely be needed with the use of one of the alternative surveys under investigation for screening for the S&E population. These surveys have, in theory, almost complete coverage, and they could be used to screen for a representative sample of the entire S&E population. However, given their current sample sizes, they would yield very few individuals with S&E doctoral degrees; similarly, a general screening of the population for scientists and engineers with doctoral degrees would be very expensive. To address this problem, the SDR could be continued to provide required sample sizes for doctorate recipients, with new cohorts being sampled from the frame provided by the SED as at present. The estimates for the S&E population with doctoral degrees could then be constructed using one of the following options:
Similarly, a dual-frame design could be considered for the S&E population with graduate degrees from abroad. For example, one could explore the feasibility of using lists from the Immigration and Naturalization Service (INS) to identify immigrant scientists and engineers who obtained their degrees abroad. Such a design would be difficult to implement, but might be worth considering if the alternative survey frame lacked coverage of this portion of the S&E population.
As the preceding discussion has shown, if another household survey is to be used as a first-phase screening survey for locating members of the S&E population, the survey would need to have a very large sample size. The main advantages of linking the S&E workforce survey with an existing ongoing federal survey would be decreased cost and improved efficiency by consolidating field data collection operations that are common to the different surveys. The remainder of this section examines the possible use of the American Community Survey and other large-scale federal household surveys, including the Current Population Survey, for screening for the S&E population. The possibility of using an existing large-scale telephone survey, such as the National Immunization Survey, to screen for members of the S&E population is also examined. With a household telephone survey, consideration needs to be given to the increasingly serious problem of nonresponse that such surveys often face.
Another very different approach is to sample members of the S&E population at their places of work, using an establishment-based sample design. The use of a list of establishments as the frame to survey employed scientists and engineers is examined in the section "Establishment-Based Sample Design."
The American Community Survey (ACS) is a rolling sample survey (Kish 1990) that is being developed by the U.S. Census Bureau as an intended replacement for the census long form. The ACS will include the same detailed socioeconomic subject areas as the census long form questionnaire. Instead of collecting data from about 17 million housing units at one time (as is done during the decennial census), the ACS will sample about 250,000 addresses each month, or some 3 million each year throughout the decade (Alexander, Dahl, and Weidman 1997). The ACS is currently in the development and testing stage. The sampling frame will be the Census Bureau's Master Address File, which will be updated throughout the decade to keep it fully current. The sample will be distributed throughout the country with no clustering, and with higher sampling fractions in small governmental units. Each address could be sampled at most once in a 5-year period.
The ACS will be conducted using a mail-out, mail-return, self-response approach, combined with initial CATI follow-up, supplemented by a CAPI follow-up of a subsample of the remaining nonrespondents. As an ongoing survey, ACS is a flexible vehicle capable of adapting to changing user needs. Once fully implemented, there is the potential to add supplemental questions on subjects of current interest or to help identify special population groups. It must be noted, however, that it is not necessarily easy to add questions to the ACS and in no case will changes be made before 2008. It is likely that a legal reason will be required for the inclusion of new questions, in addition to several years of lead time for testing.
Once the ACS is fully implemented, estimates will be available annually for areas and population groups of 65,000 or more people. Estimates for smaller areas will be provided on a multiple-year average basis. Estimates for the smallest areas or for small population groups will be available on a 5-year average cycle, with reliability consistent with that provided for these groups in recent decennial censuses.
The ACS could offer a range of sampling options, flexibility in design and content, and more current data for analysis than the census long form. It also could be used to efficiently identify households or individuals with unique characteristics. Once identified, follow-up interviews could be conducted by mail or telephone, or in person, if required, to meet the needs for more detailed information about these households. Furthermore, the ACS could not only provide more timely data for use in designing the S&E workforce surveys, it could also improve survey coverage. If used as a screener for the S&E population, the problems associated with pooling S&E cases across multiple years to achieve sufficient samples of scientists and engineers would need to be addressed.
Because ACS is planned to be a continuous survey, each round of the NSF data collection could be based on a fresh sample selected from the ACS, thus covering the full current S&E population (a repeated cross-sectional survey approach). An approach using recent ACS samples could lead to changes in content, frequency, or sample design of the NSCG. The ACS might be able to provide the coverage updating function of the NSRCG samples, although the small size of this subpopulation may present problems in having enough cases in the ACS samples. Because the ACS would not provide a large enough sample of doctorate recipients at the small domain levels necessary (e.g., field by race/ethnicity by sex), it would seem desirable to continue with a separate SDR survey. However, the ACS could provide representation of scientists and engineers graduating abroad and non-S&E graduates working in S&E occupations. If the census long form is replaced by the ACS, the use of the ACS as a vehicle for conducting the NSF S&E workforce surveys needs to be explored. Furthermore, NSF needs to determine whether using the ACS could result in significant cost savings and provide improved coverage of the S&E population.
If implemented, the ACS will be an attractive option to consider as a possible venue for collaboration on an existing survey. Given the ACS annual sample size of 3 million housing units and assuming a completion rate of 75%, data will be collected from some 2.25 million housing units, or more than 6 million people, which will yield an annual S&E sample of more than 275,000 individuals.
Issues to Consider. Issues include the following:
The ongoing National Immunization Survey (NIS) that is being conducted by the National Center for Health Statistics (NCHS) and the National Immunization Program of the Centers for Disease Control and Prevention could offer an alternative opportunity for screening for a sample of the S&E population. The NIS is a large-scale random digit dialing (RDD) survey that uses CATI methods to screen more than 900,000 households each year. The survey was initiated in April, 1994 to monitor vaccination coverage levels of children 19–35 months of age on an ongoing basis. NIS covers all 50 states and the District of Columbia. The sample is allocated to produce national estimates and separate estimates for 78 Immunization Action Plan (IAP) areas.
The use of a list-assisted RDD approach for sampling households in the NIS results in noncoverage of households without telephones and households with unlisted numbers in the 100 banks containing no listed numbers. Although approximately 5% of the households in the United States do not have a telephone, this is probably a minor concern for households containing individuals in the S&E population. Noncoverage as a result of residential unlisted telephone numbers belonging to the 100 banks containing no listed numbers is less than 2%, which is negligible.
For purposes of identifying the eligible S&E population, it would be necessary to modify the NIS screening questionnaire by adding special questions. If the NIS sample is not large enough to meet all the sample size requirements for the S&E data system, the set of special questions could also be made into a separate instrument to use in screening an additional sample of households for members of the S&E population. The cost of administering such a screener, which might be substantial, is of real concern and would need to be explored completely and carefully. The main advantage of an independent screener is that the screening instrument would be developed to suit the needs of S&E data user requirements. In view of the rarity of those with S&E doctorates, it would almost certainly still be necessary to augment the screened sample with a sample of S&E doctoral degree holders from the SDR.
Sample Design. The NIS uses a two-phase sample design. For the first phase, a quarterly sample of telephone numbers is drawn for each IAP area, and a screening questionnaire is administered to locate households with one or more children 19–35 months of age. When an eligible child is found, the person most knowledgeable about the child's vaccinations is identified. If that person is available, the full interview is administered at that time; otherwise the interviewer arranges a time to call back. Both the screening and the immunization interviews are conducted by CATI.
The NIS seeks to attain a coefficient of variation no larger than 5.0% for the annual vaccination coverage estimates in each IAP area. To satisfy this precision requirement, the sample size target for NIS is set at 440 eligible children per IAP area for each four-quarter period, or 110 per quarter. To achieve a sample of 110 children per quarter for each IAP area, the sampling rates vary across IAP areas according to the sizes of the areas. For example, the larger IAP areas are sampled at lower sampling rates.
The NIS target sample size is 8,580 completed household interviews per quarter for all 78 IAPs (110 per IAP area). For 1994, an average of 512,800 sample telephone numbers were drawn per quarter, and 420,500 sample telephone numbers were actually dialed per quarter by interviewers after the prescreening for business and nonworking numbers had removed 18% of the initial sample. Assuming that approximately 60% of the numbers are residential numbers and there is no nonresponse, the sample yield would be 252,300 residential telephone numbers per quarter. By accumulating the sample over a period of 2 years, the sample size would be more than 2 million households, or 5.4 million people. The expected sample of individuals in scope for the S&E population would be in excess of 240,000. However, because the NIS sample is designed to produce estimates for each of the 78 IAPs, and the IAP areas vary considerably in population size, the NIS file contains highly differential weights that lead to a sizable loss in precision for national estimates. Also, allowance needs to be made for nonresponse to the S&E screener and follow-up data collection.
Issues to Consider. Issues include the following:
Several other federally sponsored household surveys are conducted on a continuous basis. For example, the Current Population Survey (CPS) has been conducted monthly by the Census Bureau since 1942. Other examples of household survey vehicles that could be explored for studying the S&E workforce are the Census Bureau's Survey of Income and Program Participation (SIPP) and the National Health Interview Survey (NHIS) of NCHS.
Two options would be available for using these surveys to screen for the S&E population: (1) accumulate samples over time and conduct the S&E workforce survey at the required intervals; or (2) collect data on a continuing basis and accumulate data over time to produce estimates at the required intervals. Under option 1, the costs associated with tracing and locating respondents could be very high. On the other hand, the estimates produced under option 2 would not reflect the S&E population for a single point in time, but rather average values over the data collection time period (as is also the case with the ACS).
Current Population Survey. The CPS is a monthly survey of about 50,000 households conducted jointly by the Census Bureau and the Bureau of Labor Statistics (BLS). The CPS is the primary source of information on the labor force characteristics of the U.S. population. The sample is scientifically selected to represent the civilian noninstitutionalized population. Respondents are interviewed to obtain information about the employment status of each member of the household who is 15 years of age and older.
Estimates obtained from the CPS include employment, unemployment, earnings, hours of work, and other indicators. They are available by a variety of demographic characteristics including age, sex, race, marital status, and educational attainment. They are also available by occupation, industry, and class of worker.
The CPS employs a stratified multistage clustered sample design. Since the inception of the survey, there have been various changes in the design of the CPS sample. The survey is traditionally redesigned and a new sample selected after each decennial census. The current sample design, introduced in January 1996, includes about 59,000 housing units from 754 sample areas. The number of eligible households is about 50,000, and the number actually interviewed is about 46,800 every month, i.e., about 94% of eligible households respond to the survey. The CPS monthly sample of 46,800 households cannot support the analytic needs of SRS. The CPS sample is designed to support the measurement of the U.S. labor force as a whole rather than that of specialized populations such as the S&E workforce. The CPS sample would need to be aggregated over time to be a possible option for studying the S&E workforce.
The CPS employs a rotating panel design in which only one-eighth of the sample is changed each month. Each monthly sample comprises eight representative subsamples or rotation groups. A given rotation group is interviewed for a total of 8 months, divided into two equal periods. The rotation group is in the sample for 4 consecutive months, leaves the sample during the following 8 months, and then returns for another 4 consecutive months. In each monthly sample, one of the rotation groups is in the first month of enumeration, another rotation group is in the second month, and so on. Because of the rotating panel design of the survey, only about 158,000 unique households are interviewed over a 12-month period. Similarly, over a 24-month period, the accumulated sample size would be 228,000 households, and over a 3-year period, 298,000 households. Even after accumulating the sample over 3 years, the screened sample of individuals who are in scope for the S&E workforce surveys would be approximately 36,000 individuals. The effective sample size would be even smaller for two reasons: design effects attributable to disproportional allocation of the sample, and intracluster correlation, because the unique households that are being sampled are sampled from the same primary sampling units and segments (clustering effect). Thus, the sample sizes would not be large enough to produce S&E estimates with the required reliability.
In response to a legislative mandate under the State Children's Health Insurance Program (SCHIP), the Census Bureau expanded the monthly sample for the CPS in 2000. This expansion was introduced over a 3-month period, beginning with the September 2000 survey, and occurred in 31 states and the District of Columbia. In all, the total number of households eligible for the monthly survey increased from about 50,000 to about 60,000 households.
The SCHIP legislation requires that the Census Bureau improve state estimates of the number of children who live in low-income families and lack health insurance. The expansion of the monthly CPS sample is one part of the Census Bureau's plan for improving the SCHIP estimates. Other parts of the plan include an increase in the number of households that are asked the questions from the annual March supplement to the CPS, the source of information on income and access to health insurance.
The increased sample yields roughly 357,600 interviewed unique households over a 3-year period. Therefore, even the increased CPS sample will not be sufficient for S&E sample size requirements. Moreover, the increase is aimed at improving the state-level estimates. Thus, most of the sample increase is allocated to smaller states, which means larger design effects.
Survey of Income and Program Participation. The SIPP, conducted by the Census Bureau, provides information on the economic situation of households and individuals in the United States. SIPP began in late 1983 with a design that attempted a compromise between the twin goals of collecting accurate cross-sectional and longitudinal data on income and program participation by using a multiple-panel overlapping design. A revised design was introduced in April 1996 to focus primarily on providing accurate and useful longitudinal data by using abutting 4-year panels.
There are three basic elements contained in the overall design of the survey content. The first is a control card, which is used to record basic social and demographic characteristics for each person in the household at the time of the initial interview. The second major element of the survey content is the core portion of the questionnaire. The core questions are repeated at each interview and cover labor force activity, the types and amounts of income received during the 4-month reference period, and participation status in various programs. The third major element is the various supplements or topical modules that are embedded during selected household visits.
The sample for the first SIPP panel in 1983 consisted of about 20,000 households selected to represent the noninstitutionalized population of the United States. The 1996 panel has a sample size of approximately 36,800 households. Households in this SIPP panel were interviewed at 4-month intervals over a period of 4 years. The reference period for the questions is the 4-month period preceding the interview. The sample households within a given panel are divided into four samples of nearly equal size. These subsamples are called rotation groups, and one rotation group is interviewed each month. In general, one cycle of four rotation groups covering the entire sample using the same questionnaire is called a wave. The rotation group design was chosen because it provides a steady workload for data collection and processing.
Data collection operations are managed through the Census Bureau's 12 permanent regional offices. A staff of interviewers assigned to SIPP conducts interviews during monthly personal visits, with most interviewing completed during the first 2 weeks of that month. Completed questionnaires are transmitted to the regional offices where they undergo an extensive clerical edit before being entered into the Bureau's SIPP data processing system.
The Census Bureau's current working plan for future panels is shown in table 5. Each of these planned panels will be interviewed every 4 months over 3-year periods. A large panel is started every third year, with smaller panels starting in other years. It should be noted that the Census Bureau had delayed the beginning of the second large panel from 2000 to 2001 because of operational considerations associated with the 2000 decennial census. Moreover, the 2004 (and subsequent) panels will be state-representative, but they will not produce reliable state-level estimates unless some additional sample can be included. The additional sample would not only improve the reliability of the state-level estimates, but it would also compensate for the loss of efficiency for national estimates resulting from differential weights.
As can be seen from table 5, even after accumulating data over all the proposed panels from 2000 through 2004 the sample size will only be 107,600 households. Because that number is not sufficient for the S&E workforce survey, SIPP is not a viable option for studying the S&E workforce.
National Health Interview Survey. The NHIS, which has been in continuous operation since 1957, is designed to produce national and selected subnational estimates of health indicators, health care utilization and access, and health-related behaviors for the U.S. resident civilian noninstitutionalized population. The NHIS is conducted by NCHS, a component of the Centers for Disease Control and Prevention, U.S. Public Health Service, in the Department of Health and Human Services. In accordance with specifications established by NCHS, the Census Bureau participates in the planning of the NHIS and in the data collection. The NHIS sample has been redesigned after each decennial census, and the specific parameters of the design have changed over time. For example, the 1973–84 NHIS design was based on a sample of 386 primary sampling units (PSUs), the 1985–94 NHIS design was based on a sample of 198 PSUs, and the current 1995–2004 NHIS design is based on a sample of 358 sample PSUs. The NHIS sample for the data collection years 1995 to 2004 was designed to improve the precision for various domains defined by race and ethnicity and to enhance the survey's ability to provide state estimates.
The estimates from NHIS can be produced for various population subgroups, including those defined by age, sex, race, family income, geographic region, and place of residence. The number of interviewed households per year is about 41,500. The total number of screened housing units is much larger, about 71,500 per year including nonrespondent and vacant housing units. Compared with other households, the NHIS oversamples Hispanic households at a relative ratio of 2.1:1 and black households at a relative ratio of 1.4:1 to improve the estimates for the Hispanic and black populations. The approximately 30,000 housing units screened but not interviewed are white and "other race" housing units, as well as vacant and nonrespondent units.
The NHIS is based on a stratified multistage area sampling design with clustering of housing units. The PSU is a metropolitan statistical area or a group of one or more counties. The PSUs are stratified by region and state, with some oversampling occurring in small states. An area segment sample of housing units is selected at the second stage of selection. Most housing units constructed since 1990 are separately sampled from new construction permit records.
The NHIS sample is randomly partitioned into four nationally representative subsamples of approximately equal size and conceptually similar statistical features. These partitions are referred to as "panels," each of which contains about 104 PSUs. The largest self-representing PSUs are included in all panels, and no non-self-representing (NSR) PSU is included in more than one panel. The sample is also divided into a number of temporal subdesigns. The sample is first divided into subdesigns that are assigned for data collection for each year in the period from 1995 to 2004. Annual NHIS samples are then divided into 52 weekly interviewer assignment samples, with each weekly sample constituting a national probability sample of housing units. An average NSR PSU has five weekly assignments during a year. Large self-representing PSUs have assignments in many more weeks per year. NCHS processes groups of 13 weekly samples that correspond to quarters of the calendar year to produce national estimates for each quarter.
Given the very small NHIS sample size, which in fact would have to be accumulated for more than 10 years to obtain an adequate S&E sample, the option of using the NHIS to screen for the S&E population is not feasible.
Another approach is to sample the S&E population by place of work, using an establishment-based sample design. As part of the ongoing SRS program, one survey is already being conducted using lists of companies as a sampling frame. The annual survey of Research and Development in Industry collects information on R&D expenditures and employment of scientists and engineers from a nationally representative sample of about 23,000 companies. The survey was started in 1992 and includes data from both manufacturing and nonmanufacturing companies. The frame for the annual Survey of Research and Development in Industry is a potential establishment-based frame for finding members of the S&E workforce.
Alternatively, the sampling frame for the S&E data collection could be constructed using establishment-level primary and secondary Standard Industrial Classification (SIC) codes to identify establishments likely to employ members of the S&E population. A sample of employees who are in scope for the S&E workforce could then be selected from the sampled establishments. There are several business frames and databases that could be used as frame sources. Some business frames, such as Dun’s Market Identifiers (DMI) of the Dun and Bradstreet Information Services, are fairly comprehensive. The DMI contains more than 10 million records on establishments small and large. It is an establishment-based frame but also has data on corporate structure. The DMI database was used as a sampling frame for the 1994 National Employee Health Insurance Survey (NEHIS) sponsored by NCHS. According to an assessment study conducted by Marker and Edwards (1997), the database has about 99% coverage of the universe. Coverage of family farms and the self-employed, however, is somewhat weak. Although family farms are not especially important for an S&E workforce study, the absence of the self-employed may prove much more significant. Recently established small businesses also are likely to be missed disproportionately. Coverage of all employees is probably higher than the coverage of establishments, because it is much more likely for large establishments to be included in the frame.
A weakness of the DMI file is its continued inclusion of many small establishments that are no longer in business. This weakness does not cause bias, but would entail increased costs to identify and eliminate sampled establishments that are no longer in operation and that result in increased variances of the survey estimates. However, the reliability of the survey estimates can be improved through poststratification.
The DMI file contains information about business establishments (individual business locations) in the private sector as well as government entities. Individual records in the private-sector portion of the file represent a business establishment, and include basic information such as company name, address, telephone number, and names of corporate officers. Also, the file provides information about total sales and the number of employees. To reduce cost, a full DMI abstract file containing a limited number of design-related variables could be acquired for survey design and sample selection purposes. The full range of data items could then be obtained for sampled records only. The number of employees in the business establishment is missing for about 13% of the records, but these could be treated as a separate category for sample design purposes. However, such an option would be very costly.
Both the U.S. Census Bureau and BLS maintain business registers for use as sampling frames for their business surveys (see appendix E). Because of confidentiality regulations, these registers are not available to other government agencies. They can be used only if the Census Bureau or BLS conducts the survey and controls the data. Otherwise, the DMI file is the only establishment list that can be used as a sampling frame for the NSF workforce data. The population control totals can, however, be tabulated from the BLS list of establishments for ratio adjustment (poststratification). Wallace et al. (1995) describe how the respondent weights in the private-sector portion of the NEHIS were ratio-adjusted (poststratified) to align with independent estimates of the number of employees provided by BLS. This weight ratio adjustment method reduces the sampling variability in estimates that are correlated with the number of employees, and it also provides a mechanism for adjusting for sampling frame undercoverage. It should be noted that there would still be a risk of potential bias for subgroups for which there is systematic undercoverage on the frame that is not accounted for in the poststratification adjustment.
The DMI file could be used as a sampling frame to study only the employed portion of the S&E workforce population. This frame does not cover those who are unemployed or not in the labor force, and some who are self-employed. These groups are often important subjects for SRS workforce information efforts. A further concern, even for studying the employed, would be whether businesses (or government or academic institutions for that matter) would provide NSF or its contractor with either access to the employees or with names and addresses, which could then be used to contact the employees. Perhaps establishment surveys should be considered as a collection vehicle only for special modules for which the data would be collected directly from the establishments.
 With a simple random sample, the estimated number of elements in the population with a given characteristic is given by Np, where N is the population size and p is the sample estimate of the proportion of the population with that characteristic. The CV of this estimated number is approximately where P = R / N, R is the number of elements in the population with the characteristic, and n is the sample size (see, for example, Cochran 1977, section 3.2). These formulas have been used in the calculations in this section under the simplifying assumption that each household contains at most one member of the S&E population. With this assumption, the simple random sampling formulas can be applied with the household as the unit of analysis. The total number of households in the United States has been taken to be N = 101.1 million, and R denotes the number of members of the S&E population or subpopulation of interest.
 However, because only landline phones can be surveyed with RDD methods, the increasing number of households with only cell phones is an emerging problem.