Survey Quality Measures
Availability of Data
The Survey of Industry Research and Development is the primary source of information on R&D performed by industry within the fifty states and the District of Columbia. The results of the survey are used to assess trends in R&D expenditures. Government agencies, corporations, and research organizations use the data to investigate productivity determinants, formulate tax policy, and compare individual company performance with industry averages. Individual researchers in industry and academia use the data to investigate a variety of topics and while preparing professional papers, dissertations, and books. The usefulness of the information collected in this survey is enhanced by the linkage of the data file to the Census Bureau's Longitudinal Establishment Data file, which contains information on the outputs and inputs of companies' manufacturing plants. Further, total R&D expenditure statistics are used by the Bureau of Economic Analysis for inclusion in their System of National Accounts and Foreign Direct Investment programs. Prior to 2001, completion of four items on the questionnaire was mandated by law: sales, total number of employees, total R&D, and Federally funded R&D. Beginning in 2001, response to the item that asks for the distribution of total R&D by state also was required.
The Survey of Industrial Research and Development is an annual sample survey that intends to include or represent all for-profit R&D-performing companies, either publicly or privately held. The survey is completed by representatives at manufacturing and nonmanufacturing companies known to conduct R&D and by representatives from samples of companies in both sectors that may conduct R&D. A company is defined as one or more establishments under common domestic ownership or control. In some cases representatives at the establishment level return forms, so that more than one form per company is returned. In these cases, company data are aggregated during processing.
c. Key variables
- Expenditures for research and development
- North American Industrial Classification System (NAICS) code
- Company size
- Total employment
- Source of financing (company or Federal)
- Character of R&D work (basic research, applied research, and development)
- Geographic location (within the 50 United States and D.C.)
- Research and development scientists and engineers (full-time equivalent)
- Type of cost (salaries, fringe benefits, etc.)
2. Survey Design
a. Target population and sample frame
The target population consists of all industrial companies with 5 or more employees that perform R&D in the United States. Companies represented on the Business Register (BR), a Bureau of the Census compilation that contains information on more than 3 million establishments with paid employees, is the target population from which the frame used to select the survey sample is created. For companies with more than one establishment, data are summed to the company level. The frame from which the survey sample is drawn includes all for-profit companies classified in nonfarm industries. For surveys prior to 1992, the frame was limited to companies above certain size criteria based on number of employees. These criteria varied by industry. Some industries were excluded from the frame because it was believed that they contributed little or no R&D activity to the final survey estimates. Beginning with the 1992 sample, new industries were added to the frame, and the size criteria were lowered considerably and applied uniformly to firms in all industries. As a result, more companies with 5 or more employees were added to the frame and given a chance of selection for the annual samples. (NOTE: A specification error caused companies with 4 or more employees to be eligible for inclusion in the sample frame for 2007. This resulted in the selection of an additional 25 companies, accounting for about $1.6 million in industrial R&D, for the 2007 sample.)
Frame Partitioning. For the 2007 survey, the frame was partitioned into three groups: (1) companies known to conduct R&D in any of the previous five survey years or in the most recent Company Organization Survey, (2) companies that only reported zero R&D in all of the previous five survey years, and (3) companies for which information about the extent of R&D activity is uncertain. There were 14,335 companies in the first group, 79,867 companies in the second group, and 1,806,522 companies in the third group, for a total of 1,900,724 companies.
b. Sample design
For surveys after 1999, data were summed to the company level, and each company then was assigned a single North American Industrial Classification System (NAICS) code based on payroll. The method used followed the hierarchical structure of the NAICS. (The 1999 survey was the first in which companies were classified using NAICS. Prior to 1999, the Standard Industrial Classification (SIC) system was used. The two systems are discussed later under Comparability of Statistics.) The company was first assigned to the economic sector, defined by a 2-digit NAICS code, or combination thereof, representing manufacturing, mining, trade, etc., that accounted for the highest percentage of its aggregated payroll. Then the company was assigned to a subsector, defined by a 3-digit NAICS code, that accounted for the highest percentage of its payroll within the economic sector. Then the company was assigned a 4-digit NAICS code within the subsector, again based on the highest percentage of its aggregated payroll within the subsector. Finally, the company was assigned a 6-digit NAICS code within the 4-digit NAICS, based on the highest percentage of its aggregated payroll within the 4-digit NAICS. Assignment below the 6-digit level was not done because the 6-digit level was the lowest level needed to guarantee publication level industry classification. Some originally-assigned industry codes were revised later during statistical processing; see Industry Reclassification below.
Sampling Strata. For the first and third partitioned groups in the sample frame, the sampling strata were defined corresponding to the 4-digit industries and groups of industries for which statistics were developed and published. For the 2007 survey, there were 28 manufacturing and 22 nonmanufacturing strata in each of these partitioned groups. (Note: one manufacturing and one nonmanufacturing were "unclassifieds.") The second partitioned group was divided into two strata, one certainty and the other noncertainty.
Certainty Companies. Before 1994, companies with 1,000 or more employees had been selected with certainty, but it was observed that the level of spending varied considerably and that many of these companies reported no R&D expenditures each year. For these reasons, it was determined that these companies should be given chances of selection based upon the size of their R&D spending if they were in the previous survey or upon an estimated R&D value if they were not. Consequently, the employment size criteria were dropped for surveys after 1994. The criteria based on the estimated amount of R&D spending for identifying companies selected for the survey with certainty was set at $1 million for 1995. With a fixed total sample size, there was concern that the representation of the very large noncertainty universe by a smaller sample each year would be inadequate. To limit the growth occurring each year in the number of certainty cases within the total sample, the certainty criterion was raised for the 1996-2001 surveys to $5 million in total R&D expenditures. Beginning with the 2002 survey, ad hoc certainty companies were companies selected with certainty independent of relative standard error (RSE) constraints. There were different criteria defining an ad hoc certainty company depending on the partitioned group the company was in. Companies in the first partitioned group that also had previously reported or imputed R&D of $3 million or more were ad hoc certainties. Companies in the second partition that had any establishments in NAICS 5417 were ad hoc certainties. Companies in the third partition, which were also in the top 50 of their strata by payroll or in the top 50 of their state by payroll, were ad hoc certainties.
Sample Selection-Probability Proportionate to Size. The distribution of companies by R&D in the first partitioned group or by payroll in the third partitioned group was skewed as in earlier frames. Because of this skewness, a fixed-sample probability-proportionate-to-size (pps) method remained the appropriate selection technique for these partitioned groups. That is, with the pps method, large companies had higher probabilities of selection than did small companies. The fixed-sample-size methodology has been replicated for every survey since the 1998 survey.
Companies in the first partitioned group received a measure of size equal to the most recent reported positive R&D expenditures. Companies in the third partitioned group received a measure of size equal to their company payroll. RSE constraints by industry were imposed separately in the first and third partitioned groups. Each company in these two partitions was classified into one industry and assigned a probability of selection based on its total measure of size, either total R&D expenditures or total company payroll, respectively.
Simple Random Sampling. The second partitioned group was split into two strata, certainty and noncertainty. The noncertainty stratum was sampled using simple random sampling (srs). Companies in the noncertainty stratum received a probability of selection of roughly 0.01 for manufacturing companies and 0.004 for other companies.
Sample Stratification and Relative Standard Error Constraints-The particular sample selected was one of a large number of samples of the same type and size that by chance might have been selected. Statistics resulting from the different samples would differ somewhat from each other. These differences are represented by estimates of sampling error or variance. The smaller the sampling error, the less variable the statistic. The accuracy of the estimate, that is, how close it is to the true value, is also a function of nonsampling error.
Controlling Sampling Error. Historically, it has been difficult to achieve control over the sampling error of survey estimates. Efforts were confined to controlling the amount of error due to sample size variation, but this was only one component of the overall sampling error. The other component depended on the correlation between the data from the sampling frame used to assign probabilities (namely R&D values either imputed or reported in the previous survey) and the actual current year reported data. The nature of R&D is such that these correlations could not be predicted with any reliability. Consequently, precise controls on overall sampling error were difficult to achieve.
Sampling Strata and Standard Error Estimates. The constraints used to control the sample size in each stratum were based on a universe total that was estimated. That is, as previously noted, a prior R&D value for the first partitioned group and payroll for the third partitioned group were assigned to companies in their respective groups. Assignment of sampling probability was, nevertheless, based on this distribution. The presumption was that actual variation in the sample design would be less than that estimated, because many of the sampled companies in the third partitioned group have true R&D values of zero, not the widely varying values that were imputed using total payroll as a predictor of R&D. Previous sample selections indicate that in general, this presumption held, but exceptions have occurred when companies with large sampling weights have reported large amounts of R&D spending.
Nonsampling Error. In addition to sampling error, estimates are subject to nonsampling error. Nonsampling errors are grouped into five categories: specification, coverage, response, nonresponse, and processing. Efforts are made to minimize the effects of any and all of these nonsampling errors.
Sample Size-The parameters set to control sampling error resulted in samples of 10,751 companies from the first frame partition, 1,339 companies from the second frame partition, and 19,910 companies from the third frame partition. The overall final sample consisted of 32,000 companies. This total included an adjustment to the sample size based on a minimum probability rule and changes in the operational status of some companies.
Minimum Probability Rule. A minimum probability rule was imposed for both the first and third partitions. As noted earlier, probabilities of selection proportionate to size were assigned to each company, where size was the prior reported R&D or payroll value assigned to each company. Selected companies received a sample weight that was the inverse of their probability. Selected companies that ultimately report R&D expenditures vastly larger than their assigned values can have adverse effects on the statistics, which were based on the weighted value of survey responses. In order to minimize these effects on the final statistics, a minimum probability rule was imposed to control the maximum weight of a company. If the probability based on company size was less than the minimum probability, then it was reset to this minimum value. The consequence of raising these original probabilities to the specific minimum probability was to raise the final sample size.
Changes in Operational Status. Between the time that the frame was created and the survey was prepared for mailing, the operational status of some companies changed. That is, they were merged with or acquired by another company, or they were no longer in business. Before preparing the survey for mailing, the operational status was updated to identify these changes. As a result, the number of companies mailed a survey questionnaire was somewhat smaller than the number of companies initially selected for the survey.
Weighting, Maximum Weights, and Probabilities of Selection-Sample weights were applied to each company record to produce national estimates. Within the first partition of the sample, consisting of known R&D performers (positive R&D expenditures), the maximum sample weight was roughly 20. For the third partition, consisting of companies with uncertain R&D activity, the maximum sample weight was roughly 100 for companies classified in manufacturing and 250 for those classified in nonmanufacturing. The weight for any sampled company was calculated as the reciprocal of its selection probability.
c. Data collection techniques
The survey is conducted by the Census Bureau in accordance with an interagency agreement with NSF/SRS. Two questionnaires are used each year to collect data for the survey. Known large R&D performers are sent a detailed survey form, Form RD-1. The Form RD-1 requests data on sales or receipts, total employment, employment of scientists and engineers, expenditures for R&D performed within the company with federal funds and with company and other funds, character of work (basic research, applied research, and development), company-sponsored R&D expenditures in foreign countries, R&D performed by others, R&D performed in collaboration with others, federally funded R&D by contracting agency, R&D costs by type of expenditure, R&D costs by technology area, domestic R&D expenditures by state, energy-related R&D, R&D done in collaboration with others, and foreign R&D by country. Because companies receiving the Form RD-1 have participated in previous surveys, computer-imprinted data reported by the company for the previous year are supplied for reference. Companies are encouraged to revise or update the prior-year data if they have more current information; however, prior-year statistics that had been previously published were revised only if large disparities were reported. Small R&D performers and firms included in the sample for the first time were sent Form RD-1A. This questionnaire collects the same information as Form RD-1 except for five items: Federal R&D support to the firm by contracting agency, R&D costs by type of expenditure, domestic R&D expenditures by state, energy-related R&D, and foreign R&D by country. It also includes a screening item that allows respondents to indicate that they do not perform R&D. No prior-year information is made available since the majority of the companies that receive the Form RD-1A have not been surveyed in the previous year.
For 2007, which was an Economic Census year, in addition to the five items that are mandatory in non-Economic Census years—total R&D expenditures, federally funded R&D, net sales, total employment, and the distribution of R&D by state—response to all of the other survey items was mandatory. One other change was implemented for 2007. A question that asked for the R&D by type of expenditure was added to the Form RD-1A. This question was added to conform with the Form RD-1 using results of cognitive interviews that concluded RD-1A respondents were able to report these expenditures.
The 2007 survey questionnaires were mailed in February 2008. Recipients of Form RD-1A were asked to respond within 30 days, while Form RD-1 recipients were given 60 days. A follow-up questionnaire and letter were mailed to RD-1A recipients every 30 days up to a total of three times, if their completed survey form had not been received. After questionnaire and letter follow-ups, one additional automated telephone follow-up was conducted for the remaining delinquent RD-1A recipients.
A letter was mailed to Form RD-1 recipients 30 days after the initial mailing, reminding them that their completed survey questionnaires were due within the next 30 days. A second questionnaire and reminder letter were mailed to Form RD-1 respondents after 60 days. Two additional followups (by telephone) were conducted for delinquent Form RD-1 recipients not ranked among the 500 largest R&D performers based on total R&D expenditures reported in the previous survey. For the top 500 performers, multiple special telephone followups were used to encourage response.
d. Imputation techniques
For various reasons, many firms chose to return the survey questionnaire with one or more blank items. For some firms, internal accounting systems and procedures may not have allowed quantification of specific expenditures. Others may have refused to answer any voluntary questions as a matter of company policy. When respondents did not provide the requested information, estimates for the missing data were made using imputation algorithms. In general, the imputation algorithms computed values for missing items by applying the average percentage change for the target item in the nonresponding firm's industry to the item's prior-year value for that firm, reported or imputed. This approach, with minor variation, was used for most items.
3. Survey Quality Measures
a. Sampling variability
The sample is designed to produce RSEs for estimates of total R&D performance of 2 percent for industries designated as "high priority" industries and 5 percent for other industries. The designation of "high priority" is assigned when prior surveys have identified an industry as one in which there is a large amount of R&D expenditures.
Coverage error constitutes a possible source of error for the survey because the Business Register (BR) is undoubtedly missing some in-scope companies, especially relatively new ones. It should be noted that coverage errors for surveys prior to 1992 were more likely, because not all companies on the BR were subject to selection. The Census Bureau continually strengthens and updates the BR so that coverage error is minimized.
(1) Unit nonresponse - Of the companies surveyed for 2007, 19.2 percent did not respond. Nonresponse studies of companies that do not respond to the survey are conducted periodically to improve response rates in future surveys. Overall, the magnitude of unit nonresponse bias is manageable because even if no response can be elicited from a company, other sources of information about the company are used to estimate its R&D data.
(2) Item nonresponse - Companies are encouraged to estimate information when actual data are unavailable. Even so, item nonresponse rates for key data elements in the survey can be high. When estimates are not reported and cannot be elicited by following up with the respondent, complex, comprehensive imputation techniques developed over the survey's long history are used to minimize the effects of item nonresponse. Imputation rates for the key source of funding and character of work elements for 2007 ranged from 2.7 percent to 13.2 percent. Imputation rates for other items and detailed industries tended to be higher.
Variations in respondent interpretations of the definitions of R&D activities and variations in accounting procedures are of particular concern. Specifically, some companies have difficulty separating basic research from applied research, locating geographically where R&D is performed, and reporting the cost of energy R&D. The sophistication and comprehensiveness of a company's accounting system often depends on its size and activities and its willingness to accommodate Government-sponsored surveys. Work was conducted in the mid-1990's, using cognitive lab approaches, to evaluate ways in which the form could be modified to ease reporting difficulties and reduce measurement error and recommendations resulting from that work were incorporated into the survey questionnaires. Other ongoing efforts to minimize measurement error include questionnaire pre-testing, improvement of questionnaire wording and format, inclusion of more cues and examples in the questionnaire instructions, consultations with respondents, post-survey evaluations, record check studies, and computer editing.
4. Trend Data
The statistics resulting from this survey are better indicators of changes in, rather than absolute levels of, R&D spending and personnel. Nevertheless, the statistics are often taken to be a continuous time series prepared using the same collection, processing, and tabulation methods. Such uniformity has not been the case. Since the survey was first fielded, improvements have been made to increase the reliability of the statistics and to make the survey results more useful. To that end, past practices have been changed and new procedures instituted. Preservation of the comparability of the statistics has, however, been an important consideration in making these improvements. Nonetheless, changes to survey definitions, the industry classification system, and the procedure used to assign industry codes to multi-establishment companies have had some, though not substantial, effects on the comparability of statistics.
Industry Reclassification-Beginning with the 2004 survey and continuing for 2007, some companies' industry codes assigned by the process described under "sample design" were manually examined and changed. Beginning in the late 1990s, increasingly large amounts of R&D were attributed to the wholesale trade industries, resulting from the payroll-based methodology used to assign industry classifications and the change from the Standard Industrial Classification (SIC) system to the North American Industry Classification System (NAICS) in 1999. Such classification artifacts were of particular concern for companies traditionally thought of as pharmaceutical or computer-manufacturing firms. As these firms increasingly marketed their own products and more of their payroll involved employees in selling and distribution activities, the potential for the companies to be classified among the wholesale trade industries increased. To increase the relevance and usefulness of the industrial R&D statistics, NSF evaluated ways to ameliorate the negative effects of the industry classification methodology and change in classification systems. In addition to firms originally assigned NAICS codes among the wholesale trade industries (NAICS 42), firms assigned to the scientific R&D services industry (NAICS 5417) and the management of companies and enterprises (NAICS 55) using the payroll-based methodology, were manually reviewed by NSF and Census. These firms were reclassified based on primary R&D activity, which in most cases corresponded to their primary products or service activities. The result was that most of the R&D previously attributed to NAICS 42 and 55 industries was redistributed. Statistics resulting from the old and new industry classification methods were published in Tables A-9 and A-10 in Research and Development in Industry: 2004 (NSF 09-301) at http://www.nsf.gov/statistics/nsf09301/. For detailed information, see SRS InfoBrief: "Revised Industry Classification Better Reflects Structure of Business R&D in the United States" (NSF 07-313) at http://www.nsf.gov/statistics/infbrief/nsf07313/.
5. Availability of Data
The data from this survey are published annually in NCSES InfoBriefs, and in the series Research and Development in Industry, all available on the NCSES web site (http://www.nsf.gov/statistics/). Detailed historical statistics for 1953-1998 can be obtained from NSF's Industrial Research and Development Information System (IRIS) at http://www.nsf.gov/statistics/iris/, an online interface to the Survey of Industrial Research and Development Historical Database (SIRDHD). The SIRDHD is a collection of more than 2,500 statistical tables containing all of the statistics produced and published from the 1953-1998 cycles of the annual Survey of Industrial Research and Development. Information from this survey is also included in Science and Engineering Indicators and in National Patterns of R&D Resources.
b. Electronic access
Data from this survey are available on the NCSES Web site.
c. Contact for more information
Additional information about this survey can be obtained by contacting:
Research and Development Statistics Program
National Center for Science and Engineering Statistics
National Science Foundation
4201 Wilson Boulevard, Suite 965
Arlington, VA 22230
Phone: (703) 292-7789