Estimates of Stability
It is important that participating colleges and universities as well as others who use the results from the NSSE survey be confident that the benchmarks and norms accurately and consistently measure the student behaviors and perceptions represented on the survey. The minimum sample sizes established for various size institutions and the random sampling process used in the NSSE project assures that each school will have enough respondents to generate accurate point estimates at the institutional level. It is also important to assure institutions and others who use the data that the results from The Report are relatively stable from year to year, indicating that the instrument produces reliable measurements from one year to the next. That is, are students with similar characteristics responding approximately the same way from year to year?

Over longer periods of time, of course, one might expect to see statistically significant and even practically important improvements in the quality of the undergraduate experience. But changes from one year to the next should be minimal if the survey is producing reliable results.

The approaches that have been developed in psychological testing to estimate stability of measurements make some assumptions about the domain to be tested that do not hold for the NSSE project. Among the most important is that the respondent and the environment in which the testing occurs do not change. This is contrary, of course, to the goals of higher education. Students are supposed to change, by learning more and changing the way they think and act. Not only is the college experience supposed to change people, the rates at which individuals change or grow are highly variable. In addition, during the past decade many colleges have made concerted efforts to improve the undergraduate experience, especially that of first-year students. All this is to say that attempts to estimate the stability of students' responses to surveys about the nature of their experience are tricky at best.

With these caveats in mind, we have to date estimated the stability of NSSE data in three different ways to determine if students at the same institutions report their experiences in similar ways from one year to the next. Two of these approaches are based on responses from students at the colleges and universities where the NSSE survey was administered in 2000, 2001, and 2002.

Are Student Engagement Scores Stable from One Year to the Next? The first stability estimate is a correlation of concordance, which measures the strength of the association between scores from two time periods. NSSE has conducted three national administrations since 2000. This analysis is based on student responses from institutions that used NSSE two or more years. That is, 127 schools administered NSSE in both 2000 and 2001; 156 school NSSE in 2001 and 2002; and 144 institutions used the survey in 2000 and again in 2002. In addition, we also analyzed separately the 80 colleges and universities that administered the survey all three years. This assured that institutional characteristics are fully controlled. We computed Spearman's rho correlations for the five benchmarks using the aggregated institutional level data. The benchmarks were calculated using unweighted student responses to survey items that were essentially the same for the three years. These benchmarks and their rho values range from .74 to .92 for the 2000-2001 comparison, .79 to .92 for the 2001-2002 comparison, .79 to .90 for the 2000 and 2002 comparison, and .74 to .93 for the three-year comparison (Table 4). These findings suggest that the NSSE data at the institutional level are relatively stable from year to year. These findings suggest that the NSSE data at the institutional level are relatively stable from year to year.

We did a similar analysis using data from
seven institutions that participated in both the 1999 spring field test (n=1,773) and NSSE 2000 (n=1,803) by computing Spearman's rho for five clusters of items. These clusters and their rho values are: College Activities (.86), Reading and Writing (.86), Mental Activities Emphasized in Classes (.68), Educational and Personal Growth (.36), and Opinions About Your School (.89). Except for the Educational and Personal Growth cluster, the Spearman rho correlations of concordance indicated a reasonably stable relationship between the 1999 spring field test and the NSSE 2000 results.

As with the findings from the schools common to NSSE 2000, 2001, and 2002, these results are what one would expect with the higher correlations being associated with institutional characteristics that are less likely to change from one year to the next, such as the amount of reading and writing and the types of activities that can be directly influenced by curricular requirements, such as community service and working with peers during class to solve problems. The lower correlations are in areas more directly influenced by student characteristics, such as estimates of educational personal growth.

A second approach to estimating stability from one year to the next was done using matched sample t-tests to determine if differences existed in student responses to individual survey items within a two-year
period. For both first-year and senior students, about 18% of the items between
2000 and 2001 have large effect sizes and less than 16% of the items common to 2000 and 2002 have large effect size differences; only 3% of NSSE items between 2001 and 2002 have large mean difference effect sizes (> .80). For both first-year students and seniors, NSSE items are highly or moderately correlated between any of the two years, with all coefficients being statistically significant, ranging from .60 to .96. The few exceptions that fall below the .6 threshold are items where changes were made in wording or response options or where student changes may occur (e.g., using of technology, co-curricular activities, and student-reported gains in voting and elections, etc.).

We used a similar approach to estimate the stability of NSSE results from the seven schools that were common to the spring 1999 pilot and the spring 2000 survey. This analysis did not yield any statistically significant differences (p<.001). We then compared item cluster means (those described earlier in this section) for the individual institutions using a somewhat lower threshold of statistical significance (p<.05, two-tailed). Only four of 35 comparisons reached statistical significance. Moreover, the effect sizes of these differences again were relatively small, in the .25 range.

Test-Retest. The third approach to estimating stability was a form of test-retest analysis. We have two sources of test-retest data that provide some clues about the relative stability of the instrument at the individual student level, though the information is far from definitive evidence. In response to a financial incentive (a $10 long distance telephone calling card), 129 students at a university participating in NSSE 2000 agreed to complete The Report a second time. Both the "test" (first administration) and "retest" were done via the Web. The other source of data is students (n=440) who completed the survey twice without any inducement. Some of these students simply completed the form twice, apparently either forgetting they had done it in response to the original mailing or, more likely according to anecdotal information obtained from the NSSE Help Line staff, that they were worried the survey they returned got lost in the mail. All these students completed the paper version, as the Web mode has a built-in security system that does not permit the same student to submit the survey more than once. Another group of students was recruited during focus groups we conducted on eight campuses in spring 2000 (we describe this project later). We asked students in the focus groups to complete The Report a second time. Some of these students used the Web, others used the paper version, others a combination! So, it's possible that mode of administration effects are influencing in unknown ways the test-retest results, as some data were obtained using the Web, some using paper only, and some using a combination of Web (test) and paper (retest). We examine administration mode effects in the next section.

Using Pearson product moment correlation as suggested by Anastasi and Urbina (1997) for test-retest analysis, the overall test-retest reliability coefficient for all students (N=569) across all items on The Report was a respectable .83. This indicates a fair degree of stability in students' responses, consistent with other psychometric tools measuring attitude and experiences (Crocker & Algina, 1986).

Some sections of the survey were more stable than others. For example, the reliability coefficient for the 20 College Activities items was .77. The coefficient for the 10 Opinions About Your School items was .70, for the 14 Educational and Personal Growth items .69, for the five reading, writing, and nature of examinations items .66, and for the six time usage items .63. The mental activities and program items were the least stable, with coefficients of .58 and .57 respectively.

In 2002, we conducted a similar test-retest analysis with 1,226 respondents who completed the paper survey twice. For this analysis, we used the Pearson product moment correlation to examine the reliability coefficients for the items used to construct our benchmarks. For the items related to three of the benchmarks (academic challenge, enriching educational experiences, and the academic challenge), the reliability coefficients were .74. The student interaction with faculty members items and supportive campus environment items had reliability coefficients of .75 and .78 respectively.

Summary. Taken together, these analyses suggest that the NSSE survey appears to be reliably measuring the constructs it was designed to measure. Assuming that respondents were representative of their respective institutions, data aggregated at the institutional level on an annual basis should yield reliable results. The correlations are high between the questions common to both years. Some of the lower correlations (e.g., nature of exams, rewriting papers, tutoring) may be a function of slight changes in item wording and modified response options for other items on the later surveys (e.g., number of papers written). At the same time, compared with 2000, 2001 and 2002 data reflect a somewhat higher level of student engagement on a number of NSSE items, though the relative magnitude of these differences is small.