It is important that participating
colleges and universities as well as others who use the
results from the NSSE survey be confident that the benchmarks
and norms accurately and consistently measure the student
behaviors and perceptions represented on the survey. The
minimum sample sizes established for various size institutions
and the random sampling process used in the NSSE project
assures that each school will have enough respondents
to generate accurate point estimates at the institutional
level. It is also important to assure institutions and
others who use the data that the results from The Report
are relatively stable from year to year, indicating that
the instrument produces reliable measurements from one
year to the next. That is, are students with similar characteristics
responding approximately the same way from year to year?
Over longer periods of time, of course, one might expect
to see statistically significant and even practically
important improvements in the quality of the undergraduate
experience. But changes from one year to the next should
be minimal if the survey is producing reliable results.
The approaches that have been developed in psychological
testing to estimate stability of measurements make some
assumptions about the domain to be tested that do not
hold for the NSSE project. Among the most important is
that the respondent and the environment in which the testing
occurs do not change. This is contrary, of course, to
the goals of higher education. Students are supposed to
change, by learning more and changing the way they think
and act. Not only is the college experience supposed to
change people, the rates at which individuals change or
grow are highly variable. In addition, during the past
decade many colleges have made concerted efforts to improve
the undergraduate experience, especially that of first-year
students. All this is to say that attempts to estimate
the stability of students' responses to surveys about
the nature of their experience are tricky at best.
With these caveats in mind, we have to date estimated
the stability of NSSE data in three different ways to
determine if students at the same institutions report
their experiences in similar ways from one year to the
next. Two of these approaches are based on responses from
students at the colleges and universities where the NSSE
survey was administered in 2000, 2001, and 2002.
Are Student Engagement Scores Stable from One Year to
the Next? The first stability estimate is a correlation
of concordance, which measures the strength of the association
between scores from two time periods. NSSE has conducted
three national administrations since 2000. This analysis
is based on student responses from institutions that used
NSSE two or more years. That is, 127 schools administered
NSSE in both 2000 and 2001; 156 school NSSE in 2001 and
2002; and 144 institutions used the survey in 2000 and
again in 2002. In addition, we also analyzed separately
the 80 colleges and universities that administered the
survey all three years. This assured that institutional
characteristics are fully controlled. We computed Spearman's
rho correlations for the five benchmarks using the aggregated
institutional level data. The benchmarks were calculated
using unweighted student responses to survey items that
were essentially the same for the three years. These benchmarks
and their rho values range from .74 to .92 for the 2000-2001
comparison, .79 to .92 for the 2001-2002 comparison, .79
to .90 for the 2000 and 2002 comparison, and .74 to .93
for the three-year comparison (Table 4). These findings
suggest that the NSSE data at the institutional level
are relatively stable from year to year. These findings
suggest that the NSSE data at the institutional level
are relatively stable from year to year.
We did a similar analysis using data from
seven institutions that participated in both the 1999
spring field test (n=1,773) and NSSE 2000 (n=1,803) by
computing Spearman's rho for five clusters of items. These
clusters and their rho values are: College Activities
(.86), Reading and Writing (.86), Mental Activities Emphasized
in Classes (.68), Educational and Personal Growth (.36),
and Opinions About Your School (.89). Except for the Educational
and Personal Growth cluster, the Spearman rho correlations
of concordance indicated a reasonably stable relationship
between the 1999 spring field test and the NSSE 2000 results.
As with the findings from the schools common to NSSE 2000,
2001, and 2002, these results are what one would expect
with the higher correlations being associated with institutional
characteristics that are less likely to change from one
year to the next, such as the amount of reading and writing
and the types of activities that can be directly influenced
by curricular requirements, such as community service
and working with peers during class to solve problems.
The lower correlations are in areas more directly influenced
by student characteristics, such as estimates of educational
personal growth.
A second approach to estimating stability from one year
to the next was done using matched sample t-tests to determine
if differences existed in student responses to individual
survey items within a two-year
period. For both first-year and senior students, about
18% of the items between |
2000 and 2001 have large effect
sizes and less than 16% of the items common to 2000 and
2002 have large effect size differences; only 3% of NSSE
items between 2001 and 2002 have large mean difference
effect sizes (> .80). For both first-year students
and seniors, NSSE items are highly or moderately correlated
between any of the two years, with all coefficients being
statistically significant, ranging from .60 to .96. The
few exceptions that fall below the .6 threshold are items
where changes were made in wording or response options
or where student changes may occur (e.g., using of technology,
co-curricular activities, and student-reported gains in
voting and elections, etc.).
We used a similar approach to estimate the stability of
NSSE results from the seven schools that were common to
the spring 1999 pilot and the spring 2000 survey. This
analysis did not yield any statistically significant differences
(p<.001). We then compared item cluster means (those
described earlier in this section) for the individual
institutions using a somewhat lower threshold of statistical
significance (p<.05, two-tailed). Only four of 35 comparisons
reached statistical significance. Moreover, the effect
sizes of these differences again were relatively small,
in the .25 range.
Test-Retest. The third approach
to estimating stability was a form of test-retest analysis.
We have two sources of test-retest data that provide some
clues about the relative stability of the instrument at
the individual student level, though the information is
far from definitive evidence. In response to a financial
incentive (a $10 long distance telephone calling card),
129 students at a university participating in NSSE 2000
agreed to complete The Report a second time. Both the
"test" (first administration) and "retest"
were done via the Web. The other source of data is students
(n=440) who completed the survey twice without any inducement.
Some of these students simply completed the form twice,
apparently either forgetting they had done it in response
to the original mailing or, more likely according to anecdotal
information obtained from the NSSE Help Line staff, that
they were worried the survey they returned got lost in
the mail. All these students completed the paper version,
as the Web mode has a built-in security system that does
not permit the same student to submit the survey more
than once. Another group of students was recruited during
focus groups we conducted on eight campuses in spring
2000 (we describe this project later). We asked students
in the focus groups to complete The Report a second time.
Some of these students used the Web, others used the paper
version, others a combination! So, it's possible that
mode of administration effects are influencing in unknown
ways the test-retest results, as some data were obtained
using the Web, some using paper only, and some using a
combination of Web (test) and paper (retest). We examine
administration mode effects in the next section.
Using Pearson product moment correlation as suggested
by Anastasi and Urbina (1997) for test-retest analysis,
the overall test-retest reliability coefficient for all
students (N=569) across all items on The Report was a
respectable .83. This indicates a fair degree of stability
in students' responses, consistent with other psychometric
tools measuring attitude and experiences (Crocker &
Algina, 1986).
Some sections of the survey were more stable than others.
For example, the reliability coefficient for the 20 College
Activities items was .77. The coefficient for the 10 Opinions
About Your School items was .70, for the 14 Educational
and Personal Growth items .69, for the five reading, writing,
and nature of examinations items .66, and for the six
time usage items .63. The mental activities and program
items were the least stable, with coefficients of .58
and .57 respectively.
In 2002, we conducted a similar test-retest analysis with
1,226 respondents who completed the paper survey twice.
For this analysis, we used the Pearson product moment
correlation to examine the reliability coefficients for
the items used to construct our benchmarks. For the items
related to three of the benchmarks (academic challenge,
enriching educational experiences, and the academic challenge),
the reliability coefficients were .74. The student interaction
with faculty members items and supportive campus environment
items had reliability coefficients of .75 and .78 respectively.
Summary. Taken
together, these analyses suggest that the NSSE survey
appears to be reliably measuring the constructs it was
designed to measure. Assuming that respondents were representative
of their respective institutions, data aggregated at the
institutional level on an annual basis should yield reliable
results. The correlations are high between the questions
common to both years. Some of the lower correlations (e.g.,
nature of exams, rewriting papers, tutoring) may be a
function of slight changes in item wording and modified
response options for other items on the later surveys
(e.g., number of papers written). At the same time, compared
with 2000, 2001 and 2002 data reflect a somewhat higher
level of student engagement on a number of NSSE items,
though the relative magnitude of these differences is
small. |