International Journal of Teaching and Learning in Higher Education 2022, Volume 33, Number 3, 407-428
http://www.isetl.org/ijtlhe/ ISSN 1812-9129
Development and Validation of a Formative Evaluation Instrument
for College Teaching
Ren Liu, Xiufeng Liu, and Lara Hutson
University of Buffalo
This study aims to provide preliminary validation for a newly designed instrument to evaluate teaching
effectiveness through student classroom engagement and learning gains. The instrument is titled the
Middle Semester Classroom Survey of Student Engagement and Learning (MS-CSSEL); it consists
of 31items to measure student classroom engagement in three dimensions and 19 items to measure
student learning gains in three dimensions. To validate the instrument, 634 undergraduate students in
a four-year research university participated in this study. The multidimensional Rasch model was used
to conduct the analysis. The findings indicated that (a) items displayed a good fit to the Rasch model;
(b) dimensions were distinct from each other; and (c) the items displayed high reliability. This
instrument measures teaching effectiveness in a new perspective and provides college teachers with a
new tool to gauge student engagement and learning gains for conducting evidence-based teaching
improvement.
Student ratings of instruction (SRI) is among the
most predominant approaches adopted by higher
education institutions to measure overall teaching
performance or effectiveness (Berk, 2005; Chen &
Hoshower, 2003; Yao & Grady, 2005; Zabaleta,
2007). Typically, SRI are administrated near the end
of each semester, and students are asked to rate the
characteristics of teachers and courses, such as course
organization, teachers' enthusiasm, and clarity of
explanation (Uttl et al., 2017).
Although initially SRI was intended to provide
formative feedback to improve teaching quality, since
the 1970s, it has been commonly used for making
high-stakes decisions for faculty members
(Berk,2005; Clayson, 2013; Clayson & Haley, 2011;
Galbraith et al., 2012; Nasser & Fresko, 2002; Stowell
et al., 2012). However, the original formative purpose
of SRI has not been fully achieved. There is little
evidence to support the usefulness of SRI to improve
and shape the quality of teaching. Faculty members
rarely make changes in their teaching styles or course
content based on students’ ratings, especially not
senior faculty (Yao & Grady, 2005). Many instructors
are not well trained formally in pedagogy and do not
have the necessary skills to interpret students’ ratings
(Gravestock & Gregor-Greenleaf, 2008; Yao &
Grady, 2005). To further undermine the use of
summative evaluation for teaching improvement,
faculty have doubts about the validity and reliability
of the measurement instruments used for SRI because
of diverse interpretations of effective teaching
(Spooren et al., 2013). While instructors may be
willing to accept students’ feedback, at the same time
they hold negative attitudes towards the summative
use of SRI (Spooren et al., 2013). Therefore, it is
necessary to develop a separate teaching evaluation
for formative purposes, which is the purpose of this
study.
What is Teaching Effectiveness?
Teaching effectiveness is not a new concept in
higher education. However, there is no commonly
agreed-upon definition or universal criteria to answer
what effective teaching is (Benton & Cashin, 2012;
Lehrer-Knafo, 2019; Shevlin, Banyard et al., 2000).
When thinking about teaching effectiveness, it is
necessary to consider the desired goals of teaching and
learning in different contexts (Atkins & Brown, 2002).
In other words, the meaning of effective in one context
may not be the same in another (Atkins & Brown, 2002).
Nevertheless, the concept of teaching effectiveness
is stakeholder relative. Students, teachers, and evaluator
agencies may have different understandings of the
meaning effectiveness (Fauth et al., 2020; Henard &
Leprince-Ringuet, 2008). For example, the exemplary
teachers may be concerned about effective teaching
through lesson organization, lesson clarity, interests of
the lesson, and positive classroom environment (Hativa
et al., 2001). Delaney and colleagues (2010) conducted a
study with 17,000 graduate and undergraduate students
to explore the key factors that students perceived as
essential for effective teaching. Students identified nine
behavioral characteristics, including respectful,
knowledgeable, approachable, engaging,
communicative, organized, responsive, professional, and
humorous (Delaney et al., 2010).
Recently, the focus of teaching effectiveness has
been changed from observable teaching behaviors to
students’ learning. In theory, teaching quality should be
conceptualized as a complex social process relying on
the interactions between students and instructors (Fauth
et al., 2020). Handelsman and colleagues (2007)
demonstrated that “The instructor needs to consider what
they want their students to know, understand, and be able
to do and work back from there.” Similarly, Hativa and
colleagues (2001) defined effective teaching as the
Liu et al. Development and Validation of a Formative Evaluation Instrument
408
“teaching that brings about effective and successful
student learning that is deep and meaningful.” In
addition, Darling-Hammond and colleagues (2012)
considered effective teaching as the instruction that
enabled students to learn. Effective teaching should meet
the demands of discipline, instructional goals, and
students’ needs in the teaching and learning environment
(Darling-Hammond et al., 2012). “Effective teaching is
about reaching achievement goals; it is about students
learning what they are supposed to in a particular
context, grade, or subject” (Berliner, 2005, p, 207).
Carnell (2007) conducted a qualitative study with eight
instructors teaching in higher education to examine
university teachers’ conceptions of effective teaching.
Although the instructors have different teaching
experiences, they all consider effective teaching to
enable students’ learning (Carnell, 2007).
Student Ratings of Instruction (SRI)
The use of SRI to measure and interpret teaching
effectiveness has increased in higher education
institutions since the 1900s. However, SRI has
limitations on measuring teaching quality (Clayson &
Haley, 2011; Pounder, 2007; Shevlin et al., 2000;
Spooren et al., 2013; Uttl et al., 2017). First, both the
content validity and construct validity of the commonly
used teaching evaluation measurement instruments have
been questioned (Spooren et al., 2013). Due to the lack
of consensus of effective teachers' characteristics, there
is a large variation in scope for the instruments used to
measure teaching effectiveness, especially in the defined
dimensions of teaching effectiveness (Spooren et al.,
2013). In addition, the majority of measurement
instruments used currently are designed by
administrators without consideration of other essential
stakeholders’ views of effective teaching, which raises
the question of content validity for the design of the
instruments (Spooren et al., 2013).
Second, students’ ratings can be affected by a
variety of factors other than teaching practices, which
raises the issues of accuracy of using SRI’s results for
high-stake decisions (Pounder, 2007; Shevlin et al.,
2000; Worthington, 2002). Shevlin et al. (2000)
conducted a study with 213 undergraduate students
within a social science department at a UK University
exploring the potential relationship between students’
perception of the lecturer and their ratings for
instruction. The results indicate that charisma factors
account for 69% and 37% of the variation in lecturer
ability and module attributes respectively (Shevlin et al.,
2000). In addition, in a comprehensive review of the
literature, Pounder (2007) synthesized a variety of
potential student-level, course-level, and teacher-level
factors that affected student ratings, concluding that
relying only on SRI to measure teaching and learning in
the classroom was problematic since the questions on the
SRI failed to capture what happened in the classroom
settings (Pounder, 2007). Worthington (2002) conducted
a case study in a finance major course to investigate the
effects of students’ characteristics and their perceptions
of the usage of SRI. The results suggest that the expected
grade in the subject, student age, race, gender, and their
perceptions of the evaluation process all have significant
impacts on the ratings given to the instructors.
Third, SRI may discriminate instructors based on
their background characteristics, especially for female
instructors. Centra and Gaubatz (2000) conducted a
study to investigate the relationship between students’
gender and instructors’ gender across classes regarding
instruction ratings. The results indicate that both in the
same class and across all classes, there is a significant
difference between male and female students when
rating female instructors, but no significant difference is
detected for male instructions (Centra & Gaubatz, 2000).
Female instructors are more likely to receive lower
ratings from male students, even controlling for the
effects of class size (Centra & Gaubatz, 2000).
Similarly, Young and colleagues (2009) conducted
a study to explore the gender bias when rating the
instructor and instructions, and the interaction between
students’ and instructors’ characteristics, especially for
the effects of gender. The results show a potential
gender-preference while rating instructors on
pedagogical characteristics and course content
characteristics (Young et al., 2009). A more recent
qualitative study conducted by Mitchell and Martin
(2018) aimed to investigate the underlying relationship
between gender and student evaluation of teaching.
Based on analysis of the student comments, the authors
found that students did evaluate their professors with
gender bias. Students tend to comment on female
instructors’ appearance and personality more than male
instructors and show less professional respect to woman
instructors (Mitchell & Martin, 2018).
Considering the limitation of traditional SRI, the
purpose of this study is to develop and validate a
formative evaluation measurement instrument titled
Middle Semester Classroom Survey of Student
Engagement and Learning (MS-CSSEL) that can be
used in college teaching for instructional improvement
through the lens of students’ engagement and learning
gains. The measures of engagement and learning gains
intend to provide comprehensive information for
instructors to adjust teaching strategies during the course
of instruction.
Theoretical Framework
Reliability and validity are two essential
psychometric properties for any measurement
instrument to be used. According to the Standards for
Liu et al. Development and Validation of a Formative Evaluation Instrument
409
Educational and Psychological Testing (Joint
Committee, 2014), validity is defined as “the degree to
which evidence and theory support the interpretations of
test scores for proposed uses of tests. Validity is,
therefore, the most fundamental consideration in
developing tests and evaluating tests” (p. 11). The
concept of validity has evolved during the past century,
from considering criterion validity to content validity
(Strauss & Smith, 2009). Around 1980, construct
validity was developed and accepted by scholars (Strauss
& Smith, 2009). Today, it is commonly accepted that
validity claims can be established based on content-
related evidence, alignment of items with the defined
theory, internal structure of items, response process
evidence, criterion-related evidence, and consequence-
related evidence.
Reliability refers to the “consistency of scores
across replications of a testing procedure, regardless of
how this consistency is estimated or reported” (Joint
Committee, 2014). Reliability has been evaluated by a
variety of coefficients depending on the measurement
model being used, such as the Cronbach’s Alpha
reliability coefficient, generalizability coefficient, and
Item Response Theory (IRT) information functions
(Joint Committee, 2014).
In this study, we employed Rasch modeling to
validate the psychometric properties of the newly
developed measurement instrument. The item
dimensionality estimates, the correlation between
dimensions, fit statistics, item-person maps, and
threshold estimates were generated to make claims about
the instrument's construct validity. The Expected A
Posteriori (EAP) reliability can be calculated to represent
the degree of consistency for the instrument.
The research questions of this study are:
1) What evidence supports the multidimensional
construct assumption of student engagement and
learning gains?
2) Does the MS-CSSEL survey produce valid and
reliable measures to assess student classroom
engagement and learning gains?
Method
Purpose and Population of the Measurement
Instrument
The purpose of the MS-CSSEL survey is to measure
college student engagement and learning gains in the
middle of the semester. The measures of engagement and
learning gains will be used to infer whether the current
instructions are effective, and for the instructors to adjust
teaching for the rest of the course. This survey's target
population is college students, and the setting for this
survey is face-to-face college classrooms.
Student Engagement
While the conceptualizations of student engagement
are diverse among researchers, there is an agreement that
student engagement is a multidimensional concept with
three coherent dimensions: behavioral, cognitive, and
affective (Appleton et al., 2006; Fredricks et al., 2004;
Kahu, 2013). The definition of behavioral engagement
primarily relies on the idea of active involvement in
academic, social, and extracurricular activities and the
absence of negative behaviors for accomplishing
positive learning outcomes (Fredricks et al., 2004;
Fredricks & McColskey, 2012; Trowler, 2010).
According to Kahu’s (2013) framework of engagement,
there were three specific subscales attached to behavioral
engagement, time and effort allocated to educational
activities, interactions with peers and instructors for
educational purposes, and the extent of participation in
learning activities (see Figure 1).
Cognitive engagement incorporates the idea of
willingness to invest effort beyond the requirements for
the course to understand complicated concepts and
master skills (Fredricks et al., 2004; Fredricks &
McColskey, 2012; Trowler, 2010). Cognitive
engagement includes many latent factors that cannot be
observed directly in the classroom such as self-
regulation, values and beliefs about learning, cognitive
and metacognitive strategies for learning (e.g.,
memorizing, synthesizing, understanding, evaluating,
etc.), and personal goals and autonomy (Appleton et al.,
2006; Fredricks et al., 2004; see Figure 1).
Emotional engagement focuses on the affective
reactions to teachers, classmates, academics, and the
institution, which impact the willingness to participate in
school work (Fredricks et al., 2004; Fredricks &
McColskey, 2012; Trowler, 2010). Kahu (2013)
considered emotional engagement as a kind of
attachment to schools or classes, while others considered
it as enjoyment or interest in the learning activities. This
study accepted the conceptual framework of Kahu
(2013), which decomposed emotional engagement as
enthusiasm for the courses, interest in the courses, and
the sense of belonging to the classes (see Figure 1).
Learning Gains
After conducting a systematic literature review of
learning gains and how they were measured, Rogaten et
al. (2019) categorized learning gains into three different
types following the commonly used ABC (i.e.,
Affective, Behavioral, and Cognitive) classification.
Behavioral learning gains refer to skills, including study
skills, leadership skills, team-work skills, and critical
thinking skills (Rogaten et al., 2019). Cognitive learning
gains are defined as understanding and cognitive
Liu et al. Development and Validation of a Formative Evaluation Instrument
410
Figure 1
Conceptual Framework
abilities, such as analyzing, memorizing, synthesizing,
evaluating, and applying abilities (Rogaten et al., 2019).
Affective learning gains are mainly measured as the
change of attitude during a course, such as confidence,
motivation, and interests (Rogaten et al., 2019). In
addition, how much students have learned about content
knowledge is also an important indicator to measure
students’ learning gains. For this study, questions about
content knowledge are incorporated into the behavioral
dimension. Figure 1 presents the conceptual framework
of the constructs measured in this study.
Instrument Construction
Based on Kahu’s (2013) framework, the MS-
CSSEL survey utilized five-point Likert-scale questions
to assess student engagement. Behavioral engagement
was assessed using 11 questions relating to how much
time and effort students allocated to educational tasks,
interactions with peers and instructors both inside and
outside the classrooms, and attendance and participation.
Cognitive engagement was assessed using 14 questions
relating to how much extra work students did for the
course, beliefs and values related to the course, cognitive
strategies used to study the course content, and self-
regulation strategies. Affective engagement was
assessed using six items relating to sense of belonging in
the course, enthusiasm, and interests in the subject.
Regarding the items measuring student learning
gains, learning gains in skills and knowledge were
assessed using seven questions that covered gains in
content knowledge, critical thinking, problem-solving,
communication, and cooperation. Cognitive learning
gains were assessed using six questions related to
understanding, analyzing and synthesizing, evaluating,
and applying the course contents. Learning gains in
attitude were assessed using six questions relating to
enthusiasm, interests, and confidence of the content and
the course.
In addition to the 50 items above, the MS-CSSEL
survey includes two set of questions to ask to what
extent a student’s skills and knowledge gains, cognitive
gains, and affective gains have been affected by some
instructional teaching practices, such as the lecture,
assigned class activities, graded assignments, etc.
At the end of the engagement section, two open-
ended questions asked students how they had
participated and engaged in this course and how the
instructor could maintain and improve their
engagement for the rest of the semester. Another two
open-ended questions were attached to the learning
gains section, which let students to describe what they
had learned so far and provide suggestions to the
instructor for the rest of the semester (see Appendix for
the entire survey).
Data Collection
The MS-CSSEL survey was conducted to a target
audience at one research University. In total, 634
Liu et al. Development and Validation of a Formative Evaluation Instrument
411
undergraduate students in two introductory biology
courses participated in this study (see Table 1). The MS-
Table 1
Descriptive Statistics of the Sample (N = 634)
Category
Gender
Male
Female
Missing
Classification
Freshmen
Sophomore
Junior
Senior
Missing
Race
White
Hispanic/Latino
African American
Native American
Asian
Others
Missing
CSSEL survey was administrated in the middle of
the Spring 2019 semester via a commercial online survey
platform, SelectSurvey, which is similar to common
platforms such as SurveyMonkey and Qualtrics. The
online survey platform collected students’ responses and
generated descriptive report automatically. Students
were invited by email with a survey link and they were
given two weeks to complete the survey. The survey was
conducted anonymously. The instructor provided 5 extra
credit points (0.75% of the final grade) to the students
for encouraging their participation in the survey. In total,
713 students were invited by email and 634 of them
completed the survey with a response rate of 88.9%. As
recommended by Wright and Douglas (1975), in order to
obtain stable item calibration within ± ½ logit with 95%
confidence interval for the errors, the minimum sample
size is between 64 and 144. Thus, it is reasonable to
assume that this study has an appropriate sample size to
create a stable item calibration.
Traditionally, many Likert-scale evaluation
instruments used in higher education have been
developed according to the Classical Test Theory (CTT),
which assumes that all items on the survey have the same
standard error, and threshold estimates between
categories for all items are equal (Van Zile-Tamsen,
2017). Researchers have demonstrated limitations when
using CTT to analyze rating-scale or Likert-scale data
(Bode & Wright, 1999).
Based on the validity and reliability theory, using
Rasch models to develop measurement instruments or
Rasch modeling, is a systematic process in which items
are purposefully constructed according to theory and
empirically tested through Rasch models in order to
produce a set of items that define a linear measurement
scale” (Liu, 2020, p. 34). In order to address the
limitation of CTT, this study applied the Rasch modeling
approach to provide more accurate statistical evidence
for the reliability and validity claim of the MS-CSSEL
survey.
According to the literature, student engagement and
learning gains are two multidimensional constructs.
Thus, this study applied the multidimensional rating
scale Rasch model for data analysis. All the items that
measure classroom engagement adopted the same
Likert-scaled categories from strongly agree to strongly
disagree. Because all statements of items are in positive
tones, strongly agree was coded as 5 and strongly
disagree was coded as 1. The items that measured
learning gains adopted a 5-point rating scale from 5 to 1.
Items measuring students’ classroom engagement and
learning gains were analyzed separately following the
same modeling procedures.
We used the TAM” (Robitzsch et al., 2019) and
“WrightMap” packages (Torres et al., 2014) in RStudio
(RStudio Team, 2018) to do the Rasch analysis. Item fit
statistics, EAP reliability coefficients, and item-person
maps were generated. Additionally, we used Winstep
4.5.4 (Linacre, 2020) to test the appropriateness of the
category structure. The item probability curves for
student engagement and learning gains were drawn
separately.
Results
Dimension Structure
Although student classroom engagement and
learning gains are widely believed to be
multidimensional constructs, there is little statistical
evidence to prove the dimensional structure of these
concepts. For testing the necessity of using
multidimensional models, we ran both unidimensional
and 3-dimensional Rating-scale Rasch models, and then
compared the results. Because the multidimensional
model hierarchically subsumes to the unidimensional
models, the two models can be compared by testing the
significant change in their deviance that describes the
difference between the estimated model and the true
model of the concept (Baghaei, 2012). Briggs and
Wilson (2003) indicated that the difference of deviance
between two estimated models was nearly a chi-square
distribution with a degree of freedom of the difference
between the number of parameters estimated in the two
models. Janssen and De Boeck (1999) recommended
selecting the model with significantly smaller deviance
compared to estimated models. As shown in Table 2, the
3-dimensional Rasch model had significantly smaller
deviance than the unidimensional model regarding
Liu et al. Development and Validation of a Formative Evaluation Instrument
412
Table 2
Model Comparison
Deviance
Npars
Chi-square
df
p
Engagement
1-dimension
47697.30
35
570.76
5
<.05
3-dimension
47126.54
40
Learning gains
1-dimension
25527.94
23
339.20
5
<.05
3-dimension
25188.75
28
Table 3
Correlation Matrix for the Dimensions of Classroom Engagement
Behavioral
Cognitive
Affective
Behavioral
1
Cognitive
0.78
1
Affective
0.71
0.94
1
Table 4
Correlation Matrix for the Dimensions of Learning Gains
Skills and Knowledge
Cognitive
Attitude
Skills and Knowledge
1
Cognitive
0.95
1
Attitude
0.84
0.88
1
student engagement, χ
2
(5) = 570.76, p < .05, suggesting
that the 3-dimensional Rasch model was a better solution
to model student engagement than the unidimensional
model.
The same modeling process was applied to test the
difference of deviance between the unidimensional
model and 3-dimensional model used to analyze learning
gains. The results indicated that the 3-dimensional Rasch
modeling approach had a better fit to the true model of
learning gains than a unidimensional model with a
significant change of deviance, χ
2
(5) = 339.20, p < .05.
Thus, we selected 3-dimensional Rasch model for testing
the quality of the measures of engagement and learning
gains.
In addition to the comparison of deviance, the
correlations between dimensions of student engagement
and learning gains were used to provide additional
information for the preciseness of the measurement
instrument (see Table 3 and Table 4)Table 3 and Table
4). When the multidimensional approach is used, “the
higher the correlations, the greater the number of latent
traits, and the shorter the subtests” will improves
measurement precision significantly (Wang et al., 2004,
p, 125). As shown in Table 5, the results indicated that
students’ behavioral engagement had a significantly high
correlation to students’ cognitive engagement (r = 0.78)
and affective engagement (r = 0.71). In addition,
students’ cognitive engagement had a high-level
correlation to students’ affective engagement (r = 0.94).
Although the smaller correlation estimates always help
to differentiate dimensions, a higher correlation estimate
does not necessarily imply an identical dimension
(Baghaei, 2012). Table 6 presents the correlation matrix
for learning gains. Overall, the dimensions of learning
gains show a high degree of correlation (i.e., ranging
from 0.84 to 0.95), which suggests that the measurement
instrument has a high degree of precision.
Item Fit Statistics
Rasch modeling approach provides four indices for
determining how the data fits the expected Rasch model.
The mean square fit statistics (MNSQs) indicates how
much the misfit observed between the Rasch model’s
expected item performance and the actual performance
according to the data matrix (Bond & Fox, 2015). For the
mean squared statistics, the closer to 1, the better the
model-data-fit performed. For Likert-scale and rating
scale questions, the commonly accepted range of the
mean-square statistics is from 0.6 to 1.4 logits (Bond &
Fox, 2015; Linacre, 2019). The standardized fit statistics
(ZSTDs) indicate how likely the degree of misfit
Liu et al. Development and Validation of a Formative Evaluation Instrument
413
Table 5
Fit Statistics for Classroom Engagement
Outfit Statistics
Infit Statistics
MNSQ
ZSTD
p
MNSQ
ZSTD
p
Behavioral
Item 1
1.25
4.55
0.00
1.23
4.35
0.00
Item 2
1.32
5.56
0.00
1.31
5.49
0.00
Item 3*
1.42
6.04
0.00
1.48
6.78
0.00
Item 4
0.96
-0.76
0.45
0.95
-0.89
0.37
Item 5*
1.54
9.03
0.00
1.53
8.91
0.00
Item 6
1.03
0.68
0.50
1.04
0.82
0.41
Item 7
0.97
-0.41
0.68
1.00
0.00
1.00
Item 8
0.82
-3.43
0.00
0.84
-3.18
0.00
Item 9
0.85
-3.21
0.00
0.84
-3.39
0.00
Item 10
1.37
6.09
0.00
1.37
6.22
0.00
Item 11*
2.93
23.58
0.00
2.92
23.46
0.00
Cognitive
Item 12
1.23
3.81
0.00
1.19
3.21
0.00
Item 13
1.26
4.42
0.00
1.21
3.65
0.00
Item 14
1.08
1.32
0.19
1.13
2.07
0.04
Item 15
0.69
-5.70
0.00
0.72
-5.17
0.00
Item 16
0.80
-4.10
0.00
0.78
-4.52
0.00
Item 17
0.74
-5.43
0.00
0.69
-6.41
0.00
Item 18
0.78
-4.34
0.00
0.77
-4.54
0.00
Item 19
0.66
-6.72
0.00
0.67
-6.67
0.00
Item 20
0.63
-7.49
0.00
0.63
-7.61
0.00
Item 21
0.76
-4.61
0.00
0.74
-5.15
0.00
Item 22
0.74
-5.06
0.00
0.72
-5.29
0.00
Item 23
0.73
-5.17
0.00
0.73
-5.22
0.00
Item 24
0.70
-5.81
0.00
0.68
-6.19
0.00
Item 25
0.91
-1.56
0.12
0.90
-1.72
0.09
Affective
Item 26
0.78
-4.22
0.00
0.78
-4.17
0.00
Item 27
0.75
-5.06
0.00
0.73
-5.45
0.00
Item 28
0.79
-3.93
0.00
0.78
-4.20
0.00
Item 29
0.65
-6.70
0.00
0.63
-7.22
0.00
Item 30
0.65
-6.50
0.00
0.69
-5.57
0.00
Item 31
0.83
-2.92
0.00
0.86
-2.40
0.02
expressed by mean square statistics will be observed
(Bond & Fox, 2015). When the sample size is between
30 to 300, the acceptable range for the standardized fit
statistic is from -.20 to 2.0 (Bond & Fox, 2015).
Typically, the decision-making of model-data-fit
depends on those four indices equally, but it is
reasonable to make decisions according to some indices
for a particular purpose (Bond & Fox, 2015). For
example, if the sample size is larger than 300, the ZSTDs
are more likely to be too sensitive (i.e., with many items
failing to fit the model; Linacre, 2019). The standardized
fit statistics highly depend on the sample size, which
inflates putative Type I error rates; however, the mean
square statistics are comparatively insensitive to sample
size (Smith et al., 2008).
Over 600 students participated in this study;
therefore, the decision-making of model-data-fit was
made primarily based on the MNSQs. Overall, all 31
Liu et al. Development and Validation of a Formative Evaluation Instrument
414
Table 6
Fit Statistics of Learning Gains
Outfit Statistics
Infit Statistics
MNSQ
ZSTD
p
MNSQ
ZSTD
p
Skills and knowledge
1
Item 32
0.87
-2.18
0.03
0.90
-1.66
0.10
2
Item 33
1.00
0.01
1.00
0.92
-1.28
0.20
3
Item 34
0.91
-1.62
0.11
0.95
-0.76
0.45
4
Item 35
0.78
-4.09
0.00
0.82
-3.29
0.00
5
Item 36
1.28
4.57
0.00
1.30
4.85
0.00
6
Item 37
1.06
0.99
0.32
1.06
1.05
0.29
7
Item 38
1.54*
7.83
0.00
1.60*
8.54
0.00
Cognitive
8
Item 39
1.37
5.94
0.00
1.37
5.90
0.00
9
Item 40
0.70
-5.52
0.00
0.73
-4.79
0.00
10
Item 41
0.65
-6.60
0.00
0.68
-5.94
0.00
11
Item 42
0.72
-5.06
0.00
0.74
-4.67
0.00
12
Item 43
0.69
-5.81
0.00
0.72
-5.11
0.00
13
Item 44
0.87
-2.14
0.03
0.90
-1.74
0.08
Attitude
14
Item 45
0.95
-0.76
0.44
1.02
0.39
0.70
15
Item 46
1.12
1.99
0.05
1.15
2.33
0.02
16
Item 47
0.84
-2.77
0.01
0.93
-1.24
0.21
17
Item 48
0.88
-2.13
0.03
0.88
-2.08
0.04
18
Item 49
0.71
-5.48
0.00
0.73
-5.02
0.00
19
Item 50
1.70*
9.47
0.00
1.70*
9.48
0.00
items measuring student classroom engagement fit the
expected Rasch model well (Table 2). For classroom
behavioral engagement, 8 out of 11 items demonstrated
a good model-data-fit with a range of the MNSQs from
0.82 to 1.37. The MNSQs of items 3, 5, and 11 were
outside the acceptable range (see Table 2). For all 14
items measuring cognitive engagement, MNSQs ranged
from 0.63 to 1.23, which indicated an acceptable level of
model-data-fit. All six items measuring student affective
engagement also fit the expected Rasch model, with
MNSQs ranging from 0.63 to 0.86.
As presented in Table 3, 6 out of 7 items measuring
learning gains in skills and knowledge (i.e., items 32 to
38) had good model-data-fit, with MNSQs ranging from
0.78 to 1.30. Item seven, which asked how much
students learned to communicate and work with peers to
improve their learning, had more misinformation (Outfit
MNSQ = 1.54; Infit MNSQ = 1.60). All items that
measured learning gains in cognition (items 39-44) had
acceptable MNSQs, ranging from 0.65 to 1.37. Five out
of the six items that measured learning gains in attitude
(items 45-50) fit the expected model well, with the
MNSQs ranging from 0.84 to 1.15. Item 50, which asked
whether students were willing to seek help from others
when necessary, failed to fit the model, based both on
Infit and Outfit MNSQ.
Internal Structure of Items
The item-person map, also called “Wright map,”
puts the person and item estimates in a same logit scale
with the person ability estimates distributing on the left
and the item difficulty estimates on the right. Generally,
a good measurement instrument should be able to match
sample’s ability distribution with items’ difficulty
distribution (Liu, 2020, p. 40). In this study, person
ability estimates represent levels of student engagement
and extent of learning gains. On the Wright map, items
were arranged by difficulty estimates from easier to
agree with at the bottom and the harder to agree with on
the top of the map. Individuals were arranged based on
levels of engagement and learning gains from higher at
the top to lower at the bottom.
Evidence from the combined Wright Map for
student classroom engagement (see Figure 2) indicated
that levels of student classroom engagement in
Liu et al. Development and Validation of a Formative Evaluation Instrument
415
behavioral, cognitive, and affective dimensions were
normally distributed and the spread of student
engagement estimates was sufficient with the logits
ranging from -2.5 logits to 3.5. The mean logit for
student engagement was slightly higher than the
average of the item estimates suggesting that this set
of questions was relatively easier for most students to
agree with.
Separated Wright maps were produced for all
dimensions of student engagement. Student behavioral
engagement was determined from the time and efforts
put into the course, interactions with peers and
instructors, and participation in the course. The
threshold Wright map (Figure 3) indicated that there
were sufficient items to measure student classroom
behavioral engagement. The results also indicated that
the higher end of student engagement continued higher
than the highest “difficult” item thresholds, which
suggests that more items should be developed to assess
the top end of student engagement precisely.
Figure 2
Combined Wright Map for Students' Classroom Engagement
As shown in Figure 3, overall, the distribution of
cognitive engagement estimates was acceptable with a
range of -2.0 logits to 4.5 logits. Overall, the threshold
map indicated that the spread of items was sufficient to
measure most levels of cognitive engagement. However,
the higher end of cognitive engagement continued higher
than the most “difficult” item, which suggests that more
“difficult” questions were needed to capture higher
cognitive engagement. Additionally, the efficiency of
this set of questions was also not ideal, with 5 items (e.g.,
items 1,8,11,12,13) found to be at the same level of
difficulty for this population. Thus, for further
development, some of these questions should be
combined or deleted in order to improve the efficiency
of the survey.
As shown in Figure 3, the spread of students’
affective engagement was acceptable with the range of
affective engagement estimates ranging from -2.5 logits
to 3.5 logits. The mean logit of the six items was slightly
lower than the mean logit of students’ affective
engagement, which suggested that the items were
relatively easier for students to agree with. The
Liu et al. Development and Validation of a Formative Evaluation Instrument
416
Figure 3
Separated Wright Map for Behavioral, Cognitive, and affective Engagement
distribution of the threshold estimates had a sufficient
spread to measure most levels of affective engagement,
with the exception of the higher end. Thus, for further
improvement, some more difficult questions should be
considered for making this dimension more
comprehensive.
For the items measuring student learning gains, it
suggested that the spread of person estimates was
acceptable ranging from -2.5 logits to 4 logits, and the
shape of the distribution was nearly normal (see Figure
4). The mean logit of the items was slightly smaller than
the average logit of students’ learning gains, which
Figure 4
Combined Wright Map for Learning Goals
Liu et al. Development and Validation of a Formative Evaluation Instrument
417
Figure 5
Separated wright Map for Learning Gains in Skills and Knowledge, Cognition, and Attitude
indicates that students gained more through classroom
learning than what was measured by the 19 items
together, in terms of the learning gains in skills and
knowledge, cognition, and attitude.
The threshold map (Figure 5) of learning gains in
skills and knowledge indicates that the separation of
item difficulty is acceptable but not sufficient to
measure all levels of student learning gains in skills and
knowledge. Two gaps - between items 6 and 3, and
between items 7 and 1 - indicate that more questions
should be added.
The threshold map of learning gains in cognition
(Figure 5) indicate that the six items measured some
levels of cognitive learning gains, but the items were
not sufficient to capture all levels of affective learning
gains. The gap between items 1 and 3 suggests that
more items should be added in this area.
Results from the threshold map for learning gains
in attitude (Figure 5) indicate that the spread of the
difficulty of threshold is acceptable, though there was a
gap between items 4 and 6. Additionally, none of the
questions successfully measured the highest and lowest
levels of learning gains in attitude. This should also be
addressed in future improvement of the survey.
Reliability
The Expected A Posteriori (EAP) Measures were
calculated to test the reliability of the MS-CSSEL
survey. As presented in Table 7, the results indicated a
great extent of consistency regarding every sub-
dimensions of classroom engagement and learning
gains with the reliability coefficient ranging from 0.83
to 0.92. In particular, the reliability was 0.83 for
behavioral engagement, 0.87 for cognitive engagement,
and 0.83 for affective engagement. Regarding the
section measuring students’ learning gains, the
reliability was 0.91 for learning gains in skills and
knowledge, 0.92 for learning gains in cognition, and
0.90 for learning gains in attitude.
Table 7
Reliability of Engagement and Learning Gains
EAP Reliability
Classroom Engagement
Behavioral engagement
0.83
Cognitive engagement
0.87
Affective engagement
0.83
Learning gains
Skills and knowledge
0.91
Cognitive
0.92
Attitude
0.90
Category Threshold Estimates
For the rating scale Rasch model, it is required that
the average person estimates should advance
monotonically from lower-level categories to higher,
and the difference of neared threshold estimates should
Advance by at least 1.4 logits for a 5-point scale (Bond
& Fox, 2015).
Category Threshold Estimates
For the rating scale Rasch model, it is required that
the average person estimates should advance
monotonically from lower-level categories to higher, and
the difference of neared threshold estimates should
advance by at least 1.4 logits for a 5-point scale (Bond &
Fox, 2015). The acceptable range of MNSQs for each
category is from 0.6 to 1.4 (Bond & Fox, 2015). In this
study, none of the MNSQs of categories were outside the
acceptable range. The category Andrich threshold
estimates suggested that the observed average for each
category in both student engagement and learning gains
Liu et al. Development and Validation of a Formative Evaluation Instrument
418
were increased monotonically. However, the difference
of the Andrich threshold between "disagree" and
"neutral" was 0.88, and the difference of the threshold
between "neutral" and "agree" was 0.74, which
suggested that students had difficulty distinguishing
neutral with disagree and agree. As shown in Figure 6,
the highest probability peak of “neutral” was less than .5,
which indicated that this category was not functioning
well. The results suggest that the response option neutral
should be investigated further or deleted.
In terms of the category probability statistics for
learning gains, as presented in Table 8, the majority of
threshold estimates were acceptable. However, the
difference between the second and third threshold
estimates was 0.75. Because this is less than the
minimum threshold of 1.0, it suggests that students might
not be able to differentiate categories 2 and 3. As shown
in Figure 7, the highest probability peak for the second
category was less than 0.5, which suggested that this
category was not functioning well. For further
improvement, it is reasonable to consider using a 4-point
rating-scale category structure for the items measuring
learning gains.
Discussion
Validity and Reliability of the MS-CSSEL Survey
Overall, the Rasch analysis results suggest that the MS-
CSSEL survey is a valid and reliable tool that can
provide useful information to the instructor about student
classroom engagement and learning gains in the middle
of the semester. For the reliability of the MS-CSSEL
survey, the EAP reliability estimates for each sub-
dimension suggests that the survey has a large degree of
consistency. In addition, the validity of the MS-CSSEL
survey is supported by the following aspects: (a) this
study supports the need to treat student engagement and
learning gains as two multidimensional constructs by
comparing the 3-dimensional model with the
unidimensional model, (b) the moderately high
correlation coefficient between sub-dimensions suggests
a good internal relationship between dimensions, (c) the
item fit statistics indicate that 47 out of 50 items
contribute to the defined constructs, (d) the item-person
maps show a good match between “person ability” and
“item difficulty” for both the measures of classroom
Figure 6
Item Probability Curve for Student Engagement
Liu et al. Development and Validation of a Formative Evaluation Instrument
419
Table 8
Category Threshold Estimates
Figure 7
Item Probability Curve for Learning Gains
Category Label
MNSQs
Observed Average
Andrich Threshold
Infit
Outfit
Student engagement
Strongly Disagree
1.17
1.30
-.55
None
Disagree
1.00
1.03
-.10
-1.41
Neutral
0.93
0.94
.36
-.53
Agree
0.89
0.86
.90
.21
Strongly Agree
1.03
1.02
1.52
1.73
Learning Gains
1
1.22
1.36
-.89
None
2
1.00
1.06
-.35
-1.80
3
0.90
0.96
.34
-1.05
4
0.89
0.86
1.30
.43
5
1.06
1.04
2.60
2.42
Liu et al. Development and Validation of a Formative Evaluation Instrument
420
engagement and learning gains, which suggests a good
internal structure of items, and (e) the category threshold
estimates for the measures of student engagement and
learning gains indicate that each category adequately
measured the constructs.
Although most items in the MS-CSSEL survey
functioned as expected, a few items have been identified
for further improvement. First, the question regarding
student attendance showed poor model-data-fit.
According to the definition of student behavioral
engagement, attendance is an essential aspect that
reflects how students engage in classroom learning.
However, although the survey was conducted
anonymously, students might not report their attendance
accurately. Thus, this question should remain but needs
revision. Second, the questions, During class time, I
regularly work with other students on work assigned by
the instructor” and “Learning to communicate and work
with peers to improve my learning” also showed misfit
to the Rasch model. Those two questions are important
aspects of engagement and learning gains. However,
because most current large-enrollment university classes
are lecture-based, collaborative learning between and
among students occurs rarely. Thus, the instructor should
consider how to incorporate collaborative learning in
large classes or whether the items are meaningful to the
course when using this survey. Third, the item-person
map suggests that the items effectively measure
students’ engagement and learning gains, but some
apparent gaps between items need to be taken into
consideration for further improvement. Fourth, the
results of category probability estimates implied that the
response option “Neutral” could be deleted for the items
measuring students’ classroom engagement, and the
response option "2" and "3" for the rating scale used for
measuring learning gains could be collapsed. These
items may be improved in continuous development and
validation of the instrument.
Use of Results for Teaching Improvement
Instructors can use data produced by the MS-CSSEL
survey for teaching improvement. The instructors do not
need to conduct the statistical analysis reported in this
paper, instead they can rely on each item's descriptive
statistics to plan for teaching improvement. Any online
survey platform, such as Survey Monkey, can generate
descriptive statistics for each item automatically. The
mean scores of students' responses to each question
provide detailed information on students’ average
performance regarding each aspect of engagement and
learning gains. Since the MS-CSSEL survey adopts a 5-
point category structure, we recommend instructors pay
attention to the items that have mean scores below 3. For
example, the results of the behavioral engagement in this
pilot study showed that students did not always ask
questions in class (M=2.71) and had limited interaction
with the instructor outside the classroom (M=2.25).
Thus, for teaching improvement, the instructor may
consider providing more opportunities for students to ask
questions while teaching in the classroom and encourage
them to interact with the instructor after class.
Additionally, the information on how the designed
instructional components affect students’ learning gains
can provide additional evidence for the instructor to
consider further teaching improvement. For example, in
this study, the average scores of “interactions with the
instructor about learning” are lower than other
instructional components in terms of the contributions to
students’ learning gains. Aligning with the findings
identified through the Likert-type questions, the
interaction between faculty and students is an essential
aspect for further teaching improvement.
For researchers who will use the instrument, they
should conduct Rasch analysis to obtain interval
measures of student engagement and learning gains.
First, the individual Rasch scores of engagement and
learning gains of students will help understand the
relationship between students’ academic performance
and their engagement and learning gains, which suggests
better instructional strategies to teach students,
especially for the low-performance students. For
example, the individual “ability” scores could help
researchers to reach the low-performing students to find
out their strength and weakness in engagement and
learning gains, and then to prepare more targeted
mentoring and suggestions to the students for improving
their academic performance in the rest of the course.
Second, by looking at the item-person map and the
average scores of each item, researchers can identify in
what aspects the majority of students’ struggle. For
example, the separated item-person map of behavioral
engagement suggests that taking notes was the question
that students most often agreed with, which meant that
most students took notes during classroom learning.
However, the item that asked whether students always
discussed their learning process with the instructor
outside the classroom was less commonly agreed with.
Thus, to maximize students’ engagement, the instructor
should think about how to provide more opportunities for
student-faculty interactions.
Finally, this survey can benefit students in terms of
self-regulation and train them to be self-directed
learners. By taking this survey, students will have an
opportunity to monitor their engagement and reflect on
what they have learned. Instructors can provide students
the information on the average engagement and learning
gains as well as how high-performance students engage
in classroom learning as examples for other students to
adjust their learning strategies. In this way, students will
Liu et al. Development and Validation of a Formative Evaluation Instrument
421
have an opportunity to learn their strengths and
weaknesses in the course.
Administration of the MS-CSSEL Survey
The MS-CSSEL survey is intended to measure
student engagement and learning gains in the middle of
the semester. Thus, it is appropriate to administer the
survey around the middle of the semester in order to
collect data for instructors to consider potential
improvement while teaching the course.
The purpose of the MS-CSSEL survey is to help
instructors identify the strengths and weaknesses of
students’ engagement and learning gains for adjusting
teaching strategies. Thus, although the survey performed
a high internal consistency and excellent construct
validity, it should be used for formative evaluation of
teaching instead of high-stakes faculty evaluation,
according to the results. To protect students’ privacy and
encourage them to take the survey without concerns, we
recommend administering the survey anonymously.
Finally, because the survey is quite long, having students
complete it on a purely voluntary basis will likely
decrease the response rate. Thus, we recommend the
instructor provide incentives to encourage student
participation. For example, in this study, the instructor
provided five extra points (~0.75% of the final grade) to
the students who completed the survey on time. In this
way, more than 88.9% of the students responded to the
survey.
In conclusion, assessing teaching effectiveness
through measuring student classroom engagement and
learning gains is a viable way to address the lack of a
unified definition of teaching effectiveness.
Incorporating the MS-CSSEL survey in the middle of the
semester will help instructors: (a) monitor and
understand students’ engagement and learning process,
(b) gauge whether students’ learning gains match with
instructor’s expectations, and (c) adjust instruction for
the remainder of the course. Furthermore, it will prepare
students to be self-directed learners.
References
American Educational Research Association, American
Psychological Association, Joint Committee on
Standards for Educational, Psychological Testing
(US), & National Council on Measurement in
Education. (2014). Standards for educational and
psychological testing. American Educational
Research Association.
Appleton, J. J., Christenson, S. L., Kim, D., & Reschly,
A. L. (2006). Measuring cognitive and
psychological engagement: Validation of the
Student Engagement Instrument. Journal of School
Psychology, 44(5), 427-445.
Atkins, M., & Brown, G. (2002). Effective teaching in
higher education. Routledge.
Baghaei, P. (2012). The application of multidimensional
Rasch models in large scale assessment and
validation: an empirical example. Electronic
Journal of Research in Educational Psychology,
10(1), 233252.
Benton, S. L., & Cashin, W. E. (2012). Idea paper# 50
student ratings of teaching: A summary of research
and literature.
Berk, R. A. (2005). Survey of 12 strategies to measure
teaching effectiveness. International Journal of
Teaching and Learning in Higher Education, 17(1),
48-62.
Berliner, D. (2005). The near impossibility of testing for
teacher quality. Journal of Teacher Education,
56(3), 205 214.
Bode, R. K., & Wright, B. D. (1999). In Higher
education: Handbook of theory and research.
Springer.
Bond, T., & Fox, C. M. (2015). Applying the Rasch
model: Fundamental measurement in the human
sciences. Routledge.
Briggs, D. C., & Wilson, M. (2003). An introduction to
multidimensional measurement using Rasch
models. Journal of Applied Measurement, 4(1), 87-
100.
Carnell, E. (2007). Conceptions of effective teaching in
higher education: extending the boundaries.
Teaching in Higher Education, 12(1), 25-40.
Centra, J. A., & Gaubatz, N. B. (2000). Is there gender
bias in student evaluations of teaching? The Journal
of Higher Education, 71(1), 17-33.
Chen, Y., & Hoshower, L. B. (2003). Student evaluation
of teaching effectiveness: An assessment of student
perception and motivation. Assessment &
Evaluation in Higher Education, 28(1), 71-88.
Clayson, D. E. (2013). Initial impressions and the student
evaluation of teaching. Journal of Education for
Business, 88(1), 26-35.
Clayson, D. E., & Haley, D. A. (2011). Are students
telling us the truth? A critical look at the student
evaluation of teaching. Marketing Education
Review, 21(2), 101-112.
Darling-Hammond, L., Jaquith, A., & Hamilton, M.
(2012). Creating a comprehensive system for
evaluating and supporting effective teaching.
Stanford Center for Opportunity Policy in
Education.
Delaney, J. G., Johnson, A., Johnson, T. D., & Treslan,
D. (2010). Students’ perceptions of effective
teaching in higher education.
https://www.mun.ca/educ/faculty/mwatch/laura_tre
slan_SPETHE_Paper.pdf
Fauth, B. Göllner, R. Lenske, G. Praetorius, A. &
Wagner, W. (2020). Who sees what? Conceptual
Liu et al. Development and Validation of a Formative Evaluation Instrument
422
considerations on the measurement of teaching
quality from different perspectives. Zeitschrift Für
Pädagogik, 66, 138155.
Fredricks, J. A., & McColskey, W. (2012). Handbook
of research on student engagement. Springer.
Galbraith, C. S., Merrill, G. B., & Kline, D. M. (2012).
Are student evaluations of teaching effectiveness
valid for measuring student learning outcomes in
business related classes? A neural network and
bayesian analyses. Research in Higher Education,
53(3), 353-374.
Gravestock, P., & Gregor-Greenleaf, E. (2008). Student
course evaluations: Research, models and trends.
Higher Education Quality Council of Ontario.
Handelsman, J., Miller, S., & Pfund, C. (2007).
Scientific teaching. Macmillan.
Hativa, N., Barak, R., & Simhi, E. (2001). Exemplary
university teachers: Knowledge and beliefs
regarding effective teaching dimensions and
strategies. The Journal of Higher Education, 72(6),
699-729.
Henard, F., & Leprince-Ringuet, S. (2008). The path to
quality teaching in higher education.
https://www.oecd.org/education/imhe/41692318.p
df
Janssen, R., & De Boeck, P. (1999). Confirmatory
analyses of componential test structure using
multidimensional item response theory.
Multivariate Behavioral Research, 34(2), 245-268.
Kahu, E. R. (2013). Framing student engagement in
higher education. Studies in Higher Education,
38(5), 758-773.
Lehrer-Knafo, O. (2019). How to improve the quality
of teaching in higher education? The application of
the feedback conversation for the effectiveness of
interpersonal communication. EDUKACJA
Quarterly, 149(2).
Linacre, J. M. (2020). Winsteps® Rasch measurement
computer program. Winsteps.com
Liu, X. (2020). Using and developing measurement
instruments in science education: A Rasch
Modeling Approach (2nd ed.). IAP.
Mitchell, K. M., & Martin, J. (2018). Gender bias in
student evaluations. PS: Political Science &
Politics, 51(3), 648-652.
Nasser, F., & Fresko, B. (2002). Faculty views of
student evaluation of college teaching. Assessment
& Evaluation in Higher Education, 27(2), 187-
198.
Pounder, J. (2007). Is student evaluation of teaching
worthwhile? An analytical framework for
answering the question. Quality Assurance in
Education, 15(2), 178191.
Robitzsch, A., Kiefer, T., & Wu, M. (2019). TAM: Test
analysis modules. R package version 3.3-10.
https://CRAN.R-project.org/package=TAM
Rogaten, J., Rienties, B., Sharpe, R., Cross, S.,
Whitelock, D., Lygo-Baker, S., & Littlejohn, A.
(2019). Reviewing affective, behavioural and
cognitive learning gains in higher education.
Assessment & Evaluation in Higher Education,
44(3), 321-337.
RStudio Team (2018). RStudio: Integrated
development for R. RStudio, Inc.
http://www.rstudio.com/.
Shevlin, M., Banyard, P., Davies, M., & Griffiths, M.
(2000). The validity of student evaluation of
teaching in higher education: love me, love my
lectures? Assessment & Evaluation in Higher
Education, 25(4), 397-405.
Smith, A. B., Rush, R., Fallowfield, L. J., Velikova,
G., & Sharpe, M. (2008). Rasch fit statistics and
sample size considerations for polytomous data.
BMC Medical Research Methodology, 8(1), 33.
Spooren, P., Brockx, B., & Mortelmans, D. (2013). On
the validity of student evaluation of teaching: The
state of the art. Review of Educational Research,
83(4), 598-642.
Stowell, J. R., Addison, W. E., & Smith, J. L. (2012).
Comparison of online and classroom-based
student evaluations of instruction. Assessment &
Evaluation in Higher Education, 37(4), 465-473.
Strauss, M. E., & Smith, G. T. (2009). Construct
validity: Advances in theory and methodology.
Annual Review of Clinical Psychology, 5, 1-25.
Torres Irribarra, D. & Freund, R. (2014). Wright Map:
IRT item-person map with ConQuest integration.
http://github.com/david-ti/wrightmap
Trowler, V. (2010). Student engagement literature
review. The Higher Education Academy,11(1), 1-
15.
Uttl, B., White, C. A., & Gonzalez, D. W. (2017).
Meta-analysis of faculty's teaching effectiveness:
Student evaluation of teaching ratings and student
learning are not related. Studies in Educational
Evaluation, 54, 22-42.
Van Zile-Tamsen, C. (2017). Using Rasch analysis to
inform rating scale development. Research in
Higher Education, 58(8), 922-933.
Wang, W. C., Chen, P. H., & Cheng, Y. Y. (2004).
Improving measurement precision of test
batteries using multidimensional item response
models. Psychological methods, 9(1), 116-136.
Worthington, A. C. (2002). The impact of student
perceptions and characteristics on teaching
evaluations: a case study in finance education.
Assessment & Evaluation in Higher Education,
27(1), 49-64.
Yao, Y., & Grady, M. L. (2005). How do faculty make
formative use of student evaluation feedback? A
multiple case study. Journal of Personnel
Evaluation in Education, 18(2), 107.
Liu et al. Development and Validation of a Formative Evaluation Instrument
423
Young, S., Rush, L., & Shaw, D. (2009). Evaluating
gender bias in ratings of university instructors'
teaching effectiveness. International Journal for the
Scholarship of Teaching and Learning, 3(2), n2.
Zabaleta, F. (2007). The use and misuse of student
evaluations of teaching. Teaching in Higher
Education, 12(1), 55-76.
____________________________
REN LIU is a Ph.D. candidate in the Department of
Learning and Instruction, Graduate School of Education,
University at Buffalo, State University of New York. He
conducts research in the areas of college teaching
improvement, measurement development using Rasch
model, program evaluation, and application of Rasch
measurement in higher education.
XIUFENG LIU is a Professor in the Department of
Learning and Instruction, Graduate School of Education,
University at Buffalo, State University of New York. He
conducts research in closely related areas of
measurement and evaluation of STEM education,
applications of Rasch measurement, and STEM teaching
and learning. He was the inaugural director of the Center
for Educational Innovation with a mission for improving
university teaching and student learning. Among the
books he has published is Using and Developing
Measurement Instruments in Science Education: A
Rasch Modeling Approach (Second Edition, 2020,
Information Age Publishing).
LARA HUTSON is a Clinical Associate Professor and
Director of Undergraduate Studies in the Department of
Biological Sciences at the University at Buffalo
(SUNY). She is also coordinator and instructor of the
second-semester introductory majors’ biology course
(Cell Biology) and teaches biochemistry. Dr. Hutson’s
most recent projects aim to reduce D/R/F/W rates in
introductory STEM courses, including Attendance
Tracking and Intervention (funded by the UB President’s
Circle); SUNY Excels, a collaboration between STEM
instructors at the SUNY Centers (funded by the State of
New York); and UB Excite, a course redesign program
(also funded by the State of New York).
Liu et al. Development and Validation of a Formative Evaluation Instrument
424
Appendix
Part I: Classroom engagement
Behavioral Engagement
Strongly
agree
Neutral
Disagree
Strongly
disagree
1. I always ask content-related questions in class.
2. I always complete assigned readings before coming
to class.
3. I always take notes during class.
4. I always review my notes of previous classes before
coming to the next class.
5. During class time, I regularly work with other
students on work assigned by the instructor.
6. I always discuss my learning progress (e.g., grades,
assignments, learning difficulties) with the
instructor out of classroom.
7. I always pay attention to the instruction during class
time.
8. If I have a difficulty in understanding something, I
always seek additional help.
9. I always contribute to class discussion in class.
10. I always discuss assignments with other students.
11. I have been absent in class fewer than 2 times so
far.
Cognitive Engagement
12. I spend more time and effort on this course (e.g.,
assignments, studying, reviewing notes) than on
other courses.
13. I spend much time and effort in finding additional
resources to help me complete the course work.
14. I clearly understand the value and the importance
of this course for my future learning and career.
15. I work hard in order to meet the instructor’s
expectation.
16. I fully understand the course content.
17. I memorize the course content (e.g., definitions,
facts, ideas, or methods) and can recall them well.
18. I always try to decompose an idea or theory to
identify its components or elements.
Liu et al. Development and Validation of a Formative Evaluation Instrument
425
19. I synthesize knowledge (e.g., ideas, information,
experiences) into more comprehensive
interpretations and relationships.
20. I examine critically in order to make judgment
about the value of information, arguments, or
methods learned from this course.
21. I apply new knowledge and skills learned in this
course to solve practical problems.
22. Before beginning a task, I plan for appropriate
strategies and allocate sufficient time.
23. I monitor my learning progress during the course.
24. I consider the instructor’s feedback of my learning
performance carefully and adjust my learning
accordingly.
25. I fully understand my strength and weakness of
learning in this course.
Affective Engagement
26. I feel I belong to this course as a learning
community.
27. I feel I have a voice in this course.
28. I feel comfortable to talk to the instructor.
29. I feel supported by the instructor.
30. I am enthusiastic about learning new things.
31. I am interested in the course content.
1. Please describe how you have participated and engaged in this course so far.
2. Please suggest how the instructor may maintain and further improve your participation and engagement
during the rest of the course.
Liu et al. Development and Validation of a Formative Evaluation Instrument
426
Part II: Learning gains: How much have you learned so far in this course
Please rate your learning gains in the following aspects from
1(lowest) to 5(highest)
Skills and Knowledge
1
2
3
4
5
32. Gaining factual knowledge (terminology, classifications,
methods, trends).
33. Learning fundamental principles, generalizations or theories.
34. Learning how to find and use resources for answering
questions or solving problems.
35. Developing specific skills, competencies and points of view
needed by professionals in the field most closely related to
this course.
36. Developing skill in expressing yourself orally or in writing.
37. Learning to communicate with the instructor to improve my
learning.
38. Learning to communicate and work with peers to improve my
learning.
Cognitive
39. Developing creative capacities (writing, inventing, designing).
40. Gaining a broader understanding and appreciation of key
concepts of this course.
41. Learning to analyze and critically evaluate ideas, arguments
and points of view related to the key topics in this course.
42. Learning to synthesize and organize new knowledge into a
more complex and comprehensive way.
43. Learning to apply course material (to improve thinking,
problem solving and decision making)
44. Learning to apply ideas from this class to ideas encountered in
other classes within this subject area.
In general, how much has each of the following aspects of the
course helped your cognitive learning gains (Q32-Q45)?
Lecturing presentation
Assigned class activities (e.g., discussions, problem solving,
case studies)
Graded assignments
Feedback on my work of class assignments and exams from
instructors
Course materials (e.g., textbooks and supplementary
readings)
Examinations and quizzes
Liu et al. Development and Validation of a Formative Evaluation Instrument
427
Online notes or resources posted by instructor
Interaction with the instructor about your learning
Attitude
45. Acquiring an interest in learning more knowledge and skills
from this course, and having interests to take additional class
in this field.
46. Developing a clearer understanding of, and commitment to,
personal values.
47. Enthusiasm for the subject.
48. Confidence that you understand the course materials.
49. Feeling comfortable in working with complex ideas or tasks
in this field.
50. Willing to seek help from others (e.g., teacher, peers, TA)
when necessary.
In general, how much has each of the following aspects of the
course helped your affective learning gains (Q45-Q50)?
Lecturing presentation
Assigned class activities (e.g., discussions, problem solving,
case studies)
Graded assignments
Feedback on my work of class assignments and exams from
instructors
Course materials (e.g., textbooks and supplementary
readings)
Examinations and quizzes
Online notes or resources posted by instructor
Interactions with the instructor about your learning
1. Please describe what you have learned in this course so far?
2. Please suggest how the instructor may maintain and further improve instruction in order to maximize your
learning gain during the rest of the course?
Liu et al. Development and Validation of a Formative Evaluation Instrument
428
Part III: Demographic
If you do not want to respond to any of the
questions below, you can choose to skip it
What is your Gender
Male
Female
What is your Race?
(please choose all that apply)
White
Hispanic/
Latino
African
American
Native
American
Asian
Others,
please
specify
Are you a first-generation college student?
Yes
No
What is your classification?
Freshmen
Sophomore
Junior
Senior
What are the reasons to register in this
course?
Required
Selective
Others
--------