Appendix B — AI-Integrated Assessment Rubric System

B.1 What This System Does

This appendix provides a complete rubric generation pipeline: a generic template that works across disciplines, a structured guide for adapting it to your specific AI modality and discipline, and a worked example showing the full adaptation in practice.

The system is built on a single principle: assess the quality of the student’s thinking about AI, not the quality of the AI’s output. The rubric criteria operationalise the engagement spectrum described in the Assessment chapter, translating the conceptual distinction between collaborative thinking and pure delegation into markable performance levels.

B.1.1 Why Critical Engagement Is Weighted at 30%

This is the differentiator. Any tool can generate fast output. Critical evaluation, challenging AI assumptions, validating findings, recognising when AI is speculating, is the skill worth rewarding. If your rubric rewards only the final product, students who delegate effectively are indistinguishable from students who think deeply. If it rewards critical engagement, they are not.

B.1.2 How the Rubric Connects to Transcript Analysis

The transcript analysis metrics described in the Assessment chapter (Flesch readability scores, turn counts, prompt specificity over time, evidence of pushback) provide a scalable triage layer that sits underneath this rubric. The metrics help markers quickly identify which performance band a student is likely operating in before reading the submission in detail:

Transcript Signal Likely Rubric Level
High turn count, increasing specificity, evidence of pushback Excellent: genuine dialogue
Moderate turns, some follow-ups, reasonable prompt length Good: reasonable engagement
Few turns, short prompts, limited follow-up Satisfactory: limited interaction
Single prompts, no follow-up, very short Developing: delegation, not conversation

The metrics are a triage tool, not a verdict. They tell the marker where to look; the rubric tells the marker what to assess.


B.2 The Generic Template

This is the starting point. Five criteria, four performance levels, adaptable to any AI-supported assessment. Start here, then customise using the adaptation guide that follows.

Criterion Weight Excellent Good Satisfactory Developing
Critical Engagement with AI 30% Demonstrates sophisticated evaluation of AI outputs. Clearly articulates when AI is reliable vs. speculative. Actively challenges AI assumptions and validates findings through independent verification. Shows solid evaluation of AI outputs with some critical reflection. Generally identifies AI limitations. Attempts verification of key claims. Engages with AI but evaluation is surface-level. Limited critical questioning. Minimal independent verification of AI suggestions. Minimal or no critical evaluation of AI. Accepts AI outputs without question. No evidence of validation or verification.
Quality of Inquiry & Problem-Solving 25% Investigation is strategic, well-structured, and adaptive. Clear evidence of iterative refinement. Responds effectively to obstacles or constraints. Shows intellectual curiosity and initiative. Investigation is generally well-organised with some evidence of adaptation. Most key questions explored systematically. Some refinement visible. Investigation covers main points but lacks depth or strategic planning. Limited evidence of adaptation to challenges. Inquiry feels mechanical. Investigation is disorganised or incomplete. Minimal evidence of planning or adaptability. Key areas unexplored.
Conversation Quality & Accountability 20% AI conversation transcript shows genuine dialogue: challenging assumptions, asking follow-ups, refining outputs, and steering toward specific context. Student can explain all decisions and takes clear accountability for final work. Evidence of conversation, not delegation. Conversation transcript shows reasonable engagement with AI. Follow-up questions present. Generally explains how outputs were validated. Takes appropriate responsibility for work. Conversation transcript shows limited interaction. Few follow-up questions or challenges. Some evidence of refinement but mostly accepts initial outputs. Accountability is unclear in places. Conversation shows single prompts with no follow-up. No evidence of critical engagement or refinement. Unclear who is accountable for outputs.
Communication & Clarity 15% Clear, coherent, professionally structured. Complex ideas are explained accessibly. Findings/recommendations are well-justified and easy to follow. Appropriate for intended audience. Generally clear and well-organised. Most ideas are explained adequately. Findings supported by reasoning. Generally appropriate tone/style. Communication is adequate but may lack clarity in places. Ideas present but not always well-connected. Some findings need stronger justification. Communication is unclear or disorganised. Difficulty following main ideas. Minimal justification for conclusions.
Integration of Disciplinary Knowledge 10% Meaningfully connects AI engagement to unit concepts. Demonstrates how theoretical/practical knowledge informs critical evaluation of AI. Shows synthesis of learning. Connects AI work to unit content. Shows understanding of how discipline-specific knowledge applies to AI evaluation. Makes some connections to unit content but links are basic or limited. Few or no connections between AI work and disciplinary learning.

B.2.1 Mapping to the Engagement Spectrum

The Conversation Quality & Accountability criterion directly operationalises the engagement spectrum from the Assessment chapter:

Engagement Level What It Looks Like in the Transcript Rubric Level
Genuine collaborative thinking Student drives inquiry, pushes back on AI, iterates toward own understanding Excellent
Guided drafting Student provides direction, evaluates critically, modifies toward coherent submission Good
Curated delegation Student uses AI to produce submission, exercises some judgment about what passes Satisfactory
Pure delegation Student hands task to AI, submits with minimal engagement Developing

This mapping gives markers a concrete vocabulary for discussing student work. Rather than debating whether a submission “feels like AI,” the marker asks: where on the engagement spectrum does the evidence place this student?

B.2.2 General Scoring

  • Excellent (85–100%): Sophisticated, independent thinking; strong evidence of learning outcomes
  • Good (75–84%): Solid performance; demonstrates competence; minor areas for development
  • Satisfactory (65–74%): Meets minimum expectations; completes task; some areas lack depth
  • Developing (<65%): Below expectations; significant gaps in performance or understanding

B.2.3 Academic Integrity Notes

Make clear to students:

  • AI use must be disclosed and documented
  • Students remain responsible for all outputs, regardless of AI involvement
  • Critical evaluation of AI is a valued skill, not “cheating”
  • Honest AI conversation transcripts (including dead ends) demonstrate integrity, not weakness

For the broader integrity framework that supports this approach, including why transparency-based assessment is more defensible than detection-based assessment, see Section 8.2 and Section 8.13.


B.3 Adapting the Template

B.3.1 For Different AI Modalities

The generic template works as-is for most contexts, but the Critical Engagement criterion should be renamed and its performance descriptors adjusted to match the specific AI modality students are using.

B.3.1.1 1. AI Chatbot Simulations (e.g., Role-Playing Employees)

Best for: Business analysis, consulting, auditing, investigative work

Rename: “Critical Engagement with AI” → “Evaluation of AI-Generated Intelligence”

Excellent Performance Indicators:

  • Distinguishes between reliable and speculative AI responses
  • Adapts questioning strategy based on source credibility
  • Recognises when AI is operating outside its domain
  • Cross-validates information between multiple AI agents
  • Shows sophisticated prompting (follow-ups, clarifications)

Example Rubric Cell (Excellent):

“Demonstrates sophisticated judgement about different AI agents. Clearly explains which sources are reliable for different topics and why. Strategically sequences conversations and asks targeted follow-ups. Cross-validates findings. Recognises when AI is speculating vs. reporting factual information.”

B.3.1.2 2. LLM as Writing/Research Assistant

Best for: Essays, reports, research papers, creative writing

Rename: “Critical Engagement with AI” → “Validation & Integration of AI-Generated Content”

Excellent Performance Indicators:

  • Verifies factual claims made by LLM
  • Identifies AI hallucinations or overgeneralisations
  • Integrates AI suggestions into own argument, not replacing it
  • Shows evidence of iterative refinement (multiple drafts, prompts)
  • Clearly distinguishes AI-suggested ideas from own thinking

Example Rubric Cell (Excellent):

“Critically evaluates all LLM outputs before integration. Fact-checks key claims against sources. Uses AI for brainstorming/drafting but substantially refines all output. Clear evidence of multiple prompts and iterative refinement. Distinguishes own analysis from AI suggestions throughout.”

Example Rubric Cell (Developing):

“Accepts LLM output with minimal validation. No evidence that AI claims were fact-checked. Large portions appear to be unedited AI generation. Difficult to distinguish student thinking from AI output.”

B.3.1.3 3. RAG Systems or AI-Assisted Data Analysis

Best for: Data science, research, business analytics

Rename: “Critical Engagement with AI” → “Understanding & Validating AI Data Interpretation”

Excellent Performance Indicators:

  • Understands what data/sources the AI is drawing from
  • Questions AI interpretations; considers alternative explanations
  • Validates AI findings against raw data when possible
  • Recognises limitations of training data or AI knowledge cutoff
  • Acknowledges uncertainty where appropriate

Weights might shift:

  • Critical Engagement: 35% (data validation is crucial)
  • Communication: 20% (must explain methodology clearly)
  • Integration of Knowledge: 10% (same)

B.3.1.4 4. AI Code Assistant / Programming Helper

Best for: Computer science, software engineering

Rename: “Critical Engagement with AI” → “Code Validation & Testing of AI-Generated Solutions”

Excellent Performance Indicators:

  • Tests all AI-generated code before integrating
  • Understands logic of generated code; can explain it
  • Recognises when AI solution is inefficient or incorrect
  • Modifies/improves AI output rather than blindly using it
  • Documents which parts were AI-assisted and which were manual

Weights might shift:

  • Critical Engagement: 35% (testing/validation essential)
  • Quality of Inquiry: 15% (different meaning in this context)
  • Communication: 20% (code comments, documentation)
  • Conversation Quality: 10% (how decisions were made)

B.3.1.5 5. Multimodal AI (Image Generation, Video Tools, etc.)

Best for: Creative fields, design, media studies

Rename: “Critical Engagement with AI” → “Creative Direction & Critical Evaluation of AI-Generated Media”

Excellent Performance Indicators:

  • Clearly communicates intention to AI (via prompts, iterations)
  • Recognises aesthetic/conceptual limitations of AI output
  • Substantially modifies/refines AI output rather than using as-is
  • Makes intentional creative decisions about when/how to use AI
  • Conversation transcript shows iterative creative direction

B.3.1.6 6. Oral Exams / Viva Voce

Best for: High-stakes assessment verification, professional communication development, courses under approximately 25 students

Rename: “Critical Engagement with AI” → “Depth of Understanding & Ability to Defend”

Excellent Performance Indicators:

  • Explains concepts clearly without relying on notes or memorised phrasing
  • Responds to follow-up questions with nuance and relevant examples
  • Connects ideas across multiple topics or readings
  • Acknowledges limitations and considers alternative perspectives
  • Communicates with confidence, clarity, and appropriate vocabulary

Additional Criteria to Consider:

Criterion What to Assess
Understanding Depth and accuracy of knowledge, ability to go beyond surface-level recall
Argument Ability to articulate a position and respond to counter-arguments
Evidence Relevance and accuracy of examples used to support claims
Structure & Coherence Logical progression of ideas, accessibility to the listener
Speaking Skills Clarity, eye contact, confidence, vocabulary, minimal verbal clutter

Weights might shift:

  • Ability to Defend: 35% (the core of the format)
  • Quality of Reasoning: 25% (synthesis and argumentation)
  • Evidence & Examples: 20% (supporting claims)
  • Communication: 20% (verbal delivery and structure)

This modality connects directly to the Tier 1 oral checkpoint described in the Assessment chapter. Even a five-minute conversation after submission closes the most significant integrity gap.

B.3.2 For Different Disciplines

Adjust the Integration of Disciplinary Knowledge criterion to match your field:

Business/Management:

  • How does this analysis align with strategic frameworks, ethical considerations?
  • Add/adjust: “Strategic Application” criterion
  • Does student consider stakeholder perspectives?

STEM/Sciences:

  • How does methodology reflect scientific reasoning, experimental validity?
  • Add/adjust: “Methodological Rigor” criterion
  • Are limitations and uncertainties clearly articulated?

Humanities/Social Sciences:

  • How does this argument engage with theoretical traditions, interpretive methods?
  • Add/adjust: “Interpretive Depth” criterion
  • Does student engage with multiple perspectives?

Law/Professional Practice:

  • Are ethical implications considered? Is accountability clear?
  • Add/adjust: “Professional Ethics & Accountability” criterion
  • Recognises when AI judgement conflicts with professional standards?

B.3.3 Quick Reference: Weighting by Context

Context Critical Engagement Inquiry Quality Conversation Quality Communication Disciplinary Depth
Chatbot Simulation 35% 25% 15% 15% 10%
LLM Writing Tool 30% 15% 20% 20% 15%
Data Analysis 35% 20% 15% 20% 10%
Code/Programming 35% 15% 10% 20% 20%
Creative/Media 30% 20% 20% 15% 15%
Oral Exam/Viva 35% 25% 20% 20%
Generic Template 30% 25% 20% 15% 10%

B.4 Worked Example: CloudCore Audit Rubric

This example shows the generic template fully adapted for a specific assessment: an information security audit where students interact with AI “employees” (chatbots) and use LLMs as research tools. It demonstrates the adaptation process in practice.

Context: Students conduct an audit of a simulated AI-driven organisation by interacting with AI “employees” (chatbots) and using LLMs as research tools. This rubric balances investigation quality, critical AI evaluation, and professional reporting.

Criterion Weight Excellent Good Satisfactory Developing
Critical Evaluation of AI Sources 30% Demonstrates sophisticated judgement about different AI “employees.” Clearly explains which sources are reliable for different questions and why. Questions inconsistencies between AI responses. Validates findings through multiple sources or independent verification. Recognises AI speculation vs. factual responses. Shows good discernment among AI employees. Generally identifies which sources are stronger for different topics. Attempts to cross-reference findings. Recognises some AI limitations. Engages with multiple AI employees but evaluation is surface-level. Limited comparison between sources. Accepts contradictions without investigation. Minimal awareness of AI unreliability. Treats all AI employees as equally reliable. No attempt to validate information across sources. Accepts AI responses uncritically.
Audit Investigation Strategy & Execution 25% Investigation plan is systematic and adaptive. Clear evidence of strategic sequencing of AI interviews (e.g., starting with CFO for financials, then IT Manager for technical risks). Responds creatively to access constraints and cancellations. Shows persistence in clarifying unclear answers. Iterative refinement visible. Investigation follows a logical structure. Manages scheduled appointments effectively. Asks follow-up questions when unclear. Generally adapts to constraints. Some evidence of iterative inquiry. Investigation covers main areas but lacks clear strategy. Accepts initial AI responses without deep follow-up. Limited adaptation to access challenges. Feels somewhat ad-hoc. Investigation is disorganised or incomplete. Minimal engagement with access scheduling. Key areas unexplored or surface-level inquiry.
Professional Use of LLM as Research Tool 20% Critically reviews LLM outputs before integrating into audit. Clearly distinguishes own analysis from AI-generated content. Uses LLM strategically (e.g., drafting frameworks, checking logic) rather than passively accepting output. Shows evidence of multiple prompts/refinement. Acknowledges where LLM was helpful or speculative. Generally validates LLM suggestions against evidence. Most AI outputs are appropriately reviewed. Clear attribution of own vs. AI work. Some evidence of iterative prompting. Integrates LLM output with limited critical review. Attribution of sources is sometimes unclear. Limited evidence of validation or iterative refinement. Heavily relies on unvalidated LLM output. No clear distinction between own and AI-generated content. Minimal critical review.
Audit Report: Professional Communication 15% Report is clear, well-structured, and professionally formatted. Complex security findings are explained accessibly to non-technical stakeholders. Recommendations are specific and justified. Transparent about methodology (how AI was used, limitations encountered). Report is well-organised and clearly written. Findings are generally well-explained. Recommendations are reasonable. Mentions methodology but could be more transparent. Report covers main findings but organisation could be clearer. Some findings lack sufficient justification. Recommendations are general. Limited transparency about AI use in methodology. Report is difficult to follow or incomplete. Findings poorly explained. Recommendations are vague or unjustified. No mention of methodology or AI use.
AI Conversation Quality & Professional Judgement 10% Conversation transcripts show genuine dialogue with AI: challenging responses, following up on inconsistencies, refining questions based on previous answers. Student clearly articulates how own judgement validated, rejected, or refined AI suggestions. Takes clear accountability for final audit conclusions. Conversation transcripts show reasonable engagement. Follow-up questions present. Explains how own expertise was applied to evaluate AI outputs. Generally accountable for work product. Conversation transcripts show limited interaction. Few follow-up questions. Limited evidence of steering or refining the AI conversation. Accountability is somewhat unclear. Conversation shows single prompts with no follow-up. No evidence of critical engagement. Unclear who is responsible for audit conclusions.

B.4.1 Performance Level Descriptors

Excellent (85–100%):

  • Student demonstrates sophisticated professional judgement about which AI sources are credible for different investigation areas
  • Audit strategy is evident and strategic; clear sequencing of inquiry
  • All AI outputs are meaningfully evaluated; student shows independent reasoning throughout
  • Report would be valuable to actual security leadership
  • Conversation transcripts show genuine learning about managing AI as a tool

Good (75–84%):

  • Student shows competent judgement about AI reliability; identifies key differences between sources
  • Investigation is reasonably strategic and organised
  • Most AI outputs are reviewed; some independent analysis visible
  • Report is professionally written and would be useful to stakeholders
  • Conversation shows adequate critical engagement with AI

Satisfactory (65–74%):

  • Student engages with AI but evaluation lacks sophistication; treats sources somewhat generically
  • Investigation covers main areas but strategy is not always clear
  • AI outputs are sometimes accepted with limited review
  • Report communicates findings but could be more polished
  • Conversation engagement is present but limited in depth

Developing (<65%):

  • Student does not meaningfully evaluate AI reliability; little critical engagement
  • Investigation is incomplete or disorganised; key areas missing
  • AI outputs heavily influence conclusions with minimal independent review
  • Report is unclear or incomplete
  • Conversation engagement is minimal or absent

B.4.2 Usage Notes for Markers

Before Students Start:

  • Clearly explain that this rubric rewards critical thinking about AI, not just efficient tool use
  • Emphasise that asking good follow-up questions of AI employees (and the LLM) is an evaluated skill
  • Normalise that showing where AI misled you (visible in the transcript) is valued equally to successful findings

During Marking:

  • Look for evidence in the audit transcript: Can you see quality prompting? Validation? Follow-ups?
  • Check the LLM transcript (if submitted): Does it show iterative refinement or one-shot prompting?
  • Read the conversation transcript carefully: This is where students reveal their critical judgement. Look for conversation, not delegation
  • Use the transcript analysis metrics from the Assessment chapter as a first-pass triage before reading in detail

Adapted for Other Audit Contexts:

This rubric works well for ISO 27001 readiness audits, tech vendor assessments, process efficiency reviews, or any scenario where students interview AI agents and synthesise findings. Adjust the specific domain knowledge criterion and the types of findings expected.


B.5 Creating Your Own Variant

B.5.1 Step 1: Answer These Questions

  • What AI tool(s) will students actually use?
  • What matters most in your discipline?
  • What would a professional in your field do with AI?
  • What skills do you want students to develop?
  • Where on the engagement spectrum (see the Assessment chapter) do you want students operating, and how will you know?

B.5.2 Step 2: Customise in This Order

  1. Rename criteria to match actual activities
  2. Adjust weights based on importance (use the Quick Reference table as a starting point)
  3. Rewrite performance descriptors using examples from your field
  4. Add discipline-specific criterion if the generic five do not capture what matters
  5. Add any specific constraints or requirements (e.g., “code must be tested,” “citations required”)

B.5.3 Step 3: Share Early with Students

  • Include with assignment briefing
  • Walk through an “Excellent” example specific to your unit
  • Clarify what you mean by critical engagement in your context
  • Share the rubric before students start work, not after

B.5.4 Step 4: Test and Calibrate

  • Apply your version to 2–3 sample student submissions (real or imagined)
  • Calibrate scoring with colleagues to ensure consistency
  • Collect student feedback: Was the rubric clear? Did it help them understand expectations?
  • Build a portfolio of marked examples for future years

B.6 Implementation Checklist

When adapting for your specific assessment:


B.7 Common Mistakes to Avoid

  • Don’t make the rubric so specific it cannot be adapted. Do include adaptability notes and examples.
  • Don’t ignore the discipline (one-size-fits-all rarely works). Do customise weights and descriptors to your field.
  • Don’t focus only on tool efficiency (faster is not better). Do reward critical thinking about AI, not speed of output.
  • Don’t punish AI use and incentivise it being hidden. Do make transparency and critical evaluation rewarded behaviours.
  • Don’t assume all students know how to “prompt well.” Do teach and demonstrate effective AI engagement first. This is especially important if the process mark carries significant weight; see the AI literacy prerequisite discussed in the Assessment chapter.

B.8 Questions to Ask Before Finalising

  • Will this rubric encourage the thinking you want?
  • Can colleagues easily adapt this for their units?
  • Does it make clear what “critical engagement with AI” means in your context?
  • Have you tested it with actual student work?
  • Is the language accessible to students (or does it need translation)?
  • Does it align with the engagement spectrum? Can a marker use it to place a student on that spectrum from the evidence?

B.9 Further Reading

  • Dawson, P. (2017). Assessment rubrics: Towards clearer and more replicable design, research and practice. Assessment & Evaluation in Higher Education, 42(3), 347–360.
  • Perkins, M., Furze, L., Roe, J., & MacVaugh, J. (2024). The Artificial Intelligence Assessment Scale (AIAS): A framework for ethical integration of generative AI in educational assessment. Journal of University Teaching and Learning Practice, 21(6).
  • Boud, D., Ajjawi, R., Dawson, P., & Tai, J. (Eds.). (2018). Developing evaluative judgement in higher education: Assessment for knowing and producing quality work. Routledge.
  • Bearman, M., Tai, J., Dawson, P., Boud, D., & Ajjawi, R. (2024). Developing evaluative judgement for a time of generative artificial intelligence. Assessment & Evaluation in Higher Education, 49(6), 893–905.
  • Villarroel, V., Bloxham, S., Bruna, D., Bruna, C., & Herrera-Seda, C. (2018). Authentic assessment: Creating a blueprint for course design. Assessment & Evaluation in Higher Education, 43(5), 840–854.
  • Swiecki, Z., Khosravi, H., Chen, G., Martinez-Maldonado, R., Lodge, J. M., Milligan, S., Selwyn, N., & Gasevic, D. (2022). Assessment in the age of artificial intelligence. Computers and Education: Artificial Intelligence, 3, Article 100075.
  • Mollick, E. R., & Mollick, L. (2023). Assigning AI: Seven approaches for students, with prompts. arXiv preprint.