Erfan Nourbakhsh | Publications

Are LLM Benchmarks Already Contaminated? A Systematic Review of Contamination Detection Methods Published

E Nourbakhsh, MS Sirjani, A Mousavi, K Nguyen, J Quarles, M Xie, R Slavin

GEM Workshop — The 64th Annual Meeting of the Association for Computational Linguistics (ACL 2026) | Oral Paper

Paper

Abstract

Large Language Models (LLMs) are trained on web-scale corpora, increasing the risk that benchmark test data appears in training sets and inflates reported performance. We present a systematic literature review of 55 studies on LLM benchmark contamination through late 2025. Our contributions are: (1) a four-tier contamination taxonomy (Exact, Syntactic, Semantic, Task-Level; T1–T4); (2) a comparative analysis of five detection families (string-matching, likelihood-based, membership inference, LLM-prompted detection, and benchmark auditing), including access assumptions and failure modes; (3) a synthesis of contamination evidence on MMLU, GSM8K, HumanEval, and HellaSwag by measurement construct; (4) a comparative evaluation of mitigation strategies across lifecycle points, access assumptions, and evidence maturity; and (5) a Contamination Transparency Card (CTC) framework for future releases. Across studies, no detection method is consistently reliable across contamination tiers, model-access settings, and training stages. We identify instruction tuning as a persistent blind spot, note that RL/post-training contamination auditing is only beginning to mature, and report inflation estimates spanning roughly 6%–40% under benchmark- and setting-dependent assumptions.

BibTeX

@inproceedings{nourbakhsh-etal-2026-llm,
    title = "Are {LLM} Benchmarks Already Contaminated? A Systematic Review of Contamination Detection Methods",
    author = "Nourbakhsh, Erfan  and
      Sirjani, Mohammad Sadegh  and
      Mousavi, Amir  and
      Nguyen, Khoa  and
      Quarles, John  and
      Xie, Mimi  and
      Slavin, Rocky",
    editor = "Mille, Simon  and
      Gehrmann, Sebastian  and
      Schmidtov{\'a}, Patr{\'i}cia  and
      Du{\v{s}}ek, Ond{\v{r}}ej  and
      Fadaee, Marzieh  and
      Lo, Kyle  and
      Santus, Enrico  and
      Stanovsky, Gabriel",
    booktitle = "Proceedings of the Fifth Workshop on Generation, Evaluation and Metrics ({GEM})",
    month = jul,
    year = "2026",
    address = "San Diego, California, USA",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2026.gem-main.50/",
    doi = "10.18653/v1/2026.gem-main.50",
    pages = "518--539",
    ISBN = "979-8-89176-423-1",
    abstract = "Large Language Models (LLMs) are trained on web-scale corpora, increasing the risk that benchmark test data appears in training sets and inflates reported performance. We present a systematic literature review of 55 studies on LLM benchmark contamination through late 2025. Our contributions are: (1) a four-tier contamination taxonomy (Exact, Syntactic, Semantic, Task-Level; T1{--}T4); (2) a comparative analysis of five detection families (string-matching, likelihood-based, membership inference, LLM-prompted detection, and benchmark auditing), including access assumptions and failure modes; (3) a synthesis of contamination evidence on MMLU, GSM8K, HumanEval, and HellaSwag by measurement construct; (4) a comparative evaluation of mitigation strategies across lifecycle points, access assumptions, and evidence maturity; and (5) a Contamination Transparency Card (CTC) framework for future releases. Across studies, no detection method is consistently reliable across contamination tiers, model-access settings, and training stages. We identify instruction tuning as a persistent blind spot, note that RL/post-training contamination auditing is only beginning to mature, and report inflation estimates spanning roughly 6{\%}{--}40{\%} under benchmark- and setting-dependent assumptions."
}

When Retrieval Doesn't Help: A Large-Scale Study of Biomedical RAG Published

E Nourbakhsh, R Slavin, K Yang, A Rios

BioNLP Workshop — The 64th Annual Meeting of the Association for Computational Linguistics (ACL 2026)

DOI: 10.48550/arXiv.2606.04127

Paper GitHub

Abstract

Medical question answering is a high-stakes setting where factual errors can have serious consequences. Retrieval-augmented generation (RAG) is widely viewed as a promising solution, and prior work has reported substantial gains for large medical QA models. We revisit this assumption across a broad range of open-weight instruction-tuned models spanning 7B to 72B parameters. Across five models, ten biomedical QA datasets, four retrieval methods, and four retrieval corpora, we find that retrieval yields only small and inconsistent improvements over a no-retrieval baseline, typically within 1-2 points. In contrast, the choice of backbone model has a much larger effect than the choice of retriever or corpus, and expert and layman retrieval sources perform similarly in most settings. These results suggest that the main bottleneck is not retrieval quality alone, but the model's limited ability to use retrieved evidence effectively.

BibTeX

@inproceedings{nourbakhsh-etal-2026-retrieval,
    title = "When Retrieval Doesn{'}t Help: A Large-Scale Study of Biomedical {RAG}",
    author = "Nourbakhsh, Erfan  and
      Slavin, Rocky  and
      Yang, Ke  and
      Rios, Anthony",
    editor = "Demner-Fushman, Dina  and
      Ananiadou, Sophia  and
      Roberts, Kirk  and
      Tsujii, Junichi",
    booktitle = "{B}io{NLP} 2026",
    month = jul,
    year = "2026",
    address = "San Diego, California",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2026.bionlp-1.72/",
    doi = "10.18653/v1/2026.bionlp-1.72",
    pages = "890--910",
    ISBN = "979-8-89176-434-7",
    abstract = "Medical question answering is a high-stakes setting where factual errors can have serious consequences. Retrieval-augmented generation (RAG) is widely viewed as a promising solution, and prior work has reported substantial gains for large medical QA models. We revisit this assumption across a broad range of open-weight instruction-tuned models spanning 7B to 72B parameters. Across five models, ten biomedical QA datasets, four retrieval methods, and four retrieval corpora, we find that retrieval yields only small and inconsistent improvements over a no-retrieval baseline, typically within 1{--}2 points. In contrast, the choice of backbone model has a much larger effect than the choice of retriever or corpus, and expert and layman retrieval sources perform similarly in most settings. These results suggest that the main bottleneck is not retrieval quality alone, but the model{'}s limited ability to use retrieved evidence effectively."
}

MambaGaze: Bidirectional Mamba with Explicit Missing Data Modeling for Cognitive Load Assessment from Eye-Gaze Tracking Data Under Review

A Mousavi, MS Sirjani, E Nourbakhsh, M Xie, R Slavin, L Neely, J Davis, J Quarles

IEEE-EMBS International Conference on Biomedical and Health Informatics (BHI 2026)

DOI: 10.48550/arXiv.2605.22775

Paper GitHub(Soon)

Abstract

Real-time cognitive load assessment from eye-tracking signals could potentially enable adaptive human-centered-AI such as safety-critical applications such as driver vigilance monitoring or automated flight deck assistance, yet two challenges persist: handling frequent data missingness from blinks and tracking failures, and efficiently modeling long-range temporal dependencies. We propose MambaGaze, a framework that addresses these challenges through 1) XMD encoding, which augments raw features with observation masks and time-deltas to explicitly model data uncertainty, and 2) bidirectional Mamba-2, which captures temporal dependencies with linear computational complexity. Experiments on CLARE and CL-Drive datasets under leave-one-subject-out evaluation show that MambaGaze achieves 76.8% and 73.1% accuracy, respectively, outperforming CNN, Transformer, ResNet, and VGG baselines by 4-12 percentage points. Edge deployment benchmarks on NVIDIA Jetson platforms demonstrate real-time inference at 43-68 FPS with power consumption below 7.5W, confirming feasibility for wearable cognitive load monitoring.

BibTeX

@misc{mousavi2026mambagazebidirectionalmambaexplicit,
      title={MambaGaze: Bidirectional Mamba with Explicit Missing Data Modeling for Cognitive Load Assessment from Eye-Gaze Tracking Data},
      author={Amir Mousavi and Mohammad Sadegh Sirjani and Erfan Nourbakhsh and Mimi Xie and Rocky Slavin and Leslie Neely and John Davis and John Quarles},
      year={2026},
      eprint={2605.22775},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2605.22775},
}

CogAdapt: Transferring Clinical ECG Foundation Models to Wearable Cognitive Load Assessment via Lead Adaptation Under Review

A Mousavi, E Nourbakhsh, MS Sirjani, M Xie, R Slavin, L Neely, J Davis, J Quarles

IEEE-EMBS International Conference on Biomedical and Health Informatics (BHI 2026)

DOI: 10.48550/arXiv.2605.22774

Paper GitHub(Soon)

Abstract

Real-time cognitive load assessment is essential for adaptive human-computer interaction but remains challenging due to limited labeled data and poor cross-subject generalization. Recent ECG foundation models pre-trained on millions of clinical recordings offer rich representations, but cannot be directly applied to wearable devices due to sensor configuration mismatch and task differences. In this paper, we propose CogAdapt, a framework that adapts clinical ECG foundation models to wearable cognitive load assessment. CogAdapt introduces LeadBridge, a learnable adapter that transforms 3-lead wearable signals into anatomically consistent 12-lead representations, and ProFine, a progressive fine-tuning strategy that gradually unfreezes encoder layers while preventing catastrophic forgetting. Evaluations on two public datasets (CLARE and CL-Drive) under leave-one-subject-out cross-validation show that CogAdapt substantially outperforms baselines trained from scratch, achieving macro-F1 scores of 0.626 and 0.768. These results demonstrate the promise of foundation model adaptation for subject-independent cognitive load assessment from wearable sensors.

BibTeX

@misc{mousavi2026cogadapttransferringclinicalecg,
      title={CogAdapt: Transferring Clinical ECG Foundation Models to Wearable Cognitive Load Assessment via Lead Adaptation},
      author={Amir Mousavi and Erfan Nourbakhsh and Mohammad Sadegh Sirjani and Mimi Xie and Rocky Slavin and Leslie Neely and John Davis and John Quarles},
      year={2026},
      eprint={2605.22774},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2605.22774},
}

Prompting Underestimates LLM Capability for Time Series Classification Under Review

D Schumacher, E Nourbakhsh, R Slavin, A Rios

The 2026 Conference on Empirical Methods in Natural Language Processing (EMNLP 2026)

DOI: 10.48550/arXiv.2601.03464

Paper GitHub(Soon)

Abstract

Prompt-based evaluations suggest that large language models (LLMs) perform poorly on time series classification, raising doubts about whether they encode meaningful temporal structure. We show that this conclusion reflects limitations of prompt-based generation rather than the model's representational capacity by directly comparing prompt outputs with linear probes over the same internal representations. While zero-shot prompting performs near chance, linear probes improve average F1 from 0.15-0.26 to 0.61-0.67, often matching or exceeding specialized time series models. Layer-wise analyses further show that class-discriminative time series information emerges in early transformer layers and is amplified by visual and multimodal inputs. Together, these results demonstrate a systematic mismatch between what LLMs internally represent and what prompt-based evaluation reveals, leading current evaluations to underestimate their time series understanding.

BibTeX

@misc{schumacher2026promptingunderestimatesllmcapability,
      title={Prompting Underestimates LLM Capability for Time Series Classification},
      author={Dan Schumacher and Erfan Nourbakhsh and Rocky Slavin and Anthony Rios},
      year={2026},
      eprint={2601.03464},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2601.03464},
}

Optimizing Task Scheduling in Fog Computing with Deadline Awareness Published

MS Sirjani, M Ahmad, A Mousavi, E Nourbakhsh, K Nguyen

2026 IEEE 2nd International Conference on Secure IoT, Assured and Trusted Computing (SATC)

DOI: 10.1109/SATC69565.2026.11542230

Paper

Abstract

The rise of Internet of Things (IoT) devices has led to the development of numerous time-sensitive applications that require quick responses and low latency. Fog computing has emerged as a solution for processing these IoT applications, but it faces challenges such as resource allocation and job scheduling. Therefore, it is crucial to determine how to assign and schedule tasks on Fog nodes. This work aims to schedule tasks in IoT while minimizing the total energy consumption of nodes and enhancing the Quality of Service (QoS) requirements of IoT tasks, taking into account task deadlines. This paper classifies Fog nodes into two categories based on their traffic level: low and high. It schedules short-deadline tasks on low-traffic nodes using an Improved Golden Eagle Optimization (IGEO) algorithm, an enhancement that utilizes genetic operators for discretization. Long-deadline tasks are processed on high-traffic nodes using reinforcement learning (RL). This combined approach is called the Reinforcement Improved Golden Eagle Optimization (RIGEO) algorithm. Experimental results demonstrate that RIGEO achieves up to a 29% reduction in energy consumption, up to an 86% improvement in response time, and up to a 19% reduction in deadline violations compared to state-of-the-art algorithms.

BibTeX

@INPROCEEDINGS{11542230,
  author={Sirjani, Mohammad Sadegh and Ahmad, Mohammad and Mousavi, Amir and Nourbakhsh, Erfan and Nguyen, Khoa},
  booktitle={2026 IEEE 2nd International Conference on Secure IoT, Assured and Trusted Computing (SATC)},
  title={Optimizing Task Scheduling in Fog Computing with Deadline Awareness},
  year={2026},
  volume={},
  number={},
  pages={1-5},
  keywords={Internet of Things;Timing;Schedules;Scheduling;Optimization;Equations;Energy consumption;Algorithms;Tagging;Printing;Internet of Things;Fog Computing;Job Scheduling;Golden Eagle Optimization Algorithm;Reinforcement Learning},
  doi={10.1109/SATC69565.2026.11542230}
}

Unveiling User Perceptions in the Generative AI Era: A Sentiment-Driven Evaluation of AI Educational Apps' Role in Digital Transformation of e-Teaching Accepted

A Mazaherian, E Nourbakhsh

The 19th National and the 13th International Conference on e-Learning and e-Teaching (ICeLeT 2026) | Oral Paper

DOI: 10.48550/arXiv.2512.11934

Paper GitHub Hugging Face

Abstract

The rapid integration of generative artificial intelligence into education has driven digital transformation in e-teaching, yet user perceptions of AI educational apps remain underexplored. This study performs a sentiment-driven evaluation of user reviews from top AI ed-apps on the Google Play Store to assess efficacy, challenges, and pedagogical implications. Our pipeline involved scraping app data and reviews, RoBERTa for binary sentiment classification, GPT-4o for key point extraction, and GPT-5 for synthesizing top positive/negative themes. Apps were categorized into seven types (e.g., homework helpers, math solvers, language tools), with overlaps reflecting multifunctional designs. Results indicate predominantly positive sentiments, with homework apps like Edu AI (95.9% positive) and Answer.AI (92.7%) leading in accuracy, speed, and personalization, while language/LMS apps (e.g., Teacher AI at 21.8% positive) lag due to instability and limited features. Positives emphasize efficiency in brainstorming, problem-solving, and engagement; negatives center on paywalls, inaccuracies, ads, and glitches. Trends show that homework helpers outperform specialized tools, highlighting AI's democratizing potential amid risks of dependency and inequity. The discussion proposes future ecosystems with hybrid AI-human models, VR/AR for immersive learning, and a roadmap for developers (adaptive personalization) and policymakers (monetization regulation for inclusivity). This underscores generative AI's role in advancing e-teaching by enabling ethical refinements that foster equitable, innovative environments.

BibTeX

@misc{mazaherian2025unveilinguserperceptionsgenerative,
      title={Unveiling User Perceptions in the Generative AI Era: A Sentiment-Driven Evaluation of AI Educational Apps' Role in Digital Transformation of e-Teaching},
      author={Adeleh Mazaherian and Erfan Nourbakhsh},
      year={2025},
      eprint={2512.11934},
      archivePrefix={arXiv},
      primaryClass={cs.CY},
      url={https://arxiv.org/abs/2512.11934},
}

KD-OCT: Efficient Knowledge Distillation for Clinical-Grade Retinal OCT Classification Published

E Nourbakhsh, N Sanjari, A Nourbakhsh

2025 11th International Conference on Signal Processing and Intelligent Systems (ICSPIS) | Oral Paper

DOI: 10.1109/ICSPIS68676.2025.11551784

Paper GitHub Hugging Face

Abstract

Age-related macular degeneration (AMD) and choroidal neovascularization (CNV)-related conditions are leading causes of vision loss worldwide, with optical coherence tomography (OCT) serving as a cornerstone for early detection and management. However, deploying state-of-the-art deep learning models like ConvNeXtV2-Large in clinical settings is hindered by their computational demands. Therefore, it is desirable to develop efficient models that maintain high diagnostic performance while enabling real-time deployment. In this study, a novel knowledge distillation framework, termed KD-OCT, is proposed to compress a high-performance ConvNeXtV2-Large teacher model, enhanced with advanced augmentations, stochastic weight averaging, and focal loss, into a lightweight EfficientNet-B2 student for classifying normal, drusen, and CNV cases. KD-OCT employs real-time distillation with a combined loss balancing soft teacher knowledge transfer and hard ground-truth supervision. The effectiveness of the proposed method is evaluated on the Noor Eye Hospital (NEH) dataset using patient-level cross-validation. Experimental results demonstrate that KD-OCT outperforms comparable multi-scale or feature-fusion OCT classifiers in efficiency-accuracy balance, achieving near-teacher performance with substantial reductions in model size and inference time. Despite the compression, the student model exceeds most existing frameworks, facilitating edge deployment for AMD screening.

BibTeX

@INPROCEEDINGS{11551784,
  author={Nourbakhsh, Erfan and Sanjari, Nasrin and Nourbakhsh, Ali},
  booktitle={2025 11th International Conference on Signal Processing and Intelligent Systems (ICSPIS)},
  title={KD-OCT: Efficient Knowledge Distillation for Clinical-Grade Retinal OCT Classification},
  year={2025},
  volume={},
  number={},
  pages={605-611},
  keywords={Modeling;Printing;Retina;Training;Labeling;Accuracy;Convolutional neural networks;Aging;Architecture;Computer architecture;Retinal OCT;Knowledge Distillation;AMD Classification;ConvNeXt;Healthcare AI;Model Compression},
  doi={10.1109/ICSPIS68676.2025.11551784}
}

Beyond the Hype: Critical Analysis of Student Motivations and Ethical Boundaries in Educational AI Use in Higher Education Published

A Mazaherian, E Nourbakhsh

6th Congress on Education, Social and Cultural Studies with Futurist Education (ESCSSCONG06)

DOI: 10.48550/arXiv.2511.11369

Paper

Abstract

The rapid integration of generative artificial intelligence (AI) in higher education since 2023 has outpaced institutional preparedness, creating a persistent gap between student practices and established ethical standards. This paper draws on mixed-method surveys and a focused literature review to examine student motivations, ethical dilemmas, gendered responses, and institutional readiness for AI adoption. We find that 92% of students use AI tools primarily to save time and improve work quality, yet only 36% receive formal guidance, producing a de facto "shadow pedagogy" of unguided workflows. Notably, 18% of students reported integrating AI-constructed material into assignments, which suggests confusion about integrity expectations and compromises the integrity of the assessment. Female students expressed greater concern about abuse and distortion of information than male students, revealing a gendered difference in awareness of risk and AI literacies. Correspondingly, 72% of educators use AI, but only 14% feel at ease doing so, reflecting limited training and uneven policy responses. We argue that institutions must adopt comprehensive AI literacy programs that integrate technical skills and ethical reasoning, alongside clear AI-use policies and assessment practices that promote transparency. The paper proposes an Ethical AI Integration Model centered on literacy, gender-inclusive support, and assessment redesign to guide responsible adoption, protect academic integrity, and foster equitable educational outcomes in an AI-driven landscape.

BibTeX

@misc{mazaherian2025hypecriticalanalysisstudent,
      title={Beyond the Hype: Critical Analysis of Student Motivations and Ethical Boundaries in Educational AI Use in Higher Education},
      author={Adeleh Mazaherian and Erfan Nourbakhsh},
      year={2025},
      eprint={2511.11369},
      archivePrefix={arXiv},
      primaryClass={cs.CY},
      url={https://arxiv.org/abs/2511.11369},
}

ConHGNN-SUM: A Contextualized Heterogeneous Graph Neural Network for Extractive Text Summarization Published

E Nourbakhsh, HB Kashani

The 20th CSI International Symposium on Artificial Intelligence and Signal Processing (AISP 2024)

DOI: 10.1109/AISP61396.2024.10475307

Paper GitHub

Abstract

Text summarization is a valuable method for extracting important details from large volumes of text data, facilitating tasks like text data analysis. Various text summarization techniques have been developed over time, with some focusing on selecting and summarizing short sentences, while others overlook the semantic relationship between sentences. Extractive document summarization involves learning cross-sentence relations, a critical aspect that has been extensively explored using various approaches. One effective method is to employ neural networks based on graphs, which offer an intricate structure capable of obtaining relations among sentences. In this paper, we present a contextualized heterogeneous graph neural network for extractive text summarization (ConHGNN-SUM), incorporating semantic nodes that extend beyond individual sentences, and emphasizes the importance of capturing the relationship between selected sentences as a final step in the summarization process. These extra nodes function as intermediaries connecting sentences and enhancing the interrelationships between them. Our model enhances conventional graph-based extractive methods and delivers comparable performance to other advanced systems for extractive summarization.

BibTeX

@INPROCEEDINGS{10475307,
  author={Nourbakhsh, Seyed Erfan and Kashani, Hamidreza Baradaran},
  booktitle={2024 20th CSI International Symposium on Artificial Intelligence and Signal Processing (AISP)},
  title={ConHGNN-SUM: A Contextualized Heterogeneous Graph Neural Network for Extractive Text Summarization},
  year={2024},
  volume={},
  number={},
  pages={1-6},
  keywords={Data analysis;Semantics;Focusing;Signal processing;Graph neural networks;Data mining;Task analysis;Extractive Text Summarization;Graph Neural Networks;Natural Language Processing(NLP)},
  doi={10.1109/AISP61396.2024.10475307}
}