北京大學-香港理工大學漢語語言學研究中心

Computational Linguistics Summit in the Era of Large Language Models cum International Symposium on Collaborative Innovations between The Hong Kong Polytechnic University and The China Computer Federation

大模型時代的計算語言學高峰論壇暨香港理工大學與中國計算機學會合作創新國際研討）

Poster and Registration:

CCFPolyUPoster

Should you be interested, Please click HERE to register.

Programme Rundown:

Day 1, 22 Aug 2024 ( Thursday )

Time	Venue	Session
13:30 - 14:00	PQ306	Welcoming
14:00 - 15:15	PQ306	Seminar 1 What is Digital Humanities? Prof. Kam Fai Wong (黃錦輝教授) Zoom Meeting Link: https://polyu.zoom.us/j/85605655395?pwd=XGbSk8QSO8fp7qa8K1l9bl7thvJb8s.1 Meeting ID: 856 0565 5395Passcode: 455340
15:30 - 16:45	PQ306	Seminar 2 (Online seminar) 從認知科學看待語言機制 (中文报告) Prof. Guodong Zhou (周國棟教授) Zoom Meeting Link: https://polyu.zoom.us/j/85605655395?pwd=XGbSk8QSO8fp7qa8K1l9bl7thvJb8s.1 Meeting ID: 856 0565 5395Passcode: 455340

Day 2, 23 Aug 2024 ( Friday )

Time	Venue	Session
09:15 - 10:30	PQ305	Seminar 3 Advancing LLM Evaluation: Comprehen- sive Evaluation on Long-Context, Multi-Turn, and Instruction-Following Dr. Xingshan Zeng (曾幸山博士)
10:30 - 10:45		Tea Break
10:45 - 12:00	PQ305	Seminar 4 Metaphor and Synesthesia Analysis via Computational Linguistic Methods Metaphorand Synesthesia Analysis via Computational Linguistic Methods Dr. Zhongqing Wang (王中卿博士) Zoom Meeting Link: https://polyu.zoom.us/j/85605655395?pwd=XGbSk8QSO8fp7qa8K1l9bl7thvJb8s.1 Meeting ID: 856 0565 5395Passcode: 455340
12:00 - 14:00		Lunch Break*
14:00 - 15:15	PQ305	Seminar 5 (Online Seminar) Knowledge Retrieval Augmentation: Paradigm and Key TechnologiesKnowledge Retrieval Augmentation: Paradigm and Key Technologies Prof. Haofen Wang (王昊奮教授) Zoom Meeting Link: https://polyu.zoom.us/j/85605655395?pwd=XGbSk8QSO8fp7qa8K1l9bl7thvJb8s.1 Meeting ID: 856 0565 5395Passcode: 455340
15:15 - 16:30	PQ305	*Seminar 6* On Detection of Machine Generated TextOn Detection of Machine Generated Text Prof. Yue Zhang (張岳教授 ) Zoom Meeting Link: https://polyu.zoom.us/j/85605655395?pwd=XGbSk8QSO8fp7qa8K1l9bl7thvJb8s.1 Meeting ID: 856 0565 5395Passcode: 455340
16:30 - 16:45		Tea Break
16:45 -18:00	PQ305	Seminar 7 Eliciting Alignments in Foundation Language ModelsEliciting Alignments in Foundation Language Models Dr. Derek F. Wong (黃輝博士) Zoom Meeting Link: https://polyu.zoom.us/j/85605655395?pwd=XGbSk8QSO8fp7qa8K1l9bl7thvJb8s.1 Meeting ID: 856 0565 5395Passcode: 455340

*Confernce lunch is not provided

Keynote Speeches and abstracts

Keynote Speech 1

What is Digital Humanities?

Prof. Kam Fai Wong（黃錦輝）

The Chinese University of Hong Kong

In recent years, large models like ChatGPT and GPT-4 have driven significant advancements in the field of artificial intelligence, revolutionizing various research domains. This talk begins with an introduction to the concept of Digital Humanities, it discusses the impact of digital technologies on various fields within the humanities, including education, language, history, philosophy, and arts. However, these models also present challenges such as privacy leakage, black-box nature, and poor reliability. This talk addresses these issues by introducing methods developed by our research team, focusing on enhancing the forgettability, reliability, adaptability, multiplicity, and explainability (FRAME) of large models.

Keynote Speech 2

从认知科学看待语言机制

Prof. Guodong Zhou（周國棟）

Soochow University

语言机制的研究是认知科学的核心。本次讲座将从形式主义、功能主义和认知主义等多个理论视角出发，剖析语言机制问题，深入探讨这些不同方法如何加深我们对大脑中语言处理过程的理解，并评估它们对计算模型的影响。通过整合认知科学的洞见，我们将揭示这一视角如何推动自然语言处理（NLP）技术的发展，并为认知科学与计算语言学的交叉研究开辟新的方向。

Keynote Speech 3

Advancing LLM Evaluation: Comprehensive Evaluation on Long-Context, Multi-Turn, and Instruction-Following

Xingshan Zeng (曾幸山)

Huawei Noah's Ark Lab

As Large Language Models (LLMs) are increasingly integrated into real-world applications, the need for comprehensive and systematic evaluation has never been more crucial. Traditional evaluations have largely focused on diverse tasks and broad knowledge domains, often neglecting the specific skills that are essential for practical applications. This talk introduces three cutting-edge benchmarks - M4LE, MT-Eval, and FollowBench - that address this gap by systematically evaluating the core skills (long-context comprehension, multi-turn conversational capabilities, and fine-grained instruction following) of LLMs.

M4LE introduces a benchmark tailored for assessing LLMs' ability to manage long sequences across diverse tasks and domains, revealing significant challenges in multi-span attention and semantic retrieval. MT-Eval shifts the focus to multi-turn interactions, highlighting how LLMs perform in complex, real-world conversational settings and identifying key factors affecting their multi-turn performance. FollowBench takes a different angle, evaluating LLMs on their ability to adhere to multi-level, fine-grained constraints in instruction following, uncovering critical weaknesses and suggesting areas for improvement. Togeth-er, these benchmarks provide a more nuanced understanding of LLM performance, highlighting critical areas for improvement and guiding future advancements in model evaluation.

Keynote Speech 4

Metaphor and Synesthesia Analysis via Computational Linguistic Methods

Dr. Zhongqing Wang（王中卿）

Soochow University

Metaphor, distinct from simile, is a rhetorical device that makes implicit comparisons without explicit figurative words. Textual metaphor detection refers to the automatic identification of metaphorical phenomena in text. Due to the absence of clear trigger words and the complexity of linguistic expressions in metaphors, current research in this area is still relatively preliminary. Synesthesia refers to a phenomenon in metaphor where the expressed sensation differs from the original sensation associated with a word. Detecting synesthesia requires not only textual analysis but also an understanding of cognitive science to analyze the relationships between different sensations. This report will elaborate on the corpus primarily used for metaphor detection, the latest detection methods, and recent research findings on the detection of synesthesia.

Keynote Speech 5

Knowledge Retrieval Augmentation: Paradigm and Key Technologies

Prof. Haofen Wang（王昊奮）

Tongji University

Knowledge retrieval augmentation technologies provide additional knowledge sources for large language models, effectively alleviating hallucination problems and issues concerning the timeliness of knowledge. These technologies have quickly become pivotal in optimizing large model practices. During technological iterations, various technologies such as Retrieval Augmented Generation (RAG), structural indexing optimization, knowledge graphs, vector databases, large model fine-tuning, and prompt engineering have been deeply integrated. Numerous functional modules have been proposed one after another, presenting a challenge for researchers to comprehensively understand RAG.

This talk aims to thoroughly review and analyze RAG from the perspectives of paradigms, key technologies, and application development, with the goal of grasping the development trends and future directions of the technology from a higher level. Through a comprehensive analysis of the current research status, we propose a research paradigm of modular RAG and RAG Flow. We summarize six major functional modules, comprising more than 50 operator operations, and distill seven typical RAG Flow design patterns from over a hundred papers, providing guidance for designing RAG systems.

Based on these paradigms, we further advance the open-source work of the OpenRAG series. We have built the OpenRAG Knowledge Base, which comprehensively covers the information required by RAG researchers and developers and offers support for highly customizable multidimensional analysis views. Additionally, we have established the OpenRAG Playground to assist researchers and engineers in quickly building cutting-edge baseline methods and rapidly validating and comparing different RAG Flows on public or custom datasets.

Keynote Speech 6

On Detection of Machine Generated Text

Prof. Yue Zhang (3EF)

Westlake University

With advances in large language models, machine generated text have been seen increasing rapidly over the Internet and in business and educational settings. However, it is not always desirable to have them, and in some situations maliciously generated text can cause harm in society. We consider the task of automatically detecting machine generated text in the open domain setting, where a detector does not need to know the model generating textual content, the domain of the content, or the language. We discuss both supervised settings and unsupervised settings, where the detection system learns from human labeled data and makes decision without receiving supervised tuning, respectively. Both evaluation settings and detection algorithms are discussed. Our final model fulfills the task with over 96% accuracy on detecting ChatGPT.

Keynote Speech 7

Eliciting Alignments in Foundation Language Models

Derek F. Wong（黃輝）

The University of Macau

Foundation large language models (LLMs) require supervised fine-tuning (SFT) to develop instruction-following capabilities for downstream tasks. Yet, collecting and annotating data for SFT is often expensive, especially for cross-lingual or non-English tasks. This raises a critical research question: How can we unlock and align the non-English knowledge within foundation models to serve minority groups effectively? In this seminar, we will systematically explore this issue by addressing the following questions:

Do foundation models possess adequate cross-lingual knowledge?
How do foundation LLMs compare to general-purpose SFT LLMs in handling cross-lingual tasks?
What unsupervised training strategies can be employed to enhance the cross-lingual capabilities of foundation LLMs?

Our discussion aims to provide insights into the development of multilingual LLMs and promote their application in low-resource settings.

Online Conference Handbook

CCF@POLYU