2026-4-2 216.73.216.113
Code of China Chinese Classification Professional Classification ICS Classification Latest News Value-added Services

Position: Chinese Standard in English/GB/T 45288.2-2025
GB/T 45288.2-2025   Artificial intelligence―Large-scale model―Part 2: Testing and evaluation for metrics and methods (English Version)
Standard No.: GB/T 45288.2-2025 Status:valid remind me the status change

Email:

Target Language:English File Format:PDF
Word Count: 19000 words Translation Price(USD):570.0 remind me the price change

Email:

Implemented on:2025-2-28 Delivery: via email in 1~5 business day

→ → →

,,2025-2-28,DCF8CA1B30BC8EF31741701752429
Standard No.: GB/T 45288.2-2025
English Name: Artificial intelligence―Large-scale model―Part 2: Testing and evaluation for metrics and methods
Chinese Name: 人工智能大模型 第2部分:评测指标与方法
Professional Classification: GB    National Standard
Source Content Issued by: SAMR; SAC
Issued on: 2025-02-28
Implemented on: 2025-2-28
Status: valid
Target Language: English
File Format: PDF
Word Count: 19000 words
Translation Price(USD): 570.0
Delivery: via email in 1~5 business day
GB/T 45288.2-2025 Artificial intelligence―Large-scale model―Part 2: Testing and evaluation for metrics and methods English, Anglais, Englisch, Inglés, えいご This is a draft translation for reference among interesting stakeholders. The finalized translation (passing through draft translation, self-check, revision and verification) will be delivered upon being ordered. ICS 35.240 CCS L 70 National Standard of the People's Republic of China GB/T 45288.2-2025 Artificial intelligence - Large-scale model - Part 2: Testing and evaluation for metrics and methods 人工智能 大模型 第2部分:评测指标与方法 (English Translation) Issue date: 2025-02-28 Implementation date: 2025-02-28 Issued by the State Administration for Market Regulation the Standardization Administration of the People's Republic of China Contents Foreword Introduction 1 Scope 2 Normative references 3 Terms and definitions 4 Abbreviations 5 Evaluation indicators 6 Evaluation methods Annex A (Informative) Calculation methods for evaluation indicators Bibliography Artificial intelligence - Large-scale model - Part 2: Testing and evaluation for metrics and methods 1 Scope This document establishes the evaluation indicators for large artificial intelligence models and describes the evaluation methods for large artificial intelligence models. This document is applicable to model providers, application service providers, application consumers, etc., for evaluating and testing the capabilities of large models, and also is applicable to guiding the design, development, and application of large models. 2 Normative references The following documents contain requirements which, through reference in this text, constitute provisions of this document. For dated references, only the edition cited applies. For undated references, the latest edition of the referenced document (including any amendments) applies. GB/T 42755-2023 Artificial intelligence - Code of practice for data labeling of machine learning GB/T 45288.1 Artificial intelligence - Large-scale model - Part 1: General requirements 3 Terms and definitions The terms and definitions defined in GB/T 45288.1 are applicable to this document. 4 Abbreviation The following abbreviations are applicable to this document. API: Application Programming Interface BLEU: Bilingual Evaluation Understudy 5 Evaluation indicators 5.1 Comprehension ability evaluation indicators 5.1.1 Overview The evaluation of large models' comprehension ability is mainly divided into unimodal and multimodal dimensions. The unimodal dimension mainly includes three secondary dimensions: text, image, and audio. The multimodal dimension mainly includes four secondary dimensions: image-text, text-audio, image-audio, and image-text-audio. The evaluation dimensions and typical tasks of comprehension ability are shown in Table 1. 5.1.2 Text classification Evaluate the large model's ability to conduct overall analysis of input text content, including but not limited to the following capabilities: a) Classification task: Ability to map input text to specific categories, where users only need to provide the text to be classified without concerning themselves with the specific implementation. Mainly includes: single-label and multi-label classification tasks. b) Sentence segmentation: Ability to split a sentence sequence into a word sequence. c) Part-of-speech tagging: Ability to assign a part of speech to each vocabulary in natural language text, where the part-of-speech categories may include nouns, verbs, adjectives, or others. d) Sentiment analysis: Ability to determine the emotional tendency contained in the text, such as positive, negative, or neutral. e) Semantic role labeling: Ability to assign corresponding semantic roles to predicates and arguments in sentences. 5.1.3 Information extraction Evaluate the large model's ability to automatically identify and extract key information from complex text content, including but not limited to: a) Keyword extraction: Ability to identify core words and phrases from text, which are crucial for understanding the overall text content; b) Fact extraction: Ability to extract specific factual information from text, such as dates, locations, figures, and related events; c) Argument extraction: Ability to identify and extract viewpoints and arguments in text, including supporting and opposing arguments, which is particularly important for analyzing commentative and argumentative text; d) Relation extraction: Ability to extract semantic relationships between entities from text. In text, entities may include people, locations, organizations, events, etc., while semantic relationships refer to various relationships between entities, such as subject-verb relationships, verb-object relationships, hyponymy relationships, synonymy relationships, etc.; e) Coreference resolution: Ability to clearly identify and determine the specific referent of pronouns or noun phrases in a sentence. 5.1.4 Mathematical reasoning Evaluate the large model's ability to understand problems, identify implicit mathematical operations in them, and solve mathematical operation problems using mathematical concepts and principles. Including but not limited to: a) Arithmetic operations: Ability to perform basic addition, subtraction, multiplication, and division operations; b) Algebraic problems: Ability to solve algebraic problems such as equation solving, inequality problems, and simplification of algebraic expressions; c) Geometric problem-solving: Ability to solve problems involving calculations of geometric figure properties, area, perimeter, etc.; d) Mathematical application problems: Ability to solve daily life mathematical problems, such as time calculation, distance calculation, proportion problems, etc.; e) Statistical problems: Ability to interpret probability calculations, statistical charts, etc. 5.1.5 Causal reasoning Evaluate the large model's ability to analyze causal relationships in input text content, including but not limited to: a) Causal relationship identification: Ability to identify causal relationships from natural language text, such as the "because... so..." structure, including direct and indirect causal relationships; b) Causal chain construction: Ability to construct a complete causal chain based on information in the text, such as identifying and linking the cause and effect of each event from a series of events; c) Hypothetical conditional reasoning: Ability to perform logical reasoning on sentences containing hypothetical conditions (such as "if... then...") and accurately identify the relationship between conditions and results;
Code of China
Standard
GB/T 45288.2-2025  Artificial intelligence―Large-scale model―Part 2: Testing and evaluation for metrics and methods (English Version)
Standard No.GB/T 45288.2-2025
Statusvalid
LanguageEnglish
File FormatPDF
Word Count19000 words
Price(USD)570.0
Implemented on2025-2-28
Deliveryvia email in 1~5 business day
Detail of GB/T 45288.2-2025
Standard No.
GB/T 45288.2-2025
English Name
Artificial intelligence―Large-scale model―Part 2: Testing and evaluation for metrics and methods
Chinese Name
人工智能大模型 第2部分:评测指标与方法
Chinese Classification
Professional Classification
GB
ICS Classification
Issued by
SAMR; SAC
Issued on
2025-02-28
Implemented on
2025-2-28
Status
valid
Superseded by
Superseded on
Abolished on
Superseding
Language
English
File Format
PDF
Word Count
19000 words
Price(USD)
570.0
Keywords
GB/T 45288.2-2025, GB 45288.2-2025, GBT 45288.2-2025, GB/T45288.2-2025, GB/T 45288.2, GB/T45288.2, GB45288.2-2025, GB 45288.2, GB45288.2, GBT45288.2-2025, GBT 45288.2, GBT45288.2
Introduction of GB/T 45288.2-2025
GB/T 45288.2-2025 Artificial intelligence―Large-scale model―Part 2: Testing and evaluation for metrics and methods English, Anglais, Englisch, Inglés, えいご This is a draft translation for reference among interesting stakeholders. The finalized translation (passing through draft translation, self-check, revision and verification) will be delivered upon being ordered. ICS 35.240 CCS L 70 National Standard of the People's Republic of China GB/T 45288.2-2025 Artificial intelligence - Large-scale model - Part 2: Testing and evaluation for metrics and methods 人工智能 大模型 第2部分:评测指标与方法 (English Translation) Issue date: 2025-02-28 Implementation date: 2025-02-28 Issued by the State Administration for Market Regulation the Standardization Administration of the People's Republic of China Contents Foreword Introduction 1 Scope 2 Normative references 3 Terms and definitions 4 Abbreviations 5 Evaluation indicators 6 Evaluation methods Annex A (Informative) Calculation methods for evaluation indicators Bibliography Artificial intelligence - Large-scale model - Part 2: Testing and evaluation for metrics and methods 1 Scope This document establishes the evaluation indicators for large artificial intelligence models and describes the evaluation methods for large artificial intelligence models. This document is applicable to model providers, application service providers, application consumers, etc., for evaluating and testing the capabilities of large models, and also is applicable to guiding the design, development, and application of large models. 2 Normative references The following documents contain requirements which, through reference in this text, constitute provisions of this document. For dated references, only the edition cited applies. For undated references, the latest edition of the referenced document (including any amendments) applies. GB/T 42755-2023 Artificial intelligence - Code of practice for data labeling of machine learning GB/T 45288.1 Artificial intelligence - Large-scale model - Part 1: General requirements 3 Terms and definitions The terms and definitions defined in GB/T 45288.1 are applicable to this document. 4 Abbreviation The following abbreviations are applicable to this document. API: Application Programming Interface BLEU: Bilingual Evaluation Understudy 5 Evaluation indicators 5.1 Comprehension ability evaluation indicators 5.1.1 Overview The evaluation of large models' comprehension ability is mainly divided into unimodal and multimodal dimensions. The unimodal dimension mainly includes three secondary dimensions: text, image, and audio. The multimodal dimension mainly includes four secondary dimensions: image-text, text-audio, image-audio, and image-text-audio. The evaluation dimensions and typical tasks of comprehension ability are shown in Table 1. 5.1.2 Text classification Evaluate the large model's ability to conduct overall analysis of input text content, including but not limited to the following capabilities: a) Classification task: Ability to map input text to specific categories, where users only need to provide the text to be classified without concerning themselves with the specific implementation. Mainly includes: single-label and multi-label classification tasks. b) Sentence segmentation: Ability to split a sentence sequence into a word sequence. c) Part-of-speech tagging: Ability to assign a part of speech to each vocabulary in natural language text, where the part-of-speech categories may include nouns, verbs, adjectives, or others. d) Sentiment analysis: Ability to determine the emotional tendency contained in the text, such as positive, negative, or neutral. e) Semantic role labeling: Ability to assign corresponding semantic roles to predicates and arguments in sentences. 5.1.3 Information extraction Evaluate the large model's ability to automatically identify and extract key information from complex text content, including but not limited to: a) Keyword extraction: Ability to identify core words and phrases from text, which are crucial for understanding the overall text content; b) Fact extraction: Ability to extract specific factual information from text, such as dates, locations, figures, and related events; c) Argument extraction: Ability to identify and extract viewpoints and arguments in text, including supporting and opposing arguments, which is particularly important for analyzing commentative and argumentative text; d) Relation extraction: Ability to extract semantic relationships between entities from text. In text, entities may include people, locations, organizations, events, etc., while semantic relationships refer to various relationships between entities, such as subject-verb relationships, verb-object relationships, hyponymy relationships, synonymy relationships, etc.; e) Coreference resolution: Ability to clearly identify and determine the specific referent of pronouns or noun phrases in a sentence. 5.1.4 Mathematical reasoning Evaluate the large model's ability to understand problems, identify implicit mathematical operations in them, and solve mathematical operation problems using mathematical concepts and principles. Including but not limited to: a) Arithmetic operations: Ability to perform basic addition, subtraction, multiplication, and division operations; b) Algebraic problems: Ability to solve algebraic problems such as equation solving, inequality problems, and simplification of algebraic expressions; c) Geometric problem-solving: Ability to solve problems involving calculations of geometric figure properties, area, perimeter, etc.; d) Mathematical application problems: Ability to solve daily life mathematical problems, such as time calculation, distance calculation, proportion problems, etc.; e) Statistical problems: Ability to interpret probability calculations, statistical charts, etc. 5.1.5 Causal reasoning Evaluate the large model's ability to analyze causal relationships in input text content, including but not limited to: a) Causal relationship identification: Ability to identify causal relationships from natural language text, such as the "because... so..." structure, including direct and indirect causal relationships; b) Causal chain construction: Ability to construct a complete causal chain based on information in the text, such as identifying and linking the cause and effect of each event from a series of events; c) Hypothetical conditional reasoning: Ability to perform logical reasoning on sentences containing hypothetical conditions (such as "if... then...") and accurately identify the relationship between conditions and results;
Contents of GB/T 45288.2-2025
About Us   |    Contact Us   |    Terms of Service   |    Privacy   |    Cancellation & Refund Policy   |    Payment
Contact us via WeChat
Tel: +86-10-8572 5655 | Fax: +86-10-8581 9515 | Email: coc@codeofchina.com | QQ: 3680948734
Copyright: Beijing COC Tech Co., Ltd. 2008-2040
 
 
Keywords:
GB/T 45288.2-2025, GB 45288.2-2025, GBT 45288.2-2025, GB/T45288.2-2025, GB/T 45288.2, GB/T45288.2, GB45288.2-2025, GB 45288.2, GB45288.2, GBT45288.2-2025, GBT 45288.2, GBT45288.2