GB/T 26237.13-2023 Information technology - Biometric data interchange formats-Part 13: Voice data
1 Scope
This document specifies a data interchange format that can be used for storing, recording, and transmitting digitized acoustic human voice data (speech) assumed to be from a single speaker recorded in a single session. This format is designed specifically to support a wide variety of Speaker Identification and Verification (SIV)applications, both text-dependent and text-independent, with minimal assumptions made regarding the voice data capture conditions or the collection environment. Other uses for the data encapsulated in this format, such as automated speech recognition (ASR), may be possible, but are not addressed in this document. This document also does not address handling of data that has been processed to the feature or voice model levels. No application-specific requirements, equipment, or features are addressed in this document. This document supports the optional inclusion of non-standardized extended data. This document allows both the original data captured and digitally- processed (enhanced) voice data to be exchanged. A description of any processing of the original source input is intended to be included in the metadata associated with the voice representations (VRs). This document does not address data streaming.
Provisions that stored and transmitted biometric data be time-stamped and that cryptographic techniques be used to protect their authenticity, integrity and confidentiality are out of the scope of this document.
Information formatted in accordance with this document can be recorded on machine-readable media or can be transmitted by data communication between systems.
A general content-oriented subclause describing the voice data interchange format is followed by a subclause addressing an XML schema definition.
This document includes vocabulary in common use by the speech and speaker recognition community, as well as terminology from other ISO standards.
2 Normative references
The following referenced documents are indispensable for the application of this document. For dated references, only the edition cited applies. For undated references, the latest edition of the referenced document (including any amendments) applies.
ISO 8601 Data elements and interchange formats- Information interchange- Representation of dates and times
Note: GB/T 7408-2005 Data elements and interchange formats - Information interchange - Representation of dates and times (ISO 8601 : 2000,IDT)
ISO/IEC 2382-37 Information technology- Vocabulary- Part 37 : Biometrics
Note: GB/T 5271.37-2021 Information technology - Vocabulary - Part 37 : Biometrics (ISO/IEC 2382-37 :2017, MOD)
ISO/IEC 19785-1 Information technology - Common Biometric Exchange Formats Framework- Part 1 : Data element specification
Note: GB/T 28826.1-2012 Information technology - Common biometric exchange formats framework - Part 1: Data element specification (ISO/IEC 19785-1 : 2006, MOD)
ISO/IEC 19794-1 Information technology - Biometric data interchange formats - Part 1 : Framework
Note: GB/T 26237.1-2022 Information technology - Biometric data interchange formats - Part 1 : Framework (ISO/IEC 19794-1: 2011, MOD)
3 Terms and definitions
For the purposes of this document, the terms and definitions in ISO/IEC 19794-1 and the following apply.
3.1
analog-to-digital converter (ADC) resolution
exponent of the base 2 representation (the number of bits)of the number of discrete amplitudes that the analog-to-digital converter is capable of producing
Note: Common values for ADC resolution for sound-cards are: 8, 16, 20 and 24.
3.2
audio duration
duration of the complete audio containing all voice representation utterances, e.g. whole call recordings
3.3
audio encoding
encoding used by the data capture subsystem, e.g.. cellphone
Note 1: The voice signal is encoded before being transmitted over a channel. There are many formats in use today and the number is likely to continue to change as telephones and transmission channels evolve. Formats include PCM (ITU-T G.711) and ADPCM (ITU-T G.726) for wave encoding and ACELP (ITU-T G.723.1) and CS-ACELP (ITU-T G.729 Annex A) for AbS encoding. A-law PCM and mu-law PCM are included in ITU-T G.711.
Note 2: A comprehensive overview list is provided in 7.4.3.2.
3.4
compression
process that reduces the size of a digital file and, accordingly, the data rate required for transmission
Note: Some audio encodings include compression and some do not. Compression is almost always “lossy” and, therefore, has an impact on the speech signal.
3.5
cut-off frequency(lower/upper)
frequency (below/above)which the acoustic energy drops 3dB below the average energy in the pass band
3.6
far-field
region far enough from the source where the angular field distribution is independent of the distance from the source
3.7
interactive voice response
function of for a telephony based computer that is used to control the flow of telephone calls and to provide voice based self-service
Note 1: Technology that allows a computer to detect voice and keypad inputs.
Note 2: IVR systems deal with several real-world and constrained-content effects, such as emotional voices, varying environmental noises, recording of free speech, but also hot words (e.g.. yes, no, digits, keywords).
Note 3: IVRs apply ASR for user navigation, where on secure applications SIV becomes relevant e.g., financial transactions via telefone. IVR systems may combine ASR and SIV to detect audio sample replays and detect user liveness by introducing on-time generated knowledge to the user that shall be spoken.
3.8
microphone
data capture subsystem that converts the acoustic pressure wave emanating from the voice into an electrical signal
3.9
mid-field
region between the near-field and the far-field which has a combination of the characteristics found in both the near-field and the far-field
3.10
near-field
region in an enclosure in which the direct energy at the microphone from the primary source is greater than the reflected energy from that source
3.11
public switched telephone network
channel based technology used to switch analogue signal, typically telephone calls, through a network from a source such as a telephone to a destination such as another telephone
Note: Knowledge about the channel where a telephone call originates is useful because, historically, noise and other channel characteristics vary from country to country. The advent and growth of VoIP and other digital telephone networks has attenuated the impact of national telecommunications networks because they are not constrained by national boundaries. For example, a call originating in the United States might traverse Canada before arriving at its destination, which could be within the United States (see Voiceover IP).
3.12
representation duration
duration of a single voice representation utterance
3.13
sampling rate
number of samples per second (or per other unit) taken from a continuous signal to make a discrete signal
Note 1: When the rate is per second, the unit is Hertz (Hz).
Note 2: Equal to the sampling frequency.
Note 3: The rate of sampling needs to satisfy the Nyquist criterion.
3.14
session
single capture process that takes place over a single, continuous time period
Note: In biometric systems a session can be interpreted as the time of recording one or more samples without the subject leaving the scene of the biometric capturing device, i.e. passing through a control stage/barrier infers the end of a session, while multiple rejects can occur during one session.
3.15
signal-to-encoding noise ratio SNR
ratio of the pure signal of interest to the noise component that results from possible electronic noise sources
Note 1: SNR(dB)=10 lg(Ps/Pn), where Ps is average signal power and Pn is average noise power, expressed as follows for digitized signals.
Note 2: where N is the total number of digital samples.
Note 3: Usually measured in decibels (dB).
Note 4: For example, in PCM, the noise is caused by quantization and roughly calculated in Furui, Digital Speech Processing. Synthesis, and Recognition, (Dekker, 1989) as: SNR(dB)=6B-7.2
Note 5: where B is quantization bits.
3.16
speaker identification
form of speaker recognition which compares a voice sample with a set of voice references corresponding to different persons to determine the one who has spoken
3.17
speaker recognition
process of determining whether two speech segments were produced by the vocal mechanism of the same data subject
3.18
speaker verification
speaker authentication
form of speaker recognition for deciding whether a speech sample was spoken by the person whose identity was claimed
Note 1: Speaker verification is used mainly to restrict access to information, facilities or premises.
Note 2: Speaker verification can also be called speaker confirmation. In this document and practical application, confirmation and verification can be used interchangeably.
3.19
speaker identification and verification
process of automatically recognizing individuals through voice characteristics
Note : The data format itself does not depend on the application purpose (active/passive SIV).
3.20
voice
speech
sound produced by the vocal apparatus whilst speaking
Note 1: Normally defined by phoneticians as the sound that emanates from the lips and nostrils, which comprises "voiced" and "unvoiced" sound produced by the vibration of the vocal folds and from constrictions within the vocal track and modified by the time varying acoustic transfer characteristic of the vocal tract.
Note 2: For the purposes of this document, speech and voice are used interchangeably.
Standard
GB/T 26237.13-2023 Information technology—Biometric data interchange formats—Part 13:Voice data (English Version)
Standard No.
GB/T 26237.13-2023
Status
valid
Language
English
File Format
PDF
Word Count
14500 words
Price(USD)
435.0
Implemented on
2023-12-1
Delivery
via email in 1~3 business day
Detail of GB/T 26237.13-2023
Standard No.
GB/T 26237.13-2023
English Name
Information technology—Biometric data interchange formats—Part 13:Voice data
GB/T 26237.13-2023 Information technology - Biometric data interchange formats-Part 13: Voice data
1 Scope
This document specifies a data interchange format that can be used for storing, recording, and transmitting digitized acoustic human voice data (speech) assumed to be from a single speaker recorded in a single session. This format is designed specifically to support a wide variety of Speaker Identification and Verification (SIV)applications, both text-dependent and text-independent, with minimal assumptions made regarding the voice data capture conditions or the collection environment. Other uses for the data encapsulated in this format, such as automated speech recognition (ASR), may be possible, but are not addressed in this document. This document also does not address handling of data that has been processed to the feature or voice model levels. No application-specific requirements, equipment, or features are addressed in this document. This document supports the optional inclusion of non-standardized extended data. This document allows both the original data captured and digitally- processed (enhanced) voice data to be exchanged. A description of any processing of the original source input is intended to be included in the metadata associated with the voice representations (VRs). This document does not address data streaming.
Provisions that stored and transmitted biometric data be time-stamped and that cryptographic techniques be used to protect their authenticity, integrity and confidentiality are out of the scope of this document.
Information formatted in accordance with this document can be recorded on machine-readable media or can be transmitted by data communication between systems.
A general content-oriented subclause describing the voice data interchange format is followed by a subclause addressing an XML schema definition.
This document includes vocabulary in common use by the speech and speaker recognition community, as well as terminology from other ISO standards.
2 Normative references
The following referenced documents are indispensable for the application of this document. For dated references, only the edition cited applies. For undated references, the latest edition of the referenced document (including any amendments) applies.
ISO 8601 Data elements and interchange formats- Information interchange- Representation of dates and times
Note: GB/T 7408-2005 Data elements and interchange formats - Information interchange - Representation of dates and times (ISO 8601 : 2000,IDT)
ISO/IEC 2382-37 Information technology- Vocabulary- Part 37 : Biometrics
Note: GB/T 5271.37-2021 Information technology - Vocabulary - Part 37 : Biometrics (ISO/IEC 2382-37 :2017, MOD)
ISO/IEC 19785-1 Information technology - Common Biometric Exchange Formats Framework- Part 1 : Data element specification
Note: GB/T 28826.1-2012 Information technology - Common biometric exchange formats framework - Part 1: Data element specification (ISO/IEC 19785-1 : 2006, MOD)
ISO/IEC 19794-1 Information technology - Biometric data interchange formats - Part 1 : Framework
Note: GB/T 26237.1-2022 Information technology - Biometric data interchange formats - Part 1 : Framework (ISO/IEC 19794-1: 2011, MOD)
3 Terms and definitions
For the purposes of this document, the terms and definitions in ISO/IEC 19794-1 and the following apply.
3.1
analog-to-digital converter (ADC) resolution
exponent of the base 2 representation (the number of bits)of the number of discrete amplitudes that the analog-to-digital converter is capable of producing
Note: Common values for ADC resolution for sound-cards are: 8, 16, 20 and 24.
3.2
audio duration
duration of the complete audio containing all voice representation utterances, e.g. whole call recordings
3.3
audio encoding
encoding used by the data capture subsystem, e.g.. cellphone
Note 1: The voice signal is encoded before being transmitted over a channel. There are many formats in use today and the number is likely to continue to change as telephones and transmission channels evolve. Formats include PCM (ITU-T G.711) and ADPCM (ITU-T G.726) for wave encoding and ACELP (ITU-T G.723.1) and CS-ACELP (ITU-T G.729 Annex A) for AbS encoding. A-law PCM and mu-law PCM are included in ITU-T G.711.
Note 2: A comprehensive overview list is provided in 7.4.3.2.
3.4
compression
process that reduces the size of a digital file and, accordingly, the data rate required for transmission
Note: Some audio encodings include compression and some do not. Compression is almost always “lossy” and, therefore, has an impact on the speech signal.
3.5
cut-off frequency(lower/upper)
frequency (below/above)which the acoustic energy drops 3dB below the average energy in the pass band
3.6
far-field
region far enough from the source where the angular field distribution is independent of the distance from the source
3.7
interactive voice response
function of for a telephony based computer that is used to control the flow of telephone calls and to provide voice based self-service
Note 1: Technology that allows a computer to detect voice and keypad inputs.
Note 2: IVR systems deal with several real-world and constrained-content effects, such as emotional voices, varying environmental noises, recording of free speech, but also hot words (e.g.. yes, no, digits, keywords).
Note 3: IVRs apply ASR for user navigation, where on secure applications SIV becomes relevant e.g., financial transactions via telefone. IVR systems may combine ASR and SIV to detect audio sample replays and detect user liveness by introducing on-time generated knowledge to the user that shall be spoken.
3.8
microphone
data capture subsystem that converts the acoustic pressure wave emanating from the voice into an electrical signal
3.9
mid-field
region between the near-field and the far-field which has a combination of the characteristics found in both the near-field and the far-field
3.10
near-field
region in an enclosure in which the direct energy at the microphone from the primary source is greater than the reflected energy from that source
3.11
public switched telephone network
channel based technology used to switch analogue signal, typically telephone calls, through a network from a source such as a telephone to a destination such as another telephone
Note: Knowledge about the channel where a telephone call originates is useful because, historically, noise and other channel characteristics vary from country to country. The advent and growth of VoIP and other digital telephone networks has attenuated the impact of national telecommunications networks because they are not constrained by national boundaries. For example, a call originating in the United States might traverse Canada before arriving at its destination, which could be within the United States (see Voiceover IP).
3.12
representation duration
duration of a single voice representation utterance
3.13
sampling rate
number of samples per second (or per other unit) taken from a continuous signal to make a discrete signal
Note 1: When the rate is per second, the unit is Hertz (Hz).
Note 2: Equal to the sampling frequency.
Note 3: The rate of sampling needs to satisfy the Nyquist criterion.
3.14
session
single capture process that takes place over a single, continuous time period
Note: In biometric systems a session can be interpreted as the time of recording one or more samples without the subject leaving the scene of the biometric capturing device, i.e. passing through a control stage/barrier infers the end of a session, while multiple rejects can occur during one session.
3.15
signal-to-encoding noise ratio SNR
ratio of the pure signal of interest to the noise component that results from possible electronic noise sources
Note 1: SNR(dB)=10 lg(Ps/Pn), where Ps is average signal power and Pn is average noise power, expressed as follows for digitized signals.
Note 2: where N is the total number of digital samples.
Note 3: Usually measured in decibels (dB).
Note 4: For example, in PCM, the noise is caused by quantization and roughly calculated in Furui, Digital Speech Processing. Synthesis, and Recognition, (Dekker, 1989) as: SNR(dB)=6B-7.2
Note 5: where B is quantization bits.
3.16
speaker identification
form of speaker recognition which compares a voice sample with a set of voice references corresponding to different persons to determine the one who has spoken
3.17
speaker recognition
process of determining whether two speech segments were produced by the vocal mechanism of the same data subject
3.18
speaker verification
speaker authentication
form of speaker recognition for deciding whether a speech sample was spoken by the person whose identity was claimed
Note 1: Speaker verification is used mainly to restrict access to information, facilities or premises.
Note 2: Speaker verification can also be called speaker confirmation. In this document and practical application, confirmation and verification can be used interchangeably.
3.19
speaker identification and verification
process of automatically recognizing individuals through voice characteristics
Note : The data format itself does not depend on the application purpose (active/passive SIV).
3.20
voice
speech
sound produced by the vocal apparatus whilst speaking
Note 1: Normally defined by phoneticians as the sound that emanates from the lips and nostrils, which comprises "voiced" and "unvoiced" sound produced by the vibration of the vocal folds and from constrictions within the vocal track and modified by the time varying acoustic transfer characteristic of the vocal tract.
Note 2: For the purposes of this document, speech and voice are used interchangeably.