A review on state-of-the-art Automatic Speaker veriﬁcation system from spooﬁng and anti-spooﬁng perspective

Background/Objectives : The anti-spooﬁng measures are blooming with an aim to protect the Automatic Speaker Veriﬁcation systems from susceptible spooﬁng attacks. This review is an amalgam of the possible attack types, the datasets required, the renowned feature representation techniques, modeling algorithms involving machine learning, and score normalization techniques. Method/Findings : A detailed analysis of existing datasets is carried based on the total speaker samples, the number of speakers, and source of availability-open or licensed. This may foster choosing the right dataset for building the anti-spooﬁng frameworks. Further, the feature extraction schemes are elaborated with an intention to cover the vast span of features existing in various parts of raw speech for obtaining speaker-speciﬁc traits. Further, the machine learning algorithms ranging from discriminative to generative to mixed form are explored for seeking the right algorithm in speciﬁc attack conditions. On the whole, these analyses of existing features and machine learning algorithms together contribute to classifying the unknown test samples as genuine or spoofed. The score normalization techniques are also considered in this review to avoid any misclassiﬁcations and ultimately reduce the False Acceptance Ratios. The performance of any anti-spooﬁng speaker veriﬁcation system may be evaluated using standard objective measures such are Equal Error Rate, False positive ratios, and graphical plots. These measures are brieﬂy explained in this review. Overall, the critical analysis of individual methods-feature extraction, machine learning, score normalization, and all the anti-spooﬁng datasets are also discussed for giving a kick-start to any researcher beginning to explore in this direction. The shortcomings and risks involved in building an enhanced speaker veriﬁcation system that is robust to almost all the attack types are listed in this article. The review of studies conducted so far has led to vital future directions that are enlisted in the concluding remarks of the article.


Introduction
The task of permitting pre-enrolled speakers with an intension to disallow unknown ones is called Speaker Verification and that system is an Automatic Speaker Verification (ASV) (1) . The ASV is a crucial part of a Speaker Recognition platform after the Speaker Identification (SI) mechanism (2) . The SI system opts for the most likely speaker among the presented list of speakers while ASV investigates if the claimed identity is true or false. Depending upon the input sequence, the ASV may be text-dependent or independent. The former being preferred in authentication scenarios as it needs higher accuracy (3) . Furthermore, the chances of an ASV being susceptible to spoofing attacks is inevitable as any imposter could mimic or synthesize the target's voice for getting through the intended system. Hence, this study focuses on spoofing and anti-spoofing measures from the system analysis and decision algorithm perspective.
The developments in reducing channel and noise interference have led to ASV systems being employed in security-based scenarios such as phone banking (4) . However, the primary concern in using ASV is susceptibility to imposters. According to studies conducted in (5,6) , there are two basic types of attacks: direct or Physical Access attacks (PA) and indirect or Logical Access attacks (LA). The PA attacks occur at the sensor level while LA attacks are observed after the sensor, that is at the feature representation stage or modelling stage. Also, another way of categorizing the spoofing attacks is through imposter variations which may be impersonated, replayed, voice converted, or synthetic speech. The impersonation or mimicry is the first-ever attack on a speaker verification system because of its ease of production. The only criteria are, the target and imposter's voice must have a similar fundamental frequency which is usually the case for twins. Mimicry is a product of a professional artist, who is trained and dedicated for the sole purpose of copying. The imposter or mimicry artist usually mimics the prosody and timbre of the target speaker (7) . The replayed speech is easy to reproduce for the attacker as it involves playing pre-recorded speech which is certainly captured without the permission of the target. This seems to be a text-dependent scenario where a fixed phrase is used for verifying the speaker's traits. Capturing realtime speech or modified speech of the intended target is tough, but is often considered as a challenge by the imposters and conducted irrespective of the challenges due to no human intervention for re-producing them.
The synthetic speech may be a by-product of a Text-to-Speech (TTS) system (i.e. Speech Synthesis (SS) or a Voice transformation or conversion (VC) speech. Presently, the TTS produces quite intelligible speech as it imitates the human speech production mechanism (8) . On the contrary, the VC speech may be generated from either (or all of these) human speech production, perception, and prosodic models (9) . On the whole, the ASV comprises of two-fold operating conditions: first being the training phrase where a statistical model is a result of extracting appropriate features from a known speaker's voice and trained through convenient machine learning algorithms. The saved model is then used during the testing phase to find if the unknown speaker's speech sample belongs to the known speaker or not. The block schematic of internal blocks during each mode of operation is demonstrated in Figure 1.
https://www.indjst.org/ The literature offers quite significant reviews of developments in the ASV domain and anti-spoofing measures. The work (10) describes the basics of biometrics along with spoofing attacks while (11,12) describes a survey of techniques existing in the speaker verification system from the ASV Spoof challenge perspective that includes protocols, databases, and future directions. Another review covers the detection of speech synthesis and replay attacks (13) . The review of ASV based on short utterances is presented in (14) which explains challenges including the trends in this field. On the other hand in this article, a thorough analysis of stateof-art features, training algorithms along with popular databases and evaluation metrics are presented. Unlike past reviews, this work concentrates on popular methods addressing in-depth function of the blocks belonging to the ASV system. Such an intensive review of feature extraction and machine learning algorithms employed in anti-spoofing frameworks has not been conducted according to the best of the authors' knowledge. In fact, this article also presents a critical evaluation of these internal blocks for providing a base in developing anti-spoofing frameworks. Thus the main objective of this article are three-fold: 1. Investigating the available datasets for developing anti-spoofing measures in order to analyse the nature and types of attacks. This will promote a deciding criteria for selecting the right kind of dataset. 2. Investigating the feature representation techniques that help reducing the raw data redundancies and represent the spoofed and genuine speakers efficiently. This will a kick start for researchers looking out for existing feature extraction techniques in addition to their pros and cons. 3. Exploring existing machine learning and score normalization algorithms for accurate categorization of input test sample (as genuine or spoofed). This will keep a track of evolution of machine learning algorithms right from discriminative models to generative models to the artificial neural networks.
The article is structured as follows: Section 2 describes the ASV system for spoofed speech while section 3 identifies popular databases employed in studying ASV. Section 4 covers extensive literature related to feature representation techniques, section 5 describes the machine learning algorithms while the score normalization techniques are explained in section 6. Furthermore, section 7 elaborates evaluation metrics and lastly, section 8 summarizes the article review with hints for future work.

ASV system for spoofing attack
The mode of operation for an ASV for imposter speech is nearly identical to the standard ASV only with additional requirements to detect such mimicked or synthetic speech. Also in the testing mode, the test sample is matched with the trained model to obtain a score signifying the speaker belongs to a known or unknown class as depicted in Figure 2.
To decrypt the process of spoofing attacks on the ASV and taking necessary actions to prevent it, the characteristics of natural speech as against artificial or mimicked speech must be investigated. The human behavioural traits such as huskiness, https://www.indjst.org/ breathlessness, and speaking rate are utterly impossible to mimic or synthesize individually (11) . Furthermore, the high-level features like pitch and duration are not consistent considering inter-speaker and intra-speaker variations. Timbre is also a potential feature considering natural versus artificial speech.

Speech-based Spoofing Attacks
The combined effort by individual steps of training and testing are vulnerable including links between each component like a microphone to input feature representation, features to classifiers, and classifiers to decision algorithm (15) . These attacks may be sub-divided into two basic types: Direct or the general spoofing attacks that occur at the microphone and during transmission to the first input component (feature extraction block); while indirect attacks occur inside the ASV itself which usually needs access to the system say at the feature level or classifier or decision logic end. The attacker would replace or modify the contents of these components. Direct access attacks are considered potential risks as opposed to indirect attacks as they don't require system-level access. Additionally, four broad categories of representation attacks are natural impersonation, speech synthesis, VC, and replay speech as described below:

Natural Impersonation
The Impersonation is performed by a professional mimicry artist or imposter who holds the ability to produce similar voice traits and behaviour or even twins with identical spectral characteristics. Through studies, it is inferred that the imposter does not depend on the prior knowledge of machines for copying the target speaker. All the imposter needs is a target speaker's voice sample and a nearly similar spectral pattern would authorize the imposter (16) . In fact, the impersonator tries to reproduce the prosodic parameters of the target (17) . Along with this, the imitator adapts to the target speaker's accent, pronunciation, lexicon, and various high-level features. Thereby the voice produced by impersonation could deceive the human ear perception. However, the practicality of this attack is negligible or extremely low as most often the anti-spoofing ASV considers spectral parameter traits as the base feature technique.

Speech Synthesis
Speech Synthesis (SS) refers to the conventional TTS system but with a target-specific speech that sounds intelligible, natural, and yet it is machine-generated speech from prompted text. Some common applications of SS in the younger generation such as audiobooks, in-the-car navigation, speech translation, etc (18,19) . The speech synthesis involves two basic steps: analyze the text (front-end) and generate waveform or speech (back-end). When analyzing the text, the words and sentences are broken into https://www.indjst.org/ a less complex linguistic unit called a phoneme. On the contrary, the speech generation utilizes these linguistic parameters to build up a waveform. The careful analysis of literature suggested, four prime waveform generation techniques evolved to date. The first being the acoustic features specially formants representing every phoneme (20) . Following this, the second approach was rooted in diphones which comprised the second half of the first phoneme till the first half of consecutive phoneme. These diphones were further represented using a linear prediction algorithm. The third technique was based on selecting the right speech units and then concatenating them into a single speech sample which is termed as unit selection approach (21) . And lastly, statistical parametric-based synthesis techniques such as Hidden Markov Model (HMM) have shown promising results in the domain of SS (22,23) . Additionally, the DNN is also proposed for SS (23,24) .

Voice Conversion
The speech signal from the source speaker is modified statistically/acoustically to sound identical to the target speaker's speech. This is a basic parametric difference between speech synthesis and voice conversion algorithms. To modify a source speaker's speech, the spectral characteristics and prosody of the imposter are mapped to find no audible change in speech. So the voice timbre, prosodic parameters like pitch and intonation are amended to be reflected in its characteristics. The spectrum modifications are performed by statistical parameters, frequency warping, and lastly unit selection algorithm (25) . The spectral modification techniques like Vector Quantization (26) , GMM (27) , Restricted Boltzmann Machines (RBM) (28) and Deep Belief Networks (DBN) (29) are explored in producing VC speech. The frequency warping techniques modify the source's frequency axis to the target's speaker. These modification techniques preserve the spectral content producing naturally sounding target speech (29,30) . Furthermore, the unit selection technique gave promising results, producing converted speech similar to the target speaker's voice.
Along with spectral parameters, the prosody modification would contribute to closer and natural synthetic speech. Pitch and duration are looked up when mentioning about speaker's prosody in case of voice conversion framework (31) . The threat to the ASV systems has increased over the period due to improvements in converted speech signals' quality.

Replay Speech
The pre-recorded speech fed into the ASV poses a potential risk to the system's configuration as may lead to giving unauthorized entry to the adversary. Such a spoofed speech could be procured at a given time without the consent of the victim. The speech samples may be concatenated or even clipped to obtain the desired utterance. Such attacks work well in the text-dependent ASV that has a fixed text phrase for getting access to the system. Spoofing attacks using replayed speech has now been a common practice due to the availability of affordable, good-quality recording instruments like mobile phones and laptops. Therefore , these attacks occur at the microphone level more often than the transmission level. The spectral similarity between natural and replay speech turn out to be quite close; thus it becomes rightful to conclude that the spectral features are susceptible to replay speech-based attacks (32) . From the point of view of objective score, the False Acceptance Ratio (FAR) has increased due to these attacks.

Detection of Spoofed Speech
The spoofed speech is a by-product of a human mimicry or machine's effort to mimic natural speech traits. The former is a lowrisk attack hence not quite popular in the anti-spoof detection community while the latter involves synthetic speech produced by a TTS or VC framework or replayed through a recording device. There seems to be a need for detecting spoofed speech from the natural speech with the sole purpose of protecting unauthenticated access to crucial information. The developments in the VC field are more established than the TTS due to early work that began at the start of the 1990s. This implies more breaches were uncovered using VC speech than SS systems. The preliminary VC attack consisted of Harmonic and Noise Model (HNM) and HMM-based synthesis (33) . As opposed to VC, the SS gained momentum only after developments in HMM-based synthesis (34) . The study in developing countermeasures for attacks began with prior knowledge of the attack that yielded biased results yet gave a kick start in developing algorithms. The work initiated with f0-based contours along with time stability to make distinction between genuine and spoofed speech (35) . The algorithm had a setback for capturing generality as there was scarce variation in the number of speakers. The visual cues such as images from video were a fine choice for representing speech through Mean Pitch Stability (MPS) and MPSr (range) along with jitter in (36) . In another approach Cosine Normalization Phase Spectrum (CosPhase) along with Modified Group Delay Function (MGDF) were proposed. The MFCC based features ignore the phaserelated information during feature extraction; hence resulted in an increased EER. Furthermore, Magnitude Modulation (MM) and Phase Modulation (PM) super-vectors improved EER respectively when fused with MGDF features (37,38) . These short-term features produce a lower EER and lead to another inference that the long and short-term features produce related yet reciprocal content when fusion is performed. Having said that, the short-term features produce artefacts due to framing which is potential https://www.indjst.org/ scope for improvement for speech researchers. Table 1 summarises various features and associated attack types while Table 2 lists the past five years research in the spoof detection area with regards to features along with the detection results. The developments in known attacks are not in ample for real-time scenarios and open up doors for building algorithms that can detect spoofed speech irrespective of the attack type. One such study started with the development of the SAS dataset and ASV Spoof 2015 challenge (11,39) . Following which, the Local Binary Patterns (LBP)-DCT for computing the MGDF and CosPhase (40) , Magnitude features such as LMS, RLMS, and phase features like GD, MGD, Instantaneous Frequency (IF), Baseband Phase Difference (BPD), Pitch Synchronous Phase (PSP) (41) are also found to perform well in spoof detection task. Apart from phase based features, the Linear Prediction Coefficients (LPC) and its residual (LPR) are also considered for detecting known attacks (42) . Some distinct feature sets like Cross-Teager Cepstral Co-efficient (TECC) (43) , Energy Separation (44) and Time-frequency based LFCC (45) are amongst novel representation techniques.
The heterogeneity in feature sets has thrown researchers challenges and further training/testing of these features is supposed to be performed through appropriate speaker modelling techniques. The i-vectors approach is a breakthrough in the speaker https://www.indjst.org/ verification scenario; hence it has been proposed for spoof detection scenario with filter-bank based features and Deep Neural Networks (DNN) (46,47) . Furthermore, a comparative study using the Mel Wavelet Packet Coefficients (MWPC) was conducted to investigate Support Vector Machines (SVM) and Deep Belief Networks (DBN). The SVM performed better than the DBN (48) . Various ensemble based approaches have also been proposed in the literature paving way for research in Deep Learning area (49)(50)(51)(52) .

Database for ASV
Primitively for the development of anti-spoofing and spoofing measures in ASV, we need to decide the dataset based on our goals and specifications. The following section describes various datasets used in ASV-system development. The corpora are labelled as licensed and open source for ease of understanding and requirements in this research, as summarised in Table 3 and Table 4 respectively.

Licensed / Proprietary Datasets
The YOHO dataset has large utterances with office space recordings but lacks variations pertaining to vocabulary (53) . The dataset offers more number speakers with 106 Male while 32 Female speaker voices. There were 24 utterances in 4 sessions each. The utterances are low pass filtered at 3.4kHz and up sampled to 8kHz comprising 5500 utterances in all. Contrarily, the WSJ is a multi-speaker dataset but not created for ASV, as was the case for YOHO. As the speaker variability and size is large, it may be entitled for producing new synthetic speech and then treating the original corpus as genuine speech samples. The work (36) used the corpus SI-284 of the WSJ (WSJ0 and WSJ1) for the synthetic speech generation which comprises 81 hours of recording for 284 speakers. The NIST-SRE is a speaker recognition corpus developed by a joint collaboration of NIST and LDC. There are multiple speakers with conversational telephonic speech (54) . BioCPqD-PA is a proprietary database, that has 222 speakers recorded in the Portuguese language. The dataset is known to be versatile as a result of variations in recording environments. It comprises in all 27,253 samples with 7,941 evaluation samples while another condition in which 3,91,678 spoofed samples are present amongst which 1,14,111 are evaluation samples (55) . https://www.indjst.org/

Open Source datasets
The SAS (Speaker Verification and Anti-spoofing) dataset is built from the VCTK corpus with 106 speakers sampled at 16kHz frequency. The corpus contains 22,831 natural speech as opposed to 2,03,592 spoofed speech. The VC and SS are the source of spoofed speech (39) .
The RSR 2015 corpus is a text-dependent Speaker Recognition database with 151 hours and 30 minutes of speech recorded in English. There are nearly 300 speakers and 1,96,844 utterances segregated for development and evaluation motives (56,57) .
Furthermore, the very initial dataset built for the sole purpose of boasting anti-spoofing development started with ASV Spoof 2015. The dataset has 1,93,404 spoofed samples generated using VC, and SS(known attacks, LA) while 9,404 genuine samples. Additionally, the corpus has unknown attack-based spoofed samples for developing attack independent algorithms (28) . The AV Spoof corpus was developed as a major part of the BTAS 2016 Challenge (58) . The dataset includes various presentation attacks in particular VC and SS-based; with 20,060 LA spoofed samples and 43,320 PA spoofed samples. There are 5,578 natural speaker samples.
The Voice Presentation Attack corpus has emerged from AV Spoof corpus only for genuine samples. VoicePA dataset contains replay speech that is recorded using a laptop where speech is replayed using internal and external speakers. Moreover, there are replay speech samples present from iPhone and Samsung phone devices (internal speakers). There is broad range of spoofed utterances including 3,91,678 samples from which 1,14,111 samples are fixed for evaluation. Contrarily, the natural speech samples are 27,253 from which 7,941 samples are again reserved for assessment purposes (59) .
The RedDots is a text-dependent replay speech corpus (60) that contains native and non-native 62 English speakers with small phrases recorded on multiple devices. The corpus contains 16,067 replayed spoofed samples while 2,346 natural samples. Since the dataset is designed by considering crowdsourcing during replaying and recording, it was entitled to be included in the ASV Spoof 2017 challenge.
After successfully conducting ASV Spoof 2015 challenge, the organizers refined and launched a new dataset ASV Spoof 2017 that was adapted from the RedDots replay dataset. There are 24 speakers, 1,298 genuine, and 12,008 spoofed samples (61) . The ReMASC (Realistic replay attack Microphone Array Speech Corpus) is a replay speech dataset that has been designed considering the voice-controlled device. There are 45,472 spoofed and 9,240 natural speech samples with 55 speakers. The recording areas include outdoors, inside the home, and in vehicles as well (62) .
The ASV Spoof 2019 corpus is a third consecutive challenge for anti-spoofing measures development broadened to synthetic (VC, SS) and replay speech. There are 20 speakers, 71,747 LA, and 1,37,457 PA samples, with tandem-DCF introduced in addition to the other evaluation measures for the challenge (5,64).

Feature Representation for Spoof Detection
The voice signal when sampled and stored in digitized data form contains alot of application-independent content which may not be required for performing the dedicated task of ASV. Thus, the speech signal is represented using appropriate features through framing windowing and conversion. This conversion operation may be time, spectral, cepstral, or some form of filtering to reduce the redundant contents. Taking this into consideration, speech features may be categorized as low-level (or short-term), long-term (or prosodic), and high-level features. Additionally, there are deep features that obtained from using DNN as a feature extraction scheme. The low-level or short-term features are linked with a speaker's timbre. The aim here is to capture local information within a frame of 20-40ms. So the spectral representation parameters like MFCC, LPCC and Cochlear Model contributing to glottal parameters may be categorised as short-term features when extracted over the defined frame duration. These kind of features are synthesizable with simplicity and hence are more susceptible to spoofing attacks (65,66). The long-term or prosody-based parameters are linked to human-like auditory traits including pitch and duration of the speaker, intonation, and Constant Q-transform Cepstral Coefficients (CQCC) (45,52,67). The prosodic parameters are obtained from long segments of spoken speech like words and syllables representing speaking rate, style, and intonation levels. These features are less likely to channel distortions yet the training process requires a larger data size (68). Furthermore, the algorithms involving the extraction of pitch do not perform well in noisy scenarios. Likewise, the high-level parameters that are obtained from lexicons to characterize behaviour of speaker and lexical cues. Phonemes and ideolect are also high-level features. These unique parameters are preferred over short-term and long-term parameters due to the fact that they are less affected by noise and channel distortions. Yet, there seems to be a possibility for exploring more as researchers are hesitating to apply these high-end features in standalone ASV because they need a high-front-end like speech recognition framework (63,64) .
The features may be segregated based on the duration of the segment as sub-segmental, segmental, and supra-segmental. The speaker utterance is framed using a frame duration of 3 ms to 5 ms in the sub-segmental parameters. The research clarifies that source excitation parameters are extracted through sub-segmental features (65,66) . The segmental features involve the same https://www.indjst.org/ framing as in sub-segmental features except for the duration of the frame is changed and increased to 10 to 30 ms with shifting of frames with overlap to maintain continuity and avoid loss of information due to sharp edges of frame boundaries. The speech as a signal is generally non-stationary in nature and yet observed to be stationary for 10 ms to 30 ms duration, hence the segmental parameters' frame duration is justified. So the vocal tract parameters are obtained via segmental parameters. Lastly, the supra-segmental parameters are extracted with 100 ms to 300 ms frame size. These features symbolise the behavioural traits of the speaker like accent, word duration, speaking style, etc (65,67) . The popular features used in ASV and countermeasures are described below:

Mel Frequency Cepstral Coefficients (MFCC)
The MFCC based perceptual features are preferred in most speech processing frameworks such as automatic speaker recognition (68,69) . The frequencies are transformed into Mel scale through the standard procedure of short-term representation using framing, windowing, and spectral transformation as portrayed in Figure 3. The real cepstrum is processed through a triangular filter bank with T th order. The triangular filtering is used to average out the centre frequency energies. The Mel scale is known to have linear spacing for lower frequencies while the logarithmic distribution for higher frequencies.
The mel frequency f MEL is computed as Thus, the MFCC coefficient C h is given as Where h is the cepstral coefficient index. The MFCC coefficients along with their first and second derivatives are usually included in the feature set. Another form of representation is through an inverse mel filter bank for Inverse MFCC coefficients that represent high-frequency regions efficiently (51) .

Mel-warped Overlap Block Transformation (MOBT)
Prior to speech processing, the MOBT parameters are efficient in capturing discriminative information provided by the formants . The filter-bank energies (MFLE) are segregated into overlapping and non-overlapping frames. Besides that, for computing the cepstral parameters, the filter-bank energies are block transformed, and the DCT of every block yields MOBT parameters as seen from Figure 4. Similar to IMFCC, inverse mel scale might be considered in place of Mel scale in MOBT to extract IMOBT parameters (70,71) .

Speech-Signal-Based Frequency Cepstral Coefficient (SFCC)
For investigating the significant role of frequency content in the speech production model, warping of frequency is performed. A similar purpose is inculcated in SFCC features too (71) . The input time-domain speech signal v(t) is firstly passed through STFT followed by power spectral density operation for every frame is computed which is given as Here,N is the total samples for one window. When P (i, w) is averaged over entire speech data, the ensemble energy P (w) is calculated along with log function and lastly, distributed in a manner as below https://www.indjst.org/ Where, A j is the j th area interval, w j is lower cut-off and, w j+1 is the higher cut-off frequencies. The P point speech-based frequency warping is given as Where, F (w) is a continuous function when P approaches infinity and lie within 1.
The frequency warping helps convert spectral to the cepstral domain to get a triangular filter. The ISFCC is product of inverse warping operation as against SFCC (71) . Furthermore, the SOBT parameters are produced as a combined effort of MOBT and SFCC while the inverse would yield ISOBT (71) as seen in Figure 4

Linear Prediction Cepstral Coefficients (LPCC)
https://www.indjst.org/ The LPC is simple to compute in order to have low computations (72) . The speech production model is represented using Autoregressive Moving Average (ARMA). The LPC model employs all-pole filters through the prediction of the k th speech sample using linear combination of previous j samples.
Where a 1 , a 2 . . . , a j are the LPC parameters for every individual frame. Thus, error during prediction error is computed as Where s (k) is the original speech sample and s (k) is the predicted sample. Further, the square of error is calculated to obtain unique coefficients, The p stands for total samples in one analysis frame. To computer LPCC features, the squared error is differentiated wrt to LPC coefficients or filter weights as shown in Figure 5.
Therefore, the cepstral coefficients are

Cochlear filter cepstral coefficients (CFCC)
The conventional feature extraction techniques involve pre-processing such as windowing, framing, and low pass filtering. Contrastingly, human speech production does not operate on these pre-processing principles. Moreover, the framing and windowing operations surely introduce artefacts and discontinuities in the processed speech in contrast to the actual raw speech model. Hence, distinction of natural speech over a synthetic speech that is pre-processed becomes easier and will have the effects of distortions seen in their spectral response. One such feature is CFCC (73) that builds on the base of auditory cepstral coefficients and the block schematic is depicted in Figure 6.
https://www.indjst.org/ The x (t)and ψ (t) belong to Hilbert space. The factor m is scaling parameter while n is the time shift parameter, with energy that remains the same for all values of m and n given as The cochlear filter is presented as The α and β define the shape and width of the cochlear filter respectively. The value for θ must satisfy the admissibility property of mother wavelet ψ (t), Specifically, there exists a natural number C ψ such that C ψ = is a Bandpass Cochlear filter with the lowest frequency f L and center frequency f C .
In specific sub-band filter, k th sub-band filter, the value of m should be available in advance for a specific center frequency of cochlear filter at k ∈ (1,28]. After Cochlear filters the frequencies, the hair cell behaves like transducers to promote the vibration of BM. The hair cell vibrates in a positive direction only, hence - The output from the hair cell gets transformed into nerve spike density representation, which is given as, Where window length is g and window shift duration is Q. From the above function, the output obtained is further passed through the cube root and followed by the DCT function.

Cochlear Filter Cepstral Coefficients with Instantaneous Frequency (CFCCIF)
The CFCC-IF features were first applied in ASV through ASV Spoof 2015 Challenge (74) . The CFCC along with Instantaneous Frequency (IF) together builds up the CFCC-IF coefficients. The CFCC features are based on wavelet transform that utilizes the AT, Hair cell, and Nerve spike density computation. The product of Nerve spike density and IF is differentiated followed by non-linear log operation Int the end, DCT is applied to de-correlate the parameters producing CFCC-IF parameters as shown in Figure 7.

Constant-Q Cepstral Coefficients (CQCC)
The CQCC was successfully utilized in anti-spoofing by (75) . Like STFT, the CQCC is known to produce conjointly timefrequency variations. The important highlight of CQCC is high-frequency resolution at low frequencies and high-time based resolution for high frequencies. The spectral response obtained because of Constant QT is then processed through a non-linear logarithmic scale and then linearized by the Constant QT scale. Yet again, DCT is applied for producing the CQCC parameters as portrayed in Figure 8.

Magnitude based features
The time-domain speech signal is difficult to process and visually gives no clue of frequency contents. To do so, the STFT representation of speech yields magnitude and not to mention the phase contents too. On the whole, the spectral contents help process the data better. Thus, the magnitude-based features hold quite some weightage while detecting spoofed speech. The STFT of speech utterance is given as Here, S (t, w) ∨ signifies magnitude-related content while θ (t, w) holds the phase contents. The Log-Magnitude Spectrum (LMS) and Residual LMS (R-LMS) are worthy to detect spoofing attacks. LMS parameters are derived by the simple process of computing logarithmic of magnitude spectrum obtained because of STFT which is given as Therefore, it may be confirmed that the LMS features hold crucial magnitude contents such as formants, pitch, and specifically the harmonics present in the vowel spectrum. Also, the logarithmic operation limits the dynamics of the speech spectrum (76) . Furthermore, the R-LMS features are well established in speech recognition frameworks but still are not much explored in an anti-spoofing environment. Moreover, the synthetic speech from VC or TTS algorithms represents formants quite well. So, using formants to differentiate between speakers is tough. Furthermore, the LPC technique is popular for representing formants well enough, but the residual part has no presence of formants. The R-LMS parameters are obtained using the LMS algorithm on the LP-residual (LPR) signal (76) .

Phase based features
The STFT representation of speech produces magnitude-related and phase-related spectrums. The phase contents from the speech are perceptually indistinguishable. Still when differentiating between speakers, the smallest of parameter counts; hence phase too is a potential choice for spoof detection. So, the phase-associated features including Group Delay (GD), Modified GD (MGD), Instantaneous Frequency (IF), Pitch Synchronous Phase (PSP), and Baseband Phase-Difference (BPD), are used by researchers for anti-spoofing (77,78) . Additionally, the phase spectrum is considered unstable making pattern matching tedious due to phase warping. So, phase spectrums are modified further to benefit from them in anti-spoofing scenario. The GD function of phase spectrum is obtained by computing derivation wrt frequency and is given below Where, the princ(.) functions maps the phase spectrum to (−π, π). Despite its abilities to extract pitch and formants efficiently from speech, the standard GD function lacks in grasping short-time spectral contents as zeros are existing in z-plane which are nearer to the unit circle. Hence, the MGD function was introduced to overcome the shortcoming of the GD function (77) . The MGD parameters are computed as https://www.indjst.org/ Here, the α and γ are meant to fine-tune the function, M(w) is the smoothened Z(w), while Z(w) is the complex speech spectrum. The derivation of phase spectrum wrt frequency yields GD parameters while the derivative wrt to time axis produces a different parameter called Instantaneous Frequency (IF) (79) . The IF can be computed as So, it may be inferred that GD and IF provide complimentary contents which may be useful for spoof detection. Furthermore, BPD parameters are more steady time derivative phase parameters that are also computed to support spoof detection studies (70) .
The BPD parameters are given as Here, P is frame shift expressed using total samples, Ω t is frequency which is constant and is equal to 2π p/L, FFT length is L. Additionally, the PSP is another choice when computing phase parameters. The speech signal consists of periodic and non-periodic signals. The periodic part is usually computed from fixed frame size while with regards to PSP parameters, it is extracted using pitch instances. The Glottal Closure instants (GCI) are important for deciding the start and end of the pitch period (77) . The algorithm begins with one pitch period preset and keeps updating from consecutive pitch periods. Another phase parameter is Cosine Normalization Function or more commonly addressed as Cosine Normalized Phase (CosPhase) (21) . Below are the steps for computing CosPhase parameters 1. Unwarp the phase spectrum 2. Compute cosine function to the spectrum obtained in step (i). This normalizes the function to -1 and +1. 3. Lastly, apply DCT after normalizing the function from step (ii). Choose initial eighteen parameters with their △ and △ 2 .

Miscellaneous features
Besides the categorical parameters which have a certain rigid way of classification like phase, magnitude, human speech production, or perceptual model; there exist other features that might not fit in the given categories but are specifically developed for spoof detection. So, indirectly they are handcrafted for dedicated tasks and may be fused to benefit from their combination rather than individual shortcomings. These parameters are Perceptual Linear Prediction (PLP), Rectangular Filter Cepstral-Co-efficients (RFCCs), Spectral Centroid Magnitude-Co-efficients (SCMC), Sub-band Spectral Flux Co-efficient (SSFC), and Variable length Teager energy operator energy separation algorithm-instantaneous frequency cosine coefficients (VESA-IFCC) (86,87).

Critical Evaluation of Feature Representation in ASV
The features are the key to any speech application because the manner in which raw speech frames are represented affects the performance of that application. Therefore, knowing the salient qualities of a feature set helps in choosing the right feature.
The MFCC based features have established popularity amongst the entire speech and audio community because of their ability to represent human response accurately. Despite that, these features do not consider phase when extracting parameters from the speech (80) . Also, the synthetic speech production algorithms (TTS or VC) usually ignore the phase as it is imperceivable. This led to research on phase-based features in addition to magnitude features such as LMS, Residual-LMS, GD, MGD, IFD, BPD, and PSP. Furthermore, the VC speech that uses mel-frequency warping highlights the lower frequency regions as against the high-frequency regions. While all this time, the high-frequency components held vital speaker traits that contribute to differentiate synthetic speech. Thus, the long-term features are found to be more effective than short-term features (81) , not to forget the CQCC and CFCCIF. The sub-band-based not limited to LFCC and ESA-IFCC, gained importance since the artefacts in synthetic speech are spread across various sub-bands. The temporal features such as IF and magnitude envelope capture these artefacts. Additionally, the wavelet filter banks are also explored to represent scalograms. The conventional LP features are also explored lately as they represent spectral peaks more accurately than valleys (82) .
The prosodic features represent the accent and speaker ques, yet they are easy to reproduce and hence susceptible to attacks. Also, there is a demand for a larger data size for training to extract prosody from speech. https://www.indjst.org/ Furthermore, the pitch extraction techniques do not perform up to the mark in noisy environments (11) . On the contrary, the high-level features are more reliable since they are less sensitive when exposed to variations in channel and noise. This is opposite to prosodic and spectral parameters (70,90).

Machine Learning techniques
The Machine learning algorithms either govern the pattern classification or learning of features. After feature representation, the statistical models train using significant features to further prepare for testing. The test sample may belong to a known/ unknown attack or is simply a genuine speaker. The task is complex but made solvable using efficient machine learning schemes for speaker modeling, and also decision-making tasks.
The speaker models may be subdivided into generative, discriminative, and mixed/fused approaches. The generative models comprise of Gaussian Mixture Models (GMM) (83,84) , i-vectors (36) Vector Quantization (VQ) (21) , and Hidden Markov Models (HMM) (93) while discriminative models include the Support Vector Machines (SVM) (94), deep learning, and neural networks (85,86) . The increasing developments in utilizing DNN in the speaker verification scenario are owing to accurate results and of course, their ability to discriminate between speakers (46) . Consider an unknown test utterance T that is claiming to be speaker A, then building a hypothesis for determining the class of the utterance.
The efficiency of speaker verification depends greatly on the model building; thus, the appropriate model choice will further improve results. The below sub-sections describe commonly used modeling algorithms that lead to a reduction in EER implying better anti-spoofing techniques (51) .

Vector Quantization (VQ) technique
The VQ-based codebook mapping technique is suitable for text-dependent scenarios. When the training phase begins, the codebook is built through clustering techniques (87) . The clustering algorithm averages out the temporal information present in codebook. Moreover, there is no requirement for temporal alignment. The input vector is compared against every codeword from the codebook. The code word that has least distance is selected as a matched pair (37) . One such approximation technique is the nearest neighbor's algorithm which performs better than Dynamic Time Warping (DTW) and VQ (87) . As opposed to VQ technique, the nearest neighbor algorithm considers temporal content. The input frame is compared to past frames forming a distance-based inter-frame matrix. The nearest neighbor is one with the lowest distance amongst input and past frame. The match score produced is the average of distance for input frames. The match scores together give the log-likelihood ratio approximation.

Gaussian Mixture Models (GMM)
The GMM models are popular for generalization and assume that the nature of input data is Gaussian (70) . The individual gaussian has an associated mean, standard deviation, and feature vectors are multiple gaussians called mixture of gaussians. The GMM is represented using output probability function for X feature vector, Where λ k is a weighted sum of G components for k th speaker, w m is the mixture weight with G ∑ m=1 w m = 1 while the d m is density for individual components, and a K-variate GMM gaussian function is given as Where µ m is a mean vector of the dimension (K) while ∑ m signifies covariance matrix, with dimension (K × K). The GMM for a speaker k, λ k is given as https://www.indjst.org/ The criteria for selecting total mixtures are dependent on the kind of language, such as English language which has 45 phones. Thus, total mixtures selected must be higher than the total-phones so 64 is an appropriate measure. The mean, deviation, and mixture weights are obtained through parameter estimation including Expectation-Maximization (EM), Maximum Likelihood (ML), and Maximum A-Posteriori criteria (MAP) (3) . Out of these, the ML algorithm is quite common. Suppose for a given utteranceT = t 1 ,t 2 , ...,t N , target model λ target and imposter model λ imposter , the ML ratio is given as Thus, in logarithmic scale, Bayes Rule is given as, The prior probabilities are ignored as they are constant. The likelihood ratio is compared to a threshold φ .
The likelihood ratio is a score obtained by a fair comparison of the target model to the imposter model. The value of the threshold is updated regularly depending on these scores to steer clear of any false positives and negatives. The likelihood of sample belonging to the target is given as

GMM and Universal Background Models (UBM)
The ASV system acknowledges the test sample as known when the score obtained equals to or is greater than the set threshold. There is another model built which is the imposter model using GMM also called UBM. The UBM is known to represent any claimed speaker's identity efficiently. The data required for training a UBM is large which contributes to better parameter estimation. Subsequently, the number of components also increases (as against a single GMM like above 256). The motive of building a UBM is to reduce the speaker dependency i.e., speaker-independent features distributed speaker's data (3,84,88) . Along with imposter model training, the UBM overcomes the issue of training all the GMM in case of a new addition to the data. Only UBM is trained when new data is added and not the individual GMMs. Assume a UBM model with feature F = F 1 , F 2 , ...., F n wrt to a specific speaker. At i th instant, with c components, to train this feature vector into the UBM, the probabilistic alignment is given as Here, p c (F i ) is probability density function for i th feature vector with w c is mixture weight. Thus mean, variance, and mixture weights are calculated as https://www.indjst.org/ The calculated mean, variance, and mixture weights are further used to update the previous values of UBM for the c th component.
Here, γ is scaling parameter for maintaining summation to unity, (α w c , α m c , α v c ) are adaptation parameters that remove any mismatches between past and currently estimated parameters. The mean estimation involves setting weight and variance to zero.

Support Vector Machines (SVM)
The clubbing of generative and discriminative models is an interesting alliance in ASV research (89) . The GMM-UBM framework produces a speaker template that is fed to the SVM and SVM being a natural solution to ASV problem discriminates between speakers efficiently (89) . This framework implies that SVM is an appropriate choice for binary classification tasks.
The GMM-UBM framework assumes a diagonal covariance matrix and MAP is employed for training. The resultant adapted model is a stacked version comprising of mean with K dimensions called super vectors and K is the total mixtures. So, this GMM based super vector is a mapping function between speech utterances and the higher dimension matrix. Thus, these super vectors are treated as SVM features and the unknown sample is detected as where t l is the required output which might be equal to a positive one for acceptance and a negative one for imposter. The support vector is v l , b is the learning constant, M ∑ l=1 α l t l is ideally zero and α l > 0. The kernel is given as Here, the mapping function is m (.) that converts input features to super vectors, distance measure u (x) separates hyperplane, and its polarity shows the category of the unknown sample. Regarding the x super vectors, the predicted label as zero, points to negative class while one signifies positive class.

Joint Factor Analysis (JFA)
The factors responsible for the efficiency of the GMM-UBM alliance are two-fold: the first being speaker variations and the second is session variability i.e. from training to testing (90) . These issues are solved by building models of individual speakers and channel distortions as is the case for JFA. The algorithm branches out GMM super vectors V, into individual speakerdependent super vectors i and channel-dependent super vector, c where, i = j + K f + Dg and c = Nh, K is a low-rank speaker variability matrix, D is a diagonal variability matrix that models residual variability that cannot be captured by the speaker, f and g are speaker factors and residuals respectively. The low-rank channel variability matrix is N and h is a channel factor vector. During the training stage, the GMM super vectors are obtained by JFA based training, and channel-dependent information is not considered. On the contrary in the testing stage, channeldependent information is acquired from test utterance and the obtained super vector is ranked using linear dot product (87) . https://www.indjst.org/

I-vectors
The JFA technique causes system degradation relating to performance due to loss of channel-dependent content being ignored during the training stage (91) . To get rid of this problem, i-vectors were introduced (4,16) . The i-vectors are known to use single variability space for GMM super vectors and are represented as Where, j is the same super vector used in JFA, B is a lower-rank variability matrix for entire training data and w is the total variability factor. The cosine similarity score (CSS) gives the angular difference between i-vector w Test and target i-vector, w Target for classification objectives.

Hidden Markov Models (HMM)
The HMM comprise of a hidden stochastic process using an observation sequence (35) . The arcs and chain form a markov chain where arcs direct to the transitional probabilities that connect one state to another. The HMM differs from Markov chain with a slight variation of hidden state while state and transitional probabilities are already known in the Markov chain's case (92) . The conventional GMM is an obvious choice when building an ASV state-of-the-art model which does not take into account the temporal information present in the features. Along with temporal contents, linguistic information is also ignored when building a phone-based GMM (93) . For processing temporal information, HMM is considered. The HMM performs well in text-dependent scenarios while GMM takes a lead in text-independent tasks (3) .

Multi-layer Perceptron (MLP)
The basic Feedforward Neural Network (FNN) is also termed a Multi-layer Perceptron (MLP) that uses back-propagation to train its weights. Usually, they perform binary classification when applied to ASV for clear distinction amongst known to unknown speakers. The construction of MLP is simple with nodes in every layer and multiple layers with interconnected nodes. The input to the nodes is utilized to calculate the weighted sum while the transfer function gives results of output nodes. The gradient descent algorithm is utilized to determine weights using back propagation. For an ASV, the MLP discriminates between the imposter and genuine speaker by computing a score from every frame of test utterance (94) .

Deep Neural Networks (DNN) for ASV
The DNN is the new hope in not only the decision-making and speaker modeling but also as bottleneck features. In other words, for end-to-end ASV, DNNs act as features. The speaker representation is influenced by the speaker model, representation level, and loss function during training. Furthermore, the DNN bottleneck features preliminarily acquire the speaker-specific information at frame level followed by utterance level. The DNN output i.e., DNN features are modified into i-vectors and at last, PLDA is employed to determine the verification score (95) . So, it is harmless to treat DNN alone as a feature representation technique or clubbing it with other existing features such as MFCC. The commendable performance by DNN is due to the reason that GMM does not predict the phonetic contents in the text-dependent scenarios (96) . More studies based on DNN can be accessed from (25,51,97) .

Convolution Neural Network (CNN)
The conventional Feed-forward NN (FNN) has a similar layout as the CNNs. The components are identical like the weights, bias, non-linear conversion function; still, there seems to be a difference in local connectivity (51) . The FNN has connectivity between all the layers with the input nodes while the CNN comprises small filters that cover the entire input that gathers the summation of the result. This is the basis of convolution operation (70,98) . Thus, the layers in a CNN are a sandwich of convolution, max pooling, and fully connected layers.

Recurrent Neural Network (RNN)
The RNN is a sequence-based NN that considers weight estimation over a timestamp (77) . When the RNN is unfolded a DNN is obtained that consists of layers with time step. The weight matrix W x (where x can be input, hidden, and output) and biases b y (where y is input and/or output) for input r 1 , r 2 , ..., r L has output given as The conventional RNN efficiently captures temporal contents in a single direction only, hence the Bidirectional RNN (99) is proposed.

Critical Evaluation of Machine Learning algorithms in ASV
The GMM has been widely chosen on the account of their commendable performance in ASV task. Yet again at the same time require larger datasets for training with moderate to a high quality of data posing difficulty in noisy environments. The urge for a large dataset can be addressed through a diagonal covariance matrix that subsides the computational intricacy as well.
Another concern is centered around unknown data, the GMM is unable to capture non-linearities due to its generative nature contributing to a low classification score. This concern may be addressed through data segmentation into training, development, and testing labels. While the GMM-UBM framework is preferred as against individual GMM overcoming unknown data problem. As a consequence, UBM is trained on a larger dataset gradual increment in mixture number makes them robust to unknown data. Furthermore, the GMM-SVM alliance has advantages of generative and discriminative models leading to high accuracy scores. Nonetheless, the MLP needs larger data for an optimized performance and longer training time to reach that milestone. Above all, the DNN are found to outperform all networks with competence to adapt as features and learning unknown data which is contrasting to generative models. The CNN is used where variability is observed in time while RNN is preferred in case of temporal data (100) .

Score Normalization
The unnecessary variations in the score are stabilized using score normalization techniques. The operation of normalization is equivalent to thresholding in speaker dependent scenarios. The normalization techniques found in literature are based on a unique assumption that the imposter's score to have gaussian distribution. Thus, the mean µ G and standard deviation σ G are accustomed normalize a given score ‫א‬ (Y ‫א)‬ The normalization process implemented by means of the target speaker's statistical information is Zero Normalization (ZNorm) (101,102) . The similarity score is measured through relational analysis of the target speaker's model with various set of imposters as in imposter similarity score. This similarity score is further used to compute µG and σ G. The good thing about ZNorm is permitting offline parameter estimation. However, the Test Normalization (TNorm) utilizes the test sample to compute µG and σ G (103) . During the testing stage, a set of imposters are involved in computing the imposter similarity score for specific test utterances. It is observed that the significant improvement due to TNorm is when low false positives are obtained. In contrast to the ZNorm, the TNorm has to be conducted in the online mode which is the testing stage. The ZNorm variants like Handset Norm (HNorm) and Channel Norm (CNorm) may offer a reduction in the channel as well as microphone effects (104) . The HNorm and CNorm are conducted for every individual speaker model which has chances of being attacked through handset or channel. So during the testing stage, the prior knowledge of the effect (either handset or channel) leads to application to respective parameters for score normalization. The main drawback of ZNorm and TNorm is being informed about the imposter in advance which is technically impossible. To address this issue, the DNorm fits right in to predict pseudo-imposter information from the background model through appropriate algorithms like Monte-Carlo (105) . https://www.indjst.org/

Limitations of Score Normalization technique
The ZNorm can estimate parameters offline while TNorm needs online estimation. The TNorm outperforms cohort normalization by employing variance for approximating the distribution of cohort population more efficiently. As this estimation is based on the same target speaker-test. Thus, acoustic mismatches are intervened. Though TNorm has a major setback of language dependency of the speaker (102) .

Evaluation Metrics for ASV
The ASV system built with experimental features and classification needs to qualify closer to or even prove better than state-ofthe-art techniques. This is feasible through standard evaluation measures used in ASV scenarios.
Let t be a verification trial linked to two speech samples t = s1, s2. In case the trial is supervised, there might be labels present for respective samples at {A1, A2}, which are indirectly controlled by samples belonging to the same or different speakers. Consider a test set T for the same speaker and different speakers with correct detection labelled as A1 or A2. Thus, based on these labels there are two potential errors reported in the case of Miss Detection (MD) also called False Negative or False rejection Ratio (FRR) and the second being False Positive (FP) or False Alarm (FA) or False Acceptance Ratio (FAR).
The probability for detecting error for a specific test set is given as Where, d is the threshold that keeps track of both the errors giving the freedom to the user to choose a preferred operating point. The Detection Cost Function (DCF) was first presented in the NIST-SRE challenge that determines the overall cost for both errors (11) . The weighted aggregate of probabilities of FP and FN given as Where, p (A 2 ) = 1 − p (A 1 ). Here, C MD is the relative cost of MD, C FA is the relative cost of FA, p (A 1 )and p (A 2 ) are prior probabilities. The normalization of C dc f by a-priori cost C de f ault yields a more specific metric, C norm . The C de f ault is calculated by fixing all trials to the same speakers or different speakers whichever is less.
Irrespective of the threshold, the cost is calculated from hard decisions and the threshold is chosen for the value of min DCF (15) . Thus, the min DCF is computed as Furthermore, the Constellation plots are a better choice in case similar scores appear overlapping or closer on the DET graph (106) . The constellation plot utilizes chosen pairs of operating points in a 2D. Another popular metric, the EER gives a point of convergence where FP and FN are the same and ideally must be as low as possible (8) . Additionally, the visualization of plots sometimes makes decision-making easier with plots like Detection Error Tradeoff (DET). It is an option in place of ROC curves. The minimum DCF and EER are drawn on the DET curve [15,117] (15) .

Recent Trends and Future Perspectives
Currently, there is an immediate demand to have transparent methods in order to evaluate spoofing attacks. This task isn't easy as it may seem, as a couple of parameters need to be considered while developing anti-spoofing techniques. Addressing only a single anti-spoofing method won't be enough as the atrocity of attack type and data acquiring environments also influence the https://www.indjst.org/ performance. So a countermeasure that resumes its operation while the first one failed, might provide two-step authentication to ASV systems. With advancements in speaker recognition and verification, potential algorithms are being developed where the conventional features are no longer required and are replaced with deep features sometimes. The end-to-end DNN systems have gained momentum recently due to their low EER in contrast to the state-of-the-art GMM techniques. Thus there have been instances where fusion-based algorithms have been reported leading to 0% EER as well. To conclude, it is certainly time straining and difficult to bring out novelty in order to get low EER yet with proper literature review, the tasks level of complication may be reduced. Moreover, the developments in building stronger ASV system extends protection to our current speech biometric systems.

Acoustic conditions of the speaker
The impact of the acoustic environment such as background noise, reverberated speech, and effects due to windowing and overlapping speech frames influences outcome of the ASV framework. These conditions need to be incorporated prior to developing an ASV system for eluding FRR due to contaminated speech which may be from the natural environment. Hence, either speech filtering must be performed or a noise-resistant feature set and decision-making algorithms needs to be reengineered to foster a better ASV system.

Traits of the speaker
Selecting a dataset does not involve listening to every speaker's voice but rather a significant amount of samples with variation in attack types is enough for building a generic anti-spoofing ASV framework. Yet, sometimes the machine learning system is unable to capture the speaker-specific information and performs badly by not reaching convergence. This happens when speaking rate, speaker's style, and accent are ruled out as possibilities that could hamper the ASV's performance. According to the author's knowledge, not much work has been carried in developing systems adapting according to speaker's traits.

Demand for unsupervised data and training
The requirement for unknown test data and hence the unsupervised training is both challenging and demanding. Since, the ASV Spoof 2017 challenge, test samples for unknown attacks have been considered. Hence, some form of unsupervised learning mechanism that helps capture the generality over the unlabelled data and equally detect the unknown attack needs to be explored. Ultimately, the sole aim of researchers must be to make quick amendments in the already trained model (if necessary) and also for the algorithms to continuously learn from their ideal role models that are "the humans".

Language independent anti-spoofing ASV
The demand for making the system capable is unending as we desire the machines to be exactly like us but simply without the physiological aspect. Similarly, the speaker's language should not be glitch anymore, in order to verify his identity. Imagine a scenario where a speaker needs access to his bank account ( of course speech-based authentication) but has suspicions that an attacker is keeping a watch over him, so he might change his language to get access without the attacker being able to impersonate or record or take some action. Sounds like a three-level authentication scenario.

Lack of exploring human-based features and Universal anti-spoofing techniques
The studies conducted to date consider single features are several dissimilar features along with their fused versions. There is a need to develop features that are interdependent and influenced by the human speech production mechanism. So a chain of features that are highly similar to humans may act as mediator trait in feature engineering. Additionally while developing anti-spoofing systems, the focus should be on building algorithms that consider all attacks along with noisy real environments to obtain a universal measure. Diversity is surely expected to reach standardization and avoid biasing issues.

Lack of grading in database development
The databases available so far do not indulge in variation of environments like noisy places and different attacks in addition. For instance, ASV Spoof 2015 challenge only has a synthetic speech from TTS and VC frameworks while the ASV Spoof 2019 dataset has replay speech in addition to VC and TTS speech. Yet again, if datasets are recorded with clean and noiseless scenarios, there is a huge possibility the system is most likely to fail for unknown attacks with unknown noises in the test speech. https://www.indjst.org/

Short utterance testing and Text-dependent systems
The real-time instances involve smaller utterances and if the system has fixed the length of samples for authentication, it becomes vulnerable besides facing performance degradation. A user's voice must be unique and differentiable causing the fixed-length systems to fail. Furthermore, current ongoing researches are based on text-independent systems which are the right fit for surveillance applications. Conversely, the text-dependent ASVs are yet to reach an established stage with their usage in authentication scenarios.

Acknowledgement
This research was funded by Taylors University, Lakeside Campus, Selangor, Malaysia. The authors deeply grateful to the University and the School of Computer Science and Engineering for their encouragement and support.

Conclusion
This article gives a broad view of speech based spoofing attacks and anti-spoofing measures. Most recent and state-of-the-art feature representation and pattern matching techniques employed for building countermeasures are also discussed in this article along with their critical analysis. The score normalization techniques and evaluation metrics used to judge performance of the anti-spoofing system are mentioned as well. To support the anti-spoofing system, the dedicated spoofing datasets, their traits and limitations are also described in this article.
The studies relating to evaluation of ASV systems exposed to spoofing attacks have increased lately. But it has also been quite difficult and challenging to reconstruct the unbiased and unfeigned attack scenarios for building spoofing datasets. Moreover, the spoofing attack samples are generated in controlled environments and thus it is rare or impossible to assemble datasets with diverse characteristics. Additionally, in the real-attack conditions, the type of attack is certainly unknown and so there is a need to develop systems that work without any constraints. Hence, there are few open-ended queries in this decade-old anti-spoofing research, such as what are problems in the counter-measures developed so far?; what are the future directives that would contribute in improving the anti-spoofing frameworks?; and lastly, where to begin with in order to make a difference in this domain?
Before long, the most obvious part to begin with is evaluating the speech based attacks. This issue is not a candid task but in fact more demanding in terms of acquiring knowledge about newer irregularities involved while building the spoofing techniques. Furthermore, it may be rightful to say there is no such algorithm that can be considered as the best as not one but fusion of best algorithms may be capable to perform equally well for all spoofing attacks. Thus, the development of alternative counter-measures to the ones proposed by the researchers is the need of the hour. Last but not the least, developments of refined spoof detection schemes have led to deeper possibilities in terms of research as the need to prevent such attacks is even more crucial now. This opens up opportunities for fostering reliable spoof detection algorithms.