EEG-Based User Reaction Time Estimation Using Riemannian Geometry Features

Riemannian geometry has been successfully used in many brain–computer interface (BCI) classification problems and demonstrated superior performance. In this paper, for the first time, it is applied to BCI regression problems, an important category of BCI applications. More specifically, we propose a new feature extraction approach for electroencephalogram (EEG)-based BCI regression problems: a spatial filter is first used to increase the signal quality of the EEG trials and also to reduce the dimensionality of the covariance matrices, and then Riemannian tangent space features are extracted. We validate the performance of the proposed approach in reaction time estimation from EEG signals measured in a large-scale sustained-attention psychomotor vigilance task, and show that compared with the traditional powerband features, the tangent space features can reduce the root mean square estimation error by 4.30%–8.30%, and increase the estimation correlation coefficient by 6.59%–11.13%.


I. INTRODUCTION
Brain-computer interfaces (BCIs) can use brain signals such as the scalp electroencephalogram (EEG) to enable people to communicate or control external devices [23], [36]. Thus, they can help people with devastating neuromuscular disorders such as amyotrophic lateral sclerosis, brainstem stroke, cerebral palsy, and spinal cord injury [53]. However, there are still many challenges in their transition from laboratory settings to real-life applications, including the reliability and convenience of the sensing hardware [29], and the availability of highperformance and robust algorithms for signal analysis and interpretation [24], [33], [34], [47]. This paper focuses on the latter, particularly, feature extraction for EEG-based BCIs.
For example, Li, Wong, and de Bruin [28] used RG of the EEG power spectral density matrices for sleep pattern classification. They also proposed a closed-form weighting matrix for the power spectral density matrices to minimize the distance between similar features and to maximize the distance between dissimilar features, and demonstrated better performance than the Euclidian distance and the Kullback-Leibler distance. Barachant et al. [5] proposed two RG approaches for motor imagery classification. The first uses the spatial covariance matrices of the EEG signal as features and RG to directly classify them in the manifold of symmetric and positive definite (SPD) matrices. The second maps the covariance matrices onto the Riemannian tangent space, which is a Euclidean space, and then performs variable selection and classification. They achieved comparable or better performance than a multiclass Common Spatial Pattern (CSP) plus Linear Discriminant Analysis (LDA) approach. In [14], Congedo, Barachant, and Andreev further used RG to build calibrationless BCI systems for applications based on eventrelated potentials, sensorimotor (mu) rhythms, and steadystate evoked potential. It outperformed several state-of-theart approaches, including xDAWN, stepwise LDA, CSP+LDA, and blind source separation plus logistic regression. Barachant [7] also proposed a spatial filter to increase the signal to signalplus-noise ratio of magnetoencephalography (MEG) signals before constructing a special form of a covariance matrix for RG feature extraction, and a k-means clustering like unsupervised learning algorithm in the Riemannian manifold to improve the offline classification performance. This approach outperformed 266 other approaches and won the Kaggle "DecMeg2014 -Decoding the Human Brain" competition 1 , which aimed to predict visual stimuli from MEG recordings of human brain activity. Kalunga et al. [25] proposed an online classification approach in the Riemannian space and showed that it outperformed Canonical Correlation Analysis in Steady-State Visually Evoked Potential classification. Yger, Lotte, and Sugiyama [59] empirically compared several covariance matrix averaging methods for EEG signal classification. They showed that RG for averaging covariance matrices improved performances for small dimensional problems, but as the dimensionality of the covariance matrix increased, RG became less efficient. Lotte [33] also proposed a framework to combine transfer learning, ensemble learning, and RG for calibration time reduction, which outperformed CSP+LDA. The Riemannian distance was used in regularization to emphasize auxiliary users whose covariance matrices are close to the target user. Navarro-Sune et al. [39] proposed a BCI to automatically detect patient-ventilator disharmony from EEG signals. RG of EEG covariance matrices was used in semi-supervised learning for effective classification of respiratory state, and it outperformed the Euclidean distance. Waytowich et al. [48] proposed an approach to integrate RG with transfer learning and spectral meta-learner [40], an offline ensemble fusion approach, for user-independent BCI, and demonstrated in single-trial event-related potential classification that it can significantly outperform existing calibration-free techniques and traditional within-subject calibration techniques when limited data is available.
All above approaches focused on EEG classification problems in BCI, whereas BCI regression problems have been largely overlooked. In theory a regression problem is equivalent to a classification problem with infinitely many classes, and hence the output has much finer granularity than a traditional two-class or multi-class classification problem, which provides richer information in decision making. There are at least two types of BCI regression problems in the literature and practice. The first type is behavioral or cognitive status prediction, e.g., estimating the continuous value of a driver's drowsiness from the EEG [30]- [32], [49], [54], [56]- [58], and estimating a subject's response speed in a psychomotor vigilance task (PVT) from the EEG [55]. The second type is direct control applications, e.g., controlling the movement of a mouse cursor using BCI [13], [21], [35], [51], [52], and controlling the continuous movement of a hand in the 3D space using EEG [12].
In this paper, we apply RG and tangent space features to supervised BCI regression problems. To overcome the limitation pointed out by Yger, Lotte, and Sugiyama [59], i.e., RG is less efficient when the dimensionality of the covariance matrix is large, we adopt an approach similar to what Barachant used in [7]: we first use a spatial filter proposed in [55] to reduce the dimensionality of the covariance matrices and also to increase the EEG signal quality, and then extract the RG features in the Riemannian tangent space. We validate the performance of the proposed approach in reaction time (RT) estimation from EEG signals measured in a large-scale sustained-attention PVT [16], which collected 143 sessions of data from 17 subjects in a 5month period. To our knowledge, this is the first time that RG has been used in BCI regression problems.
The remainder of this paper is organized as follows: Section II describes the spatial filter we proposed earlier for supervised BCI regression problems. Section III introduces RG and the tangent space features for BCI regression problems. Section IV describes the experimental setup, RT and EEG data preprocessing techniques, and the procedure to evaluate the performances of different feature extraction methods. Section V presents the results of the comparative studies. Section VI provides parameter sensitivity analysis and additional discussions. Finally, Section VII draws conclusions and outlines a future research direction.

II. SPATIAL FILTERING FOR SUPERVISED BCI
REGRESSION PROBLEMS Recently we [55] proposed two spatial filters for supervised BCI regression problems, which were extended from the common spatial pattern (CSP) algorithm for supervised classification problems. They have similar performance and computational cost. One of them, CSP for regression -one versus the rest (CSPR-OVR), is briefly introduced in this section, as the RG features are better extracted from the spatially filtered EEG data than the raw EEG data.
Let X n ∈ R C×S (n = 1, ..., N ) denote the nth EEG trial in the training data, where C is the number of channels and S the number of time samples. We assume that the mean of each channel measurement has been removed, which is usually performed by band-pass filtering. Let y n ∈ R be the corresponding RT of the nth trial. CSPR-OVR first constructs K fuzzy sets [60], which partition the training samples into K fuzzy classes. To do that, it partitions the interval [0, 100] into K + 1 equal intervals, and denotes the partition points as {p k } k=1,...,K . It is easy to obtain that For each p k , CSPR-OVR then finds the corresponding p k percentile value of all training y n and denotes it as P k . Next we define K fuzzy classes from them, as shown in Fig. 1. Then, for each fuzzy class, CSPR-OVR computes its mean spatial covariance matrix as: where µ k (y n ) is the membership degree of y n in Fuzzy Class k.
Next CSPR-OVR designs a spatial filtering matrix W * k ∈ R C×F , where F is the number of individual vector filters, to maximize the variance difference between Fuzzy Class k and the rest, i.e., where Tr(·) is the trace of a matrix. (3) is a generalized Rayleigh quotient [22], and the solution W * k is the concatenation of the F eigenvectors associated with the F largest eigenvalues of the matrix and the spatially filtered trial for X n is: In summary, the complete CSPR-OVR algorithm for supervised BCI regression problems is shown in Algorithm 1.
Input: EEG training examples (X n , y n ), where X n ∈ R C×S , n = 1, ..., N ; K, the number of fuzzy classes for y n ; F , the number of spatial filters for each fuzzy class. Output: Spatially filtered EEG trials X ′ n ∈ R KF ×S . Band-pass filter each X n to remove the mean of each channel; Compute {p k } k=1,...,K in (1); Compute the corresponding percentile values {P k } k=1,...,K for y n ; Construct the K fuzzy classes as shown in Fig. 1; ComputeΣ k by (2); Compute W * k by (3); Construct W * by (4); Return X ′ n by (5)

III. RG AND THE TANGENT SPACE FEATURES
This section introduces the basics of RG, and an approach to extract the Riemannian tangent space features.

A. Riemannian Geometry
The RG approach for BCI works on the covariance matrices of EEG trials, which are symmetric positive-definite and form a differentiable Riemannian manifold M [20] with dimensionality R(R + 1)/2, where R is the number of rows (columns) of the covariance matrices. As a result, we need to use Riemannian metrics, instead of the traditional Euclidean metrics, which are more appropriate for flat spaces of vectors. Particularly, we are interested in the distance measure between two covariance matrices, as many machine learning methods rely on such distances.
The Riemannian distance δ(Σ, Σ n ) between two covariance matricesΣ ∈ R R×R and Σ n ∈ R R×R , called the geodesic, is the minimum length of a curve connecting them on the manifold M. It can be computed as [4], [37]: where the subscript F denotes the Frobenius norm, and λ r , r = 1, ..., R, are the real eigenvalues ofΣ −1 Σ n .
AtΣ ∈ M, a scalar product can be defined in the associated tangent space TΣM. This tangent space is Euclidean and locally homomorphic to the manifold. So, Riemannian distance computations in the manifold can be approximated by Euclidean distance computations in the tangent space [6].
The logarithmic map projects locally a Σ n ∈ M onto the tangent space TΣM ofΣ by: where logm(·) denotes the logarithm of a matrix [10]. The logarithm of a diagonalizable matrix The exponential map projects an elementΣ n on the tangent space TΣM back to the manifold M by: where expm(·) denotes the exponential of a matrix [10]. Fig. 2 illustrates a Riemannian manifold M, the tangent space TΣM atΣ, the geodesic betweenΣ and Σ n , and the corresponding logarithmic and exponential maps.
The Riemannian distance δ(Σ, Σ n ) between two covariance matricesΣ and Σ n on the manifold M can also be computed by a Euclidean distance in the tangent space aroundΣ, i.e. [5], where the upper(·) operator keeps the upper triangular part of a symmetric matrix and vectorizes it by applying weight 1 for the diagonal elements and weight √ 2 for the out-of-diagonal elements [45]. The RG mean [42], or the intrinsic mean [19], of N covariance matrices is defined as the matrix minimizing the sum of the squared Riemannian distances, i.e., There is no closed-form expression for the RG mean, but an iterative gradient descent algorithm (see Algorithm 2 [19]) can be used to find the solution. Note that Algorithm 2 makes heavy use of the logarithmic and exponential maps. In this paper we used the implementation in the Matlab Covariance Toolbox 2 .

B. Tangent Space Features for BCI Regression Problems
To use the tangent space features for BCI regression problems, we first spatially filter each X n to obtain X ′ n in (5), and then estimate its spatial covariance matrix Σ n ∈ R KF ×KF (note that each row of X ′ n has zero mean): and weight √ 2 to the out-of-diagonal elements so that their Euclidean norm is equal to the Riemannian distance betweenΣ and Σ k . The weights do not have an effect when regression methods like LASSO are used, but are very important for distance based regression methods like kNN regression.
The complete tangent space feature extraction procedure for BCI regression problems is summarized in Algorithm 3.

Algorithm 3:
The Riemannian tangent space feature extraction procedure for BCI regression problems.

IV. EXPERIMENTS AND THE PERFORMANCE EVALUATION PROCESS
This section introduces a PVT experiment that was used to evaluate the performances of the proposed tangent space feature extraction method, and the corresponding RT and EEG data preprocessing procedures.

A. Experiment Setup
Seventeen university students (13 males; average age 22.4, standard deviation 1.6) from National Chiao Tung University (NCTU) in Taiwan volunteered to support the data-collection efforts over a 5-month period to study EEG correlates of attention and performance changes under specific conditions of real-world fatigue [26], as determined by the percent effectiveness score of Readiband [43]. The Institutional Review Board of NCTU approved the experimental protocol.
The customer-designed daily sampling system consists of a smartphone, actigraph, sleep diary, subjective scales of fatigue and stress, and software for recording, storing, transmitting, and analyzing data acquired from individuals in their natural environments on a daily basis. Each participant was provided a wrist-worn actigraph (Fatigue Science Readiband, Vancouver, BC), and was instructed to complete several subjective report scales and enter the percent effectiveness score from the actigraph approximately 30-60 minutes upon awakening each morning and to be available for experiment testing approximately once every 1-3 weeks over a 5-month period for a total of nine repeated sessions. Data recorded by the daily sampling system included electronically-adapted visual analog scales of fatigue and stress, the Karolinska Sleepiness Scale [1], and the Pittsburgh Sleep Diary [38]. The daily sampling data were automatically uploaded from the smartphone to a designated secure server at NCTU on a daily basis. In this way we could track and identify periods when the participants were currently exhibiting low, normal, or high levels of fatigue based on the percent effectiveness score values (>90%, 70 − 90%, <70%, respectively). The goal was to examine the participants during experiment sessions three times within each of the three fatigue levels. Most participants finished all nine sessions.
When the participants reported to the laboratory, we measured their fatigue level on site again right before the experiment to make sure it was close to the fatigue state reported via the smartphone. Upon completion of the related questionnaires and the informed consent form, subjects performed a PVT, a dynamic attention-shifting task, a lane-keeping task, and selected surveys preceding each condition. EEG data were recorded at 1000 Hz using a 64-channel NeuroScan Quik-Cap system (62 EEG channels and 1 electrocardiogram channel). The ground was between FPZ and FZ, and the reference channels were A1 and A2 at the mastoids.
In this paper we focus on the PVT [15], which is a sustained-attention task that uses RT to measure the speed with which a subject responds to a visual stimulus. It is widely used, particularly by NASA, for its ease of scoring, simple metrics, convergent validity, and free of learning effects. In our experiment, the PVT was presented on a smartphone with each trial initiated as an empty solid white circle centered on the touchscreen that began to fill in red displayed as a clockwise sweeping motion like the hand of a clock. The sweeping motion was programmed to turn solid red in one second or terminate upon a response by the participants, which required them to tap the touchscreen with the thumb of their dominant hand. The RT was computed as the elapsed time between the appearance of the empty solid white circle and the participant's response. Following completion of each trial, the circle went back to solid white until the next trial. Inter-trial intervals consisted of random intervals between 2-10 seconds.
143 sessions of PVT data were collected from the 17 subjects, and each session lasted 10 minutes. Our goal is to predict the RT using a short EEG trial immediately before it.

B. Performance Evaluation Process
The following procedure was used to evaluate the performances of different feature extraction methods: 1) RT data preprocessing to remove outliers. The number of trials and the mean RTs for the 17 subjects are shown in Table I. Subject 17 may have data recording issues, because many of his RTs were longer than 5 seconds, which are highly unlikely in practice, and his mean RT was more than two times larger than the largest mean RT from other subjects. So we excluded him from consideration in this paper, and only used Subjects 1-16. The RTs were very noisy, and there were obvious outliers. It is very important to suppress the outliers and noise so that the performances of different algorithms can be more accurately compared. We employed the following 2-step procedure for RT data preprocessing: a) Outlier removal, which aimed to remove abnormally large RTs. First, a threshold θ = m y + 3σ y was computed for each subject, where m y is the mean RT from all sessions of that subject, and σ y is the corresponding standard deviation. Then, all RTs larger than θ were removed. Note that the threshold was different for different subjects. b) Moving average smoothing, which replaced each RT by the average RT during a 60 seconds moving window centered at the onset of the corresponding PVT to suppress the noise.
2) EEG data preprocessing to remove or suppress artifacts and noise. Generally raw EEG data recorded from the scalp contain many artifacts (e.g., head motion, blinks, eye movements, etc.) and noise (e.g., power-line noise, noise caused by changes in electrode impedances, etc.) [11], [46], so it is very important to remove or suppress them to increase the signal-to-noise ratio before a machine learning algorithm is applied. This paper used the standardized early-stage EEG processing pipeline (PREP) [11], which consists of three steps: a) remove line-noise, b) determine and remove a robust reference signal, and, c) interpolate the bad channels (channels with a low recording signal-to-noise ratio). The preprocessed EEG signals coming out of PREP were downsampled to 250 Hz. They were then epoched to 5-second trials according to the onset of the PVTs: if a PVT started at t, then the 62-channel EEG trial in [t − 5, t] seconds was used to predict the RT, i.e., X n ∈ R 62×1250 . Each trial was then individually filtered by a [1,20] Hz finite impulse response band-pass filter to make each channel zero-mean and to remove nonrelevant high frequency components.

3) 5-fold cross-validation to compute the regression performance for each combination of feature set and regression method.
We first randomly partitioned the trials into five folds; then, used four folds for supervised spatial filtering and regression model training, and the remaining fold for testing. We repeated this five times so that every fold was used in testing. Finally we computed the regression performances in terms of root mean square error (RMSE) and correlation coefficient (CC). We extracted the following three different feature sets for each preprocessed EEG trial: • Feature Set 1 (FS1): Theta and Alpha powerband features from the band-pass filtered EEG trials. We computed the average power spectral density in the Theta band (4-8 Hz) and Alpha band (8-13 Hz) for each channel using Welch's method [50], and converted these 62 × 2 = 124 band powers to dBs as our features. • Feature Set 2 (FS2): Theta and Alpha powerband features from EEG trials filtered by Algorithm 1. This procedure was almost identical to the above one, except that the band-pass filtered EEG trials That is, we first band-pass filtered the raw EEG signals, then spatially filtered them by Algorithm 1 (K = 10 and F = 3), and further applied Algorithm 3 to extract the tangent space features, which had 30 × 31/2 = 465 dimensions. Two regression methods were used on each feature set: LASSO [44], and kNN regression [2]. For labeled training data {x n , y n } n=1,...,N , LASSO solves the following minimization problem to find a sparse linear regression model: where λ > 0 is an adjustable parameter, which was optimized by an inner 5-fold cross-validation on the training dataset in this paper. Once β 0 and β are identified, the final LASSO regression model is: We used k = 5 in kNN. Once the five nearest neighbors {x i , y i } i=1,...,5 to the new trial x n are identified, the regression output is computed as a weighted average: where the weights are the inverses of the feature distances:

4) Repeat
Step 3 10 times and compute the average regression performance.

V. EXPERIMENTAL RESULTS
This section compares the informativeness of the features in FS1, FS2 and FS3, and presents the regression performances.

A. Informativeness of the Features
Before studying the regression performance, it is important to check if the extracted features in FS1, FS2 and FS3 are indeed meaningful.
In this first study, we computed the CC between the RT and powerband features in FS1 at different channel locations for each of the 16 subjects, and then averaged them. The corresponding topoplot is shown in Fig. 3. Both theta and alpha band powers show higher correlation at the central and central-frontal regions of the brain; however, generally the CC is small. This indicates that FS1 features are not very informative.

Fp1
Fpz  In the second study, we picked a typical subject, partitioned his data randomly into 50% training and 50% testing, and extracted the powerband features FS1. We then designed the spatial filters using Algorithm 1 on the training data, and extracted the corresponding powerband features FS2, and the Riemannian tangent space features FS3 using Algorithm 3. For each feature set, we identified the top three features that had the maximum CCs with the RT using the training data, and also computed the corresponding CCs for the testing data. The results are shown in Fig. 4, where in each panel the data on the left of the black dotted line were used for training, and the right for testing. The top thick curve is the RT, and the bottom three curves are the maximally correlated features identified from the training data. The training and testing CCs are shown on the left and right of the corresponding feature, respectively. For FS1, we also show the corresponding channel labels and powerband names. For FS2, we only show the powerband names of the top three features, as a channel here does not have a specific label (each channel in FS2 is a weighted combination of all 62 physical electrodes). Fig. 4 shows that FS2 gave much smoother features than FS1, and also achieved much larger CCs to the RT, both in training and testing, suggesting that spatial filtering by Algorithm 1 can indeed increase the signal quality. FS3 further achieved larger training and testing CCs to the RT than FS2, suggesting that the tangent space features are more informative than the powerband features.

B. Estimation Performance Comparison
The RMSEs and CCs of LASSO and kNN using three different feature sets are shown in Fig. 5 for the 16 subjects. Recall that for each subject the feature extraction methods were run 10 times, each with randomly partitioned training and testing data, and the average regression performances are shown here. The average RMSEs and CCs across all subjects are also shown in the last group of each panel.   5 shows that regardless of which regression method was used, generally FS2 resulted in smaller RMSEs and larger CCs than FS1, suggesting that the spatial filtering approach can indeed improve the regression performance. Fig. 5 also shows that FS3 further achieved better RMSEs and CCs than FS2, suggesting that the tangent space features were more effective than the powerband features. Finally, LASSO had better performance than kNN on FS1, but kNN became better on FS2 and FS3. The RMSEs for Subjects 4, 9 and 11 in Fig. 5 are much larger than others, because, as shown in Table I, these three subjects have much larger RTs than others.
To illustrate the performance differences among the three feature extraction methods from another viewpoint, Fig. 6 shows the corresponding percentage performance improvements of LASSO and kNN using the three feature sets, where the legend "LASSO,FS2/FS1" means the percentage performance improvement of LASSO on FS2 over LASSO on FS1, and other legends should be understood in a similar manner. For LASSO, on average FS3 had 4.30% smaller RMSE than FS2, and 6.59% larger CC. For kNN, on average FS3 had 8.30% smaller RMSE than FS2, and 11.13% larger CC. These results again demonstrated that the tangent space features are more effective than the traditional powerband features. We also performed a two-way Analysis of Variance (ANOVA) for different regression algorithms to check if the raw RMSE and CC differences among the three feature sets (FS1, FS2, and FS3) were statistically significant, by setting the subjects as a random effect. The results are shown in Table II as "p for raw values." Study results showed that there were statistically significant differences (at 5% level) in raw CCs among different feature sets for both LASSO and kNN, but not for raw RMSEs.
However, because the RTs from different subjects had significantly different magnitudes, an ANOVA on the raw RMSEs and CCs may be unfair for those subjects with small RTs. So, we also performed a two-way ANOVA for different algorithms and feature sets on the ratios. For example, to compute the RMSE ratios for LASSO, we replaced all RMSEs for FS1 by 1, the RMSEs for FS2 by the ratios of the corresponding RMSEs from FS2 over those from FS1, and the RMSEs for FS3 by the ratios of the corresponding RMSEs from FS3 over those from FS1. In this way the RMSEs were normalized, and hence different subjects were treated equally. The corresponding ANOVA test results are shown in Table II as "p for ratios." Observe that there were statistically significant differences (at 5% level) in both RMSE ratios and CC ratios among different feature sets for both LASSO and kNN. Then, non-parametric multiple comparison tests based on Dunn's procedure [17], [18] were used to determine if the difference between any pair of algorithms was statistically significant, with a p-value correction using the False Discovery Rate method [9]. The p-values for the raw values are shown in Table III, and the p-values for the ratios are shown in Table IV, where the statistically significant ones are marked in bold. Table III shows that the raw RMSE difference between FS3 and FS1 was statistically significant when kNN was used. Furthermore, the raw CC differences between all pairs of feature sets were statistically significant. Table IV shows that the ratio differences between all pairs of feature sets were statistically significant, for both LASSO and kNN.

VI. DISCUSSIONS
This section provides parameter sensitivity analysis and additional discussions.

A. Parameter Sensitivity Analysis
Tangent space feature extraction relies on the spatial filter in Algorithm 1, which has two adjustable parameters: K, the number of fuzzy classes for the RTs, and F , the number of spatial filters for each fuzzy class. The filtering performance is robust to K but changes noticeably when F changes [55]. As a result, the performance of the tangent space features also varies as F changes. In this subsection we study the sensitivity of the regression performance to F . The regression performances for F = {5, 10, 15, 20} (K was fixed to be 3) are shown in Fig. 7. Algorithms 1 and 3 were repeated five times, each time with a random partition of training and testing data, and the average regression results are shown. Note that F cannot be too large because of three constraints: 1) F cannot exceed the number of channels (C) in the original EEG data, becauseΣ kΣ −1 ∈ R C×C in (3) has at most C eigenvectors; 2) the tangent space features have dimensionality KF (KF + 1)/2, which increases rapidly with F ; so, a large F can easily result in over-fitting; and, 3) there may be numerical difficulties in computing the RG mean when F is large, e.g., for Subjects 5,8 and 15 in Fig. 7 when F = 20. Fig. 7 shows that the regression performance increased when F increased from 5 to 15, but decreased when F further increased to 20. For the PVT experiment, F ∈ [10,15] seemed to achieve a good compromise between performance and computational cost. Additionally, in the previous subsection we used 5-second EEG trials to estimate the corresponding RT, and it is also interesting to study how the estimation performance changes with different trial lengths. The results are shown in Fig. 8 for trial lengths of {1, 3, 5, 7, 9} seconds. In general, as trial length increased, the estimation performance improved. However, a longer trial means heavier computational cost and larger delay in estimation. Furthermore, a trial cannot be arbitrary long, as then it cannot capture the up-to-date RT. These effects should be taken into consideration when choosing the right trial length.

B. Regression Performance versus the Number of Features
Recall from Section IV-B that FS1 has 124 features, FS2 has 60 features, and FS3 has 465 features, i.e., FS3 has much more features than FS1 and FS2. So, FS3's superior performance may be due to its increased number of features. In this subsection we investigate the relationship between the regression performance and the number of useful features.
Because LASSO automatically selects the most useful features, whereas kNN always uses all the features, in this study we focus only on LASSO. For each subject and each feature set, we used all data in LASSO training, and recorded the number of selected features, as well as the corresponding training RMSEs and CCs. The results are shown in Fig. 9. On average LASSO selected 58.6 features from FS1, 30.6 features from FS2, and 69.1 features from FS3. Although the selected FS2 subset was only about half the size of the selected FS1 subset, they resulted in similar overall training RMSEs and CCs. Connecting this observation with that in the previous subsection, i.e., FS2 had much better testing RMSEs and CCs than FS1, we can conclude that the CSPR-OVR spatial filter can aggregate the most useful information into just a small number of features, which reduces overfitting and improves the generalization performance. Fig. 9 also shows that the selected FS3 subset was slightly larger than the selected FS1 subset, but the FS3 subset resulted in much better training performance, and also much better testing performance, as presented in the previous subsection. These observations together suggest that the Riemannian geometry approach can indeed extract some novel informative features, which improve both the training and the testing performances.

C. Computational cost
The training of our feature extraction method (FS3) consists of three steps: 1) design the CSPR-OVR filter by Algorithm 1; 2) compute the RG meanΣ by Algorithm 2; and, 3) map the spatially filtered EEG trials to the Riemannian tangent space by Algorithm 3. Once the training is done, feature extraction for a testing trial can be performed very efficiently: a matrix multiplication (5) is first used to spatially filter it, and then another matrix multiplication (11) is used to compute its spatial covariance matrix Σ n ; finally, compute logm Σ− 1 2 Σ nΣ can be pre-computed, and henceΣ − 1 2 Σ nΣ − 1 2 is also a simple matrix multiplication. So, in this subsection we focus on the training computational cost only. Let N be the number of training samples. Then, the actual training time increased linearly with N , as shown in Fig. 10. The platform was a Dell XPS15 laptop (Intel i7-6700HQ CPU @2.60GHz, 16 GB memory) running Windows 10 Pro 64-bit and Matlab 2016b. A least squares curve fit shows that the training time is 0.0261 + 0.0030N seconds, which should not be a problem for a practical N .

D. RT versus Fatigue State
We also studied the relationship between the RT and the fatigue state. Our conjecture is that as the fatigue level goes up, the RT should be larger. Boxplots of the RT in different sessions for two typical subjects are shown in Fig. 11, where "L", "N" and "H" mean low, normal, and high fatigue, respectively. Fig. 11 shows that the mean RT of a high fatigue sessions is generally larger than that of a low or normal fatigue session, and the former also has more extreme values and a larger variance. The difference between a low fatigue session and a normal fatigue session is not obvious. These observations suggest that although the fatigue state contains some useful information, it may be too coarse for accurate RT prediction. That's why it was not used in this paper.

VII. CONCLUSIONS AND FUTURE RESEARCH
In this paper, we have proposed a new feature extraction approach for EEG-based BCI regression problems: a spatial filter is first used to increase the EEG trial signal quality and also to reduce the dimensionality of the covariance matrix, and then Riemannian tangent space features are extracted. We validated the performance of the proposed approach in RT estimation from EEG signals measured in a large-scale sustainedattention PVT experiment, and showed that compared with the traditional powerband features, the tangent space features can reduce the estimation RMSE by 4.30-8.30%, and increase the estimation CC by 6.59-11.13%. To our knowledge, this is the first time that RG has been used in BCI regression problems.
Our future research will focus on reducing the dimensionality of the tangent space features. As shown in Algorithm 3, the tangent space features have dimensionality KF (KF + 1)/2, where K is the number of fuzzy classes for the RTs, and F is the number of spatial filters for each fuzzy class. So, the feature dimensionality increases quadratically with respect to both K and F , which quickly results in overwhelming computational cost, overfitting, and numerical problems. We will investigate effective dimensionality reduction approaches for the tangent space features to reduce the computational cost while maintaining or even improving the regression performance.