Sun Haoqi, Jia Jian, Goparaju Balaji, Huang Guang-Bin, Sourina Olga, Bianchi Matt Travis, Westover M Brandon
Energy Research Institute @ NTU, Interdisciplinary Graduate School, Nanyang Technological University, 639798, Singapore.
Fraunhofer IDM @ NTU, Nanyang Technological University, 639798, Singapore.
Sleep. 2017 Oct 1;40(10). doi: 10.1093/sleep/zsx139.
Automated sleep staging has been previously limited by a combination of clinical and physiological heterogeneity. Both factors are in principle addressable with large data sets that enable robust calibration. However, the impact of sample size remains uncertain. The objectives are to investigate the extent to which machine learning methods can approximate the performance of human scorers when supplied with sufficient training cases and to investigate how staging performance depends on the number of training patients, contextual information, model complexity, and imbalance between sleep stage proportions.
A total of 102 features were extracted from six electroencephalography (EEG) channels in routine polysomnography. Two thousand nights were partitioned into equal (n = 1000) training and testing sets for validation. We used epoch-by-epoch Cohen's kappa statistics to measure the agreement between classifier output and human scorer according to American Academy of Sleep Medicine scoring criteria.
Epoch-by-epoch Cohen's kappa improved with increasing training EEG recordings until saturation occurred (n = ~300). The kappa value was further improved by accounting for contextual (temporal) information, increasing model complexity, and adjusting the model training procedure to account for the imbalance of stage proportions. The final kappa on the testing set was 0.68. Testing on more EEG recordings leads to kappa estimates with lower variance.
Training with a large data set enables automated sleep staging that compares favorably with human scorers. Because testing was performed on a large and heterogeneous data set, the performance estimate has low variance and is likely to generalize broadly.
自动睡眠分期此前受到临床和生理异质性的限制。原则上,这两个因素都可以通过能够进行稳健校准的大数据集来解决。然而,样本量的影响仍不确定。目的是研究在提供足够的训练病例时,机器学习方法能够在多大程度上接近人类评分者的表现,并研究分期表现如何取决于训练患者的数量、背景信息、模型复杂性以及睡眠阶段比例之间的不平衡。
从常规多导睡眠图的六个脑电图(EEG)通道中提取了总共102个特征。将两千个夜晚分成相等的(n = 1000)训练集和测试集进行验证。我们根据美国睡眠医学学会的评分标准,使用逐时段的科恩kappa统计量来衡量分类器输出与人类评分者之间的一致性。
逐时段的科恩kappa随着训练EEG记录数量的增加而提高,直到出现饱和(n = ~300)。通过考虑背景(时间)信息、增加模型复杂性以及调整模型训练程序以考虑阶段比例的不平衡,kappa值进一步提高。测试集上的最终kappa为0.68。对更多EEG记录进行测试会导致kappa估计值的方差更低。
使用大数据集进行训练能够实现与人类评分者相比具有优势的自动睡眠分期。由于测试是在一个大型且异质的数据集上进行的,性能估计的方差较低,并且可能具有广泛的通用性。