Abstract
Data quality issues have been widely recognized in IoT data, and prevent the downstream applications. Improving IoT data quality however is particularly challenging, given the distinct features over the IoT data such as pervasive noises, unaligned timestamps, consecutive errors, misplaced columns, correlated errors and so on. In this tutorial, we review the state-of-the-art techniques for IoT data quality management. In particular, we discuss how the dedicated approaches improve various data quality dimensions, including validity, completeness and consistency. Among others, we further highlight the recent advances by deep learning techniques for IoT data quality. Finally, we indicate the open problems in IoT data quality management, such as benchmark or interpretation of data quality issues.
Authors
- Shaoxu Song, Tsinghua University
- Aoqian Zhang, University of Waterloo
Downloads
- Slides, TBD
References
Validity
Constraint Validity
- Jun Rao, Sangeeta Doraiswamy, Hetal Thakkar, Latha S. Colby: A Deferred Cleansing Method for RFID Data Analytics. VLDB 2006: 175-186
- Ziawasch Abedjan, Cuneyt Gurcan Akcora, Mourad Ouzzani, Paolo Papotti, Michael Stonebraker: Temporal Rules Discovery for Web Data Cleaning. Proc. VLDB Endow. 9(4): 336-347 (2015)
- Lukasz Golab, Howard J. Karloff, Flip Korn, Avishek Saha, Divesh Srivastava: Sequential Dependencies. Proc. VLDB Endow. 2(1): 574-585 (2009)
- Shaoxu Song, Aoqian Zhang, Jianmin Wang, Philip S. Yu: SCREEN: Stream Data Cleaning under Speed Constraints. SIGMOD Conference 2015: 827-841
- Bettina Fazzinga, Sergio Flesca, Filippo Furfaro, Francesco Parisi: Cleaning trajectory data of RFID-monitored objects through conditioning under integrityconstraints. EDBT 2014: 379-390
- Shaoxu Song, Yue Cao, Jianmin Wang: Cleaning Timestamps with Temporal Constraints. Proc. VLDB Endow. 9(10): 708-719 (2016)
- Jianmin Wang, Shaoxu Song, Xuemin Lin, Xiaochen Zhu, Jian Pei: Cleaning structured event logs: A graph repair approach. ICDE 2015: 30-41
Statistical Validity
- Wush Chi-Hsuan Wu, Mi-Yen Yeh, Jian Pei: Random Error Reduction in Similarity Search on Time Series: A Statistical Approach. ICDE 2012: 858-869
- Aoqian Zhang, Shaoxu Song, Jianmin Wang: Sequential Data Cleaning: A Statistical Approach. SIGMOD Conference 2016: 909-924
- Tamraparni Dasu, Ji Meng Loh: Statistical Distortion: Consequences of Data Cleaning. Proc. VLDB Endow. 5(11): 1674-1683 (2012)
- Chris Mayfield, Jennifer Neville, Sunil Prabhakar: ERACER: a database approach for statistical inference and data cleaning. SIGMOD Conference 2010: 75-86
- Asif Iqbal Baba, Manfred Jaeger, Hua Lu, Torben Bach Pedersen, Wei-Shinn Ku, Xike Xie: Learning-Based Cleansing for Indoor RFID Data. SIGMOD Conference 2016: 925-936
Completeness
Constraint-based Imputation
- Ruilin Liu, Guan Wang, Wendy Hui Wang, Flip Korn: iCoDA: Interactive and exploratory data completeness analysis. ICDE 2014: 1226-1229
- Jianmin Wang, Shaoxu Song, Xiaochen Zhu, Xuemin Lin: Efficient Recovery of Missing Events. Proc. VLDB Endow. 6(10): 841-852 (2013)
- Jianmin Wang, Shaoxu Song, Xiaochen Zhu, Xuemin Lin, Jiaguang Sun: Efficient Recovery of Missing Events. IEEE Trans. Knowl. Data Eng. 28(11): 2943-2957 (2016)
Statistical Model
- Lei Li, James McCann, Nancy S. Pollard, Christos Faloutsos: DynaMMo: mining and summarization of coevolving sequences with missing values. KDD 2009: 507-516
- Yongjie Cai, Hanghang Tong, Wei Fan, Ping Ji, Qing He: Facets: Fast Comprehensive Mining of Coevolving High-order Time Series. KDD 2015: 79-88
- Shawn R. Jeffery, Minos N. Garofalakis, Michael J. Franklin: Adaptive Cleaning for RFID Data Streams. VLDB 2006: 163-174
- Thanh T. L. Tran, Charles Sutton, Richard Cocci, Yanming Nie, Yanlei Diao, Prashant J. Shenoy: Probabilistic Inference over RFID Streams in Mobile Environments. ICDE 2009: 1096-1107
- Haiquan Chen, Wei-Shinn Ku, Haixun Wang, Min-Te Sun: Leveraging spatio-temporal redundancy for RFID data cleansing. SIGMOD Conference 2010: 51-62
- Zhou Zhao, Wilfred Ng: A model-based approach for RFID data stream cleansing. CIKM 2012: 862-871
Deep Learning-based Imputation
- Wei Cao, Dong Wang, Jian Li, Hao Zhou, Lei Li, Yitan Li: BRITS: Bidirectional Recurrent Imputation for Time Series. NeurIPS 2018: 6776-6786
- Reza Asadi, Amelia Regan: A convolution recurrent autoencoder for spatio-temporal missing data imputation. CoRR abs/1904.12413 (2019)
- Hongyuan Mei, Guanghui Qin, Jason Eisner: Imputing Missing Events in Continuous-Time Event Streams. ICML 2019: 4475-4485
- Vincent Fortuin, Gunnar Rätsch, Stephan Mandt: Multivariate Time Series Imputation with Variational Autoencoders. CoRR abs/1907.04155 (2019)
- Yonghong Luo, Xiangrui Cai, Ying Zhang, Jun Xu, Xiaojie Yuan: Multivariate Time Series Imputation with Generative Adversarial Networks. NeurIPS 2018: 1603-1614
- Yonghong Luo, Ying Zhang, Xiangrui Cai, Xiaojie Yuan: E²GAN: End-to-End Generative Adversarial Network for Multivariate Time Series Imputation. IJCAI 2019: 3094-3100
- Yukai Liu, Rose Yu, Stephan Zheng, Eric Zhan, Yisong Yue: NAOMI: Non-Autoregressive Multiresolution Sequence Imputation. NeurIPS 2019: 11236-11246
Consistency
Pattern-based Detection
- Lei Cao, Yizhou Yan, Samuel Madden, Elke A. Rundensteiner, Mathan Gopalsamy: Efficient Discovery of Sequence Outlier Patterns. Proc. VLDB Endow. 12(8): 920-932 (2019)
- Laure Berti-Équille, Tamraparni Dasu, Divesh Srivastava: Discovery of complex glitch patterns: A novel approach to Quantitative Data Cleaning. ICDE 2011: 733-744
- Pavel Senin, Jessica Lin, Xing Wang, Tim Oates, Sunil Gandhi, Arnold P. Boedihardjo, Crystal Chen, Susan Frankenstein: Time series anomaly discovery with grammar-based compression. EDBT 2015: 481-492
Statistical Model
- Kexin Rong, Peter Bailis: ASAP: Prioritizing Attention via Time Series Smoothing. Proc. VLDB Endow. 10(11): 1358-1369 (2017)
- Christos Faloutsos, Jan Gasthaus, Tim Januschowski, Yuyang Wang: Forecasting Big Time Series: Old and New. Proc. VLDB Endow. 11(12): 2102-2105 (2018)
- Aoqian Zhang, Shaoxu Song, Jianmin Wang, Philip S. Yu: Time Series Data Cleaning: From Anomaly Detection to Anomaly Repairing. Proc. VLDB Endow. 10(10): 1046-1057 (2017)
- Nikolay Laptev, Saeed Amizadeh, Ian Flint: Generic and Scalable Framework for Automated Time-series Anomaly Detection. KDD 2015: 1939-1947
- Sharmila Subramaniam, Themis Palpanas, Dimitris Papadopoulos, Vana Kalogeraki, Dimitrios Gunopulos: Online Outlier Detection in Sensor Data Using Non-Parametric Models. VLDB 2006: 187-198
Deep Learning-based Detection
- Pankaj Malhotra, Lovekesh Vig, Gautam M. Shroff, Puneet Agarwal: Long Short Term Memory Networks for Anomaly Detection in Time Series. ESANN 2015
- Pankaj Malhotra, Anusha Ramakrishnan, Gaurangi Anand, Lovekesh Vig, Puneet Agarwal, Gautam M. Shroff: LSTM-based Encoder-Decoder for Multi-sensor Anomaly Detection. CoRR abs/1607.00148 (2016)
- Dan Li, Dacheng Chen, Baihong Jin, Lei Shi, Jonathan Goh, See-Kiong Ng: MAD-GAN: Multivariate Anomaly Detection for Time Series Data with Generative Adversarial Networks. ICANN (4) 2019: 703-716
- Fiete Lüer, Dominik Mautz, Christian Böhm: Anomaly Detection in Time Series using Generative Adversarial Networks. ICDM Workshops 2019: 1047-1048