Challenges and Approaches in Connected Vehicles Data Wrangling 2017-01-0069
This manuscript compares window-based data imputation approaches for data coming from connected vehicles during actual driving scenarios and obtained using on-board data acquisition devices. Three distinct window-based approaches were used for cleansing and imputing the missing values in different CAN-bus (Controller Area Network) signals. Lengths of windows used for data imputation for the three approaches were: 1) entire time-course for each vehicle ID, 2) day, and 3) trip (defined as duration between vehicle's ignition statuses ON to OFF). An algorithm for identification of ignition ON and OFF events is also presented, since this signal was not explicitly captured during the data acquisition phase. As a case study, these imputation techniques were applied to the data from a driver behavior classification experiment. Forty four connected vehicles were used to provide data on various signals viz., engine speed, vehicle speed, engine torque, brake, clutch, acceleration pedal, and gear. Distribution plots for all variables showed similar difference when 3 methods were compared. Mainly, the shapes of the histograms were the same for all methods. However, dataset size was around 37% more for both the vehicle ID-wise and day-wise imputed dataset compared to the trip-wise imputation approach. K-Means clustering did not show significant differences between vehicle ID-wise and day-wise imputed datasets, but around 16% vehicles were assigned to different clusters when trip-wise imputed data was used. Trip-window was perceived to be a superior window compared to the other two sizes since it provides a means to remove noisy records from the connected vehicle data, thus increasing the robustness of any analytical model built on top of it according to garbage-in-garbage-out rule. Given the scale of the data, big data tools, like Hive and Spark are used on Hadoop platform to process and impute the data set.