Document Type

Artificial Intelligence

Publication Date

Summer 2019


Nowadays, streaming data overflows from various sources and technologies such as Internet of Things (IoT), making conventional data analytics methods unsuitable to manage the latency of data processing relative to the growing demand for high processing speed and algorithmically scalability [1]. Real-time streaming data analytics, which processes data while it is in motion, is required to allow many organizations to analyze streaming data effectively and efficiently for being more active in their strategies. To analyze real time “Big” streaming data, parallel and distributed computing over a cloud of computers has become a mainstream solution to allow scalability, resiliency to failure, and fast processing of massive data sets. Several open source data analytics frameworks have been proposed and developed for streaming data analytics successfully. Apache Spark is one such framework being developed at the University of California, Berkley and gains lots of attentions due to reducing IO by storing data in a memory and a unique data executing model. In Computer & Information Sciences (CISC) at Harrisburg University (HU), we have been working on building a private Cloud Computing for future research and planning to involve industry collaboration where high volumes of real time streaming data are used to develop solutions to practical problems in industry. By developing a HU Cloud based environment for Apache Spark applications for streaming data analytics with batch processing on Hadoop Distributed File System (HDFS), we can prepare future big data era that can turn big data into beneficial actions for industry needs. This research aims to develop Spark applications supporting an entire streaming data analytics workflow, which consists of data ingestion, data analytics, data visualization and data storing. In particular, we will focus on a real time stock recommender system based on state-of-the-art Machine Learning (ML)/Deep Learning (DL) frameworks such as mllib, TensorFlow, Apache mxnet and pytorch. The plan is to gather real time stock market data from Google/Yahoo finance data streams to build a model to predict a future stock market trend. The proposed Spark applications on the HU cloud-based architecture will give emphasis to finding time-series forcating module for a specific period, typically based on selected attributes. In addition, we will test scale-out architecture, efficient parallel processing and fault tolerance of Spark applications on the HU Cloud based HDFS. We believe that this research will bring the CISC program at HU significant competitive advantages globally.