i4Q Services for Data Analytics

General Description

The i4Q Services for Data Analytics (i4QDA) solution aims to provide a Data Analytics experimentation environment that allows the creation of data analytics workflows in a dynamic way. This solution enables the creation of AI workflows in a simple and intuitive, code-free manner, building the workflow using a visual programming environment to place the components through drag-and-drop interface. The i4QDA solution is supported by Apache Airflow, a workflow orchestrator, which is used to create, schedule and monitor workflows, meaning it oversees the handling of execution, orchestration, management and storage of these workflows, and implements a number of operators and technologies that are used in the creation of these workflows, such as databases, machine learning libraries and platforms, among many other technologies.

Features

The i4QDA solution comprises four major components. The first one is the Apache Airflow platform, responsible for the orchestration of the workflows, meaning the scheduling, execution, monitoring and storage of the workflows. The second component is the set of supporting technologies, which encompasses all the necessary supporting technologies for workflow creation, as well as AI, Data Analytics and machine learning algorithms. The third component is the set of operators (connectors, transformers and analytics) used to create workflows, and finally, the last component is the i4QDA User Interface (UI), which guides users through the dynamic definition of workflows, by enabling the creation of workflows that can connect to data sources (data source configuration, connection and storage), perform data preparation and pre-processing (data filtering, aggregation, harmonisation and semantic enrichment) and apply AI methods for AI model training, updating and serving and Data Analytics (selection and configuration of AI methods). This last component has the following specific features, that allow the dynamic creation of workflows:

  1. Creation of Data Analytics Workflows: Visual programming of AI workflows, by using the drag and drop functionality to add operators and setting up dependencies between operators.

  2. Save and Load Workflows: Allows the saving of workflows in a database, so that it can be loaded and executed anytime, and allowing the user to make changes to previously done workflows.

  3. Creation and integration of new operators: The creation of user specific operators, through the help of guidelines, and integration of those specific on the workflow creation tool.

  4. Configuration of external tools and technologies: Full integration with tools and technologies, like Keras, Tensorflow, Sklearn, allowing an all-in-one box setting.

  5. Execution and orchestration of workflows: Easy plug’n’play of user created workflows directly on the Airflow interface, allowing the execution and orchestration of multiple workflows.

ScreenShots

i4Q Data analytics User Interface main screen

Comercial Information

Authors

Company

Website

Logo

UNINOVA

http://www.uninova.pt

Uni Nova

KNOWLEDGEBIZ

https://www.knowledgebiz.pt

Knowledgebiz

IKERLAN

http://www.ikerlan.es

logo example

ENGINEERING

http://www.eng.it

logo example

CERTH

http://www.iti.gr

logo example

License

Apache License 2.0

Pricing

Subject

Value

Payment Model

One-off

Price

0 €

Associated i4Q Solutions

Required

  1. No strict dependencies

Optional

  1. i4Q Data Integration and Transformation

  2. i4Q Data Repository

  3. i4Q Analytics Dashboard

System Requirements

  1. Docker Engine

  2. Docker-compose

  3. RAM >= 16GB

  4. Processor: Intel(R) Core(TM) i5-6500 or greater / AMD Ryzen 5 3600 or greater

Installation Guidelines

Resource

Location

Last release (v.1.0.0)

Link

Installation

This section provides all the necessary steps for the installation of the i4Q DA tool

The first step is the installation of Docker Engine and docker-compose to run the containers

The second step is to clone the repository for the i4QDA solution

The final steps are the build and initialization of the docker image. To do that you need to go to directory of the project and execute the following commands:

$ cd ./{cloned_directory}
$ docker-compose build

And when the build is finished run this command to initialize the containers

$ docker-compose up

To stop the containers, use the following command

$ docker-compose stop

User Manual

User creation

To create a new user to access the workflow creator, first the user needs to go to the home page through the link localhost:3000, where the user will be greeted by the following image:

i4Q Data analytics login screen

After that, the user needs to press the “Register” option, which will redirect to the user creation page (see image bellow).

i4Q Data analytics create new account screen

Fill in the fields with the necessary information, and when that its done press the “Register Account” option. An email will be sent to user saying that the account as been created and activated (it may take some time since the account needs to be activated by an admin)

Workflow Creation

To start a workflow creation, the user needs to open a web browser of his choice and enter the following link localhost:3000. After login in, the user will be redirected to the following page:

i4Q Data analytics User Interface main screen

To start a new workflow, first the user must click on the option “New Workflow”, which then will prompt the user to insert a name for the workflow, like it is shown in the following window:

Create new workflow modal view

After filling in the box with a valid workflow name and clicking on the “create” button, then a new tab will be added on the header component, and the workspace will have a blank canvas where the operators can be added. On the sidebar, three dropdown menus will appear, which contain all the operators available, divided between them. This can be seen in the figure bellow.

Workflow canvas

To start the workflow definition, the operators can be drag-n-dropped into the canvas, where a similar prompt will ask the user to insert a name for the operator (see next figure).

Inserting a new operator in the workflow

Each operator is colour coded regarding their category, where blue operators are blue, transformers are green, and analytics are red.

When multiple operators are added to the canvas, the user can then start editing the properties of these said operators, and also can start connecting them between themselves, in order to create the flow of data, from the data collection/ingestion to the model training/serving. To edit the properties of an operator, the user clicks on the cog wheel symbol on the operator box, which will open a side menu on the right side, with all the fields that the operator needs to have filled in. This can be seen in the next figure.

Edit operator properties

After saving, the user can then start connecting the operators, by clicking on the green circle on the right of the operator, and dragging and clicking the connection to a red circle from another operator (see next figure).

Creation of dependencies between operators

Finally, when all the steps above are completed, the user can then click on the green play button on the bottom of the window, which will initialize the workflow and prepare it to be executed in Airflow.

Saving and initialization of a workflow

Saving/Loading Workflow

To save a workflow the user needs only to, at any stage of the workflow creation, to click on the blue button with a disk icon. After saving the workflow, the user can return to the “start” page, by navigating through the tabs shown in the header component, where then he can select the “Load Workflow” option. If the user has any workflows that where stored previously, these will show in a list (see figure bellow), where the user selects which workflow to open to start editing.

Adding custom operators

To add a custom operator, the user select the “Integrate New Operator” option in the “start” menu, opening the following tab.

Integrate new operator

The user then needs to write a name for the operator to be added, along side the python script that corresponds to the task to be integrated, and the json containing the metadata for this operator.

Operators

The tool has 3 types of operators available for the creation of workflows. These are the:

  • Connectors

  • Transformers

  • Analytics

The connectors is a set of operators to access data sources, such as CSV files or RDBMS databases. The transformers are the operators that contain data manipulation processes, such as remove null values or filter by some parameter. The last group are the analytic operators, with algorithms for training AI/ML models, and applying previously trained models. The next sections contain a detailed description of how to use all available operators.

Connectors

Here is a table with all the available connector operators, alongside a description and what are the input parameters that the user needs to provide.

Operator

Description

Parameters

RDBMS connector

Loads the data from a RDBMS database into a Spark Dataframe

Database Driver: List with available database drivers

URL: Url of the database

Port: Port of the Database

Name: Name of the Database

Table: Name of the Table to read

Username: Username to access database

Password: Password

Mode: Read/Write (Reading from a database or writing)

File connector

Connector to a file (CSV). Currently only supports CSV and JSON file types

File Type: Extension for the type of file

File Path: Path for the file

P rovider (Optional): Name of the data provider

CSV Folder connector

Connector to a folder with many CSVs

Path: Path to the folder containing all files to read

Transformers

Here is a table with all the available transformer operators, alongside a description and what are the input parameters that the user needs to provide. The names of the operators in this list are abreviated from the ones that apear in the user interface, where the full name corresponds to “Custom/Public {name of the operator} Transformer”.

Operator

Description

Parameters

Preprocess Change Column Values

Changes the values in a column to other user chosen values. Input from user needs to be in {key: value} format, where key is the value to be changed, and value is the new value

Values: Key value pair to change

Column Name: Column to apply the change

Rename Column

Changes the name of a column

Ex isting Column Name: Name of the column to change name

New Column Name: New name for the column

Convert Types

Converts the type of the values in a column into another type

Column: Column to apply the change

Variable type: New variable type for the values in the selected(int, float, string)

Remove Columns

Removes a set of columns from the dataset

Removed Columns: List of columns to remove from the dataset (seperated by commas)

Select Columns

Selects a set of columns from the dataset

Selected Columns: List of columns to select from the dataset (seperated by commas)

Filter Dataset

Filters a columns based on a condition defined by the user

Filter Option: Filter parameter (under, equals, above)

Condition Value: Value of the condition to apply the filter

Selected Columns: List of columns in the dataset to apply the filter (seperated by commas)

Filter Missing Values

Filter Option:

Selected Columns:

Normalize Columns

Normalizes the values in a set of columns

Selected Columns: List of columns in the dataset to apply the normalization (seperated by commas)

Remove Duplicates

Removes duplicate rows present in the dataset

—-

Replace Missing Values

Replaces missing values with a value chosen by the user

Value: Value to replace the missing values

Selected Column: Column where the transformation will be applied

Aggregate Group By

Aggregates data based on another column entries

`` Aggregation Column``: Column name containing the aggregation fields

` Type of operation`: List of the types of aggregation methods

Target Column: Column name with the values to be aggregated

Label Encoder

One Hot Encoder

Replace Infinite Values

Simple Query

Allows the application of simple queries over the data

SQL Query: Query to apply over the data

Analytics

Here is a table with all the available analytic operators, alongside a description and what are the input parameters that the user needs to provide. The names of the operators in this list are abreviated from the ones that apear in the user interface, where the full name corresponds to “{Technology} {name of the operator} Analytics”.

Operator

Description

Parameters

Random Forest Train

Trains a Random Forest classification model.

Target column: Column with the data for the model training

Model name: name of the model to be tranined

Analytics Library: Library used (currently only supports “sklearn”)

Random Forest Predict

Performs predictions using a trained Random Forest model.

Target column: Column with the data for the model training

Gaussian Naive Bayes Train

Trains a Gaussian Naive Bayes

Target column: Column with the data for the model training

Model name: name of the model to be tranined

Gradient Boosting Classifier Train

Trains a Gradient Boosting Classifier model

Target column: Column with the data for the model training

Model name: name of the model to be tranined

Decision Tree Classifier Train

Trains a Decision Tree Classifier model

Target column: Column with the data for the model training

Model name: name of the model to be tranined

XGB Classifier Train

Trains a XGB Classifier model

Target column: Column with the data for the model training

Model name: name of the model to be tranined

LGB Classifier Train

Trains a XGB Classifier model

Target column: Column with the data for the model training

Model name: name of the model to be tranined

Classifier Model Predictor

Loads a classifier type model previously trained

Target Column: Column to apply the regression model

Model name: Name of the model to load

XGB Regressor Train

Trains a XGB Regressor model

Target column: Column with the data for the model training

Model name: Name of the model to training

LGBM Regressor Train

Trains a XGBM Regressor model

Target column: Column with the data for the model training

Model name: Name of the model to training

Gradient Boosting Regressor Train

Trains a Gradient Boosting Regressor model

Target column: Column with the data for the model training

Model name: Name of the model to training

Decision Tree Regressor Train

Trains a Decision Tree Regressor model

Target column: Column with the data for the model training

Model name: Name of the model to training

Linear Regression Train

Trains a Linear Regression model

Target column: Column with the data for the model training

Model name: Name of the model to training

Ridge Regression

Trains a Ridge Regression model

Target column: Column with the data for the model training

Model name: Name of the model to training

Lasso Regression

Trains a Lasso Regression model

Target column: Column with the data for the model training

Model name: Name of the model to training

Regression model Predictor

Loads a Regression type model previously trained

Target Column: Column to apply the regression model

Model name: Name of the model to load