This is a comprehensive data integration and ETL platform designed to serve as a backend pipeline for a Business Intelligence (BI) platform. The system automates the process of ingesting, cleaning, and loading large volumes of data from various sources into a format optimized for analytics consumption.
A key component of this platform is a custom-built Task Manager, essentially a visual cron scheduler tailored for Laravel. It empowers users to define and manage tasks that can execute Laravel Artisan commands, Python scripts, or shell commands. The Task Manager captures essential execution metadata such as run status, execution time, and output logs. It also supports real-time notifications via Slack to alert users whether a scheduled job succeeded or failed.
ETL Workflow Overview
- Data Extraction creates commands or scripts that fetch or scrap raw data from external sources (Web, Scraping, APIs, FTP servers, etc.).
- Task Scheduling These scripts are then scheduled to run either manually or automatically at predefined intervals through the Task Manager.
- Staging in Object Storage Upon successful execution, the extracted data is uploaded to a cloud-based object storage service such as Amazon S3.
- Data Cleaning & Transformation The upload triggers a big data processing job using platforms like AWS Glue or Azure Synapse Studio. Here, the data is cleaned, normalized, and formatted according to business requirements.
- Loading into Database Once processed, the clean dataset is moved to a PostgreSQL database where it becomes available to the BI platform for visualization and reporting.
Additional Feature: Big Data File Browser
To enhance data accessibility, a Data Browser module was developed. This allows users to explore and preview large CSV files typically too large to fit into memory. As PHP alone was not sufficient for handling such scale, we developed a custom Laravel-compatible database driver for DuckDB (github.com/harish81/laravel-duckdb) —a high-performance analytical database engine. This integration enables efficient browsing and querying of large datasets directly within the platform, without requiring full ingestion or loading into memory.
Screenshots
Technology used
- Laravel
- Laravel Jobs
- Laravel Horizon
- FilamentPHP
- Postgres
- Python
- SQL
- Spark (Pyspark)
- Scraping
- ETL
- AWS Glue
- Azure Synapse Studio
Client Review
