Using Practicus AI Studio with Airflow
You can use Practicus AI Studio for the following tasks for Airflow workflows.
Practicus AI Studio functionality for Airflow
- Explore data sources such as Data Lakes, Data Warehouses and Databases
- Transform data
- Join data from different data sources
- Export the result to any data source
- Perform these tasks on individual Workers or on distributed Spark cluster
- Generate data processing steps as Python code
- Auto-detect dependencies between tasks
- Generate the DAG code
- Export data connection files separately so you can change them later
Sample scenario
- Load some_table from a Database A
- Make changes
- Save to Database B
- Load some_other_table from a Data Lake C
- Make changes
- Save to Data Warehouse D
- Load final_table from Database E
- Join to some_table
- Join to some_other_table
- Make other changes
- Save to Data Lake F
- Export everything to Airflow
Let's take a quick look on the experience.
Joining data sources
- Left joining final_table with column ID to some_other_table column ID
Exporting to Airflow
- Practicus AI automatically detects the dependency:
- Operations on some_table and some_other_table can execute in parallel since they do not depend on each other
- If both are successful, operations on final_table can happen including joins
Viewing the exported code
- After the code export is completed you can update 4 types of files:
.py files:
Each are tasks that include the data processing steps, SQL etc..._worker.json files:
Defines the worker that each task will run on.- Container image to use, worker capacity (CPU, GPU, RAM) ..
.._conn.json files:
Defines how to read data for each task.- Note: Data source credentials can be stored in the Practicus AI data catalog.
.._save_conn.json files:
Defines how to write data for each task.- Note: Data source credentials can be stored in the Practicus AI data catalog.
.._join_.._conn.json files:
Defines how each join operation will work: how to read data and where to join..._dag.py file:
The DAG file that brings everything together.
Sample view from the embedded Jupyter notebook inside Practicus AI Studio.
Airflow deployment options
You have 2 options to deploy to Airflow from Practicus AI Studio.
Self-service
- Select the schedule and deploy directly to Airflow add-on service that an admin gave you access to.
- This will instantly start the Airflow schedule.
- You can then view your DAGs using Practicus AI and monitor the state of your workflows.
- You can also manually trigger DAGs.
Working with a Data Engineer (recommended for sensitive data)
- Just export the code and share with a Data Engineer, so they can:
- Validate your steps (.py files)
- Update data sources for production databases (conn.json files)
- Select appropriate Worker capacity (worker.json files)
- Select appropriate Worker user credentials (worker.json files)
- Deploy to Airflow
- Define the necessary monitoring steps with automation (e.g. with Practicus AI observability)
Previous: Deploying On Airflow | Next: Generative AI > Introduction