Skip to content

CIAT-DAPA/pnn_monitoring_etl

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

PNN Monitoring

The main goal of this project is to design and develop a system for monitoring the activities and resources coming from different actors to finance the implementation of the Sustainability Action Plans (PAS) of the Regional Systems of Protected Areas (SIRAP). The fundamental purpose of this system is to provide authorities and stakeholders with a tool to evaluate and track the level of investment and financial support for the conservation of Colombia's protected areas.

PNN MONITORING ETL

This component plays a key role in extracting data from external sources, such as Excel files and other sources of financial information. Its main function is to adjust, validate and transform the data so that it aligns precisely with the format and structure defined in the database PNN Monitoring DB.

Important notes

These Scripts must be used in conjunction with the models that was developed for the project, which you can find in this repository.

Features

  • Built using
  • Supports Python 3.x

Getting Started

To use etl you must have an instance of a postgresql database.

Prerequisites

Project Structure

  • config/: Folder where the configuration data for the import process will be saved.
  • workspace/log/: Folder where all errors will be saved, (it is not necessary to create it, it is created automatically).
  • workspace/import/: This folder contains the Excel files used for data import.
  • workspace/outputs/ : Folder where all outputs will be saved, (it is not necessary to create it, it is created automatically).
  • src/: Folder to store the source code of the project.

Instalation

To use ETL we must install a set of requirements, which are in a text file, for this process we recommend to create a virtual environment, this in order not to install these requirements in the entire operating system.

  1. Clone the repository
git clone https://github.com/CIAT-DAPA/pnn_monitoring_etl.git
  1. Create a virtual environment
python -m venv env
  1. Activate the virtual environment
  • Linux
source env/bin/activate
  • windows
env\Scripts\activate.bat
  1. Install the required packages
pip install -r requirements.txt

ETL execution

This run_main.py script is a key component of the PNN monitoring ETL. It allows to perform two types of processes: Data Import and Rollback.

Configuration

The project now uses environment variables to configure database connection information and a configuration file for other parameters.

Required Environment Variables

To run the project, you need to set the following environment variables:

  • DB_USER: This is the name of the user that will be used to access the database.
  • DB_PASSWORD: This is the secret key used to authenticate and allow access to the specified user.
  • DB_HOST: This is the address of the server where the database is hosted.
  • DB_PORT: This is the port number to which the application will connect to access the database on the server.
  • DB_NAME: This is the specific name of the database to be accessed.

Config file

The project uses a configuration file config_file.csv to adjust certain parameters

  • matriz_name: Name of the excel sheet where the information to be imported is located.
  • matriz_skiprows: Number of rows to omit at the beginning of the Excel spreadsheet.

Parameters

The script uses the following parameters:

  • -prc or --process: Type of process to perform.

    • 1` for Data Import.
    • 2` for Rollback.
  • -path or --path: Path where the data needed for the process is located.

    • This path must point to the folder containing the workspace folder with the data to be imported.
  • -fid or --fid: (Required only for the Rollback process) Specific ID of the rollback to be performed.

Usage examples

Data import:

python run_main.py -prc 1 -path <PATH_WHERE_DATA_ARE_LOCATED>

Replace <PATH_WHERE_DATA_ARE_LOCATED> with the actual path where the data you want to import is located.

Note

During the data import a number of important files are generated:

  • In the log folder the error files will be saved. If any error occurs or any data is not imported correctly, a folder with the format "yyyymmdd_hhmmss" representing the run will be created. Inside this folder CSV files identified with the name "sirap_column_name_error" will be generated.

  • In the outputs folder, a folder with the same format representing the run will be created. Inside this folder, CSV files will be saved with the information of the imported data, including the IDs with which they were saved in the database. These files will have the name "sirap_table_name_output". This information will be used during the rollback process.

Rollback:

python run_main.py -prc 2 -path <PATH_WHERE_DATA_ARE_LOCATED> -fid <ROLLBACK_ID>

Replace <PPATH_WHERE_DATA_ARE_LOCATED> with the path where the required data is located and <ROLLBACK_ID> with the specific ID of the rollback you want to run, is the identifier of the folder generated during the import and located in outputs.

About

ETL scripts for the tool Finance monitoring PNN

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages