Skip to content

Docker image with Apache Spark / Hadoop3 (compatible with AWS services like S3) and with RDKit installed in anaconda environment

Notifications You must be signed in to change notification settings

andgineer/spark-aws-rdkit

Repository files navigation

Apache Spark compatible with Amazon services. With pyspark conda environment for Data Science and Cheminformatics

This is fully functional Spark Standalone cluster compatible with AWS services like S3. Also Python conda environment is installed with pyspark, pandas, RDKit and so on.

You can launch it locally with docker-compose or in Amazon cloud AWS ECS.

PySpark example

Separate container submit will wait for Spark cluster availability and after that it will run PySpark example. The example shows how to submit Spark jobs to the cluster. For details see src/.

Docker-compose

./compose.sh up --build

That will start Spark Master and two Workers, and example in submit.

Spark Web UI will be available on http://localhost:8080

Spark Driver on spark://localhost:7077 (PySpark: setMaster('spark://localhost:7077')).

Current settings are for Docker on MacOS. If you are on Linux change docker.for.mac.localhost in .env to localhost.

Docker images

AWS ECS

This Apache Spark containers also tested with AWS ECS (Amazon Container Orchestration Service).

See scripts and README.md in ecs/.

You fill configuration into config.sh and after that you can create Spark cluster in AWS ESC completely automatically.

About

Docker image with Apache Spark / Hadoop3 (compatible with AWS services like S3) and with RDKit installed in anaconda environment

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published