Integrated Version Control for AWS Glue

AWS Glue is a fully managed data integration service from AWS. Glue allows users to prepare data for a variety of use cases such as machine learning, analytics, or general application development. It also supports users at many different skill levels. Glue Studio allows users a no-code environment to visually create ETL jobs. It also provides the ability to write jobs in PySpark or Python for users who prefer to code their jobs directly.

 

AWS recently released integrated version control for Glue using either GitHub or AWS CodeCommit. This allows Glue jobs to be versioned in a central repository for easy rollback and recovery. Having this feature integrated directly in Glue allows for a simpler, more streamlined management of job versions. Enabling version control for jobs is straightforward if you already have jobs created, or if you are just starting out. Let’s look at the set-up process and then check out the versioning in action.

 

To start, navigate to Glue Studio and open the Jobs page.

Glue Studio Jobs page

Select a job that needs to be versioned. For this example, we will use the gosales_product_s3_load jobs. Across the top of the Jobs editor are several tabs. Select the Version Control tab to configure the version control repository for the job.

Jobs example to be versioned

Once selected, there will be a drop-down menu to select either GitHub or AWS CodeCommit as the preferred Git service.

Git service selection

If GitHub is selected, you will need to provide a personal access token and the name of the repository owner. Once that information is entered you can select the repository to be used along with the branch.

GitHub personal access token

AWS CodeCommit is more tightly integrated since it is also an AWS service. Selecting this option will allow you to see all the repositories in your AWS account. Select the repository to be used along with the branch. Once selected, click save in the upper right of the page.

AWS Code Commit

To push a version to the repository, click the Actions button and select Push to Repository. You will then be asked if you are sure you want to push commit a version to the repository and branch, click Confirm to continue.

 

Once the commit is complete, navigate to CodeCommit in the AWS console. Select the repository and branch used in the job configuration. You should be able to see a new folder for the job.

AWS Code Commit

Opening the folder will show the job’s JSON file. You can inspect this to see the raw JSON used by Glue.

raw JSON file opened by Glue

That’s it! Now version control is set up for this Glue job. If any further changes are made to the job, be sure to push commit the new version to the repository to capture that change.

 

The addition of integrated version control enables teams to build their Glue pipelines with software development best practices by ensuring all code is properly versioned in a central repository. If you currently have jobs running in Glue, take the time to enable your jobs to make sure no work, or time, is lost.

Conclusion

We hope that you found this article informative and helpful. If you have any questions about AWS Glue, reach out to us! Also, be sure to subscribe to our newsletter for more PMsquare articles, updates, and insights delivered straight to your inbox.