Accessing deep learning and machine learning packages

Andrew Laidlaw
6 min readDec 8, 2020

--

One of the primary use cases for the GPU accelerated Power Systems servers like the AC922 and IC922 is to run deep learning and machine learning workloads. Both training and inference jobs can be accelerated by the NVIDIA GPUs to deliver higher performance.

The IBM Power Systems AC922 with NVIDIA Tesla V100 GPUs is an ideal system for training deep learning models using the software packages discussed here.

There are a range of methods to get access to the common deep learning and machine learning frameworks on these Power Systems servers, from IBM and Open Source project locations. These are commonly provided as Python packages to ensure they integrate with other data science tools.

Watson Machine Learning Community Edition (WML CE)

This is an IBM provided repository of compiled and optimised binaries for the common deep learning and machine learning frameworks in use today.

https://public.dhe.ibm.com/ibmdl/export/pub/software/server/ibm-ai/conda/#/

IBM provides the Watson Machine Learning Community Edition packages through a conda channel

This was set up as a public facing Anaconda repository to allow for easy access to those binaries for internal and external customers alike. Through agreements with our Partner NVIDIA, we were also able to include their CUDA code and associated packages within this library — covered under the license terms in place for access to the repository. Due to the nature of the Anaconda repository system, the IBM development team were able to include other proprietary tools within that same license, including tools like SnapML.

Each release of WML CE included updated versions of the common frameworks, built to run on a consistent CUDA release and set of underlying packages which ensured that multiple frameworks could run effectively within the same Conda environment. The latest version (1.7.0) was all built to run on CUDA 10.2 which is available for Red Hat Enterprise Linux 7.6-alt on the POWER9 based IBM Power Systems AC922 and IC922 servers.

The WML CE conda repository is also supported by a couple of other public conda repositories:

The Early Access Channel provides packages that are newer than those in the main channel, but might not have gone through such extensive testing.

The Anaconda PowerAI channel provides open access to a number of other packages that are commonly used in data science, compiled to run on the ppc64le architecture. These are largely open source, and do not come with formal support.

The frameworks made available within WML CE were never intended to be up to date with the upstream Open Source projects. Instead, releases of each framework have been chosen to be thoroughly tested by the IBM development team, with relevant patches included to ensure high levels of resiliency and performance. This also allowed IBM to provide a formal support agreement for these packages to customers that require Enterprise grade support for their deep learning and machine learning workloads running in Production environments.

However, most of the user base for the WML CE repository did not require this level of support, in fact most were from research institutions and academic organisations without any need for formal support. Instead, the requests from many of the user community has been for IBM to provide newer framework versions and deliver packages closer to the upstream releases as possible.

IBM has worked with those academic and research users as well as the Open Source Community to deliver an alternative option for customers looking to run the latest deep learning and machine learning frameworks on their systems. This allows for greater flexibility and choice, as well as delivering the latest releases more rapidly to researchers and developers.

Open-CE on Github

This Open Source project has been created to encapsulate the experience of the WML CE development team and make that available to the wider community. It includes feedstocks for the common frameworks, making them available for customers to build their own packages, and host their own local Anaconda repository for their user community.

https://github.com/open-ce/

The Open-CE project is hosted on Github, with contributions welcomed from the Community

The main benefit for the research community is that this method will give them faster access to the latest releases from the upstream projects for these frameworks. Users will be able to build new packages as often as they like to get access to the latest functionality and features. Sites will also be able to host their own Anaconda repositories for any built packages, making them available to the wider user community quickly and efficiently. As this is based around an Open Source approach, the worldwide user community will also be able to benefit from experiences and expertise from across the globe, and not be reliant on the IBM development team to incorporate the latest patches.

This is expected to give a faster rollout of deep learning and machine learning framework releases than we have been able to offer with WML CE, and a greater flexibility for sites.

As this is a fully open source community, proprietary code such as the CUDA runtime and associated tools from NVIDIA will not be available from this Github repository. Instead, these would need to be sourced directly from NVIDIA to ensure acceptance of their Terms and Conditions. For many HPC environments these would likely be included in the base OS image for the servers.

Other IBM proprietary applications that were included in WML CE are not available in Open-CE, including SnapML. The latest release of this is still available through the WML CE repository, and the development team are still evaluating their choices on how to provide this to customers in the future.

Alongside the feedstocks to build each of the frameworks, the Open-CE Github repository includes documentation and instructions on how to build either individual frameworks or a whole release. This includes the steps to build a local Anaconda repository, as well as guidance on how to build container images incorporating these frameworks. There are build scripts provided to allow these packages to be included in a CI/CD pipeline or other automated build processes, allowing sites to maintain their own up to date repositories for researchers to use.

The current release (v1.0.1) is set up to run on CUDA 10.2 to maintain consistency with the latest WML CE release. However, future releases will use CUDA 11.x, which is supported on Red Hat Enterprise Linux 8 on the IBM Power Systems AC922 and IC922 servers.

Open-CE Public Repositories

The Open-CE project has been setup to deliver the build recipes and instructions for the common deep learning and machine learning frameworks and does not include a public Anaconda repository in the same way that WML CE does. Instead, there are several institutions that intend to provide their own public repositories for these packages, making them available to the wider community. The first of these is the Open Source Lab at Oregon State University, which already has an Open-CE Conda channel up and running.

https://osuosl.org/services/powerdev/opence/

Oregon State University Open Source Lab provide a public repository of packages

Having community provided repositories offers greater flexibility, with some sites likely to host frequent updates to keep pace with the open source communities whilst others maintain fixed releases to ensure consistency and to reduce risk. Other customers will then be able to choose which of these repositories they want to pull from for each project. For instance, a researcher could use the latest nightly build of Tensorflow to make use of the latest features for training but deploy that model on a more resilient and tested package from a different source for inferencing against live experimental data.

The experience of pulling packages from these public repositories is the same as they have with the WML CE repository currently, and so existing workflows can remain the same. This allows consistency for users already working with the packages from WML CE whilst offering greater flexibility to new researchers coming on to the platform.

There are also plans to deliver the frameworks within Open-CE as OCI compliant container images for sites that are making use of technologies like Docker / Singularity / Kubernetes / Red Hat OpenShift to manage workloads on their clusters. These will be similar to the container images already publicly available for the WML CE provided packages.

https://hub.docker.com/r/ibmcom/powerai/

Sign up to discover human stories that deepen your understanding of the world.

Free

Distraction-free reading. No ads.

Organize your knowledge with lists and highlights.

Tell your story. Find your audience.

Membership

Read member-only stories

Support writers you read most

Earn money for your writing

Listen to audio narrations

Read offline with the Medium app

--

--

Andrew Laidlaw
Andrew Laidlaw

Written by Andrew Laidlaw

IBM SystemsAI team member, focusing on the best infrastructure and software for Deep Learning and Machine Learning workloads.

No responses yet

Write a response