A key part of the overall AI Booster environment is Datahub, the Vodafone Data Lake, one of our most valuable and valuable data vaults. Built on BigQuery, Datahub is split into local markets and is the source of all data used in AI/ML use cases. Access to data segments is vetted by each data segment’s respective owner, and each local market uses a separate project to enforce additional isolation. The Vodafones data science team has read-only access to the production data they need for their specific use cases.

AI Booster itself consists of 5 projects for each use case, each with a distinct purpose. Three of these projects are fairly typical environments.

LAB: A development environment where data scientists can create and iterate models. Scratch storage is used for R/W and is purged every 14 days for compliance reasons.

Staging: Staging/pre-production staging area. It has the same storage settings as the LAB environment.

PROD: Production environment with R/W access to the production data hub.

Additionally, there are two auxiliary Google Cloud projects used for build purposes.

BuildSTAGING: Cloud Build, from lab to staging

BuildPROD: Cloud Build, Staging to Production

The entire setup, including all environments, will be created with the Vodafones workflow tool upon request. This workflow integrates privacy and responsible AI governance processes, requiring approval from each stakeholder in the use case or project before deploying resources. Once approved, the workflow tool triggers the build process using Terraform, launching all Google Cloud projects, service accounts, VPC/networking, etc. and leveraging Google Cloud Build. According to Omotayo Aina, solution architect at Vodafone Group Technology, the architecture has built-in security using VPC-SC, CMEK, Cloud NAT, etc., and has no access to the public internet.

Central mirror and repository

When Vodafone data scientists and ML engineers need access to libraries, packages and containers, they actually receive them from a central mirror. The central mirror is a key component of the AI ​​Booster architecture, serving vetted artifacts (from the Artifact Registry) and enabling quick and easy searching. The central mirror consists of:

PYPI Mirror: A local mirror of the Python library

Hub.docker.com mirror that pulls from public Docker repositories (such as Mongo)

Container store: Containers that can be used to launch notebooks or Vertex AI pipelines

For added security, all artifacts are scanned and compromised versions are flagged centrally to prevent developers from retrieving vulnerable artifacts. Projects that require vulnerable (flagged) artifacts should seek alternative artifact versions that are not vulnerable. If it is not possible to replace the artifact, perform a residual risk review and mitigate the risk before giving approval. A NAT/firewall configuration prevents the developer from using the original repository outside his Vodafone network, ensuring that only mirrored artifacts are used.

Leveraging AI Booster: Predicting Churn in Two Different Markets

Vodafone operates in 17 countries, mainly in Europe. The Vodafone Group’s Big Data and AI team was recently tasked with building customer churn prediction models for some of these countries, but it proved to be a complex undertaking.

Each regional market has its own set of requirements and different data dimensions are available for each country. But Vodafone machine learning engineer Ari Kabkel said, “We want to replicate our work quickly across markets.” AI Booster makes this much easier.

To better understand the AI ​​Booster architecture and how it helped improve our business, let’s start with a real-world use case. This means predicting customer churn in one local market and replicating it in another local market.

1. LAB Environment: Development

A data scientist (DS) starts in the LAB environment. Here you can access:

A local data hub (BigQuery) with access to customer analytics records (CAR). CAR is a broad table containing invoices, calls, data usage, and other customer data specific to your local market.

Private GitHub repositories for each use case

Many Google Cloud tools within AI Booster (Vertex AI, BigQuery, BigQuery ML, etc.)

Data scientists spend time preparing, preprocessing, feature selection, and building models using frameworks such as BQML, Spark ML (serverless), XGBoost, LightGBM, scikit-learn, TensorFlow, and Google AutoML. spend As a best practice, all code is regularly checked into a GitHub repository.

To speed things up, there is a central template repository containing Cloud Build config files, Kubeflow Pipelines (KFP), pre-built (reusable) KFP components, build triggers, unit tests, payload files, queries, and dependency requirement files. I have. This is a sample repository published by Google Cloud.

As use cases evolve, the ability to run multiple parallel experiments with data becomes a priority. To do this, data scientists and machine learning engineers need to be able to quickly create training pipelines in Vertex Pipelines by leveraging the assets provided in our template repository.

