



Create your first computer vision project using label detection, object detection, facial expression detection, text detection, and dominant color detection.

Photo by Kevin Ku on Unsplash

Use computer vision to extract useful information from images, video, and audio. It enables computers to see and understand information that can be gleaned from visual input. After receiving visual input, you can gather valuable information in the image and decide the next steps to take.

The Google Vision API is a Google cloud service that enables you to extract valuable information from your image input using computer vision. Beginners can use this service to gain meaningful insight into their images. The following diagram shows how the Google Vision API works.

Source: https://cloud.google.com/architecture/creative-analysis-at-scale-with-google-cloud-and-machine-learning

The image above demonstrates the functionality of the Google Vision API. The Google Vision API can recognize facial expressions, text, and dominant colors in ad images. The facial expression clearly shows a joyful expression, the text explains the words “LEARN MORE”, and the dominant colors indicate the top 10 dominant colors in the image.

You’ll find that you can get a lot of insight from your images by leveraging the Google Vision API capabilities. For example, you want to know which elements of an ad image cause customers to click and view your ad. This can be discovered by using the Google Vision API service.

This article will primarily focus on how to obtain insight factors in images and what insights can be obtained from a given image. I won’t use the example ad image because it’s a trade secret and I can’t share it. Instead, use product images that are available in the Kaggle dataset for data analysis.

This project dataset image is based on Kaggle’s stylized product image dataset. Since the dataset contains a large number of product images from e-commerce sites, we only get a small portion of the images available for creative analysis. This dataset license allows you to copy, modify, distribute and work on it.

Before you begin, you must first configure the Vision API service from your Google Cloud service. See here for detailed instructions. But to keep things simple, I’ll walk you through how to set up the Vision API from Google cloud services.

(Note: This Google Cloud Service Vision API must be configured from your own Google Cloud account. This tutorial does not provide a file containing your confidential Google Cloud key.)

Step 1: Log in to your Google Cloud project and from the home page[API の概要に移動]Choose.

(Image from author)

Step 2:[API とサービスを有効にする]search for Cloud Vision API, and enable it.

(Provided by Image Provider)(Provided by Image Provider)

Step 3:[資格情報]Go to[資格情報を作成],[サービス アカウント]Click

(Image from author)

Step 4: Enter your service account information (you can skip the optional part),[完了]Click.

(Image from author)

Step 5: Go to the service account you created. Go to Keys and add a key to create a new key.

(Provided by Image Provider)(Provided by Image Provider)

Step 6: Create a JSON key type, download the JSON file and place it in the working directory of your Python script.

(Image from author)

Before we can start computer vision modeling, we first need to install the necessary libraries. The first library that is often installed is google-cloud-vision, which is used to detect computer vision models. Once you have access to the Google Cloud Vision API, you can use this library.

The next library is webcolors. This is useful when you need to convert a hexadecimal color number from color detection to the closest known color name.

After installing the required libraries, import them into your script. Import vision from Google Cloud library for the purpose of vision modeling detection. Other libraries such as Ipython, io and pandas were used for data preprocessing.

Webcolors are used to convert hexadecimal color formats into familiar color names. KDTree is used to find the closest color match in the CSS3 library. A KDTree provides an index into a set of k-dimensional points that can be used to quickly find the closest point.

After placing the JSON file in the directory, you need to enable the Google Cloud Vision API service in your Python script.

Any label in the image can be detected using label detection. LabelAnnotation can be used to identify common objects, places, activities, products, and other things in images. The code below explains how to extract the label information from the images in the stylish dataset.

From this image you can see that the Google Vision API detected some common labels such as:

Facial expression (smile) Human body (face, joints, skin, arms, shoulders, legs, human body, sleeves) Objects (shoes)

Despite the fact that the vision identified many labels, some of the common objects were misidentified and not mentioned. That vision mistook the image of sandals for shoes. Also, I couldn’t recognize the clothes, houseplants, mugs, and chairs in the image above.

Object detection can be used to detect any object in an image. Unlike labeling, object detection is primarily concerned with detection confidence level. LocalizedObjectAnnotation scans an image for multiple objects and displays the object’s position within a rectangular boundary.

From this image you can see that the Google Vision API detected some objects as follows:

Sunglasses (Confidence: 90%) Necklace 1 (Confidence: 83%) Necklace 2 (Confidence: 77%) Miniskirt (Confidence: 76%) Shirt (Confidence: 75%) Clothing (Confidence: 70%) ) Necklace 3 (Confidence: 51) %)

In the image above, we can see that most of the vision is identifying clothing objects. Due to the high confidence, several objects have been identified as transparent objects, including sunglasses, necklace 1, necklace 2, miniskirts, shirts, and clothing. Necklace 3 has the lowest confidence because Vision also considers the image in the lower right corner to be a necklace. The Necklace 3 object is more like a bracelet than a necklace, so it has a lower confidence than the other objects.

Face detection can detect any human face and emotion in an image. FaceAnnotation is a technology that scans the position of human faces in an image. While scanning a human face, it can also scan the emotions of different facial expressions.

From this image above, you can see that the Google Vision API detected various expressions on a human face, including:

Joy: VERY_LIKELY Sadness: VERY_UNLIKELY Anger: VERY_UNLIKELY Surprise: VERY_UNLIKELY

From the image above, we can see that the expression is a smile, which the Vision API recognizes as an expression of joy. Other expressions of sadness, anger, and surprise do not appear to match the picture above because the person is not expressing those emotions. ringed and the other scored as very unlikely.

TextAnnotation can be used to detect and extract text from images. Individual words and sentences within the bounds of the rectangle are included in the extracted text.

From this image you can see that the Google Vision API detected various texts such as:

this looks awesome

For some reason, the vision identified what appeared to be a Japanese word. This can occur when translating into Japanese words.

The text in the image shows that it is detecting uppercase and non-uppercase words. I also found the word THIS IS in the above text. It should be THIS IS. As a result, a limitation of the Vision API is to detect words from THIS IS to THISIS because the words are too narrow.

Dominant color detection is a feature of image property annotations. Dominant color detection can be used to find the top 10 dominant feature colors in an image and the percentage of those colors.

As shown in the image above, the Google Vision API detected the top 10 different colors in hexadecimal format. To get the actual color name, you need to use a CSS3 library to convert the hexadecimal color format to color name format. Then use KDTree to get the closest color familiar from the CSS3 library.

I used the hexadecimal color A41B24 as an example for the second primary color. Using the function above, I found that the closest color in the CSS3 library is refractory brick. This is illustrated by the reddish color of the sneakers in the image above.

We’ve already done computer vision modeling with dominant color detection from labels, objects, facial expressions, text, and the above creative analysis. Each detection annotation still has many limitations after running the Vision API.

Label detection: Many common objects and facial expressions can be detected from photos, but some objects are misidentified (for example, sandals were misidentified as shoes in our analysis). These factors can be predicted by looking at the object detection confidence level (for example, the analysis misidentified a bracelet as another necklace). However, when we try to use images with facial expressions that are excluded from visual modeling, all facial expression detections are classified as highly improbable because visual modeling cannot determine which facial expressions are. Text detection: Text annotations can be used to detect text. , but there are unwanted texts involved in visual modeling (for example, our analysis detects Japanese words even though there are no Japanese words in the photo). Dominant Color Detection: Multiple dominant colors can be detected. However, currently you can only convert to RGB or Hexadecimal color formats. To convert to a familiar color, we need to add a function that converts the hexadecimal color to a color name.

If you want to know more about the code used in this modeling technique, check out my Github repository.

(Note: Google Cloud Keys are not available in the repository, you will need to create them yourself following the steps above).

Github: https://github.com/nugrahazikry

LinkedIn: https://www.linkedin.com/in/zikry-adjie-nugraha/

Sources 1/ https://Google.com/ 2/ https://towardsdatascience.com/simple-computer-vision-image-creative-analysis-using-google-vision-api-50cc42737a00

