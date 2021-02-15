



This winter, I discovered that Wellesley College, where I am currently studying media art and science, has an archive of over 100 years of course catalogs, admission guidelines, and annual reports. I was immediately impressed with the potential of attractive data that could be extracted from these documents, but the first step is to convert them to text as there are not many analytical methods that can be performed by scanning old brown PDFs. there is.

In this way, I started looking for a quick and effective way to perform OCR on a large number of PDF files while maintaining as much format and accuracy as possible. After trying several methods, I found that the Google Cloud Vision API gave the best results of any published OCR tool.

I couldn’t find a single comprehensive guide to running a simple OCR application using this great tool, so create it so that even people with little programming knowledge can use it. I made it.

Install Python 3 and pip on your computer Run a Python program on your computer using Visual Studio Code A, a text editor for editing code. You also need a payment method to enter into your Google Cloud account, but you don’t. Spend money to complete this tutorial. A debit card, credit card, or Google wallet account will suffice.

To perform optical character recognition using Google Cloud Vision, you first need a Google account. This will allow you to log in to the Google dashboard for cloud services. One of the many services accessible from this dashboard is the file storage used to host PDF files that you want to convert to text.

Advanced machine learning algorithms accessed through the Cloud Vision API run in the cloud, so you need to upload the PDF to a bucket of files hosted by Google for access.

This tutorial will show you how to write a text file containing all the text in the final PDF to a location on your computer.

If you are not logged in to your Google account, please visit google.com to log in or create an account. I don’t think my readers need any further guidance on this step. After logging in to your Google account, follow this link to access your Google Cloud dashboard. If requested, please agree to the terms of use. You should see a page similar to this. GoogleCloudPlatform dashboard

3. Click the drop-down menu just to the right of the logo that says Google Cloud Platform. I say the OCR test, which is the name of the currently open project, but you would say something else.A window pops up with a list of recent projects and in the upper right corner[新しいプロジェクト]A button is displayed. Click the button to create a new project. Give the project a name so that you can remember the purpose of the project. You don’t have to worry about other fields.[作成]Click. After creating the project, reopen the window and select the project from the list of recent projects.

4. You will see the project information, API, and other information panels for the newly created project, as shown in the screenshot above.Bottom left[はじめに]On the panel[APIの探索と有効化]Click. This allows you to choose which Google APIs you want to use in this project.

APIs and services

5. In the menu bar at the top of the screen[APIとサービスを有効にする]Click. This will take you to the API library. Search for and select the Cloud Vision API.

6.[有効にする]Click to make the API available in your project. This will take you to the Cloud Vision API overview page.In the upper right corner of the screen[資格情報の作成]Click.

7. Which API are you using? Select the Cloud Vision API from the drop-down menu below. Are you planning to use this API on App Engine or Computer Engine? so,[いいえ],[使用しない]Choose.blue[どの資格情報が必要ですか？]Click. button.

8. Now you can create a key so that you can authenticate yourself when you try to connect to the Cloud Vision API. Choose a service account name that is easy to remember and set the role to owner. Set the key type to JSON.[続行]Click. You can now download the JSON file that contains your credentials.

Your project is now created on Google Cloud Platform and you can use the Cloud Vision API. The next step is to upload the PDF document and save it in the cloud. You can then write a script to convert it to text.

9. If it’s not already open, click the navigation menu on the left side of Google Cloud Platform and[ストレージ]Scroll down until you see. Click on it to open a drop-down menu.From the drop-down menu[ブラウザ]Choose. At this point, if you haven’t already enabled billing, you need to. If you have Google Pay, you can use it here. Otherwise, you will need to enter your external payment information. We will not inform you as it depends on the payment method. When complete, you will be presented with a dialog with the option to create a bucket.

10. Give the bucket a unique name. This is the storage repository in the project you created earlier. Set the default location for storing data in multiple regions and the default storage class for data.[作成]Click.

The bucket is now set up and you can upload the file so that you can access it from any API enabled in your current project. You can upload the PDF file you want to post to anywhere on your computer by dragging and dropping.

You are now ready to write a program that can access both this file and the Cloud Vision API by connecting to your Google Cloud service and providing you with the key you downloaded earlier.

Now that Google Cloud has all the necessary settings, install the necessary tools on your computer and use them to extract the text from the PDF file.

First, you may need to do some installations. Open a terminal and go to the folder where you want to save the Python script you created. Enter the following command:

pip install google-cloud-vision

pip install google-cloud-storage

They use pip to install two Python libraries, each with tools for interacting with Google Cloud Vision and the Cloud Storage API.Then run

Pip freeze

This will ensure that you have everything you need installed. Most may be newer versions, but you need the following:

google-api-core == 1.14.3google-api-python-client == 1.7.11google-auth == 1.6.3google-auth-httplib2 == 0.0.3google-cloud == 0.34.0google-cloud-core == 1.0.3google-cloud-storage == 1.20.0google-cloud-vision == 0.39.0google-resumable-media == 0.4.1googleapis-common-protos == 1.6.0google-api-core == 1.14.3

If you don’t have them, use pip to install the missing ones.

Finally, you need to set up Google Application Credentials. This means that you need to register a location to hold the previously downloaded json key so that your computer can authenticate itself to your Google account when you run the program using the Google Cloud service. there is.

Here you can find great steps on how to do this on any platform. Doing this will allow you to run programs that use Google Cloud Services from the command line.

Then proceed to the fun part of creating a script that actually performs optical character recognition on the selected PDF. Create a new Python file and open it in your favorite code editor. I’ll explain each part of the script I used so that I can understand it when I replace the information. The entire script can also be found here on my Github. Follow each step before downloading to tinker with it.

The first step is to import the required libraries.

You need to import the json so that you can handle the output of Cloud Visions. re is a library that allows you to use regular expressions to match a particular pattern in a string.

Google.cloud Vision and Storage enables you to use Google Cloud Vision and the Google Cloud Storage API.

2. The next step is to use the Google Cloud Vision API to create a function that finds everywhere in the PDF file that there is readable text. Be sure to read the comments in this function to understand what each step is doing.

A function that annotates where the text is in a PDF file

In addition to the comments that describe this feature, there are a few things to keep in mind. When you run Google’s amazing OCR tool on a document, you can expect it to return a text file. In fact, this function only outputs a json file or multiple files, depending on the size of the PDF, which contains information about the location of the text in the file. Getting the actual text so that we can read it is the next step.

This function takes two inputs. First, gcs_source_uri is the location of the PDF file in Google Cloud storage. The second gcs_destination_uri is the location in Google Cloud Storage where you move the json file that contains the file annotations.

URI is a term that describes the location of a file in Google Cloud storage. You can think of it as a URL in Google Cloud Storage, or a path on your computer. This shows where in the hierarchy of files you keep in Google Cloud you can find a particular file. To find the URI of a file, double-click the file to see its details and copy the URI from the table of data you want to open.

To generate the annotation, write a line at the bottom of your Python file that calls the async_detect_document function. I’m like this.

The first URI is the path to the PDF document stored in your Google cloud storage bucket and will be read from there. The second leads to the folder that stores all the document annotations.

3. Now that you’ve annotated the PDF, you can finally use Cloud Vision to navigate to where the text is and load it into a text file. Here is my code to do this: Again, be sure to read the comments.

This function takes only one argument, the URI of where you saved the annotation. The result of the voice-to-speech conversion is printed to a text file in the currently active directory, in addition to being printed on the terminal.

Shows how this was called using the same directory as before.

Congrats! If all goes well, you should have a text file that contains line-by-line transcriptions of all machine-readable text in the PDF. You may be surprised at how much you can read by hand.

Here are some left-right comparisons of my results. This is a page of course catalogs created from the Wellesley College archives, dating back to 1889. Despite using a completely unprepared PDF as the input file for this test, the results are very accurate. Name and foreign language.

