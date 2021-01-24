



Achieving 4x Pre-Training Speedup from Strongly Adjusted T5-XXL Baseline (Google’s Top Transformers) Photo: Arthur Osipyan from Unsplash

Transformers have been very successful in machine learning, especially NLP (and now image processing). They are one of the hottest topics right now, and it makes sense for Google to try to improve them.

A few days ago, Google released a huge new paper proposing a new way to significantly increase the number of parameters while maintaining the number of floating-point operations per second (the standard metric for ML computation costs).

It is well known that increasing the number of parameters increases the complexity and learning ability of the model (up to a certain point, of course). And as expected, the model is four times better than the T5-XXL and seven times better than the T5-Base and T5-Large.

Their treatises are much larger than most treatises (about 30 pages), so I’ll highlight the most important details to keep this article concise and gist.

Prerequisites HD Science Photographs of the Unsplash Mixture of Experts (MoE) Algorithm

Mixing experts refers to a machine learning technique that uses multiple experts (learners) to divide the problem space into uniform areas. If the output is conditioned by a multi-level stochastic gating function, the mix is ​​called an expert hierarchical mix.

Source: Wikipedia

One of the most interesting things about MoE is that a typical deep learning network uses the same parameters for all inputs, while MoE tunes the network to use different parameters for different inputs. This expands the use of networks.

And if you think about it, this means that the model has a huge number of (sparse) parameters, but not all of them are always used at the same time, that’s the essence of this treatise. ..

MoE was introduced long before this white paper, but there were some training instability and computational cost issues that we have been working on in this white paper.

2. Distillation

Distillation is essentially taught in ML. It’s when a small network mimics a larger and better network and learns from it without going through a lengthy training process. This is a bit different from transfer learning, which greatly improves performance.

In fact, Facebook’s latest state-of-the-art DeIt paper, which classifies images with zero convolution and less than 1% of state-of-the-art datasets, relies on distillation. If you want to know more about Distillation Token tricks, check out my article here:

3. Model and data sharding (parallel processing)

If your model and data are very large, you should start splitting them into multiple cores. This is often difficult, but it’s easy because the model here is mostly sparse (not all parameters are always used). In fact, the main selling point is simple and efficient sparsity!

Paper highlights (how easy it works) Photo: Garett Mizunaka on Unsplash

One of the most fun things about this white paper is that it uses engineering thinking. You have to be smart when dealing with large amounts of computational power and model parameters. This is why the first highlight is how the tokens are routed to the correct expert (MoE) after paying attention.

The model starts with the classic self-attention block (the essence of Transformers). The attention part aggregates the information and correlates the individual items in the sequence into a new sequence in which the token can collect information from all other tokens. [2]

Then there is a feedforward network (FFN) where all tokens are separated. FFN’s job is to determine the optimal representation of each token in the next layer group. So basically the attention is to correlate tokens, and feedforward layers correlate layers. [2].. You can think of FFN as an intermediary that translates between two entities. One has tokens that need to be processed and the other has a group of experts.

Here is an interesting point. All of the above are MoE transformers, but we haven’t reached the switch transformers yet.

Routing trick

Switch Transformers introduces the Switch layer before the FFN layer. This basically makes FFN an expert and Switch matches each token to the correct FFN / expert. Basically, it routes the correct token to the best expert.

For example, the FFN layer may specialize in processing nouns, while another layer may specialize in processing verbs, punctuation, and so on. [2]..

They call this concept switch routing, which is essentially a mixed upgrade of professionals. Earlier authors of MoE assumed that this was not possible due to computational cost requirements. Google has introduced a very novel workaround in this area.

Instead, use a simplified strategy that routes to only one expert. This simplification shows that the quality of the model is maintained, routing calculations are reduced, and performance is improved. This k = 1 routing strategy is later referred to as the switch layer. There are three advantages of the switch layer.

(1) Since the token is routed to only one expert, the calculation of the router is reduced.

(2) Since each token is routed to only a single expert, the batch size (expert capacity) of each expert can be at least halved.

(3) The implementation of routing is simplified and communication costs are reduced.

Source: Switch Transformers Paper

Everyone knows that ML relies on floating point arithmetic. Also, if you are deploying a large distributed model, you will need to submit a large number of float numbers. Float numbers come in two main sizes, 16-bit and 32-bit, and if you send only 16-bit numbers, you will not be able to perform standard ML calculations and 32 bits due to computational constraints (required for switch routing). Cannot be sent.

So what did they do ..? They send 16-bit floats to the model, selectively upscale the floats needed to perform the required operation to 32-bit floats, and then selectively downscale those floats to 16-bit again. Introduce precision technology. Easy solution to difficult problems!

It also optimizes these processes through the concept of capacity factors, where each expert processes only a certain number of tokens.

In addition, we use distillation techniques based on the BERT model to alleviate some of the deployment issues (because these models are huge).

Source: arxiv (LaTex reproduction table)

As a result, performance is improved by 330% without increasing the number of parameters. This proves the magic of distillation!

Final idea

There seems to be a lot of work done in this white paper, but it’s not exhaustive. There’s more about distributing models and data in the treatise, but my purpose was to highlight. It’s always great to see innovation. I think the best part is to use engineering and computer system technology to solve ML problems (routing, selective accuracy, etc.). This shows that ML is not only mathematics and statistics, but also computer science.

If you’re interested in reading more about other novel treatises, check out my article here:

reference:

[1] Switch Transformer: Scaling to a trillion parameter model with simple and efficient sparsity. William Fedas, Barrett Zoff and Noam Chazer. 2021. with arxiv

[2] Switch Transformer: Scaling to a trillion parameter model with simple and efficient sparsity.Youtube Yannick Killer

What Are The Main Benefits Of Comparing Car Insurance Quotes Online

LOS ANGELES, CA / ACCESSWIRE / June 24, 2020, / Compare-autoinsurance.Org has launched a new blog post that presents the main benefits of comparing multiple car insurance quotes. For more info and free online quotes, please visit https://compare-autoinsurance.Org/the-advantages-of-comparing-prices-with-car-insurance-quotes-online/ The modern society has numerous technological advantages. One important advantage is the speed at which information is sent and received. With the help of the internet, the shopping habits of many persons have drastically changed. The car insurance industry hasn't remained untouched by these changes. On the internet, drivers can compare insurance prices and find out which sellers have the best offers. View photos The advantages of comparing online car insurance quotes are the following: Online quotes can be obtained from anywhere and at any time. Unlike physical insurance agencies, websites don't have a specific schedule and they are available at any time. Drivers that have busy working schedules, can compare quotes from anywhere and at any time, even at midnight. Multiple choices. Almost all insurance providers, no matter if they are well-known brands or just local insurers, have an online presence. Online quotes will allow policyholders the chance to discover multiple insurance companies and check their prices. Drivers are no longer required to get quotes from just a few known insurance companies. Also, local and regional insurers can provide lower insurance rates for the same services. Accurate insurance estimates. Online quotes can only be accurate if the customers provide accurate and real info about their car models and driving history. Lying about past driving incidents can make the price estimates to be lower, but when dealing with an insurance company lying to them is useless. Usually, insurance companies will do research about a potential customer before granting him coverage. Online quotes can be sorted easily. Although drivers are recommended to not choose a policy just based on its price, drivers can easily sort quotes by insurance price. Using brokerage websites will allow drivers to get quotes from multiple insurers, thus making the comparison faster and easier. For additional info, money-saving tips, and free car insurance quotes, visit https://compare-autoinsurance.Org/ Compare-autoinsurance.Org is an online provider of life, home, health, and auto insurance quotes. This website is unique because it does not simply stick to one kind of insurance provider, but brings the clients the best deals from many different online insurance carriers. In this way, clients have access to offers from multiple carriers all in one place: this website. On this site, customers have access to quotes for insurance plans from various agencies, such as local or nationwide agencies, brand names insurance companies, etc. "Online quotes can easily help drivers obtain better car insurance deals. All they have to do is to complete an online form with accurate and real info, then compare prices", said Russell Rabichev, Marketing Director of Internet Marketing Company. CONTACT: Company Name: Internet Marketing CompanyPerson for contact Name: Gurgu CPhone Number: (818) 359-3898Email: [email protected]: https://compare-autoinsurance.Org/ SOURCE: Compare-autoinsurance.Org View source version on accesswire.Com:https://www.Accesswire.Com/595055/What-Are-The-Main-Benefits-Of-Comparing-Car-Insurance-Quotes-Online View photos