Pages

Sunday, 21 July 2019

Benchmarking process for TF-TRT, and a workaround for the Coral USB Accelerator

A couple of days ago I published some benchmarking results running a TF-TRT model on the Pi and Jetson Nano. I said I'd write up the benchmarking process. You'll find the details below. The code I used is on GitHub.

I've also managed to get a Coral USB Accelerator running with a Raspberry Pi 4. I encountered a minor problem, and I have explained my simple but very hacky workaround at the end of the post.

TensorFlow and TF-TRT benchmarks

Setup


The process  was based on this excellent article, written by Chengwei Zhang.

On my workstation


I started by following Chengwei Zhang's recipe. I trained the model on my workstation using and then copied trt_graph.pb from my workstation to the Pi 4.

On the Raspberry Pi 4


I used a virtual environment created with pipenv, and installed jupyter and pillow.

I downloaded and installed this unofficial wheel.

I tried to run step2.ipynb but encountered an import error. This turned out to be an old TensorFlow bug resurfacing.

The maintainer of the wheel will fix the problem when time permits, but I used a simple workaround.

I used cd `pipenv --venv` to go to the location of the virtual environment, and then ran cd lib/python3.7/site-packages/tensorflow/contrib/ to move to the location of the offending file __init__.py

The problem lines are

if os.name != "nt" and platform.machine() != "s390x":
     from tensorflow.contrib import cloud
These try to import cloud from tensorflow.contrib, which isn't there and fortunately isn't needed :)

I replaced the second line with pass using

sed -i '/from tensorflow.contrib import cloud/s/^/ pass # ' __init__.py

and captured the timings.

Later, I ran raw-mobile-netv2.ipynb to see how long it took to run the training session, and to save the model and frozen graph on the Pi.

On the Jetson Nano


I used the Nano that I had configured for my series on Getting Started with the Jetson Nano; it had NVIDIA's TensorFlow, pillow and jupyter lab installed.

I found that I could not load the saved/imported trt_graph.pb file on the Nano.

Since running the original training stage on the Pi did not take as long as I'd expected, I ran step1.ipynb on the Nano and used the locally created trt_graph.pb file which loaded OK.

Then I ran step2.ipynb and captured the timings which I published.

Using the Coral USB Accelerator with the Raspberry Pi 4


The Coral USB Accelerator comes with very clear installation instructions, but these do not currently work on the Pi 4.

A quick check of the install script revealed a hard-coded check for the Pi 3B or 3B+. Since I don't normally use the Pi 3B, I changed that entry to accept a Pi 4.

When I ran the modified I found a couple of further issues.

The wheel installed in the last step of the script expects you to be using python3.5 rather than the Raspbian Buster default of python3.7.

As a result, I had to (cough)

cd /usr/local/lib/python3.7/dist-packages/edgetpu/swig/
cp _edgetpu_cpp_wrapper.cpython-35m-arm-linux-gnueabihf.so _edgetpu_cpp_wrapper.cpython-37m-arm-linux-gnueabihf.so



and change all references in the demo paths from python3.5 to python3.7

With these changes, the USB accelerator works very well. There are plenty of demonstrations provided. In this image it's correctly identified two faces in an image from a workshop I ran at Pi towers a couple of years ago.

It's an impressive piece of hardware. I am particularly interested by the imprinting technique which allows you to add new image recognition capability without retraining the whole of a compiled model.

Imprinting is a specialised form of transfer learning. It was introduced in this paper, and it appears to have a lot of potential. Watch this space!

Friday, 19 July 2019

Benchmarking TF-TRT on the Raspberry Pi and Jetson Nano


Trying to choose between the Pi 4B and the Jetson Nano for a Deep Learning project?

I recently posted some results from benchmarks I ran training and running TensorFlow networks on the Raspberry Pi 4 and Jetson Nano. They generated a lot of interest, but some readers questioned their relevance. They were'n interested in training networks on edge devices.

Most people expect to train on higher-power hardware and then deploy the trained networks on  the Pi and Nano. If they use TensorFlow for training, they have are several choices for deployment:

  1. Standard TensorFlow
  2. TensorFlow Lite
  3. TF-TRT (a TensorFlow wrapper around NVIDIA's TensorRT, or TRT)
  4. Raw TensorRT
In this post I'll focus on timing Standard TensorFlow and TF-TRT. In a later post I plan to cover TensorFlow Lite on the Pi with and without accelerators like the Coral EDGE TPU coprocessor and the Intel Compute Stick.

I've run a number of benchmarks, and the results have been much as I expected.
I did encounter one surprising result, though, which I'll talk about at the end of the post. It's a pitfall that could easily have invalidated the figures I'm about to share.

Benchmarking MobileNet V2


The results I'll report are based on running MobileNetV2 pre-trained with ImageNet data. I've adapted the code from the excellent DLology blog which covers deployment to the Nano. I've also deployed the model on the Pi using a hacked community build of TensorFlow, obtained from here.  That has a wheel containing TF-TRT for python3.7, which is the default for Raspbian Buster.

(The wheel seems to have a minor bug. This weekend I'll set up a GitHub repo with the sed script I used to work around that, along with the notebooks I used to run the tests.)

The Pi 4B has 4GB of RAM, running a freshly updated version of Raspbian buster.

So here are the results you've been waiting for, expressed in seconds per image and frames per second (FPS).

Platform        Software    Seconds/image   FPS
Raspberry Pi    TF          0.226            4.42
Raspberry Pi    TF-TRT      0.20             5.13
Jetson Nano     TF          0.082           12.2
Jetson Nano     TF-TRT      0.04            25.63

According to these figures, the Nano is three to five times faster than the Pi, and TF-TRT is about twice as fast as raw TensorFlow on the Nano.

TF-TRT is only slightly faster than raw TensorFlow on the Pi. I'm not sure why this should be, but the timings are pretty consistent. At some stage I'll run some other models, but those will have to do for now.

A benchmarking pitfall


jtop
I mentioned one pitfall. When I re-ran the tests for this blog post I got much  slower performance for the Nano using TF-TRT - around 5 fps.

Fortunately Raffaello Bonghi's excellent jtop package saved the day.  jtop is an enhanced version of top for Jetsons which shows real-time GPU and memory usage.

Looking at its output,  I realised that an earlier session on the Nano was still taking up memory. Once I'd closed the session down, a re-run gave me the 25 fps which I and others had seen before.

I continue to be impressed by the Pi 4 and the Nano.

While the Nano's GPU offers significantly faster performance on Deep Learning tasks, it cost almost twice as much. Both represent excellent value for money, and your choice will depend on the requirements for your project.

Sunday, 14 July 2019

Training ANNs on the Raspberry Pi 4 and Jetson Nano


There have been several benchmarks published comparing performance of the Raspberry Pi and Jetson Nano. They suggest there is little to chose between them when running Deep Learning tasks.

I'm sure the results have been accurately reported, but I found them surprising.

Don't get me wrong. I love the Pi 4, and am very happy with the two I've been using. The Pi 4 is significantly faster than its predecessors, but...

The Jetson Nano has a powerful GPU that's optimised for many of the operations used by Artificial Neural Networks (ANNs).

I'd expect the Nano to significantly outperform the Pi running ANNs.


How can this be?


I think I've identified reasons for the surprising results.

At least one benchmark appears to have tested the Nano in 5W power mode. I'm not 100% certain, as the author has not responded to several enquiries, but the article talks about the difficulty in finding a 4A USB supply. That suggests that the author is not entirely familiar with the Nano and its documentation.

The docs makes it clear that the Nano will only run at full speed if 10 W mode is selected, and 10 W mode requires a PSU with a barrel jack rather than a USB connector. You can see the barrel jack in the picture of the Nano above.

Another common factor is that most of the benchmarks test inference with a pre-trained network rather than training speed. While many applications will use pre-trained nets, training speed still matters; some applications will need training from scratch, and others will require transfer learning.

I've done a couple of quick tests of relative speed with a 4GB Pi4 and a Nano running training tasks with the MNIST digits and fashion datasets data sets.

The Nano was in 10 W max-power mode. The Pi was cooled using the Pimoroni fan shim, and it did not get throttled by overheating.

Training times  using MNIST digits 


The first network uses MNIST digits data and looks like this:
model = tf.keras.models.Sequential([
  tf.keras.layers.Conv2D(32, (3,3), activation='relu', input_shape=(28, 28, 1)),
  tf.keras.layers.MaxPooling2D(2, 2),
  tf.keras.layers.Flatten(),
  tf.keras.layers.Dense(128, activation='relu'),
  tf.keras.layers.Dense(10, activation='softmax')
])
While training, that takes about 50 seconds per epoch on the Nano, and about 140 seconds per epoch in the Pi 4.

MNIST Fashion training times with a second, larger CNN layer


The second (borrowed from one of the deeplearning.ai course notebooks) uses the mnist fashion dataset and looks like this

model = tf.keras.models.Sequential([
  tf.keras.layers.Conv2D(64, (3,3), activation='relu', input_shape=(28, 28, 1)),
  tf.keras.layers.MaxPooling2D(2, 2),
  tf.keras.layers.Conv2D(64, (3,3), activation='relu'),
  tf.keras.layers.MaxPooling2D(2,2),
  tf.keras.layers.Flatten(),
  tf.keras.layers.Dense(128, activation='relu'),
  tf.keras.layers.Dense(10, activation='softmax')
])
While training, that takes about 59 seconds per epoch on the Nano and about 336 seconds per epoch on the Pi.

Conclusion


These figures suggest that the Nano is between 2.8 and 5.7 times faster than the Pi 4 when training.

That leaves me curious about relative performance when running pre-trained networks. I'll report when I have some more data.