Digit App

Draw a digit 0-9 or something else...

Digit: ?

HELPHome

Get started

This app demonstrates Digit Recognition using Deep Learning. Draw a digit 0-9 (or something else) in the drawing area then Predict and you will get a prediction (a digit 0-9) and also a prediction if it is a digit or not.

How does it work?

The pixels (value is 0-255) of the drawn image is input to a model that transforms the input pixels (number of pixels is 784) into 10 values (the predictions) that can be interpreted as the probability that the drawn digit is the digit 0-9. The model contains almost 600,000 parameters and various layers that handle and transform the input pixels. The values of the parameters are learned via a process called training on a large number of labeled images. Labeled means that each input image is associated with a number 0-9 telling what digit the image looks like. The training process is very slow so this app only predicts using a saved model.

The prediction is just a mapping of 784 values into 10 values. So no magic even if it is Deep Learning.

The other prediction (is it a digit or not) is done using another model trained in another way (seeModel: digit or not). Here the 784 pixels are transformed into two values - digit or not.

So the first prediction tells which digit 0-9 is most similar to the drawn symbol and the second prediction tells if the symbol is a digit or not.

Input and output

Drawing area: Draw within the red frame.
Predict button: To get predictions.
Clear button: To clear the drawing area.
Digit ?: Shows the predicted digit.
Accepted/Rejected: Shows if the drawn symbol is accepted as a digit.
Predictions: 10 values. Shows the probability of digit 0-9. The digit with max value is put into the output field Digit. The sum of the predictions is =1.
Bar chart: Bar chart of the predictions 0-9.
Digit/not digit bar: Red shows the value of the non-digit prediction and green the digit prediction. If non-digit >= 0.5 (50%) the symbol is rejected as a digit. If the bar is yellow the app has detected more than one symbol in the drawing area.
Arrow buttons: Moves the drawn symbol.
-/+: Zoom out or zoom in the drawing area.
Rotate buttons: Rotates the drawn symbol 10 degrees. Rotate is only working as expected if the digit is centered.
360 button: Rotates the drawn symbol 360 degrees. Predict after each 10 degrees. Stop to stop. First use Predict button once to get the symbol normalized.
Toggle normalization button (Actions): If normalization is to be done before each prediction (indicated by green Predict button) or not (blue Predict button). If normalization (green button) the prediction for digit or not digit is also done.
Toggle pen/eraser button (Actions): Use pen or eraser.
Normalize button (Actions): Normalizes the drawn symbol. Means that the size of the symbol is changed to what our models like (i.e. what our models are trained on). The symbol is also centered.

Home

MORE INFOHome

The MNIST dataset

Our model is trained using the famous MNIST dataset. MNIST is a hand-written digit dataset. The training set of MNIST contains 60,000 images and the test set 10,000 images. The test set is not used for training but for evaluation to see how good the model is. The pictures in the MNIST dataset are centered and normalized to the same size. This is why "position and size matter".

Model: digit 0-9

____________________________________________
Layer (type)                 Output shape   
============================================
conv2d_Conv2D1_input (InputL [null,28,28,1] 
____________________________________________
conv2d_Conv2D1 (Conv2D)      [null,26,26,32]
____________________________________________
conv2d_Conv2D2 (Conv2D)      [null,24,24,32]
____________________________________________
max_pooling2d_MaxPooling2D1  [null,12,12,32]
____________________________________________
conv2d_Conv2D3 (Conv2D)      [null,10,10,64]
____________________________________________
conv2d_Conv2D4 (Conv2D)      [null,8,8,64]  
____________________________________________
max_pooling2d_MaxPooling2D2  [null,4,4,64]  
____________________________________________
flatten_Flatten1 (Flatten)   [null,1024]    
____________________________________________
dropout_Dropout1 (Dropout)   [null,1024]    
____________________________________________
dense_Dense1 (Dense)         [null,512]     
____________________________________________
dropout_Dropout2 (Dropout)   [null,512]    
____________________________________________
dense_Dense2 (Dense)         [null,10]      
============================================
Total params: 594922
Trainable params: 594922
Non-trainable params: 0

At prediction time the pixel-values of our image are the input to the first Conv2D-layer. The output of one layer is the input to the next. In the end we have a layer with just 10 values as output. Those 10 values are transformed into our 10 prediction values (the probabilities).

The layers have different roles:

Conv2D: Performs an image-to-image transform using "sliding filters". You can see that the dimensions of the "image" changes. This layer is complicated and hard to describe (at least for me).

MaxPooling2D: Reduces the size of the tensor by just keeping the max value within a certain area (uses also "sliding filters").

Flatten: Flattens a multi-dimension tensor into a single-dimension. Here 4*4*64=1,024.

Dropout: The dropout layer randomly sets some output values to zero. The other input values keep their values on output. Reduces the risk of over-fitting. Over-fitting means that the parameters are adapted too much to the training data. The accuracy on training data is maybe 100% but lower on the test data. Over-fitting is a common problem. Dropout is only active during training.

Dense: The dense layers improves the output from the Conv2D layers. You can have a model with just dense layers but the result will not be as good as with this model. In fact before the neural networks other machine learning techniques were used with good result.

Training

Before any prediction is possible the 594,922 parameters of the model must have proper values i.e. the model must be trained. The training process tries to minimize the loss value. The process is complicated but is done by Tensorflow.

We can see that we get an accuracy on the test set of 99.4%.

Epoch 1 / 15
eta=0.6 =========================================> 
acc=0.964 loss=0.134
237005ms 4647us/step - 
acc=0.921 loss=0.245 val_acc=0.979 val_loss=0.0704
Epoch 2 / 15
eta=0.7 =========================================> 
acc=0.964 loss=0.0486
271554ms 5325us/step - 
acc=0.978 loss=0.0676 val_acc=0.989 val_loss=0.0384
...................................................
Epoch 14 / 15
eta=0.7 ==========================================> 
acc=1.00 loss=0.000358
560396ms 5494us/step - 
acc=0.996 loss=0.0114 val_acc=0.994 val_loss=0.0275
Epoch 15 / 15
eta=0.7 ==========================================> 
acc=1.00 loss=0.0000813
580674ms 5693us/step - 
acc=0.997 loss=0.0107 val_acc=0.994 val_loss=0.0277

Evaluation result:
  Loss = 0.022; Accuracy = 0.994

Model: digit or not

This model is a pre-trained model using the "digit 0-9" model. This means that the model has a lot of knowledge of how the digits look like without any training. This new model is created out of the "digit 0-9" model by freezing all layers except the last two dense layers. Freezing means that the parameters are not changed during training. The last layer (the output layer) is modified to give two values: a prediction - digit or not digit. This method is called transfer learning.

Data for training is labeled digit or not digit. Data is produced by running a slightly modified version of this app. The pixel values of the drawn symbols are written to the local storage of the browser and then exported for training. The number of symbols for training is now around 1500. If the model makes a bad prediction for a certain symbol you have to add that symbol with a label to the training set and retrain the model to get a better model. This means rather much work...

In order to restrict the number of symbols for training the app checks if there is more than one symbol in the drawing area (only one symbol is allowed). This is checked by flood filling the symbol with fill characters (black). If there are any non-black characters in the drawing area after flood filling it is interpreted as more than one symbol.

Implementation

The app is written in JavaScript using TensorFlow.js for "Machine Learning" and Bootstrap for the UI. All code related to the app is run in the browser. The web server is only keeping the files of the app. The files (including the model) are downloaded to the browser. So the predictions are done in the browser. But the training is done outside of the browser because the models are rather large.

The source code of this app is available at GitHub.

				
Copyright 2019 GubboIT

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.

Acknowledgment

This app was inspired by the excellent book "Deep Learning with JavaScript" from Mannning Publications.

Home