Oracle 23 will introduce many new solutions, including many new features related to text data in JSON notation. The most important:
Storing the JSON format as a native data type (OSON format) for databases regardless of where they are run. Previously this was only available for Autonomous JSON Database in OCI. I recommend this post for details.
Clearer syntax
Oracle API for MongoDB – the same data, and you can operate with SQL and/or programmatically
In this post, we will focus on an example of syntax simplification and an example of using JSON aggregation, construction and JSON unnesting operations. The text data is stored in the native format and operations on JSON documents will be performed using SQL. The basic structure and data are taken from Chris Saxon’s great blog, which focuses on Oracle version 19c.
create table departments_json (
department_id integer not null primary key,
department_data json not null
);
insert into departments_json
values ( 100, 'Random junk');
SQL Error: ORA-40441: JSON syntax error
JZN-00078: Invalid JSON keyword 'Random' (line 1, position 1)
40441. 00000 - "JSON syntax error"
*Cause: The provided JavaScript Object Notation (JSON) data had invalid
syntax and could not be parsed.
*Action: Provide JSON data with the correct syntax.
Ok. That’s fine. No junks on our backyard. Now time for example SQL where we could use all 3 JSON related operations all together.
select
JSON {
'Year' : EXTRACT( YEAR FROM TO_DATE( substr(hireDate,1,10), 'YYYY-MM-DD' )),
'Total' : COUNT(*)
}
from departments_json NESTED department_data COLUMNS (
department,
NESTED employees[*] COLUMNS (
name, hireDate
))
GROUP BY EXTRACT( YEAR FROM TO_DATE( substr(hireDate,1,10), 'YYYY-MM-DD' ));
We create a new JSON document by executing SQL. We use the unnesting operation on the JSON because employees is an array. And then we group according to the array data shown above. Result is below and yes, it is JSON document π
The are two common problems in machine learning related to learning:
Overfitting
Underfitting
Various methods are used to fight them. For underfitting, model tuning using hyperparameters is used (hyperparameters as opposed to the parameters of neurons in a neural network). Various methods are used to minimize overfitting – penalty functions, increasing the data set, etc. In this post we will look at an example of how you can improve the quality of the model by reducing the phenomenon of overfitting.
While working on the code and fighting the filtration of unnecessary information (yes, the Tensorflow library is very talkative π But in this case it’s good – we also have the ability to control what to display) I came across one optimization introduced in version 2.15 -> oneDNN operations for x86 CPU. The technical explanation of the improvement is quite complicated, but I really like the detailed description of the changes and consideration of what areas of existing solutions it may affect and the different categories of users/consumers of Tensorflow library solutions.
Let’s get back to the problem. We continue the topic started in the previous post. Overfitting means a phenomenon when the model has learned to classify the data in the training set well. Shows good values ββfor evaluation metrics. Here we will limit ourselves to one thing – accuracy. There are quite a lot of them, but the Tensorflow documentation describes it in a practical way here. Let’s start. The code below imports the necessary modules (tensorflow-addons module must be installed).
import tensorflow import tensorflow_datasets as tfds import tensorflow_addons as tfa
You can check the version like this:
print("Tensorflow version is : {}".format(tensorflow.__version__)) print("Tensorflow dataset version is : {}".format(tfds.__version__)) print("Tensorflow-addons version is : {}".format(tfa.__version__))
My Tensorflow version is 2.15, Tensorflow Dataset is 4.9.4, Tensorflow-addons 0.22.0 By the way, I found out that support for Tensorflow-addons ends at the end of May 2024. And that there may be a problem with some functionalities when used with Tensorflow 2.15 . What we need is not particularly complicated, let’s take a risk and use this library. ensorflow has an API – Tensorflow Datasets, which allows you to download various types of data sets. We will use the collection “Horses and humans” described here.
Sets in Tensorflow Datasets can be divided into different subsets. The simplest are the training set and the test set. Here we also use dividing these sets into smaller ones to train the model in batches. (For now) Let’s not get into that. Batch learning generally gives better results. We will use the same batch approach for parameters in each model. Let’s define our CNN network.
Classification into two categories -> human or horse. For this, one neuron is enough (the last layer with the sigmoid activation function). Convulsive layers (“highlight certain types of lines”) are used together with pooling (“sharpen”). Why were 5 such pairs used and not 3, 7, 9 or 11? I haven’t found a specific analytical explanation for this. The larger the image, the more such operations must be performed. The idea is probably to increase the number of patterns in the images. Compilation of the model with a training phase (with testing/validation) looks as follows.
For the learning process, we obtained a result of 94.06%, while the classification accuracy for the test and validation set is 90.62%. Our network has overtrained. One possible fix is ββto increase the amount of data in the training set. How can you do this for photos? Carrying out various transformations. The Tensorflow library has image transformation routines distributed across various modules. For example tensorflow.image and tensorflow_addons. Let’s define a transformation where we rotate the image by 40 degrees and centralize it by 20% and 70%.
Normalization and projection to float32 and pixel values ββare needed to perform the transformation. To apply it to the training set, it must perform a mapping operation
train_aug1 = data.map(augmentimages1)
The next step is to merge the original training set with the set of modified images. To avoid problems, additional operations must be performed. The code looks like this.
def cast_data(image, label):
'''
:param image: Image to be augmented
:param label: Label of augmented image
:return: Augmented image
'''
image = tensorflow.cast(image, tensorflow.float32)
return image, label
We obtained quite an interesting result. Even though the network had a training efficiency of quite low 62.54% (down from 94.06%!!!), the model performed better on the test set 92.63% (up from 90.62%!!! π ). The model certainly did not overfit in this situation. Additionally, we see that the number of steps has increased to 411. It’s obvious, more data, more work. Someone may ask a question. What such transformations look like and how can one be sure that the modified pixel values ββare exactly what they are supposed to be. Let’s use a notebook in OCI for this.
We have to install additional libraries which are not part of pre-installed set in notebooks.
Let’s define transformation functions, apply them and define procedure for image display using matplotlib library in Python.
import tensorflow
import tensorflow_datasets as tfds
import tensorflow_addons as tfa
import matplotlib.pyplot as plt
print("Tensorflow version is : {}".format(tensorflow.__version__))
print("Tensorflow dataset version is : {}".format(tfds.__version__))
print("Tensorflow dataset version is : {}".format(tfa.__version__))
And last let’s check if original image is still the same.
To sum up, you can fight against overfitting of the neural network by expanding the training set in an artificial way – in the case of CNN, generating new images by transforming existing ones. Oracle Cloud Infrastructure provides multiple services that enable end-to-end support for the lifecycle of a machine learning model and the ability to quickly conduct experiments on notebooks.
Tomasz ZIss
14:30 on 2024-01-09 Tags: #machinelearning, #neuralnetworks, #oci, #oracle ( 2 ), #python
A few days ago I was talking to my friend about neural networks and image category recognition using the MNIST Fashion dataset and you can see a visualization of this dataset here. MNIST Fashion is a collection of grayscale wardrobe images. The training set has 60,000 elements, while the test set has 10,000. I started talking about how you can significantly increase the accuracy of image classification using convolution and pooling operations. My friend, a graduate of electronics, immediately realized that a similar operation is performed on signals. Approx. I was supposed to be more on the IT education path, but the fact was that there was a mathematical operation as below π
If you are interested in details, you can find them here. In the case of CNN, the convolution operation is the operation of modifying pixels in an image using a filter. Filters can have different sizes – 3×3, 5×5, and so on. The new pixel value is calculated as the sum of the products of the adjacent pixel values ββand the values ββin the filter.
Well, my friend says, how much better is a network using convolution and pooling (CNN – convolutional neural network) than a deep network DNN – deep neural network (without convolution and pooling, based on the lack of modification of pixels in the input image).
DNN – accuracy of 88% after 50 learning iterations CNN – accuracy of 99% after 20 learning iterations
My friend, as befits a true engineer, goes on and asks how long the cycles in these networks last. I replied that it would take a long time. Well, I haven’t measured it, but you can’t leave an answer like that forever. Of course, my laboratory is not omnipotent, but such time can be measured. For implementation, I used a notebook available at OCI cloud and the Tensorflow library. Here you can find a great series describing another competing library, PyTorch, and how it can be used in DataScience services at OCI. The Tensorflow library offers automatic display of statistics about each iteration of the neural network training:
After carrying out the measurements, I received the results
DNN – iteration time in the range of 5-7 seconds CNN – iteration time in the range of 23-30 seconds
So we have a 4-5 times increase in the case of the CNN network. Well, the CNN network is more complicated – in my implementation it has 7 layers, and the DNN network only 3. My friend and I expected that it might even be an order of magnitude. But I started wondering how many parameters (weights and biases in neurons + values ββin filters) must be learned in each of these two types of networks. The Tensorflow library has a very cool function that you can use to describe a model -> model.summary()
Which means the CNN had to learn twice as many parameters as the DNN to correctly classify the images into the clothing category. This resulted in a 4-5 times longer iteration time in the learning phase. Well, yes, we expected an increase in time, but we bet on an increase of at least an order of magnitude. Someone might think that 99% on the training set is better than 88%. Well – it depends π – but generally in machine learning there is a phenomenon of overfitting. Which means that the model cannot learn the data from the training set “too accurately” because it will be inflexible to slightly similar/slightly different patterns in the test data. More can be found here. It also reminded me of a similar task from the past – image recognition. The point was to correctly recognize road signs in the photos. The implementation consisted of extracting image features and building rules like: “If a sign is a triangle and has a border in the color Playing with neural networks and the MNIST data set also show how traditional programming differs from machine learning. Traditional programming implements rules while machine learning finds rules from data π