Tomasz ZIss 10:03 on 2024-01-22

Oracle 23 and JSON – Example SQL and JSON native data type

Oracle 23 will introduce many new solutions, including many new features related to text data in JSON notation. The most important:

Storing the JSON format as a native data type (OSON format) for databases regardless of where they are run. Previously this was only available for Autonomous JSON Database in OCI. I recommend this post for details.
Clearer syntax
Oracle API for MongoDB – the same data, and you can operate with SQL and/or programmatically

In this post, we will focus on an example of syntax simplification and an example of using JSON aggregation, construction and JSON unnesting operations. The text data is stored in the native format and operations on JSON documents will be performed using SQL. The basic structure and data are taken from Chris Saxon’s great blog, which focuses on Oracle version 19c.

create table departments_json (

  department_id   integer not null primary key,

  department_data json not null

);



insert into departments_json 

  values ( 110, JSON {

    'department': 'Accounting',

    'employees': [

        {

            'name': 'Higgins, Shelley',

            'job':  'Accounting Manager',

            'hireDate': '2002-06-07T00:00:00'

        },

        {

            'name': 'Gietz, William',

            'job': 'Public Accountant',

            'hireDate': '2002-06-07T00:00:00'

        }

    ]} 

  );

insert into departments_json 

  values ( 112, JSON {

    'department': 'Sales',

    'employees': [

        {

            'name': 'Brown, Katie',

            'job':  'Sales Manager',

            'hireDate': '2001-08-12T00:00:00'

        },

        {

            'name': 'Holder, William',

            'job': 'Sales Presales',

            'hireDate': '2000-01-02T00:00:00'

        }

    ]} 

  );

commit;

What about some junk as JSON format?

insert into departments_json

  values ( 100, 'Random junk');

SQL Error: ORA-40441: JSON syntax error
JZN-00078: Invalid JSON keyword 'Random' (line 1, position 1)
40441. 00000 -  "JSON syntax error"
*Cause:    The provided JavaScript Object Notation (JSON) data had invalid
           syntax and could not be parsed.
*Action:   Provide JSON data with the correct syntax.

Ok. That’s fine. No junks on our backyard.
Now time for example SQL where we could use all 3 JSON related operations all together.

select 

    JSON {

        'Year'  : EXTRACT( YEAR FROM TO_DATE( substr(hireDate,1,10), 'YYYY-MM-DD' )),

        'Total' : COUNT(*)

    }

from departments_json NESTED department_data COLUMNS (

    department,

    NESTED employees[*] COLUMNS (

        name, hireDate

    ))

GROUP BY EXTRACT( YEAR FROM TO_DATE( substr(hireDate,1,10), 'YYYY-MM-DD' ));

We create a new JSON document by executing SQL. We use the unnesting operation on the JSON because employees is an array. And then we group according to the array data shown above.
Result is below and yes, it is JSON document 🙂

{"Year":2002,"Total":2}

{"Year":2001,"Total":1}

{"Year":2000,"Total":1}

To sum up, we see a much better and clearer syntax when operating on JSON data in Oracle version 23.

Tomasz ZIss 11:39 on 2024-01-17

OCI Data Science – Tensorflow, CNN network and images augmentation

The are two common problems in machine learning related to learning:

Overfitting
Underfitting

Various methods are used to fight them. For underfitting, model tuning using hyperparameters is used (hyperparameters as opposed to the parameters of neurons in a neural network). Various methods are used to minimize overfitting – penalty functions, increasing the data set, etc.
In this post we will look at an example of how you can improve the quality of the model by reducing the phenomenon of overfitting.

While working on the code and fighting the filtration of unnecessary information (yes, the Tensorflow library is very talkative 🙂 But in this case it’s good – we also have the ability to control what to display) I came across one optimization introduced in version 2.15 -> oneDNN operations for x86 CPU. The technical explanation of the improvement is quite complicated, but I really like the detailed description of the changes and consideration of what areas of existing solutions it may affect and the different categories of users/consumers of Tensorflow library solutions.

Let’s get back to the problem. We continue the topic started in the previous post. Overfitting means a phenomenon when the model has learned to classify the data in the training set well. Shows good values for evaluation metrics. Here we will limit ourselves to one thing – accuracy. There are quite a lot of them, but the Tensorflow documentation describes it in a practical way here.
Let’s start. The code below imports the necessary modules (tensorflow-addons module must be installed).

import tensorflow
import tensorflow_datasets as tfds
import tensorflow_addons as tfa

You can check the version like this:

print("Tensorflow version is : {}".format(tensorflow.__version__))
print("Tensorflow dataset version is : {}".format(tfds.__version__))
print("Tensorflow-addons version is : {}".format(tfa.__version__))

My Tensorflow version is 2.15, Tensorflow Dataset is 4.9.4, Tensorflow-addons 0.22.0 By the way, I found out that support for Tensorflow-addons ends at the end of May 2024. And that there may be a problem with some functionalities when used with Tensorflow 2.15 . What we need is not particularly complicated, let’s take a risk and use this library.
ensorflow has an API – Tensorflow Datasets, which allows you to download various types of data sets. We will use the collection “Horses and humans” described here.

data = tfds.load('horses_or_humans', split='train', as_supervised=True)

test_data = tfds.load('horses_or_humans', split='test', as_supervised=True)
test_batches = test_data.batch(32)
train_batches = data.shuffle(100).batch(10)

Sets in Tensorflow Datasets can be divided into different subsets. The simplest are the training set and the test set. Here we also use dividing these sets into smaller ones to train the model in batches. (For now) Let’s not get into that. Batch learning generally gives better results. We will use the same batch approach for parameters in each model.
Let’s define our CNN network.

model = tensorflow.keras.models.Sequential([

    tensorflow.keras.layers.Conv2D(16, (3, 3), activation='relu',

                                   input_shape=(300, 300, 3)),

    tensorflow.keras.layers.MaxPooling2D(2, 2),

    tensorflow.keras.layers.Conv2D(32, (3, 3), activation='relu'),

    tensorflow.keras.layers.MaxPooling2D(2, 2),

    tensorflow.keras.layers.Conv2D(64, (3, 3), activation='relu'),

    tensorflow.keras.layers.MaxPooling2D(2, 2),

    tensorflow.keras.layers.Conv2D(64, (3, 3), activation='relu'),

    tensorflow.keras.layers.MaxPooling2D(2, 2),

    tensorflow.keras.layers.Conv2D(64, (3, 3), activation='relu'),

    tensorflow.keras.layers.MaxPooling2D(2, 2),

    tensorflow.keras.layers.Flatten(),

    tensorflow.keras.layers.Dense(512, activation='relu'),

    tensorflow.keras.layers.Dense(1, activation='sigmoid')

])

Classification into two categories -> human or horse. For this, one neuron is enough (the last layer with the sigmoid activation function). Convulsive layers (“highlight certain types of lines”) are used together with pooling (“sharpen”). Why were 5 such pairs used and not 3, 7, 9 or 11? I haven’t found a specific analytical explanation for this. The larger the image, the more such operations must be performed. The idea is probably to increase the number of patterns in the images.
Compilation of the model with a training phase (with testing/validation) looks as follows.

model.compile(optimizer='adam', loss='binary_crossentropy',

              metrics=['accuracy'])

history = model.fit(train_batches, epochs=10,

                    validation_data=test_batches, validation_steps=1)

After 10 iterations of the model learning process, we have the following result.

epoch 10/10 -> 103/103 [==============================] - 13s 126ms/step - loss: 0.1704 - accuracy: 0.9406 - val_loss: 1.2022 - val_accuracy: 0.9062

For the learning process, we obtained a result of 94.06%, while the classification accuracy for the test and validation set is 90.62%. Our network has overtrained.
One possible fix is to increase the amount of data in the training set. How can you do this for photos? Carrying out various transformations. The Tensorflow library has image transformation routines distributed across various modules. For example tensorflow.image and tensorflow_addons.
Let’s define a transformation where we rotate the image by 40 degrees and centralize it by 20% and 70%.

def augmentimages1(image, label):

    '''

        :param image: Image to be augmented

        :param label: Label of augmented image

        :return: Augmented image

    '''

    image = tensorflow.cast(image, tensorflow.float32)

    image = (image/255)

    image = tensorflow.image.random_flip_left_right(image)

    image = tfa.image.rotate(image, 40, interpolation='NEAREST')

    return image, label

def augmentimages2(image, label):

    '''

        :param image: Image to be augmented

        :param label: Label of augmented image

        :return: Augmented image

    '''

    image = tensorflow.cast(image, tensorflow.float32)

    image = (image/255)

    image = tensorflow.image.central_crop(image, central_fraction=0.2)

    image = tensorflow.image.resize(image, [300, 300])

    #image = tensorflow.cast(image, tensorflow.uint8)

    return image, label



def augmentimages3(image, label):

    '''

        :param image: Image to be augmented

        :param label: Label of augmented image

        :return: Augmented image

    '''

    image = tensorflow.cast(image, tensorflow.float32)

    image = (image/255)

    image = tensorflow.image.central_crop(image, central_fraction=0.7)

    image = tensorflow.image.resize(image, [300, 300])

    #image = tensorflow.cast(image, tensorflow.uint8)

    return image, label

Normalization and projection to float32 and pixel values are needed to perform the transformation. To apply it to the training set, it must perform a mapping operation

train_aug1 = data.map(augmentimages1)

The next step is to merge the original training set with the set of modified images.
To avoid problems, additional operations must be performed. The code looks like this.

def cast_data(image, label):

    '''

        :param image: Image to be augmented

        :param label: Label of augmented image

        :return: Augmented image

    '''

    image = tensorflow.cast(image, tensorflow.float32)

    return image, label

def normalize(image, label):

    image = (image/255)

    return image, label

train_aug1 = data.map(augmentimages1)

train_aug2 = data.map(augmentimages2)

train_aug3 = data.map(augmentimages3)
data = data.map(cast_data)

data = data.concatenate(train_aug1)

data = data.concatenate(train_aug2)

data = data.concatenate(train_aug3)

data = data.map(normalize)

train_batches = data.shuffle(100).batch(10)

If we train now our model we have below final result.

Epoch 10/10
 411/411 [==============================] - 60s 147ms/step - loss: 0.5426 - accuracy: 0.6254 - val_loss: 206.1765 - val_accuracy: 0.9263

We obtained quite an interesting result. Even though the network had a training efficiency of quite low 62.54% (down from 94.06%!!!), the model performed better on the test set 92.63% (up from 90.62%!!! 🙂 ). The model certainly did not overfit in this situation. Additionally, we see that the number of steps has increased to 411. It’s obvious, more data, more work.
Someone may ask a question. What such transformations look like and how can one be sure that the modified pixel values are exactly what they are supposed to be. Let’s use a notebook in OCI for this.

We have to install additional libraries which are not part of pre-installed set in notebooks.

Let’s define transformation functions, apply them and define procedure for image display using matplotlib library in Python.

import tensorflow

import tensorflow_datasets as tfds

import tensorflow_addons as tfa

import matplotlib.pyplot as plt



print("Tensorflow version is : {}".format(tensorflow.__version__))

print("Tensorflow dataset version is : {}".format(tfds.__version__))

print("Tensorflow dataset version is : {}".format(tfa.__version__))



def visualize(original, augmented):

  fig = plt.figure()

  plt.subplot(1,2,1)

  plt.title('Original image')

  plt.imshow(original)



  plt.subplot(1,2,2)

  plt.title('Augmented image')

  plt.imshow(augmented)



def augmentimages1(image, label):

  '''

      :param image: Image to be augmented

      :param label: Label of augmented image

      :return: Augmented image

  '''

  image = tensorflow.cast(image, tensorflow.float32)

  image = (image/255)

  image = tensorflow.image.random_flip_left_right(image)

  image = tfa.image.rotate(image, 40, interpolation='NEAREST')

  #image = tensorflow.cast(image, tensorflow.uint8)

  return image, label



def augmentimages2(image, label):

    '''

        :param image: Image to be augmented

        :param label: Label of augmented image

        :return: Augmented image

    '''

    image = tensorflow.cast(image, tensorflow.float32)

    image = (image/255)

    image = tensorflow.image.central_crop(image, central_fraction=0.2)

    image = tensorflow.image.resize(image, [300, 300])

    #image = tensorflow.cast(image, tensorflow.uint8)

    return image, label



def augmentimages3(image, label):

    '''

        :param image: Image to be augmented

        :param label: Label of augmented image

        :return: Augmented image

    '''

    image = tensorflow.cast(image, tensorflow.float32)

    image = (image/255)

    image = tensorflow.image.central_crop(image, central_fraction=0.7)

    image = tensorflow.image.resize(image, [300, 300])

    #image = tensorflow.cast(image, tensorflow.uint8)

    return image, label



def cast_data(image, label):

    '''

        :param image: Image to be augmented

        :param label: Label of augmented image

        :return: Augmented image

    '''

    image = tensorflow.cast(image, tensorflow.float32)

    return image, label



def normalize(image, label):

    image = (image/255)

    return image, label



data = tfds.load('horses_or_humans', split='train', as_supervised=True)

data_aug1 = data.map(augmentimages1)

data_aug2 = data.map(augmentimages2)

data_aug3 = data.map(augmentimages3)



first_element = data.take(1)

image_first, label_first = list(first_element)[0]



second_element = data_aug1.take(1)

image_second, label_second = list(second_element)[0]



third_element = data_aug2.take(1)

image_third, label_third = list(third_element)[0]



fourth_element = data_aug3.take(1)

image_fourth, label_fourth = list(fourth_element)[0]



visualize(image_first,image_second)

visualize(image_first,image_third)

visualize(image_first,image_fourth)



# Lets combine them

data = data.map(cast_data)

new_data = data.concatenate(data_aug1)

print(new_data.cardinality())



new_data = new_data.map(normalize)

first_element = new_data.take(1)

image_first, label_first = list(first_element)[0]

image_second, label_second = image_first, label_first





visualize(image_first,image_second)

This is rotation transformation in action.

20% centralization

70% centralization

And last let’s check if original image is still the same.

To sum up, you can fight against overfitting of the neural network by expanding the training set in an artificial way – in the case of CNN, generating new images by transforming existing ones. Oracle Cloud Infrastructure provides multiple services that enable end-to-end support for the lifecycle of a machine learning model and the ability to quickly conduct experiments on notebooks.

Tomasz ZIss 14:30 on 2024-01-09
Tags: #machinelearning, #neuralnetworks, #oci, #oracle ( 2 ), #python

OCI Data Science – CNN vs DNN learning iterations elapsed time

A few days ago I was talking to my friend about neural networks and image category recognition using the MNIST Fashion dataset and you can see a visualization of this dataset here.
MNIST Fashion is a collection of grayscale wardrobe images. The training set has 60,000 elements, while the test set has 10,000. I started talking about how you can significantly increase the accuracy of image classification using convolution and pooling operations.
My friend, a graduate of electronics, immediately realized that a similar operation is performed on signals. Approx. I was supposed to be more on the IT education path, but the fact was that there was a mathematical operation as below 🙂

If you are interested in details, you can find them here. In the case of CNN, the convolution operation is the operation of modifying pixels in an image using a filter. Filters can have different sizes – 3×3, 5×5, and so on. The new pixel value is calculated as the sum of the products of the adjacent pixel values and the values in the filter.

Well, my friend says, how much better is a network using convolution and pooling (CNN – convolutional neural network) than a deep network DNN – deep neural network (without convolution and pooling, based on the lack of modification of pixels in the input image).

DNN – accuracy of 88% after 50 learning iterations
CNN – accuracy of 99% after 20 learning iterations

My friend, as befits a true engineer, goes on and asks how long the cycles in these networks last.
I replied that it would take a long time. Well, I haven’t measured it, but you can’t leave an answer like that forever. Of course, my laboratory is not omnipotent, but such time can be measured. For implementation, I used a notebook available at OCI cloud and the Tensorflow library. Here you can find a great series describing another competing library, PyTorch, and how it can be used in DataScience services at OCI.
The Tensorflow library offers automatic display of statistics about each iteration of the neural network training:

1875/1875 [==============================] - 27s 13ms/step - loss: 0.4399 - accuracy: 0.8413
Epoch 2/50

After carrying out the measurements, I received the results

DNN – iteration time in the range of 5-7 seconds
CNN – iteration time in the range of 23-30 seconds

So we have a 4-5 times increase in the case of the CNN network.
Well, the CNN network is more complicated – in my implementation it has 7 layers, and the DNN network only 3.
My friend and I expected that it might even be an order of magnitude. But I started wondering how many parameters (weights and biases in neurons + values in filters) must be learned in each of these two types of networks. The Tensorflow library has a very cool function that you can use to describe a model -> model.summary()

For my CNN network:

Model: "sequential"
=================================================================
 Layer (type)                Output Shape              Param #
=================================================================
 conv2d (Conv2D)             (None, 26, 26, 64)        640

 max_pooling2d (MaxPooling2  (None, 13, 13, 64)        0
 D)

 conv2d_1 (Conv2D)           (None, 11, 11, 64)        36928

 max_pooling2d_1 (MaxPoolin  (None, 5, 5, 64)          0
 g2D)

 flatten (Flatten)           (None, 1600)              0

 dense (Dense)               (None, 128)               204928

 dense_1 (Dense)             (None, 10)                1290

=================================================================
Total params: 243786 (952.29 KB)
Trainable params: 243786 (952.29 KB)
Non-trainable params: 0 (0.00 Byte)
=================================================================

For DNN network:

Model: “sequential”
=================================================================
Layer (type) Output Shape Param #
=================================================================

flatten (Flatten) (32, 784) 0

dense (Dense) (32, 128) 100480

dense_1 (Dense) (32, 10) 1290

=================================================================
Total params: 101770 (397.54 KB)
Trainable params: 101770 (397.54 KB)
Non-trainable params: 0 (0.00 Byte)
=================================================================

Which means the CNN had to learn twice as many parameters as the DNN to correctly classify the images into the clothing category. This resulted in a 4-5 times longer iteration time in the learning phase. Well, yes, we expected an increase in time, but we bet on an increase of at least an order of magnitude.
Someone might think that 99% on the training set is better than 88%. Well – it depends 🙂 – but generally in machine learning there is a phenomenon of overfitting. Which means that the model cannot learn the data from the training set “too accurately” because it will be inflexible to slightly similar/slightly different patterns in the test data. More can be found here.
It also reminded me of a similar task from the past – image recognition. The point was to correctly recognize road signs in the photos. The implementation consisted of extracting image features and building rules like: “If a sign is a triangle and has a border in the color Playing with neural networks and the MNIST data set also show how traditional programming differs from machine learning. Traditional programming implements rules while machine learning finds rules from data 🙂

	OCI Data Science… on OCI Data Science – CNN v…
	tziss on Read-only partitions and stale…
	rparvu on Read-only partitions and stale…
	Resources management… on Resources management across da…

Tomasz Ziss Oracle DBA blog

Tomasz Ziss Oracle DBA blog

Monthly Archives: January 2024

Oracle 23 and JSON – Example SQL and JSON native data type

OCI Data Science – Tensorflow, CNN network and images augmentation

OCI Data Science – CNN vs DNN learning iterations elapsed time