Personality Prediction System Based on Graphology using Machine Learning

Hyeon Gu Kim
15 min readJan 5, 2022

Team Members: Lucy Hwang, Yashaswini Kalva, Hyeon Gu Kim, Kaushik Kumaran, Archit Patel

Abstract

Graphology is a method of identifying, evaluating and understanding human personality traits through the strokes and patterns revealed by handwriting. Handwriting reveals the true personality including emotional outlay, fears, honesty, defenses and many others. Professional handwriting examiners called graphologists often identify the writer with a piece of handwriting. Accuracy of handwriting analysis depends on how skilled the analyst is. Although human intervention in handwriting analysis has been effective, it is costly and prone to error. Hence the proposed methodology focuses on developing a system that can predict personality traits with the aid of machine learning without human intervention. To make this happen, we considered seven handwriting features: (i) size of letters, (ii) slant of the writing, (iii) baseline, (iv) pen pressure, (v) spacing between letters, (vi) spacing between words and (vii) top margin in a document to predict eight personality traits of a writer as shown in Figure 1.0.

Figure 1.0 handwriting attributes and respective personality behavior

After extracting all these features from the images containing the handwriting we applied a Random Forest classifier for each personality trait of the writer. We also built ANN and CNN models on the raw image data.

Introduction

Graphology is defined as the analysis of the physical characteristics and patterns of the handwriting of an individual to understand his or her psychological state at the time of writing. Handwriting is a kind of projective test where the unconscious comes to the fore and expresses itself in the conscious [1]. A Graphologist can roughly interpret an individual’s character and personality traits by analysing the handwriting. We can use graphology to determine the personality and character profile of a person.

Objective

The objective of this project is to develop a system that takes an image document containing the handwriting of a person and outputs a few of his/her personality traits based on some selected handwriting features. Carefully analysing all the significant characteristics of a handwriting manually is not only time consuming but prone to errors as well. Automating the analysis on a few selected characteristics of handwriting will speed up the process and reduce the errors

Motivation
Handwriting analysis is one among several methods to understand the psychology of a person. Graphology can be used for below two areas:

  • Psychological analysis: Graphology is used clinically by counsellors and psychotherapists.
  • Employment profiling: Companies use handwriting analysis for recruitment. A graphological report is meant to be used in conjunction with other tools, such as comprehensive background checks, practical demonstration or record of work skills.

Hand-writing analysis with a computer is fast, accurate and identifies the patterns better than visual inspection. Moreover, machine learning assisted analysis is efficient and devoid of human errors.

Literature Review

The project focuses on development of a system to predict some psychological traits of a person by analyzing his or her handwriting using machine learning. Many researchers have also done similar works on computer aided graphology.

A similar work was done by Shitala Prasad, Vivek Kumar Singh and Akshay Sapre of Department of Information Technology, Indian Institute of Information Technology Allahabad, India to predict human personality through handwriting using support vector machines [4]. Another similar work was done by Navin Karanth, Vijay Desai and S. M. Kulkarni of Mechanical Engineering Department, National Institute of Technology Karnataka, India to predict a writer’s personality through graphology, without any machine learning [5]. Another similar work was done by Champa H N, Assistant Professor of Department of Computer Science and Engg., University Visvesvaraya College of Engineering, Karnataka, India and Dr. K R Ananda Kumar, Professor of Department of Computer Science and Engg., SJB Institute of Technology, Karnataka, India on computer aided graphology using artificial neural networks [6]. All these research works have fundamental differences in selection of handwriting features, extraction methods, classification and output, etc.

Problem Statement

A system is proposed to automate the basic handwriting analysis tasks of graphology to determine a few important personality traits. Seven features/characteristics of a handwriting are considered to be extracted from a sample handwriting image. Each of the seven resulting raw values will be put into corresponding categories of respective feature variations. The classifiers will then be able to predict the personality traits of the writer. An overview is represented below:

Figure 1.1: The proposed system — A handwriting sample is taken and the personality traits are predicted.

Data Acquisition
Data from the IAM Handwriting Database of Research Group on Computer Vision and Artificial Intelligence INF, University of Bern, Switzerland is obtained. The data was readily available for download to be used for non-profit research purposes. The database contains 1538 pages of scanned text for which 657 writers contributed samples of their handwriting. Each handwriting sample is labelled with the corresponding psychological traits by manually studying each document.

Pre Processing

The handwriting images we obtained contain unwanted noise, printed texts and lines. The aim of pre-processing is to make the image data suitable for feature extraction for which we adopted below methods

1. Image resizing

These images were cropped and saved as PNG images with an automatic action script. Now the width of all the images is 850 pixels and the height is according to the content of the handwriting in the image. PNG format is used instead of JPEG because the former is a lossless format and is more suitable for storing text images, printed or handwriting.

Figure 1.21: Original image data sample obtained sample with 850px width
Figure 1.22: Cropped and normalized image data from the IAM Handwriting Database.

2. Noise Removal

Image noise is defined as random variation of brightness or color information in images, and is usually an aspect of electronic noise.

From below 2 images, it is observed that a bilateral filter preserves the edges of the subjects in the image

Figure 1.31: Noisy image before any filter is applied.
Figure 1.32: Noiseless image after bilateral filter is applied.

3. Grayscale and Binarization

The image instances were converted to grayscale and binarized using inverted global thresholding. An example is given in Figure 1.4.

Figure 1.4: A binarized version of the image

4. Contour and Warp Affine Transformation

After noise was removed and the image was converted to grayscale and inversely binarized, the lines of the handwriting were straightened using dilation, contour and warp affine transformation of OpenCV library.

Figure 1.5: The sample image after applying dilation with a 5x100 kernel. The foreground pixels are spread horizontally.

5. Horizontal and Vertical Projections

In the context of this project, the horizontal projection of an image was a Python list of sum of all the pixel values of each row of the image, while vertical projection was a Python list of sum of all the pixel values of each column of the image. Both of these operations are performed on grayscale images.

Feature Extraction

Features used for building Random Forest are — Baseline ; Line; Letter Size; Line Spacing; Word Spacing; Top Margin; Pen Pressure; Slant of Letters

Classification Labels

  1. Openness
  2. Conscientiousness
  3. Agreeableness
  4. Neuroticism

Random Forest

Random forest is used in modeling predictions and behavior analysis as feature scaling is not required and as it is less impacted by noise.

Given below are the steps followed for predicting personality traits using Random Forest:

Figure 1.6: Steps for predicting personality traits using Random Forest

For predicting each personality trait a separate random forest classifier was built. Given below is a snippet of the input data fed into the models:

Figure 1.7: Input data fed into the models

Hyperparameter Tuning

We used Randomized Grid Search to find the most optimal hyper parameters for RandomForest Classifier. Below hyper parameters are tuned

  • n_estimators
  • Max_features
  • max_depth
  • min_samples_split
  • ccp_alpha

Feature Importance

Using Random Forest Models we were able to understand the importance of features that we extracted in the pre-processing step as the model assigns importance to a feature based on the frequency of its inclusion in the sample by all trees.

Below is the summary of feature importance:

Figure 1.8: Feature importance
Figure 1.9: The most important features for each personality type

Results

Below are the results obtained from Random Forest:

  • Test Accuracy = 97.06%
  • Test Recall Score = 93.70
  • Test Precision Score = 100%

The accuracy achieved by random forest classifier with 4 trees is 97.06%. Changing the number of estimators,max features, depth,ccp_alpha, min_samples split for this data didn’t significantly improve the results

ANN

Before we dive into the art of neural networks, we first need to understand what ANN is. In short, Artificial Neural Network (ANN) is a machine learning algorithm that mimics the processing of the brain. In other words, ANN enables machines to process given data similar to how the human brain processes. Below figure shows how biological neuron and ANN similarly process data:

Figure 2.0: Biological Neuron vs ANN

This is the simplest form of ANN that is consist of inputs (x1, x2, …,xn ), weights (w1,w2,…,wn) and activation function. Similar to how the human brain takes inputs with dendrites, processes from nucleus to axon and outputs the results in axon terminals, ANN takes input data, gives weights to each input, processes through activation function and outputs the result.

Because of the vast amount of complex data from preprocessing steps, the simplest form of ANN above is not enough — we need more than that. For such a reason, we decided to include two hidden layers which distill redundant data and makes the process more efficient and faster. This is called Multi Layer Perceptron (MLP) which consists of an input layer, one or more hidden layers and an output layer (Figure 2.1).

Figure 2.1: Multi Layer Perceptron
Figure 2.2: Feedforward & Backpropagation

In addition to processing from input layer to output layer, which is called Feedforward network, what makes ANN even more powerful is the opposite notion of feedforward network, backpropagation algorithm (Figure 2.2). From the backpropagation algorithm, ANN has the ability to learn from its errors and improve the model further.

Now that we have a better understanding about ANN, let’s see how we implemented ANN for predicting personality using handwriting. The overall process of the implementation of ANN is quite simple: converting pre-processed data into arrays of pixels and putting the arrays into ANN. Below figure shows a high-level view of the ANN process in this project.

Figure 2.3: High-level view of the ANN process: With datasets of handwriting images, we converted them into arrays of pixels and put them into ANN model

Although the data already had been preprocessed, we still needed to do data transformation process where we encode categorical variables (personality labels, which is our target variable), reshape the data matrices for ANN, and split the data into train, validation and test sets (70%, 15%, 15%, respectively). Then we used Keras from TensorFlow for ANN:

Figure 2.4 : Code snippet of ANN model

The ANN is constructed as follows:

  • Rescaling & flattening
  • An input layer — 113 nodes, activation function=ReLU
  • Two hidden layers — 128, 64 nodes, activation function=ReLU
  • An output layer — 4 nodes, activation function=Softmax
  • Two regularized (“Dropout”) layers between each layer to prevent overfitting
  • Sparse Categorical Cross entropy loss function
  • RMSprop optimizer

Hyperparameter Tuning

  • Epochs: 60
  • Batch size: tried batch sizes of 16, 32, and 64

We chose ReLU activation function because it avoids the gradient vanishing problem with its linearity and is computationally lighter and faster. Moreover, we chose Softmax activation function for the output layer since it calculates relative probability of each class which is more suitable for multiclass classification problems like this project. Similarly, the Sparse Categorical Cross entropy loss function was used because this project is a multiclass classification problem. Lastly, we used RMSprop optimizer because it adapts as it moves down to minima which makes it faster and optimal than other optimizers. We also tried ADAM optimizer as well, but the accuracy turned out to be a little lower than when we used RMSprop. Below figures are the results of the ANN:

Figure 2.5: Train accuracy vs Test accuracy

We can see from Figure 2.5 that the ANN is performing well by looking at the train and test accuracy graph above. One interesting fact is that the test accuracy starts to outperform train accuracy after the 34th epoch. Next, let’s see the relationship between accuracy and loss.

Figure 2.6: Relationship between the test loss and accuracy

Similarly, in Figure 2.6, we can observe the equilibrium between the accuracy and the loss at the 34th epoch and the accuracy continues to increase as the loss continues to decrease.

Figure 2.7: Train loss vs Test loss

The above graph shows a comparison between the train loss and the test loss. Interestingly enough, the test loss diverges from the train loss when epoch is 20.

Figure 2.8: Classification Report

The above figures show a multiclass confusion matrix and a classification report (Figure 2.8) from our ANN. We can observe the model has successfully classified the data into our four personality labels — agreeableness, conscientiousness, neuroticism and openness. One thing to note is the model has the lowest F1 score on classifying conscientiousness and the highest F1 score on openness. This could be due to the size of train data of each class — openness has the largest train data size while conscientiousness has the lowest train data size.

We implemented ANN because of the three main key advantages:

  • Can learn and model non-linear and complex relationships
  • Doesn’t impose fixed constraints on the input variables
  • Robust to the data with heteroskedasticity (data with high volatile and non-constant variance)

However, ANN is not an all-mighty algorithm. Recall that our objective is to predict personality from handwriting and the data is image! Unfortunately, ANN cannot take the image data as it is but rather have to convert the images to numbers which could lead to the loss of important information. Furthermore, the high test accuracy score could raise the problem of overfitting in the future. Therefore, we decided to try another popular neural network model — Convolutional Neural Network (CNN).

CNN

Inspired from the human visual perception of recognizing things, CNN follows a hierarchical model which works on building a network, like a funnel, and finally gives out a fully-connected layer where all the neurons are connected to each other and the output is processed. The input image is fed into the CNN layers, these layers are trained to extract relevant features from the image. A CNN convolves learned features with input data, and uses 2D convolutional layers, making this architecture well suited to processing 2D data, such as images.

Figure 2.9: How CNN classifies handwritten digits

CNN Methodology

Data Preprocessing

As a first step, we separated the data into training, validation and test sets in the ratio of 70%, 15% and 15% respectively.

Since the training set had only 657 images, Data Augmentation was used in an effort to increase the number of samples.

Model Building

Since the number of available images were limited even after augmentation, there was a need to use Transfer Learning so that the model learns the lower level features with some pre-trained network. The base model used was Inception Resnetv2 with pre-trained weights flowing in from the ImageNet dataset.

This base layer was followed by the following layers:

  • Max Pooling layer: It helps in extracting sharp and smooth features
  • Dropout layer: Used to prevent overfitting which was initially observed
  • Batch Normalization was used to scale the inputs and thereby make the network more stable.
  • Finally, the network had a fully connected layer of 50 units.

Relu activation function was used for all the hidden layers and SoftMax activation was used for the output layer. Adam Optimizer was used for Gradient Descent.

Model checkpoints were incorporated to store the best weights of the model.

Hyperparameter Tuning

  • Epochs: set the number of epochs to 30
  • Batch size: tried batch sizes of 16, 32, and 64
  • Learning rate: 0.001

The following hyperparameters were tuned after running several iterations:

The optimal number of epochs was found to be 30.

The batch size was found to be 16.

The best learning rate was 0.001

CNN Results

  • Accuracy on the training set — 78.9%
  • Accuracy on the validation set — 67.3%
  • Accuracy on the test set — 65.5%
  • Precision : 66.3%
  • Recall : 62.5%
Figure 3.0: Train accuracy vs Validation accuracy
Figure 3.1: Code snippet of CNN

CNN Next Steps

The following can be tried as next steps to improve the accuracy of the model:

  • Unfreeze certain layers and try re training the model with our dataset for those layers
  • Try other architectures which could potentially outperform Inception Resnet V2 for the given dataset
  • Augment the data further for the imbalanced classes.
  • Tune parameters like the optimizer, number of layers etc

Conclusion

We used machine learning to automate the graphology process to determine important personality traits through different classifiers such as Random Forest, ANN and CNN. After image preprocessing features were extracted. The feature importance we received for each trait using the classifiers was similar to importance given by the graphologist in determining the personality traits. Random forest has performed better than CNN and ANN because subject knowledge was incorporated into the pre-processing phase.

However, we are aware there are additional resources available to better understand human personality. The sample did not require to standardize pen type and ink color. With standardization of pen, paper, margins, as well as guiding personality questions, we could further enhance our automated handwriting process to lead to more accurate results.

References

[1] D. J. Antony. Personality Profile Through Handwriting Analysis. Anugraha Publications, 2008.

[2] Karen Amend and Mary S. Ruiz. Handwriting Analysis The Complete Basic Book. New Page Books, 1980.

[3] Alessandro Vinciarelli, Juergen Luettin. A new normalization technique for cursive handwritten words. Pattern Recognition Letters 22 (2001) 1043–1050 IDIAP Switzerland, 26 February 2001.

[4] Shitala Prasad, Vivek Kumar Singh, Akshay Sapre. Handwriting Analysis based on Segmentation Method for Prediction of Human Personality using Support Vector Machine. International Journal of Computer Applications (0975 8887) Volume 8 №12, October 2010.

[5] Vikram Kamath, Nikhil Ramaswamy, P. Navin Karanth, Vijay Desai and S. M. Kulkarni . Development of an Automated Handwriting Analysis System. ARPN Journal of Engineering and Applied Sciences VOL 6, NO.9, September 2011.

[6] Champa H N, K R AnandaKumar. Arti cial Neural Network for Human Behavior Prediction through Handwriting Analysis. International Journal of Com-puter Application (0975–8887) Volume 2- №2, May 2010.l

--

--