Skip to content

Latest commit

 

History

History
98 lines (62 loc) · 5.92 KB

README.md

File metadata and controls

98 lines (62 loc) · 5.92 KB

Optical Character Recognition (OCR) Project

This project is a deep learning-based Optical Character Recognition (OCR) system designed to extract and transcribe text from images, with a particular emphasis on supporting Arabic script transcription. It leverages convolutional neural networks (CNNs) for image feature extraction and sequence prediction techniques for text recognition.

Table of Contents

Overview

The OCR project aims to build a robust system capable of recognizing and transcribing text from images, particularly focusing on Arabic language support. This system could be used in various applications, from digitizing printed documents to enabling text recognition in real-time camera feeds.

Project Objectives

  • Text Extraction: Develop a model that can accurately extract text from images.
  • Arabic Script Support: Provide accurate transcription of recognized text into Arabic script.
  • Real-Time Prediction: Enable the system to make predictions in real-time through a user-friendly interface.
  • Model Generalization: Ensure the model performs well across different fonts, sizes, and styles of text.

Technical Approach

Data Preprocessing

The input images undergo several preprocessing steps to prepare them for training and prediction:

  • Resizing: Images are resized to a standard size to ensure consistency in the input shape.
  • Normalization: Pixel values are normalized to fall within a specific range, typically between 0 and 1, to stabilize the training process.
  • Data Augmentation: Techniques such as rotation, translation, and scaling are applied to augment the training data, improving the model's robustness.

Model Architecture

The OCR model is designed to balance complexity and efficiency. The architecture consists of the following components:

  • Convolutional Layers (CNN): These layers are responsible for extracting high-level features from the input images, such as edges, textures, and character shapes.
  • Recurrent Layers (RNN/LSTM/GRU): Following the CNN layers, recurrent layers are employed to capture the sequential nature of the text, handling the varying lengths of text sequences.
  • CTC (Connectionist Temporal Classification) Layer: The final output layer uses CTC loss, which is ideal for sequence prediction tasks where the alignment between input and output is unknown.
Screenshot 2024-09-07 113111 Screenshot 2024-09-07 113141

Data Preparation

IFN/ENIT-database

Screenshot 2024-09-07 114423 Screenshot 2024-09-07 113750

for more information check out the dataset pdf or visit this link : http://www.ifnenit.com/

Training Process

The training process involves:

  1. Batch Processing: Training data is processed in batches to efficiently utilize computational resources.
  2. Loss Calculation: The CTC loss function is used, which allows the model to predict sequences of characters without needing a predefined alignment between input images and output text.
  3. Optimization: An optimizer like Adam or RMSprop is employed to minimize the loss, adjusting the weights of the network through backpropagation.
Screenshot 2024-09-07 115353 Screenshot 2024-09-07 115447

Evaluation

The model's performance is evaluated based on its ability to correctly predict the text in the test images. Metrics such as accuracy, character error rate (CER), and word error rate (WER) are used to quantify performance.

Screenshot 2024-09-07 120420 Screenshot 2024-09-07 120441

GUI Implementation

A graphical user interface (GUI) is developed using Tkinter to allow users to interact with the OCR model:

  • Image Upload: Users can upload images containing text.
  • Prediction Display: The predicted text and its Arabic transcription are displayed.
  • Real-Time Feedback: The system provides immediate feedback on the uploaded image, showing the recognized text.

Challenges and Solutions

  • Character Overlap: One of the challenges was dealing with characters that overlap or touch in cursive fonts, particularly in Arabic. This was mitigated by refining the preprocessing steps and tuning the CNN architecture.
  • Sequence Length Variation: The model needed to handle varying lengths of text sequences, which was addressed using the CTC loss function, allowing the model to flexibly predict sequences without requiring fixed alignment.

Future Directions

  • Multi-Language Support: Expanding the OCR system to support multiple languages beyond Arabic.
  • Improved Accuracy: Experimenting with more complex architectures, such as Transformer-based models, to improve recognition accuracy.
  • Deployment: Packaging the model and GUI into a deployable application for wider use.