Based on Im2Tex project available on Github Description of the image

Im2Tex

Im2Tex is a tool designed to simplify the process of converting images containing mathematical formulas (in formats such as PNG and JPEG) into LaTeX code. Our system utilizes a combination of Convolutional Neural Network (CNN) backbone and Transformer decoder model and accurately recognize and interpret complex mathematical symbols, expressions, and equations within the images. The tool is highly versatile, user-friendly, and ensures precise conversion into LaTeX code.

How it was built

Our starting point was the Image-to-Markup project developed by the Harvard NLP group, which served as a valuable template. We then customized and revamped the data generation tools based on our needs and findings, culminating in the creation of our data generation project, Printed-Latex-Data-Generation This project enables users to parse ArXiv papers, extract LaTeX code, and generate corresponding .png images and LaTeX code labels. Our data generation tools offer customization for resolution, as we discovered that higher resolution training data significantly impacts model performance. Our generated dataset demonstrates the capabilities of our data generation tools. Subsequently, we turned to the preprint by S. Singh 2018 Image-to-Markup Generation with Coarse-to-Fine Attention as a foundation for our image-to-LaTeX model development. Our final model employs a CNN ResNet34 backbone and a Transformer decoder. Extensive work has been dedicated to preprocessing the data to ensure our model can generalize beyond the dataset, which is where higher resolution and our custom data generation tools proved essential. For a more in-depth exploration, please visit the GitHub page of our project.

Technical Challenges and Innovations

Developing models for complex tasks often faces challenges in generalization outside of the training dataset. To improve this, our team has implemented robust image preprocessing methods. We limit input sizes to strips of 128x1024 pixels, apply color inversions for efficient padding, and utilize the Albumentations package for dynamic transformations. This approach helps enhance model training efficiency and convergence.

Our current model excels with single-line formulas and small matrices. Expanding its capabilities, we're developing models for more complex structures like large tables and diagrams. Enhancements include using high-resolution datasets to avoid issues like those encountered in the Image-to-Markup project, where low-resolution training data impaired model performance. Our revised strategy includes using a high-resolution dataset for training, applying random transformations to prevent overfitting, and optimizing image processing to maintain clarity and detail, ensuring high accuracy and adaptability of our OCR technology.

Inspiration

The inspiration for Im2Tex project originated from a collaboration project led by Dr. Jan Reimann at Penn State University, aiming to develop a platform using Jupyter Notebooks for sharing open-source content. We created infrastructure that supports diverse content formats like e-books, interactive notebooks, and quizzes for Canvas integration. Additionally, it offers cost-effective cloud-hosting solutions, enhancing Jupyter Notebooks' usability in educational settings. In Spring 2022, we used this platform to revise the Math 110 Techniques of Calculus course. During this initiative, we identified a need for an open-source OCR tool capable of converting images and PDFs containing mathematical formulas and text into LaTeX code. This realization led us to develop a comprehensive OCR software, now capable of processing mathematical formulas, with ongoing enhancements to include full-page content and handwritten notes.