About

Im2Tex

Im2Tex is a tool designed to simplify the process of converting images containing mathematical formulas (in formats such as PNG and JPEG) into LaTeX code. Our system utilizes a combination of Convolutional Neural Network (CNN) backbone and Transformer decoder model and accurately recognize and interpret complex mathematical symbols, expressions, and equations within the images. The tool is highly versatile, user-friendly, and ensures precise conversion into LaTeX code.

How it was built

Our starting point was the Image-to-Markup project developed by the Harvard NLP group, which served as a valuable template. We then customized and revamped the data generation tools based on our needs and findings, culminating in the creation of our data generation project, Printed-Latex-Data-Generation This project enables users to parse ArXiv papers, extract LaTeX code, and generate corresponding .png images and LaTeX code labels. Our data generation tools offer customization for resolution, as we discovered that higher resolution training data significantly impacts model performance. Our generated dataset demonstrates the capabilities of our data generation tools. Subsequently, we turned to the preprint by S. Singh 2018 Image-to-Markup Generation with Coarse-to-Fine Attention as a foundation for our image-to-LaTeX model development. Our final model employs a CNN ResNet34 backbone and a Transformer decoder. Extensive work has been dedicated to preprocessing the data to ensure our model can generalize beyond the dataset, which is where higher resolution and our custom data generation tools proved essential. For a more in-depth exploration, please visit the GitHub page of our project.

Technical Challenges and Innovations

Developing models for complex tasks often faces challenges in generalization outside of the training dataset. To improve this, our team has implemented robust image preprocessing methods. We limit input sizes to strips of 128x1024 pixels, apply color inversions for efficient padding, and utilize the Albumentations package for dynamic transformations. This approach helps enhance model training efficiency and convergence.

Our current model excels with single-line formulas and small matrices. Expanding its capabilities, we're developing models for more complex structures like large tables and diagrams. Enhancements include using high-resolution datasets to avoid issues like those encountered in the Image-to-Markup project, where low-resolution training data impaired model performance. Our revised strategy includes using a high-resolution dataset for training, applying random transformations to prevent overfitting, and optimizing image processing to maintain clarity and detail, ensuring high accuracy and adaptability of our OCR technology.

Inspiration

The inspiration for Im2Tex project originated from a collaboration project led by Dr. Jan Reimann at Penn State University, aiming to develop a platform using Jupyter Notebooks for sharing open-source content. We created infrastructure that supports diverse content formats like e-books, interactive notebooks, and quizzes for Canvas integration. Additionally, it offers cost-effective cloud-hosting solutions, enhancing Jupyter Notebooks' usability in educational settings. In Spring 2022, we used this platform to revise the Math 110 Techniques of Calculus course. During this initiative, we identified a need for an open-source OCR tool capable of converting images and PDFs containing mathematical formulas and text into LaTeX code. This realization led us to develop a comprehensive OCR software, now capable of processing mathematical formulas, with ongoing enhancements to include full-page content and handwritten notes.

Challenges encountered

One of the most significant challenges encountered by others developing similar models is poor generalization to instances outside the dataset. To address this issue, we focused on implementing robust preprocessing steps for images. Our model's input size during training is limited to strips with a height of 128 pixels and a width of 1024 pixels, maintaining a maximum aspect ratio of 8. We then invert colors and apply various transformations using the Albumentations package. Inverting colors enables us to pad with a value of zero, which has been demonstrated to facilitate faster model convergence. We also observed that higher resolution is advantageous, given that LaTeX code often contains multiple symbols in a relatively small area of the image (such as nested subscripts and superscripts).

Our current model performs well with single lines of formulas or matrices comprising up to four rows. We are in the process of developing a separate model for diagrams, tables, and matrices with more than four rows, all of which will be incorporated into our comprehensive paragraph recognition model. The original dataset used in the the Image-to-Markup project is of low resolution, and employing it for training results in suboptimal model generalization, as noted by others working with similar datasets. Our solution involved redeveloping the dataset to generate one with significantly higher resolution, which we then randomly compress through transformations during the training phase. To improve our model, we employ a higher-resolution dataset and limit the maximum height and width during training to 128 pixels and 1024 pixels, respectively, adhering to a maximum aspect ratio value of 8. We rescale the images to maintain a height of 128 pixels, preserving the aspect ratio provided it stays below the value of 8. For larger aspect ratios, we truncate them to a value of 8. Following this, we apply transformations that introduce random scaling, shifting of images, and various other modifications, such as padding the images to achieve the required strip size of 128 by 1024. These crucial steps significantly enhance the model's adaptability, accuracy, and overall performance.

Im2Tex

How it was built

Technical Challenges and Innovations

Inspiration

Future plans and ideas to implement

Challenges encountered