Tools to generate Synthetic Tex-Formula and corresponding svg/png image Dataset from a collection of tex files.
PIP Executable module is in the works and will be released soon.
Python and JS tools for generating Printed Latex Dataset (images of tex formulas with labels) via parsing Cornell's KDDCUP. Also see KDDCUP paper.
Note: parsing for ArXiv, Wikipedia, and Stackexchange sources are coming.
Note: One can use any .tar files with LaTeX formulas to parse, need to manually add it to the folder.
The easiest way to generate data is via Jupyter Notebook Data generation.ipynb
located in folder Jupyter Notebooks/
. See section Generate using Jupyter Notebook Example for step-by-step instructions.
Final outputs are located in Data
folder.
Final outputs:
generated_png_images
containing PNG imagescorresponding_png_images.txt
each new line contains PNG images filename for the folder generated_png_images
final_png_formulas.txt
each new line contains a corresponding LaTeX formularaw_data
containing raw downloaded datatemporary_data
containing formulas from various stages of processing and SVG images generated along the wayNavigate to the Jupyter Notebooks/
directory and open the provided notebook. Execute all cells except for the function:
Generate_Printed_Tex(download_tex_dataset=False,
generate_tex_formulas=False,
number_tex_formulas_to_generate=1,
generate_svg_images_from_tex=False,
generate_png_from_svg=False)
We will invoke this function in subsequent steps with different flags.
Use the Generate_Printed_Tex
function to download the LaTeX dataset. Currently, the default is the KDD CUP dataset. However, you can specify URLs to any LaTeX-containing .tar
files in the configs.py
.
Set only the download_tex_dataset=True
flag, leaving the others set to False
.
With the dataset in place, process and extract LaTeX formulas:
Set only the generate_tex_formulas=True
flag and ensure all other flags are set to False
.
Note: If number_tex_formulas_to_generate
is less than 1001, only one .tar
file will be parsed. For values greater than or equal to 1001, all downloaded .tar
files will be processed.
To convert preprocessed LaTeX formulas into SVG format:
tex_to_svg.py
file.MAX_NUMBER_TO_RENDER = 500*1000
(determines the maximum number of SVG LaTeX formulas to render)THREADS = 8
(set to the number of CPU cores, ensure it's less than the total available cores on your system)Generate_Printed_Tex
function with the generate_svg_images_from_tex=True
flag.Finally, transform the SVG images into PNG format:
Inkscape
installed and accessible via the command line for MacOS. For Linux, the process will use librsvg2
.svg_to_png.py
file and adjust the parameters:
THREADS = 7
(set this to a value less than your available CPU cores)PNG_WIDTH = 512
PNG_HEIGHT = 64
Generate_Printed_Tex
function with the generate_png_from_svg=True
flag to start the conversion.Running it will output all the data in Data
folder.
Final outputs:
generated_png_images
containing PNG imagescorresponding_png_images.txt
each new line contains PNG images filename for the folder generated_png_images
final_png_formulas.txt
each new line contains a corresponding LaTeX formularaw_data
containing raw downloaded datatemporary_data
containing formulas from various stages of processing and SVG images generated along the wayYou can download a prebuilt dataset 230k from here.
Some Dataset im2latex 230k Characteristics:
comes with a vocabulary 230k.json of size 579, which was generated on a bigger Dataset of around 330k
sample image:
Note: This code is very ad-hoc and requires tinkering with the source
pip install opencv-python
pip install smart_open
sudo apt install nodejs npm
sudo npm install --global mathjax-node-cli
sudo apt install librsvg2-bin
Printed_Tex.py
- Main moduledownload_data_utils.py
- Contains tools for downloading tex tars and unpacking and parsing them.configs.py
- Contains Paths and command line script commands.third_party/
- Contains Katex for parsing LaTeX formulaspreprocess_formulas.py
and preprocess_formulas.js
- Collection of tools for handling and parsing LaTeX formulassvg_to_png.py
- Functions to convert LaTeX formulas to SVG images using MathJaxpng_to_svg.py
- Functions to convert SVG images formulas to PNG images using inkscape
for (Darwin) MacOS and rsvg-convert
for all other systems.Data/
- Contains generated_png_images/
folder, corresponding_png_images.txt
, and final_png_formulas.txt
. Also temporary folder temporary_data
(formulas for various stages of processing and generated SVG images) and raw_data
where raw data is downloaded.Jupyter Notebooks
- Contains examples of generating data using Jupyter notebooksIdea is based on https://github.com/Miffyli/im2latex-dataset