Marker PDF Parser

GraphAI

A local PDF parsing solution towards Graph RAG.

Published

December 7, 2024

Marker by Datalab is an open source Python library that can be used to extract structured data from PDFs. We listed various Pdf parsing solution in the lengthy Graph RAG article and Marker is stands out because you can run it locally, thus avoiding sending data to a service. Services like LlamaParse are indeed very powerful but they require sending your data to a remote server. Marker is a good alternative if you want to keep your data local.

Setting Marker up is as easy as pip install marker-pdf and if to parse an article to markdown you can use:

marker_single /my-article.pdf

You can also output JSON or HTML and a more detailed prompt could be

marker_single --output_dir /tmp/output/ --page_range 0-2 --output_format json  /my-article.pdf

Some notable features of Marker are:

extracts images from PDFs
it recognizes formulas and can extract them as LaTeX
it can extract tables and output them as markdown tables
footnotes and references!

Marker is free to use under $5M in TTM revenue, which is really generous. If you are a larger company, you can contact Datalab for a quote.

One aspect that is of particular interest in some domains like legal is the heavy use of references and footnotes. Marker correctly extracts them and in the snippet below you can see the original pdf with the markdown preview next to it.

On a markdown level the footnotes are extracted as:

<sup>2</sup> David M. Scobey, Empire City: Politics, Culture, and Urbanism in Gilded-Age New York (New Haven: Yale University, 1989), 25, 26, 188.
<sup>3</sup> Scobey, 29, 334.
<sup>4</sup> James Miller, Miller's New York as it Is (New York, James Miller Press, 1872), 23.
<sup>5</sup> Thomas Bender, New York Intellect: A History of Intellectual Life in New York City from 1750 to the Beginnings of Our Own Time (Baltimore: Johns Hopkins University Press, 1987), 171.

Note that the citation is neatly converted as well. All in all, Marker is quite a catch.