Marker PDF Parser
Marker by Datalab is an open source Python library that can be used to extract structured data from PDFs. We listed various Pdf parsing solution in the lengthy Graph RAG article and Marker is stands out because you can run it locally, thus avoiding sending data to a service. Services like LlamaParse are indeed very powerful but they require sending your data to a remote server. Marker is a good alternative if you want to keep your data local.
Setting Marker up is as easy as pip install marker-pdf
and if to parse an article to markdown you can use:
marker_single /my-article.pdf
You can also output JSON or HTML and a more detailed prompt could be
marker_single --output_dir /tmp/output/ --page_range 0-2 --output_format json /my-article.pdf
Some notable features of Marker are:
- extracts images from PDFs
- it recognizes formulas and can extract them as LaTeX
- it can extract tables and output them as markdown tables
- footnotes and references!
Marker is free to use under $5M in TTM revenue, which is really generous. If you are a larger company, you can contact Datalab for a quote.
One aspect that is of particular interest in some domains like legal is the heavy use of references and footnotes. Marker correctly extracts them and in the snippet below you can see the original pdf with the markdown preview next to it.
On a markdown level the footnotes are extracted as:
<sup>2</sup> David M. Scobey, Empire City: Politics, Culture, and Urbanism in Gilded-Age New York (New Haven: Yale University, 1989), 25, 26, 188.
<sup>3</sup> Scobey, 29, 334.
<sup>4</sup> James Miller, Miller's New York as it Is (New York, James Miller Press, 1872), 23.
<sup>5</sup> Thomas Bender, New York Intellect: A History of Intellectual Life in New York City from 1750 to the Beginnings of Our Own Time (Baltimore: Johns Hopkins University Press, 1987), 171.
Note that the citation is neatly converted as well. All in all, Marker is quite a catch.