Perfectly extracts PDFS text


Introduction

In the constantly evolving landscape of data processing, extracting structured information from PDFS remains a formidable challenge, even in 2024. While numerous models stand out into questions of answering questions, the real complexity lies in transforming the non -structured PDF content into Organized and shareable data. Let’s explore this challenge and find out how indexing and paddleoc that can be the tools we need to extract a text without problems from PDFs.

Spoiler: We really solve it! Strike Cmd/ctrl+f. And look for the term Spotlight to see how!

PDF extraction is crucial in various domains. Let’s look at some common use cases:

  • Invoices and receipts: These documents vary a lot in format, which contain complex schemes, tables and sometimes hand -written notes. Precise analysis is essential to automate accounting processes.
  • Academic documents and theses: They often include a mixture of text, graphics, tables and formulas. The challenge lies in converting correctly not only text, but also mathematical equations and scientific notation.
  • Legal documents: Contracts and court files are usually dense with nuance format. Keeping the integrity of the original format while extracting text is crucial to legal reviews and compliance.
  • Historical and manuscript files: These unique challenges due to paper degradation, variations in historical writing and archaic language. OCR technology must manage these variations for effective research and archive purposes.
  • Medical Recipes and Recipes: They often contain critical hand -written notes and medical terminology. The catch needs this information is vital to patient care and medical research.

Certainly! Here is the text reviewed with active voice:

Indexify is a framework of open source data that addresses the complexities of non -structured data extraction of any source, as shown in Figure 1. Its architecture supports:

  • Injury of millions of non -structured data points.
  • Extraction and indexing pipelines in real time.
  • Horizontal scale to accommodate growing data volumes.
  • Rapid extraction times (to the seconds of intake).
  • Flexible deployment on various hardware platforms (GPUS, TPUS and CPUS).
Indexify
Fig 1: Indexing makes ingestion and effortless extraction on scale (Source: Indexyify)

If you are interested in reading more about indexify and how can you set it up for extraction, slide our 2 minutes’commence‘Guide.

Indexify Extractors: Construction blocks
Fig 2: Architecture of an indexification extractor (Source: Indexify the architecture)

In the heart of Indexify are your extractors (as shown in Fig 2) – Calculate features that transform unstructured data or extract information from it. These extractors can be implemented to run any hardware, with a single indexification deployment that supports tens of thousands of extractors in a cluster.

Make perfectly extract PDF text with indexify and paddleocr
Fig 3: Indexyify extractors with multiple modalities (Source: internal image)

Since it is indexified, it supports multiple extractors of multiple modalities (as shown in Fig 3). The complete list of indexification extractors along with your use cases can be found in the documentation.

Paddleocr PDF extractor, based on the Paddleocr Library, is a powerful tool in the indexification ecosystem. Integrates several OCR algorithms for text detection (DB, EAST, SAST) and recognition (CRNN, Rare, Starnet, Rosetta, SRN).

Let’s walk to configure and use paddleoc extractor that

Here is an example of creating a pipeline that extracts text, tables and images from a PDF document.

You will need three different open terminals to complete this tutorial:

  • Terminal 1 to download and run the indexifiic server.
  • Terminal 2 to run our indexified extractors that will manage structured extraction, chunking and incorporating ingested pages.
  • Terminal 3 to run our Python scripts to help upload and check out the data from our indexified server.

Step 1: Start the indexification server

Let’s start first by downloading the indexification server and running it.

Terminal 1

curl https://getindexify.ai | sh
./indexify server -d

Let’s start by creating a new virtual environment before installing the necessary packages in our virtual environment.

Terminal 2

python3 -m venv venv
source venv/bin/activate
pip3 install indexify-extractor-sdk indexify

Then we can run all available extractors using the command below.

!indexify-extractor download tensorlake/paddleocr_extractor
!indexify-extractor join-server

Terminal 3

!python3 -m venv venv
!source venv/bin/activate

Create a Python script that defines the extraction chart and execute it. Steps 3-5 in this sub-section should be part of the same Python file that should be executed after activating the VENV in terminal 3.

from indexify import IndexifyClient, ExtractionGraph
client = IndexifyClient()
extraction_graph_spec = """
name: 'pdflearner'
extraction_policies:
   - extractor: 'tensorlake/paddleocr_extractor'
     name: 'pdf_to_text'
"""
extraction_graph = ExtractionGraph.from_yaml(extraction_graph_spec)
client.create_extraction_graph(extraction_graph)

This code configures an extraction chart called “PDFLearner” that uses Paddleoc extractor that converts PDF into text.

Step 4: Load PDF of your application

content_id = client.upload_file("pdflearner", "/path/to/pdf.file")
client.wait_for_extraction(content_id)

extracted_content = client.get_extracted_content(content_id=content_id, graph_name="pdflearner", policy_name="pdf_to_text")
print(extracted_content)

This fragment loads a pdf, expects extraction to be completed and then recover and print the extracted content.

We did not believe it can be a simple process of few steps to extract all textual information significantly. So, we try it with a real -world fiscal bill (as shown in Figure 4).

Fig 4: A real -world fiscal bill with a complex orientation (Source: Internal Image)
[Content(content_type="text/plain", data=b"Form 1040\nForms W-2 & W-2G
Summary\n2023\nKeep for your records\n Name(s) Shown on Return\nSocial
Security Number\nJohn H & Jane K Doe\n321-12-3456\nEmployer\nSP\nFederal 
Tax\nState Wages\nState Tax\nForm W-2\nWages\nAcmeware 
Employer\n143,433.\n143,433.\n1,000.\nTotals.\n143,433.\n143,433.\n1,000.\nFo
rm W-2 Summary\nBox No.\nDescription\nTaxpayer\nSpouse\nTotal\nTotal wages, 
tips and compensation:\n1\na\nW2 box 1 statutory wages reported on Sch C\nW2 
box 1 inmate or halfway house wages .\n6\nc\nAll other W2 box 1 
wages\n143,433.\n143,433.\nd\nForeign wages included in total 
wages\ne\n0.\n0.\n2\nTotal federal tax withheld\n 3 & 7 Total social security 
wages/tips .\n143,566.\n143,566.\n4\nTotal social security tax 
withheld\n8,901.\n8,901.\n5\nTotal Medicare wages and 
tips\n143,566.\n143,566.\n6\nTotal Medicare tax withheld . 
:\n2,082.\n2,082.\n8\nTotal allocated tips .\n9\nNot used\n10 a\nTotal 
dependent care benefits\nb\nOffsite dependent care benefits\nc\nOnsite 
dependent care benefits\n11\n Total distributions from nonqualified plans\n12 
a\nTotal from Box 12\n3,732.\n3,732.\nElective deferrals to qualified 
plans\n133.\n133.\nc\nRoth contrib. to 401(k), 403(b), 457(b) plans .\n.\n1 
Elective deferrals to government 457 plans\n2 Non-elective deferrals to gov't 
457 plans .\ne\nDeferrals to non-government 457 plans\nf\nDeferrals 409A 
nonqual deferred comp plan .\n6\nIncome 409A nonqual deferred comp plan 
.\nh\nUncollected Medicare tax :\nUncollected social security and RRTA tier 
1\nj\nUncollected RRTA tier 2 . . .\nk\nIncome from nonstatutory stock 
options\nNon-taxable combat pay\nm\nQSEHRA benefits\nTotal other items from 
box 12 .\nn\n3,599.\n3,599.\n14 a\n Total deductible mandatory state tax 
.\nb\nTotal deductible charitable contributions\nc\nTotal state deductible 
employee expenses .\nd\n Total RR Compensation .\ne\nTotal RR Tier 1 tax 
.\nf\nTotal RR Tier 2 tax . -\nTotal RR Medicare tax .\ng\nh\nTotal RR 
Additional Medicare tax .\ni\nTotal RRTA tips. : :\nj\nTotal other items from 
box 14\nk\nTotal sick leave subject to $511 limit\nTotal sick leave subject 
to $200 limit\nm\nTotal emergency family leave wages\n16\nTotal state wages 
and tips .\n143,433.\n143,433.\n17\nTotal state tax 
withheld\n1,000.\n1,000.\n19\nTotal local tax withheld .", features=
[Feature(feature_type="metadata", name="metadata", value={'type': 'text'}, 
comment=None)], labels={})]
Structured inference to extractor chain (Source: internal image)
Fig 5: Structured inference to the extractor chain (Source: Internal Image)

While extracting text is useful, we often need to analyze this text in structured data. Here’s how you can use indexify to extract specific fields from your PDF (the full workflow is shown in Figure 5).

from indexify import IndexifyClient, ExtractionGraph, SchemaExtractorConfig, Content, SchemaExtractor
client = IndexifyClient()
schema = {
    'properties': {
        'invoice_number': {'title': 'Invoice Number', 'type': 'string'},
 'date': {'title': 'Date', 'type': 'string'},
        'account_number': {'title': 'Account Number', 'type': 'string'},
        'owner': {'title': 'Owner', 'type': 'string'},
        'address': {'title': 'Address', 'type': 'string'},
        'last_month_balance': {'title': 'Last Month Balance', 'type': 'string'},
        'current_amount_due': {'title': 'Current Amount Due', 'type': 'string'},
        'registration_key': {'title': 'Registration Key', 'type': 'string'},
 'due_date': {'title': 'Due Date', 'type': 'string'}
    },
    'required': ['invoice_number', 'date', 'account_number', 'owner', 'address', 'last_month_balance', 'current_amount_due', 'registration_key', 'due_date']
    'title': 'User',
    'type': 'object'
}
examples = str([
    {
        "type": "object",
        "properties": {
            "employer_name": {"type": "string", "title": "Employer Name"},
            "employee_name": {"type": "string", "title": "Employee Name"},
            "wages": {"type": "number", "title": "Wages"},
            "federal_tax_withheld": {"type": "number", "title": "Federal Tax
Withheld"},
 "state_wages": {"type": "number", "title": "State Wages"},
            "state_tax": {"type": "number", "title": "State Tax"}
        },
        "required": ["employer_name", "employee_name", "wages",
"federal_tax_withheld", "state_wages", "state_tax"]
    },
    {
        "type": "object",
        "properties": {
            "booking_reference": {"type": "string", "title": "Booking Reference"},
            "passenger_name": {"type": "string", "title": "Passenger Name"},
            "flight_number": {"type": "string", "title": "Flight Number"},
            "departure_airport": {"type": "string", "title": "Departure Airport"},
            "arrival_airport": {"type": "string", "title": "Arrival Airport"},
            "departure_time": {"type": "string", "title": "Departure Time"},
            "arrival_time": {"type": "string", "title": "Arrival Time"}        },
        "required": ["booking_reference", "passenger_name", "flight_number","departure_airport", "arrival_airport", "departure_time", "arrival_time"]
    }
])
extraction_graph_spec = """
name: 'invoice-learner'
extraction_policies:
  - extractor: 'tensorlake/paddleocr_extractor'
    name: 'pdf-extraction'
  - extractor: 'schema_extractor'
    name: 'text_to_json'
    input_params:
      service: 'openai'
      example_text: {examples}
content_source: 'invoice-learner'
"""
extraction_graph = ExtractionGraph.from_yaml(extraction_graph_spec)
client.create_extraction_graph(extraction_graph)
content_id = client.upload_file("invoice-learner", "/path/to/pdf.pdf")
print(content_id)
client.wait_for_extraction(content_id)
extracted_content = client.get_extracted_content(content_id=content_id, graph_name="invoice-learner", policy_name="text_to_json")
print(extracted_content)

This advanced example shows how to chain several extractors. First use paddleocr to extract text from the pdf and then apply a scheme extractor to analyze the text in structured JSON data based on the defined scheme.

The scheme extractor is interesting as it allows you to use both the scheme and inferring the scheme of the language model chosen by learning few shots.

We do this by passing few examples of how the scheme must see the parameter example_text. The cleaner and more verbal the examples are, the better the scheme inferred.

Let’s investigate the output of this design:

[Content(content_type="text/plain", data=b'{"Form":"1040","Forms W-2 & W-2G Summary":{"Year":2023,"Keep for your records":true,"Name(s) Shown on Return":"John H & Jane K Doe","Social Security Number":"321-12-3456","Employer":{"Name":"Acmeware Employer","Federal Tax":"SP","State Wages":143433,"State Tax":1000},"Totals":{"Wages":143433,"State Wages":143433,"State Tax":1000}},"Form W-2 Summary":{"Box No.":{"Description":{"Taxpayer":"John H Doe","Spouse":"Jane K Doe","Total":"John H & Jane K Doe"}},"Total wages, tips and compensation":{"W2 box 1 statutory wages reported on Sch C":143433,"W2 box 1 inmate or halfway house wages":0,"All other W2 box 1 wages":143433,"Foreign wages included in total wages":0},"Total federal tax withheld":0,"Total social security wages/tips":143566,"Total social security tax withheld":8901,"Total Medicare wages and tips":143566,"Total Medicare tax withheld":2082,"Total allocated tips":0,"Total dependent care benefits":{"Offsite dependent care benefits":0,"Onsite dependent care benefits":0},"Total distributions from nonqualified plans":0,"Total from Box 12":{"Elective deferrals to qualified plans":3732,"Roth contrib. to 401(k), 403(b), 457(b) plans":133,"Elective deferrals to government 457 plans":0,"Non-elective deferrals to gov't 457 plans":0,"Deferrals to non-government 457 plans":0,"Deferrals 409A nonqual deferred comp plan":0,"Income 409A nonqual deferred comp plan":0,"Uncollected Medicare tax":0,"Uncollected social security and RRTA tier 1":0,"Uncollected RRTA tier 2":0,"Income from nonstatutory stock options":0,"Non-taxable combat pay":0,"QSEHRA benefits":0,"Total other items from box 12":3599},"Total deductible mandatory state tax":0,"Total deductible charitable contributions":0,"Total state deductible employee expenses":0,"Total RR Compensation":0,"Total RR Tier 1 tax":0,"Total RR Tier 2 tax":0,"Total RR Medicare tax":0,"Total RR Additional Medicare tax":0,"Total RRTA tips":0,"Total other items from box 14":0,"Total sick leave subject to $511 limit":0,"Total sick leave subject to $200 limit":0,"Total emergency family leave wages":0,"Total state wages and tips":143433,"Total state tax withheld":1000,"Total local tax withheld":0}}, features=[Feature(feature_type="metadata", name="text", value={'model': 'gpt-3.5-turbo-0125', 'completion_tokens': 204, 'prompt_tokens': 692}, comment=None)], labels={})]

Yes, that’s hard to read, so let’s widen it to you in Figure 6.

Clean extraction and requires chaining extractors in indexify (Source: internal image)
Fig 6: Clean extraction and needs to chain extractors in indexify (Source: internal image)

This means that after this step, our textual data is successfully extracted in a structured JSON format. The data may be complex, unevenly spaced, horizontally oriented, vertically oriented, diagonally oriented, big sources, small sources, no matter the design, it only works!

Well, all, but it solves the problem we initially propose to do. Finally we can shout the mission made, the Tom Cruise style!

Although Paddleocr’s extractor is powerful for PDFS text extraction, the true indexification force lies in its ability to chain multiple extractors together, creating sophisticated data processing pipelines. We delve deeper why you may want to use additional extractors and how to indexify it makes this process perfect and efficient.

Indexify extraction graphics allow you to apply a sequence of extreaming -ingested content extractors. Each step in an extraction chart is known as extraction policy. This approach offers several advantages:

  • Modular: Unfortunately the complex extraction tasks in smaller and manageable steps.
  • Flexibility: Easily modify or replace individual extractors without affecting the entire pipeline.
  • Efficiency: Process data in streaming, reducing latency and use of resources.

Running tracking

Indexify track the transformed content lineage and features extracted from the source. This feature is crucial for:

  • Data governing: Understand how your data have been processed and transformed.
  • Purification: Easily trace problems back to your source.
  • Fulfillment: Comply with regulatory requirements by maintaining a clear data transformation audit trail.

While the paddleocr stands out in text extraction, other extractors can add significant value to data processing pipeline.

Why choose to indexify?

Indexify shines in scenarios where:

  • Is dealing with a large volume of documents (> 1000s).
  • Your volume of data grows over time.
  • You need reliable and available oleoducts.
  • You are working with multimodal data or combining multiple models in a single pipeline.
  • The user experience of your application depends on updated data.

Conclusion

PDFS structured data extraction does not have to be a headache. With indexify and a number of powerful extractors such as Paddleocr, you can speed up your workflow, manage large volumes of documents and extract significant and easily structured data. Whether you are processing invoices, academic work or any other PDF document type, Indexify provides the tools you need to convert unstructured data into valuable ideas.

Ready to speed up the PDF extraction process? Try to indexify and experience the ease of intelligent and scalable data extraction.

Leave a Reply

Your email address will not be published. Required fields are marked *