Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Space after capital letter #373

Open
gaehlerm opened this issue Jun 18, 2024 · 0 comments
Open

Space after capital letter #373

gaehlerm opened this issue Jun 18, 2024 · 0 comments

Comments

@gaehlerm
Copy link

gaehlerm commented Jun 18, 2024

I tried so search a pdf file created by Markdown PDF using pypdf. But it didn't work as expected because pypdf frequently found a white space after a capital letter.
I think this is a problem of markdown pdf as I couldn't reproduce this error with pdf files from other sources. Though I did not check extensively.

Here is the pdf file I created with Markdown PDF:
testfile.pdf

Here is the python script to find the bug (I used pypdf version 4.2.0):

import pypdf

PDF_FILE = "testfile.pdf"

def get_all_text():
    all_text = ""

    complete_text = pypdf.PdfReader(PDF_FILE)
    for page_obj in complete_text.pages:    
        text = page_obj.extract_text()
        all_text += text

    with open("all_text.txt", "w") as file:
        file.write(all_text)

if __name__ == "__main__":
    get_all_text()

Here is the output (watch the spaces after the capital letters). The output seems to be reproducible.

testfile.md 2024-06-18
1 / 1A Lot Of Capitalized W ords Like S witzerland For Example. Where Is R obert?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant