Quickly Search Strings inside PDF files
13 visualizaciones (últimos 30 días)
Mostrar comentarios más antiguos
Michael B
el 11 de Abr. de 2016
Comentada: Walter Roberson
el 4 de En. de 2023
I have ~25,000 PDF files that I want to classify based on the presence of keywords in their text. I know there's a PDF Toolbox that provides MATLAB with an interface for reading PDF text, but the fact that it comes from Sourceforge makes it difficult to obtain (this is for work) and the reliance on java seems to me like it would make the process very slow -especially for searching so many files. Is there a simpler, faster way to parse these documents if all I want to do is basically strfind on the text to check for keywords?
7 comentarios
Walter Roberson
el 4 de En. de 2023
For batch extracting I see the commercial product https://www.qoppa.com/files/pdfstudio/guide/batch-extract-text-from-pdf.htm (which I have never used.)
I also see instructions at https://kenbenoit.net/how-to-batch-convert-pdf-files-to-text/ for a free convertor. As those instructions basically involve preparing a file of names and then running a shell script, then building the file name list inside MATLAB would not be difficult. Running the converter would be simple in MacOS or Linux; in Windows it would take more work.
Respuesta aceptada
Jan
el 11 de Abr. de 2016
PDFs are designed to guarantee an equal output on different machines. You want to create a catalogue of the contained strings. These two jobs do not match.
What about converting the PDFs by one of the many pdf2text tools and work on the text files? E.g. http://www.foolabs.com/xpdf, http://www.codeproject.com/Articles/14170/Extract-Text-from-PDF-in-C-NET
Más respuestas (1)
Sarah Palfreyman
el 30 de Abr. de 2018
1 comentario
Benjamin Ehrlich
el 4 de En. de 2023
Is there ANY way to effectively speed up textanalytics.internal.pdfparser.extractText?
A single page can take up to 20 seconds... I just want to extract a small section of text.
-Ben
Ver también
Categorías
Más información sobre Characters and Strings en Help Center y File Exchange.
Productos
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!