How to extract data from pdf file in matlab?

Question

azizullah khan el 19 de Sept. de 2014

1
Enlazar

Enlace directo a esta pregunta

https://la.mathworks.com/matlabcentral/answers/155500-how-to-extract-data-from-pdf-file-in-matlab

Comentada: Yue Zhao el 30 de Jun. de 2019

I am in search of such algorithm that will extract data from pdf file.For example in the pdf file a sentence is present i.e: Account# 29 . I want to extract 29 from pdf file.If it is possible by fopen() function ,please share it with me.I have tried pdftotext but doesn't succeed. Now if it is possible to extract data from pdf with the help of fopen(), it will be better.I also tried fopen() but leads to failure.Please share you experience with me..Thanks.

6 comentarios
Mostrar 4 comentarios más antiguosOcultar 4 comentarios más antiguos

José-Luis el 19 de Sept. de 2014

Yes, I have seen it and it doesn't work. In principle, it might work for trivial purposes like changing the font type, but I have no idea what kind of data you are trying to extract.

Writing a robust algorithm is a tall order.

azizullah khan el 20 de Sept. de 2014

Editada: azizullah khan el 20 de Sept. de 2014

Sir,Just give a clue that how it is possible: let suppose Pdf file contain:

Account# 345

i want to capture 345 from it..for example i can use regexp() to extract numbers only...Please help me...I have spend lots of my time on it..but doesn't succeed..Almost i have wasted a month for it....Thanks

Iniciar sesión para comentar.

Iniciar sesión para responder a esta pregunta.

Answer 1

Jan el 21 de Sept. de 2014

3
Enlazar

Enlace directo a esta respuesta

https://la.mathworks.com/matlabcentral/answers/155500-how-to-extract-data-from-pdf-file-in-matlab#answer_152421

Assume you have a PDF file, which is displayed containing the string "Account# 345". Now different details impede the extraction of this string:

The contents can be compressed and/or encrypted, such that the string cannot be found in clear text inside the file.
Even without encryption or compression, the text need not be stored continously, but in a valid PDF each character can be stored with its paper position, such that the order does not matter.

In consequence searching a string in a PDF is not reliable. Therefore some OCR software is applied frequently to add an additional layer containing the contents as searchable strings. But as long as you do not specify any details of your PDF we cannot guess if they contain such strings.

Please notice, that your problem is not well defined and suggesting solutions is still based on guessing, although you've posted several corresponding questions in this forum. Finally the main problem is, that somebody decided to store data in PDF files, which is not sufficient for the later extraction of strings. Creating a large and complicatd workaround afterwards is an inefficient way. It would be more stable and faster to obtain the data in a more suitable format as a text file.

5 comentarios
Mostrar 3 comentarios más antiguosOcultar 3 comentarios más antiguos

José-Luis el 25 de Sept. de 2014

@Azizullah: I am sorry that you feel the responses have been negative. Let me just say that I think you have not grasped neither the magnitude nor the complexity of what you are asking.

Let me try to impress the main difficulties of what you are asking:

pdfs are not text files. They can be images. The individual characters might not be stored contiguously. In your case, you have a scan, you might no find any strings there.
OCR is not a problem that can be solved easily in this forum. There might be some attempts in the File Exchange. Bear in mind that there are entire companies that hire small armies of programmers to do OCR and, to my knowledge, there are no 100% fool-proof solutions.

The easiest solution, by far, is to get the original file from which the pdf was generated, as Jan suggests. Alternatively, if you have even a few hundred documents, it will be faster to manually type that in than try to come up with a robust algorithm.

I can only think of one scenario where you could extract the text. That would be if the text you are interested is in a consistent position. This is what I would do:

Extract portions of your pdf as an image.
Use the ocr function from the symbolic image toolbox to transform that image into a string.

That will work only if the text you are interested in is always in the same position.

Noam Greenboim el 25 de Mayo de 2015

A possible workaround is to convert the PDF to an Excel file, and then import that XLS file to Matlab. This is a relatively good solution for PDF's that contain tables of data.

If you take a look first at the Excel file, you might find ideas how to access the data you're interested in.

Yue Zhao el 30 de Jun. de 2019

We will get a matrix if we use imread for a picture. How do we get the matrix of the PDF?

Iniciar sesión para comentar.

Answer 2

mizuki el 25 de Abr. de 2018

1
Enlazar

Enlace directo a esta respuesta

https://la.mathworks.com/matlabcentral/answers/155500-how-to-extract-data-from-pdf-file-in-matlab#answer_317017

From R2017b, we have extractFileText for reading text data from PDF files.

0 comentarios
Mostrar -2 comentarios más antiguosOcultar -2 comentarios más antiguos

Iniciar sesión para comentar.

Answer 3

Walter Roberson el 25 de Mayo de 2015

0
Enlazar

Enlace directo a esta respuesta

https://la.mathworks.com/matlabcentral/answers/155500-how-to-extract-data-from-pdf-file-in-matlab#answer_180334

Have you looked at http://www.mathworks.com/matlabcentral/answers/151092-how-to-read-pdf-file-in-matlab ?

0 comentarios
Mostrar -2 comentarios más antiguosOcultar -2 comentarios más antiguos

Iniciar sesión para comentar.

How to extract data from pdf file in matlab?

6 comentarios
Mostrar 4 comentarios más antiguosOcultar 4 comentarios más antiguos

Respuesta aceptada

5 comentarios
Mostrar 3 comentarios más antiguosOcultar 3 comentarios más antiguos

Más respuestas (2)

0 comentarios
Mostrar -2 comentarios más antiguosOcultar -2 comentarios más antiguos

0 comentarios
Mostrar -2 comentarios más antiguosOcultar -2 comentarios más antiguos

Ver también

Categorías

Etiquetas

Productos

Community Treasure Hunt

How to extract data from pdf file in matlab?

6 comentarios Mostrar 4 comentarios más antiguosOcultar 4 comentarios más antiguos

Respuesta aceptada

5 comentarios Mostrar 3 comentarios más antiguosOcultar 3 comentarios más antiguos

Más respuestas (2)

0 comentarios Mostrar -2 comentarios más antiguosOcultar -2 comentarios más antiguos

0 comentarios Mostrar -2 comentarios más antiguosOcultar -2 comentarios más antiguos

Ver también

Categorías

Etiquetas

Productos

Community Treasure Hunt

6 comentarios
Mostrar 4 comentarios más antiguosOcultar 4 comentarios más antiguos

5 comentarios
Mostrar 3 comentarios más antiguosOcultar 3 comentarios más antiguos

0 comentarios
Mostrar -2 comentarios más antiguosOcultar -2 comentarios más antiguos

0 comentarios
Mostrar -2 comentarios más antiguosOcultar -2 comentarios más antiguos