Why does OCR separate Text into Words?
33 visualizaciones (últimos 30 días)
Mostrar comentarios más antiguos
Hi all,
I am trying to retrieve specific text from scanned documents reporting tables of numbers. Since the table can change in the amount of column, I use the following approach:
1 - detection of the units of measure through OCR function,
2 - from the units I need (for example, kg/kW.h), calculation of a proper region of interest where OCR function is used to retrieve the needed numbers
This works rather fine but I do not obtain a consistent behaviour of OCR function. In particular, some cases, all the units are well separated into words by OCR function while in others they are grouped together in a single word. In the code below working with the attached data sample, you can see the issue. In particular, the 16th element of txt1.Words reports the units '(kg/kW.h)(kW.h/)' rather than having two Words (one for '(kg/kW.h)' and the other for '(kW.h/)') with their own WordBoundingBoxes. I do not understand why in some case, the units are in the same Word and in other they are bounded together in a single Word. Is it possible to control the generation process of Words in OCR function?
clear all
load('test.mat')
figure
imshow(I)
roi=[250.5 526 1300 142];
Iocr=insertShape(I,'rectangle',roi,'ShapeColor','blue');
hold on
imshow(Iocr)
txt1=ocr(I,roi,CharacterSet=".()kWrpmlhgh/");%,LayoutAnalysis='word');
UnitString=regexp(txt1.Words,'(?<=\()[\w\.\/]*(?=\))','match');
UnitString(cellfun(@isempty,UnitString))=[];
UnitBox=txt1.WordBoundingBoxes(not(cellfun(@isempty,UnitString)),:);
3 comentarios
dpb
hace alrededor de 3 horas
Editada: dpb
hace alrededor de 1 hora
I think again you would have to provide both a "good" and a "bad" image for folks to have any chance whatever, as it's going to be related to whatever is different between the two in the particular region of interest.
See the Tips section at ocr for some hints about changing unexpected/unwanted behavior; probably the only way you'll be able to learn much more about the internals will be through the references; it looks as though Mathworks is using the open source implementation as their engine. I don't have the TB, so can't do anything here locally, but one last time; without the two images that behave differently, nothing anybody that comes by will be able to do.
BTW, it's been a long time since I've looked at the Nebraska tests, but they're a lot of fun to look at and very informative in a decision-making process if looking at new (to owner, not just brand new) equipment purchase. We're in the SW corner of KS and such is very big data here although a final purchase decision may end up relying more upon the quality and who are the nearby dealerships and less on the test data temselves.
Respuestas (0)
Ver también
Productos
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!