Transaction Data in HTML file

Hello, I have a whole bunch of html files in a directory, named transactionData1.html, transactionData2.html and so on. In these HTML files, transaction information is buried with the following parameters of interest :
<b>Customer Name: </b>Michael Henesi<br />
... (some other stuff)
<b>Transaction ID:</b> 21987335670
The transaction ID has varying length and is sometimes not available (no entry in that field). Sometimes there are multiple transactions. Sometimes, the transaction ID is specified as:
<b>Transaction ID: </b>21987335670
that is, the space before transaction ID gets shifted to space after the colon.
In some HTML files, both, the Customer Name and the Transaction ID information is missing.
The objective is to get all the Transaction IDs, along with the Customer Names, from all the files in the directory, in one text file. How can this be done ?

2 comentarios

per isakson
per isakson el 15 de Mayo de 2020
You increase your chance to get a useful answer if you upload a few html-files that represent the cases.
  • In some HTML files, both, the Customer Name and the Transaction ID information is missing.
  • The transaction ID has varying length and is sometimes not available
  • Sometimes there are multiple transactions.
  • etc.
v k
v k el 15 de Mayo de 2020
Sure. I have attached one such file (in text format).
In the transaction ID field, sometimes it is just blank.

Iniciar sesión para comentar.

 Respuesta aceptada

per isakson
per isakson el 15 de Mayo de 2020
Editada: per isakson el 15 de Mayo de 2020
This is a start
%%
sad = dir( 'd:\m\cssm\transData*.txt' );
len = length( sad );
out = cell( len, 2 );
for jj = 1 : len
chr = fileread( fullfile( sad(jj).folder, sad(jj).name ) );
xpr = '<b>Customer Name: <\/b>([^<]+).+<b>Transaction ID:<\/b>\x20*(\d+)';
cac = regexp( chr, xpr, 'tokens' );
if not( isempty( cac{1} ) )
out(jj,:) = cac{:};
end
end
out
It outputs
out =
1×2 cell array
{'Wee Lu'} {'8299045'}
>>
In response to comment
%%
sad = dir( 'd:\m\cssm\transData*.txt' );
len = length( sad );
out = cell( len, 2 );
for jj = 1 : len
chr = fileread( fullfile( sad(jj).folder, sad(jj).name ) );
xpr = '<b>Customer Name: <\/b>([^<]*).*<b>Transaction ID:<\/b>\x20*(\d*)';
cac = regexp( chr, xpr, 'tokens' );
if not(isempty( cac{1}{1} )) && not(isempty( cac{1}{2} ))
out(jj,:) = cac{1};
elseif not(isempty( cac{1}{1} ))
out(jj,1) = cac{1}(1);
out(jj,2) = {'-99'};
elseif not(isempty( cac{1}{2} ))
out(jj,1) = {'---'};
out(jj,2) = cac{1}(2);
else
out(jj,1) = {'---'};
out(jj,2) = {'-99'};
end
end
out
outputs
out =
2×2 cell array
{'Wee Lu' } {'8299045'}
{'Lam Soon'} {'-99' }
>>

6 comentarios

v k
v k el 15 de Mayo de 2020
Thank you for quick response. But when it encounters the case where there is no data in the field, for example the transaction ID field is empty, or the Customer Name field is empty,
<b>Transaction ID: </b>
then the iteration just stops with the following error : How to skip over this error so that the loop keeps on going and stores dummy 0s in the Transaction ID field :
Index exceeds the number of array elements (0).
Error in transactionFields (line 11)
if not( isempty( cac{1} ) )
v k
v k el 15 de Mayo de 2020
Attaching one such case where the Transaction ID field is empty. Thanks.
per isakson
per isakson el 15 de Mayo de 2020
See my answer
v k
v k el 16 de Mayo de 2020
This is awesome. When you wrote "This is a start", I thought that after starting, it has stumbled at an imortant hurdle. But it has now won the race. Just a couple of questions:
1). What do \x20* and (\d*) do ?
2). What does ([^<]*) achieve ?
I know that these questions are trivial for you. But important for me.
Truly Truly thanks.
per isakson
per isakson el 16 de Mayo de 2020
Editada: per isakson el 16 de Mayo de 2020
Regular expressions are not trivial. They require concentration, care with the details and reading the documentation carefully. See regexp, Match regular expression (case sensitive). There is a lot on regular expressions on the Internet, e.g. RegEx101. There are many flavors of regular expressions and Matlab has its own, which however is fairly close to PCRE.
Answers
  • \x20 stands for space; character with hexadecimal value 20.
  • * zero or more times of previous "item"
  • () capture tokens; keeps what's in the parentheses for the output
  • [^<] anything except for <
The Matlab documentation explains this much better than I do!
v k
v k el 16 de Mayo de 2020
Much appreciation.

Iniciar sesión para comentar.

Más respuestas (0)

Categorías

Más información sobre Variables en Centro de ayuda y File Exchange.

Preguntada:

v k
el 15 de Mayo de 2020

Comentada:

v k
el 16 de Mayo de 2020

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by