MATLAB Answers

0

How do I use regexp to extract text between numbers

Asked by Ean Hendrickson on 9 Nov 2019
Latest activity Edited by per isakson
on 9 Nov 2019
I have a string that I extracted from a pdf
str = "↵↵↵1. Receptacles, general purpose. ↵2. Receptacles with integral GFCI. ↵3. USB Charger receptacles. ↵4. AFCI receptacles. ↵5. Twist-locking receptacles. ↵6. Isolated-ground receptacles. ↵7. Tamper-resistant receptacles. ↵8. Weather-resistant receptacles. ↵9. Pendant cord-connector devices. ↵10. Cord and plug sets. ↵11. Wall box dimmers. ↵12. Wall box dimmer/sensors. ↵13. Wall box occupancy/vacancy sensors. ↵14. Toggle Switches. ↵15. Floor service outlets. ↵16. Associated device plates. ↵↵"
How can I use the function regexp to extract all the descriptions between the numbers to put them into a 16x1 matrix. So the end product I want will be a 16x1 string that looks like
  1. Receptacles, general purpose.
  2. Receptacles with integral GFCI.
  3. USB Charger receptacles.
  4. AFCI receptacles.
  5. Twist-locking receptacles.
  6. Isolated-ground receptacles.
  7. Tamper-resistant receptacles.
  8. Weather-resistant receptacles.
  9. Pendant cord-connector devices.
  10. Cord and plug sets.
  11. Wall box dimmers.
  12. Wall box dimmer/sensors.
  13. Wall box occupancy/vacancy sensors.
  14. Toggle Switches.
  15. Floor service outlets.
  16. Associated device plates.
I also have this line of code
parts = regexp(str,'^\d*+.*$','dotexceptnewline','lineanchors');
which finds the index of each number in the string. I think I could then use all the index values to write a for loop to extract the text that is in between the text

  4 Comments

Show 1 older comment
I tried to do that but I could not get it to work
Is this the exact text of your char array? Or are there actually some char(10) in there?
this is the exact text I extracted from a pdf. there should be no char(10) in there. I used extractFileText, strfind and extractBetween to get the above text.

Sign in to comment.

Tags

2 Answers

Answer by per isakson
on 9 Nov 2019
Edited by per isakson
on 9 Nov 2019

"So the end product I want will be a 16x1 string that looks like" I'm not sure exactly how understand your requirement.
The problem is the delimiter that looks a bit like the character on my ENTER key ( ↵). After copy&paste from your question the hex number of that character is \x21B5.
Try
%%
z = regexp( str, "\x21B5+", 'split' );
z = strtrim( z );
z( isstring(z) & strlength(z)==0 ) = [];
%%
% z = regexp( z, "(?<=\d+\.\x20).+$", 'match', 'once' ); % removes the numbers
out = reshape( z, [],1 );
%%
fprintf( 1, '%s\n', out );
outputs in the command window
1. Receptacles, general purpose.
2. Receptacles with integral GFCI.
3. USB Charger receptacles.
4. AFCI receptacles.
5. Twist-locking receptacles.
6. Isolated-ground receptacles.
....
and
>> out(1:4)
ans =
4×1 string array
"1. Receptacles, general purpose."
"2. Receptacles with integral GFCI."
"3. USB Charger receptacles."
"4. AFCI receptacles."

  0 Comments

Sign in to comment.


Answer by JESUS DAVID ARIZA ROYETH on 9 Nov 2019

str = "↵↵↵1. Receptacles, general purpose. ↵2. Receptacles with integral GFCI. ↵3. USB Charger receptacles. ↵4. AFCI receptacles. ↵5. Twist-locking receptacles. ↵6. Isolated-ground receptacles. ↵7. Tamper-resistant receptacles. ↵8. Weather-resistant receptacles. ↵9. Pendant cord-connector devices. ↵10. Cord and plug sets. ↵11. Wall box dimmers. ↵12. Wall box dimmer/sensors. ↵13. Wall box occupancy/vacancy sensors. ↵14. Toggle Switches. ↵15. Floor service outlets. ↵16. Associated device plates. ↵↵"
parts = regexp(str,'\d+\. +[.\w,-/\s]+\.','match')'
parts =
16×1 string array
"1. Receptacles, general purpose."
"2. Receptacles with integral GFCI."
"3. USB Charger receptacles."
"4. AFCI receptacles."
"5. Twist-locking receptacles."
"6. Isolated-ground receptacles."
"7. Tamper-resistant receptacles."
"8. Weather-resistant receptacles."
"9. Pendant cord-connector devices."
"10. Cord and plug sets."
"11. Wall box dimmers."
"12. Wall box dimmer/sensors."
"13. Wall box occupancy/vacancy sensors."
"14. Toggle Switches."
"15. Floor service outlets."
"16. Associated device plates."

  0 Comments

Sign in to comment.