Read csv strings, keep or create surrounding whitespace

I have a list of stop words that currently exists as a comma-separated list in a .txt file. The goal is to use that list to remove those words from some target text, but only when a given word (e.g. "and") appears by itself - remove "and", but don't make "sand" into "s". To that end, I tried manually putting spaces around all the words in the list, so "a,able,about" became " a , able , about ". However, the txtscan function stripped the spaces out. Is there a way to prevent it from doing that? Alternatively, if I use the original form of the list, can I tell txtscan to surround each string with spaces?

1 comentario

Cedric
Cedric el 20 de Jun. de 2014
Editada: Cedric el 20 de Jun. de 2014
Could you give an example, like a sample file, and indicate precisely what you want to achieve? This seems to be a task for REGEXPREP.

Iniciar sesión para comentar.

 Respuesta aceptada

Cedric
Cedric el 20 de Jun. de 2014
Editada: Cedric el 20 de Jun. de 2014
Here is an example that I can refine if you provide more information. It writes some keywords in upper case..
key = {'lobster', 'and'} ;
str = 'Lobster anatomy includes the cephalothorax which fuses the head and the thorax, both of which are covered by a chitinous carapace, and the abdomen. The lobster''s head bears antennae, antennules, mandibles, the first and second maxillae, and the first, second, and third maxillipeds. Because lobsters live in a murky environment at the bottom of the ocean, they mostly use their antennae as sensors.' ;
for kId = 1 : length( key )
pat = sprintf( '(?<=\\W?)%s(?=(s |\\W))', key{kId} ) ;
str = regexprep( str, pat, upper( key{kId} ), 'ignorecase' ) ;
end
Running this, you get
>> str
str =
LOBSTER anatomy includes the cephalothorax which fuses the head AND the thorax, both of which are covered by a chitinous carapace, AND the abdomen. The LOBSTER's head bears antennae, antennules, mandibles, the first AND second maxillae, AND the first, second, AND third maxillipeds. Because LOBSTERs live in a murky environment at the bottom of the ocean, they mostly use their antennae as sensors.
The REXEXP-based approach makes it possible to code for..
  • only if framed by non alphanumeric characters (e.g. ,),
  • unless following character is an 's',
  • unless at the beginning of the string.

21 comentarios

Ben
Ben el 20 de Jun. de 2014
Could you elaborate on your sprintf formatting? I've done some work in C, but nothing with sprintf, and even then, just basic stuff like %i, %x.
Cedric
Cedric el 21 de Jun. de 2014
Editada: Cedric el 21 de Jun. de 2014
Most of what is in the formatSpec of SPRINTF is the pattern for REGEXP. i could have implemented it as follows
pat = ['(?<=\W?)', key{kId}, '(?=(s |\W))'] ;
which is a simple concatenation of two parts of a pattern around the relevant key. Now REGEXP patterns are not trivial to build, so I would need to know more about what you are doing to refine the pattern.
In short, when the key is 'and' this pattern is
'(?<=\W?)and(?=(s |\W))'
and it tells REGEXP to match any 'and' which is preceded ( '(?<= )' is a positive look behind) by a non-alphanumeric character or nothing (which happens at the beginning of the string) and followed ( '(?=)' is a positive look ahead) by either 's ' or a non-alphanumeric character.
We can simplify it if you don't need that much, or make it more complex if you need to be more specific.
Ben
Ben el 21 de Jun. de 2014
I'm writing a function that removes a set of stop words from the given text. The text comes from a cell array of structures, each of which has a text field which is one string. The stop words will be an optional input, but the default will come from a list of common stop words I found online( here) , which is of the format word,word,word, also in a cell array.
As an example, if the function sees this string and was given the above list of stop words, it should end up looking like this (minus any words I missed). So, for example, it would remove 'the', but leave 'therefore' alone.
Cedric
Cedric el 21 de Jun. de 2014
Editada: Cedric el 21 de Jun. de 2014
Is it normal that the they from they're is not cleaned?
Ben
Ben el 22 de Jun. de 2014
My stop words file doesn't have a 'they're' in it, so it shouldn't be cleaned, no. But I'll probably end up adding words to the file once I get the function working correctly anyway.
Cedric
Cedric el 22 de Jun. de 2014
Editada: Cedric el 22 de Jun. de 2014
Ok, the following would be a "good enough" starting point I guess:
content = fileread( 'common-english-words.txt' ) ;
keywords = strsplit( content, ',' ) ;
original = fileread( 'ben_text.txt' ) ;
cleaned = original ;
for kId = 1 : length( keywords )
pattern = ['(?<=(^|\W))', keywords{kId}, '\s*(?![\w''])'] ;
cleaned = regexprep( cleaned, pattern, '', 'ignorecase' ) ;
end
Applied to the content that you provided, this gives:
>> original
original =
Deep. Man moved earth. Beginning the the itself greater be great creepeth they're creeping. Lesser divided give likeness seas fish set. Unto for and forth day thing. Let Them abundantly creature were replenish male brought. Made said spirit under made fly every light Let very day unto fill that second living image you'll Him given you. Brought he subdue fruit one sixth signs, us one you female female bearing creeping fish replenish midst very. Forth dominion made earth thing, made his fly called you're appear man gathering divide. Gathering place the him may itself image after man land had own.
>> cleaned
cleaned =
Deep. Man moved earth. Beginning itself greater great creepeth they're creeping. Lesser divided give likeness seas fish set. Unto forth day thing. abundantly creature replenish male brought. Made spirit under made fly light very day unto fill second living image you'll given . Brought subdue fruit one sixth signs, one female female bearing creeping fish replenish midst very. Forth dominion made earth thing, made fly called you're appear man gathering divide. Gathering place itself image man land .
The pattern matches keywords which are
  • Preceded by the beginning of the string or a non-"word character".
  • Followed by an optional white-space (so the white-space directly after the keyword is removed as well) and followed by a character which is neither a "word character" nor an apostrophe.
Something isn't working right. First, I'm using Matlab R2010b, and it doesn't appear to have strsplit ("Undefined function or method 'strsplit' for input arguments of type 'char'."). So I went back to my previous method of reading in the stop words:
fopen('stopWords.txt');
stops = textscan(3, '%s', 'delimiter', ',');
stops = stops{1,1}; % because the above line creates a 1x1 cell array which contains a cell array of the words as its only cell.
Now, when I run my function, the text isn't cleaned. I did some debugging, and after the first time 'pattern' is created, it looks like this:
(?<=(^|\W))a \s*(?![\w'])
The first stop word is a, and it looks like it's getting into the pattern fine, but I notice near the end that there's only one apostrophe following the w, even though in the script there are two there. Is this causing the regexprep to not do anything? If so, why? If not, then what is going wrong?
Cedric
Cedric el 22 de Jun. de 2014
Editada: Cedric el 22 de Jun. de 2014
Your method is fine. Note that TEXTSCAN can work on the string stored in content
content = fileread( 'common-english-words.txt' ) ;
so you don't need to FOPEN/FCLOSE the file. The pattern
pattern = ['(?<=(^|\W))', keywords{kId}, '\s*(?![\w''])'] ;
contains a double-apostrophe because the apostrophe is the delimiter for strings in MATLAB. Using a double apostrophe is the way to code apostrophes within strings. This is why, when you display the pattern, a single apostrophe is displayed.
The only strange thing that I see in your pattern is the present of a white-space after the a, which shouldn't be present if you are using the text file of keywords that you gave me and the code above involving TEXTSCAN. What do you get when you evaluate
>> fprintf( '-%s-\n', stops{1} ) ;
(using stops or keywords, whichever you chose to keep) just -a- with no space or is there a space as in -a - ? If you can't figure out what is happening, the next step is that we work with one key word and a simple string that we define in the code (no file processing), and we check little by little what is working or not.
Ben
Ben el 22 de Jun. de 2014
I just got '-a-'. When I changed the second parameter of regexprep to 'i', it removed all the i's in the input text, regardless of whether they were on their own or contained within some other word. So... that much works.
If you run the following
text = 'they they''re athey They THEY they, theyb' ;
pattern = '(?<=(^|\W))they\s*(?![\w''])' ;
regexprep( text, pattern, '', 'ignorecase' )
do you get
ans =
they're athey , theyb
in the command window?
Ben
Ben el 22 de Jun. de 2014
I do.
Cedric
Cedric el 22 de Jun. de 2014
Editada: Cedric el 22 de Jun. de 2014
Ok, so there is no apparent difference between our regexp engines. Now if you copy-paste the following in your command window as it is, is it working?
%content = fileread( 'common-english-words.txt' ) ;
content = 'a,able,about,across,after,all,almost,also,am,among,an,and,any,are,as,at,be,because,been,but,by,can,cannot,could,dear,did,do,does,either,else,ever,every,for,from,get,got,had,has,have,he,her,hers,him,his,how,however,i,if,in,into,is,it,its,just,least,let,like,likely,may,me,might,most,must,my,neither,no,nor,not,of,off,often,on,only,or,other,our,own,rather,said,say,says,she,should,since,so,some,than,that,the,their,them,then,there,these,they,this,tis,to,too,twas,us,wants,was,we,were,what,when,where,which,while,who,whom,why,will,with,would,yet,you,your' ;
keywords = regexp( content, ',', 'split' ) ;
%original = fileread( 'ben_text.txt' ) ;
original = 'Deep. Man moved earth. Beginning the the itself greater be great creepeth they''re creeping. Lesser divided give likeness seas fish set. Unto for and forth day thing. Let Them abundantly creature were replenish male brought. Made said spirit under made fly every light Let very day unto fill that second living image you''ll Him given you. Brought he subdue fruit one sixth signs, us one you female female bearing creeping fish replenish midst very. Forth dominion made earth thing, made his fly called you''re appear man gathering divide. Gathering place the him may itself image after man land had own.' ;
cleaned = original ;
for kId = 1 : length( keywords )
pattern = ['(?<=(^|\W))', keywords{kId}, '\s*(?![\w''])'] ;
cleaned = regexprep( cleaned, pattern, '', 'ignorecase' ) ;
end
original
cleaned
Ben
Ben el 22 de Jun. de 2014
That does work. It looks like it leaves spaces in, which doesn't matter to me.
Cedric
Cedric el 22 de Jun. de 2014
Editada: Cedric el 22 de Jun. de 2014
So there is a problem with your previous attempts. Compare your code with mine, and if you can't find where the problem is, copy-paste your code here.
Spaces are unavoidable at this stage without building a quite complex pattern. If you want to remove them, I would advice you to perform an extra "pass" over the output, and replace series of spaces with a unique space, and spaces before a comma/column/semi-column/period with an empty string. You can do this with REGEXPREP again, after the loop. If it's important to you, try to implement it and let me know how it goes.
Ben
Ben el 22 de Jun. de 2014
Unfortunately, I can't exactly copy and paste, since I'm at home on a Windows machine, tunneling in to a Linux box at school. However, I copied the pattern definition from previously within Matlab and pasted it into my script. Now, when the debugger hits the breakpoint after that definition, it says pattern is (?<=(^|\W))a\s*(?![\w']), which I believe is correct. But, it's not removing the stop words from the text I feed it.
Cedric
Cedric el 22 de Jun. de 2014
Editada: Cedric el 23 de Jun. de 2014
If my code works but not yours, we already established that it's not your system. Now I cannot debug your code without seeing it. There is no isolated 'a' in the text that you gave me; if it doesn't work with e.g. 'the', I would have to see at least the two lines of code where you define the pattern and where you call REGEXPREP.
Ben
Ben el 23 de Jun. de 2014
Ok, I've attached a screen shot of the script editor. Let me know if you need more.
Ok, the variable that you use to store the output of REGEXPREP should be the same as the variable that you pass as 1st argument, so it is cleaned a little more (by one word) at each iteration.
You should have something like what follows (whatever name you choose finally for the variable which contains the text):
for ...
currentReview = ...
textToReview = lower( ...
for ..
pattern = ...
textToReview = regexprep( textToReview , ...
end
cleanedText{1,i} = textToReview ;
end
Cedric
Cedric el 23 de Jun. de 2014
Editada: Cedric el 23 de Jun. de 2014
Note that if you really have to clean punctuation, white spaces, and any other special characters (as it seems to be indicated in the function help of your screenshot), the following may be faster (if not much faster) than the current, iterative version (but you'd have to profile both to check):
words = regexp( textToReview, '[-\w'']+', 'match' ) ;
ism = ismember( words, stopWords ) ;
cleanedText{1,i} = sprintf( '%s ', words{~ism} ) ;
Ben
Ben el 23 de Jun. de 2014
Ah, I hadn't realized that regexp functions don't do their work all at once, as stringrep does. That should do it. Thank you so much!
Cedric
Cedric el 23 de Jun. de 2014
Editada: Cedric el 23 de Jun. de 2014
You're welcome! Note that it could do its job all at once if you were passing a pattern which contains all keywords in an OR operation. Yet, it's often more efficient to apply several times a simple pattern than passing once an extra-long/complex one. That could/should be profiled for your specific case though if you wanted to optimize.

Iniciar sesión para comentar.

Más respuestas (0)

Categorías

Preguntada:

Ben
el 20 de Jun. de 2014

Editada:

el 23 de Jun. de 2014

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by