Read csv strings, keep or create surrounding whitespace

Question

0 votos

I have a list of stop words that currently exists as a comma-separated list in a .txt file. The goal is to use that list to remove those words from some target text, but only when a given word (e.g. "and") appears by itself - remove "and", but don't make "sand" into "s". To that end, I tried manually putting spaces around all the words in the list, so "a,able,about" became " a , able , about ". However, the txtscan function stripped the spaces out. Is there a way to prevent it from doing that? Alternatively, if I use the original form of the list, can I tell txtscan to surround each string with spaces?

1 comentario
Mostrar -1 comentarios más antiguos Ocultar -1 comentarios más antiguos

Cedric el 20 de Jun. de 2014

Editada: Cedric el 20 de Jun. de 2014

Could you give an example, like a sample file, and indicate precisely what you want to achieve? This seems to be a task for REGEXPREP.

Iniciar sesión para comentar.

Iniciar sesión para responder a esta pregunta.

Follow Question

Answer 1

Cedric el 20 de Jun. de 2014

Editada: Cedric el 20 de Jun. de 2014

Abrir en MATLAB Online

1 voto

Here is an example that I can refine if you provide more information. It writes some keywords in upper case..

 key = {'lobster', 'and'} ;
 str = 'Lobster anatomy includes the cephalothorax which fuses the head and the thorax, both of which are covered by a chitinous carapace, and the abdomen. The lobster''s head bears antennae, antennules, mandibles, the first and second maxillae, and the first, second, and third maxillipeds. Because lobsters live in a murky environment at the bottom of the ocean, they mostly use their antennae as sensors.' ;
 for kId = 1 : length( key )
    pat = sprintf( '(?<=\\W?)%s(?=(s |\\W))', key{kId} ) ;
    str = regexprep( str, pat, upper( key{kId} ), 'ignorecase' )  ;
 end

Running this, you get

 >> str
 str =
 LOBSTER anatomy includes the cephalothorax which fuses the head AND the thorax, both of which are covered by a chitinous carapace, AND the abdomen. The LOBSTER's head bears antennae, antennules, mandibles, the first AND second maxillae, AND the first, second, AND third maxillipeds. Because LOBSTERs live in a murky environment at the bottom of the ocean, they mostly use their antennae as sensors.

The REXEXP-based approach makes it possible to code for..

only if framed by non alphanumeric characters (e.g. ,),
unless following character is an 's',
unless at the beginning of the string.

21 comentarios
Mostrar 19 comentarios más antiguos Ocultar 19 comentarios más antiguos

Cedric el 22 de Jun. de 2014

Editada: Cedric el 22 de Jun. de 2014

Abrir en MATLAB Online

Ok, the following would be a "good enough" starting point I guess:

 content  = fileread( 'common-english-words.txt' ) ;
 keywords = strsplit( content, ',' ) ;
 original = fileread( 'ben_text.txt' ) ;
 cleaned  = original  ;
 for kId = 1 : length( keywords )
     pattern = ['(?<=(^|\W))', keywords{kId}, '\s*(?![\w''])'] ;
     cleaned = regexprep( cleaned, pattern, '', 'ignorecase' ) ;
 end

Applied to the content that you provided, this gives:

 >> original
 original =
    Deep. Man moved earth. Beginning the the itself greater be great creepeth they're creeping. Lesser divided give likeness seas fish set. Unto for and forth day thing. Let Them abundantly creature were replenish male brought. Made said spirit under made fly every light Let very day unto fill that second living image you'll Him given you. Brought he subdue fruit one sixth signs, us one you female female bearing creeping fish replenish midst very. Forth dominion made earth thing, made his fly called you're appear man gathering divide. Gathering place the him may itself image after man land had own.
 >> cleaned
 cleaned =
    Deep. Man moved earth. Beginning   itself greater  great creepeth they're creeping. Lesser divided give likeness seas fish set. Unto  forth day thing.   abundantly creature  replenish male brought. Made  spirit under made fly  light  very day unto fill  second living image you'll  given . Brought  subdue fruit one sixth signs,  one  female female bearing creeping fish replenish midst very. Forth dominion made earth thing, made  fly called you're appear man gathering divide. Gathering place  itself image  man land  .

The pattern matches keywords which are

Preceded by the beginning of the string or a non-"word character".
Followed by an optional white-space (so the white-space directly after the keyword is removed as well) and followed by a character which is neither a "word character" nor an apostrophe.

Cedric el 22 de Jun. de 2014

Editada: Cedric el 22 de Jun. de 2014

Abrir en MATLAB Online

Your method is fine. Note that TEXTSCAN can work on the string stored in content

content = fileread( 'common-english-words.txt' ) ;

so you don't need to FOPEN/FCLOSE the file. The pattern

pattern = ['(?<=(^|\W))', keywords{kId}, '\s*(?![\w''])'] ;

contains a double-apostrophe because the apostrophe is the delimiter for strings in MATLAB. Using a double apostrophe is the way to code apostrophes within strings. This is why, when you display the pattern, a single apostrophe is displayed.

The only strange thing that I see in your pattern is the present of a white-space after the a, which shouldn't be present if you are using the text file of keywords that you gave me and the code above involving TEXTSCAN. What do you get when you evaluate

>> fprintf( '-%s-\n', stops{1} ) ;

(using stops or keywords, whichever you chose to keep) just -a- with no space or is there a space as in -a - ? If you can't figure out what is happening, the next step is that we work with one key word and a simple string that we define in the code (no file processing), and we check little by little what is working or not.

Cedric el 22 de Jun. de 2014

Editada: Cedric el 22 de Jun. de 2014

Abrir en MATLAB Online

Ok, so there is no apparent difference between our regexp engines. Now if you copy-paste the following in your command window as it is, is it working?

 %content  = fileread( 'common-english-words.txt' ) ;
 content = 'a,able,about,across,after,all,almost,also,am,among,an,and,any,are,as,at,be,because,been,but,by,can,cannot,could,dear,did,do,does,either,else,ever,every,for,from,get,got,had,has,have,he,her,hers,him,his,how,however,i,if,in,into,is,it,its,just,least,let,like,likely,may,me,might,most,must,my,neither,no,nor,not,of,off,often,on,only,or,other,our,own,rather,said,say,says,she,should,since,so,some,than,that,the,their,them,then,there,these,they,this,tis,to,too,twas,us,wants,was,we,were,what,when,where,which,while,who,whom,why,will,with,would,yet,you,your' ;
 keywords = regexp( content, ',', 'split' ) ;
 %original = fileread( 'ben_text.txt' ) ;
 original = 'Deep. Man moved earth. Beginning the the itself greater be great creepeth they''re creeping. Lesser divided give likeness seas fish set. Unto for and forth day thing. Let Them abundantly creature were replenish male brought. Made said spirit under made fly every light Let very day unto fill that second living image you''ll Him given you. Brought he subdue fruit one sixth signs, us one you female female bearing creeping fish replenish midst very. Forth dominion made earth thing, made his fly called you''re appear man gathering divide. Gathering place the him may itself image after man land had own.' ;
 cleaned  = original ;
 for kId = 1 : length( keywords )
    pattern = ['(?<=(^|\W))', keywords{kId}, '\s*(?![\w''])'] ;
    cleaned = regexprep( cleaned, pattern, '', 'ignorecase' ) ;
 end
 original
 cleaned

Ben el 23 de Jun. de 2014

Ah, I hadn't realized that regexp functions don't do their work all at once, as stringrep does. That should do it. Thank you so much!

Cedric el 23 de Jun. de 2014

Editada: Cedric el 23 de Jun. de 2014

You're welcome! Note that it could do its job all at once if you were passing a pattern which contains all keywords in an OR operation. Yet, it's often more efficient to apply several times a simple pattern than passing once an extra-long/complex one. That could/should be profiled for your specific case though if you wanted to optimize.

Iniciar sesión para comentar.

Read csv strings, keep or create surrounding whitespace

1 comentario
Mostrar -1 comentarios más antiguos Ocultar -1 comentarios más antiguos

Respuesta aceptada

21 comentarios
Mostrar 19 comentarios más antiguos Ocultar 19 comentarios más antiguos

Más respuestas (0)

Categorías

Etiquetas

Community Treasure Hunt

Read csv strings, keep or create surrounding whitespace

1 comentario Mostrar -1 comentarios más antiguos Ocultar -1 comentarios más antiguos

Respuesta aceptada

21 comentarios Mostrar 19 comentarios más antiguos Ocultar 19 comentarios más antiguos

Más respuestas (0)

Categorías

Etiquetas

Ver también

Community Treasure Hunt

1 comentario
Mostrar -1 comentarios más antiguos Ocultar -1 comentarios más antiguos

21 comentarios
Mostrar 19 comentarios más antiguos Ocultar 19 comentarios más antiguos