Borrar filtros
Borrar filtros

Regular Expression to detect spaces in a string

10 visualizaciones (últimos 30 días)
Deepak
Deepak el 14 de Oct. de 2013
Comentada: Cedric el 14 de Oct. de 2013
Hallo All, I have a string for example
string='<abcd/abcd/yxz/xyz/MOTOR50DEV/sdsds/Limit not yet decided abcd/abcd/yxz/xyz/MOTOR50DEV/sdsds/Limit not yet decided>'
I want to use regexp to get all the white spaces that occur between " and < /a >. I have been trying to figure out how to use regexp to get the spaces but have not yet found an elegant solution. For eg: regexp(string,'(?<="\S*)\s') retuns only 2 spaces and not all of them.. Could someone help me out..
Thanks a lot
  2 comentarios
Cedric
Cedric el 14 de Oct. de 2013
Editada: Cedric el 14 de Oct. de 2013
What do you mean by "spaces"? Is it just white spaces or all characters? If you really meant white spaces, is it their position that you want? If you want characters, what is the purpose? REGEXP can parse the whole tag and extract whatever part you want.
Jan
Jan el 14 de Oct. de 2013
There are two " characters in this string. Which one do you mean? Please post the wanted result by editing the question (not as comment or pseudo-answer).

Iniciar sesión para comentar.

Respuesta aceptada

Deepak
Deepak el 14 de Oct. de 2013
Hi Cedric, Thanks for the really detailed answer. It really helped. I actually wanted to get the position of white spaces. So the second part of the answer really addresses my query. I was hoping to get the whote spaces with one regexp without using any other commands like isspace, but I guess would be complicated... I am not really familiar with tokens.. So once again thanks for ur detailed answer..
  1 comentario
Cedric
Cedric el 14 de Oct. de 2013
Editada: Cedric el 14 de Oct. de 2013
Hi Deepak, The issue with counting spaces using regexp is that it's not possible to do it using a simple query. The call to regexp (possibly regexprep) that we would have to use would be much more complicated than doing the whole operation using one call to regexp with a simple pattern and a few additional operations.

Iniciar sesión para comentar.

Más respuestas (1)

Cedric
Cedric el 14 de Oct. de 2013
Editada: Cedric el 14 de Oct. de 2013
Here is an example assuming that you want characters between " and </a> and not only white spaces:
>> s = regexp( html, '(?<=")[^"]+(?=</a>)', 'match' )
s =
'>Mathworks' '>Google'
Look-arounds are treacherous when dealing with this type of situations where the expression in the look-behind can appear multiple times before the expression in the look forward is found. The following example illustrates it
>> s = regexp( html, '(?<=").+?(?=</a>)', 'match' )
s =
'http://www.mathworks.com">Mathworks' 'http://www.google.com">Google'
where we see that the "smallest possible match" fails despite the lazy .+?. Let me know if you want to understand why.. or see the example/discussion between Per and I here.
Note that using tokens is generally more efficient than using look-arounds:
>> s = regexp( html, '"([^"]+)</a>', 'tokens' ) ;
>> celldisp(s)
s{1}{1} =
>Mathworks
s{2}{1} =
>Google
Back to the initial question, the pattern could be more specific though if you wanted to extract the content or the value of the href parameter, e.g.
>> s = regexp( html, '[^>]+(?=</a>)', 'match' )
s =
'Mathworks' 'Google'
Or
>> s = regexp( html, 'href.+?"([^"]*)', 'tokens' ) ;
>> celldisp(s)
s{1}{1} =
http://www.mathworks.com
s{2}{1} =
http://www.google.com
Or
>> s = regexp( html, 'href.+?"(?<href>[^"]*).*?>(?<content>.*?)</a>', 'names' )
s =
1x2 struct array with fields:
href
content
>> s(1)
ans =
href: 'http://www.mathworks.com'
content: 'Mathworks'
>> s(2)
ans =
href: 'http://www.google.com'
content: 'Google'
All these approaches can be fine-tuned/complex-ified for managing a broader set of cases, e.g. when there is a tag in the content of the anchor tag.
EDIT: if you really want to get the position of white spaces, your expression does work but not as you thought. It actually matches
'"abcd/abcd/yxz/xyz/MOTOR50DEV/sdsds/Limit '
and
'">abcd/abcd/yxz/xyz/MOTOR50DEV/sdsds/Limit '
which start both with a " followed by non-whitespace characters until after the t of Limit. Once thing that you could do if you wanted to keep the pattern simple, is to get the starting and ending position of the relevant sub-string:
>> [mStart, mStop] = regexp( html, '(?<=")[^"]+(?=</a>)', 'start', 'end' )
mStart =
76
mStop =
132
and use them to mask a logical index of position of white spaces:
>> isSpace = html == ' ' ;
>> isSpace(1:mStart-1) = false ;
>> isSpace(mStop+1:end) = false ;
>> find( isSpace )
ans =
117 121 125

Categorías

Más información sobre Characters and Strings en Help Center y File Exchange.

Productos

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by