English 中文(简体)
tokenize a string keeping delimiters in Python
原标题:

Is there any equivalent to str.split in Python that also returns the delimiters?

I need to preserve the whitespace layout for my output after processing some of the tokens.

Example:

>>> s="	this is an  example"
>>> print s.split()
[ this ,  is ,  an ,  example ]

>>> print what_I_want(s)
[ 	 ,  this ,    ,  is ,    ,  an ,     ,  example ]

Thanks!

最佳回答

How about

import re
splitter = re.compile(r (s+|S+) )
splitter.findall(s)
问题回答
>>> re.compile(r (s+) ).split("	this is an  example")
[  ,  	 ,  this ,    ,  is ,    ,  an ,     ,  example ]

the re module provides this functionality:

>>> import re
>>> re.split( (W+) ,  Words, words, words. )
[ Words ,  ,  ,  words ,  ,  ,  words ,  . ,   ]

(quoted from the Python documentation).

For your example (split on whitespace), use re.split( (s+) , This is an example ).

The key is to enclose the regex on which to split in capturing parentheses. That way, the delimiters are added to the list of results.

Edit: As pointed out, any preceding/trailing delimiters will of course also be added to the list. To avoid that you can use the .strip() method on your input string first.

Have you looked at pyparsing? Example borrowed from the pyparsing wiki:

>>> from pyparsing import Word, alphas
>>> greet = Word(alphas) + "," + Word(alphas) + "!"
>>> hello1 =  Hello, World! 
>>> hello2 =  Greetings, Earthlings! 
>>> for hello in hello1, hello2:
...     print (u %s u2192 %r  % (hello, greet.parseString(hello))).encode( utf-8 )
... 
Hello, World! → ([ Hello ,  , ,  World ,  ! ], {})
Greetings, Earthlings! → ([ Greetings ,  , ,  Earthlings ,  ! ], {})

Thanks guys for pointing for the re module, I m still trying to decide between that and using my own function that returns a sequence...

def split_keep_delimiters(s, delims="	

 "):
    delim_group = s[0] in delims
    start = 0
    for index, char in enumerate(s):
        if delim_group != (char in delims):
            delim_group ^= True
            yield s[start:index]
            start = index
    yield s[start:index+1]

If I had time I d benchmark them xD





相关问题
Simple JAVA: Password Verifier problem

I have a simple problem that says: A password for xyz corporation is supposed to be 6 characters long and made up of a combination of letters and digits. Write a program fragment to read in a string ...

Case insensitive comparison of strings in shell script

The == operator is used to compare two strings in shell script. However, I want to compare two strings ignoring case, how can it be done? Is there any standard command for this?

Trying to split by two delimiters and it doesn t work - C

I wrote below code to readin line by line from stdin ex. city=Boston;city=New York;city=Chicago and then split each line by ; delimiter and print each record. Then in yet another loop I try to ...

String initialization with pair of iterators

I m trying to initialize string with iterators and something like this works: ifstream fin("tmp.txt"); istream_iterator<char> in_i(fin), eos; //here eos is 1 over the end string s(in_i, ...

break a string in parts

I have a string "pc1|pc2|pc3|" I want to get each word on different line like: pc1 pc2 pc3 I need to do this in C#... any suggestions??

Quick padding of a string in Delphi

I was trying to speed up a certain routine in an application, and my profiler, AQTime, identified one method in particular as a bottleneck. The method has been with us for years, and is part of a "...

热门标签