Question

The bounty expires in 7 days. Answers to this question are eligible for a +250 reputation bounty. Karl Knechtel wants to draw more attention to this question:

I suspect, but do not know, that an approach using .partition and .rpartition methods would be faster than anything offered so far (and avoid conditional logic). I m looking for an answer that implements and explains this, and/or an answer showing timing results for the various approaches suggested.

我有一个 python 脚本, 出于各种原因, 它有一个变量, 是一个相当大的字符串, 比如 10mb 长。这个字符串包含多个行。

清除此字符串的第一行和最后一行的最快方法是什么? 由于字符串的大小, 操作的速度越快, 速度越好; 重点是速度。程序返回一个略小的字符串, 扫描第一行和最后一行。

" ".join(s.split(" ")[1:-1]) is the easiest way to do this, but it s extremely slow because the split() function copies the object in memory, and the join() copies it again.

示例字符串 :

*** START OF DATA ***
data
data
data
*** END OF DATA ***

额外信用: 如果中间没有数据, 这个程序就不会被扼杀; 这是可选的, 因为对于我来说, 不应该有一个字符串, 中间没有数据。

Answer 1

First split at once and then check if the string at last index contains , if yes str.rsplit at once and pick the item at 0th index otherwise return an empty string:

def solve(s):
    s = s.split( 
 , 1)[-1]
    if s.find( 
 ) == -1:
        return   
    return s.rsplit( 
 , 1)[0]
... 
>>> s =    *** START OF DATA ***
data
data
data
*** END OF DATA ***   
>>> solve(s)
 data
data
data 
>>> s =    *** START OF DATA ***
*** END OF DATA ***   
>>> solve(s)
  
>>> s =  
 .join([ a *100]*10**5)
>>> %timeit solve(s)
100 loops, best of 3: 4.49 ms per loop

Or don t split at all, find the index of from either end and slice the string:

>>> def solve_fast(s):
    ind1 = s.find( 
 )
    ind2 = s.rfind( 
 )
    return s[ind1+1:ind2]
... 
>>> s =    *** START OF DATA ***
data
data
data
*** END OF DATA ***   
>>> solve_fast(s)
 data
data
data 
>>> s =    *** START OF DATA ***
*** END OF DATA ***   
>>> solve_fast(s)
  
>>> s =  
 .join([ a *100]*10**5)
>>> %timeit solve_fast(s)
100 loops, best of 3: 2.65 ms per loop

Answer 2

考虑一个像这样的字符串 :

s = "line1
line2
line3
line4
line5"

下面的代码...

s[s.find( 
 )+1:s.rfind( 
 )]

...产生输出 :

 line2
line3
line4

因此,它是最短的代码, 用来删除字符串的第一行和最后一行。我不认为. finded 和.rfind 的方法除了搜索给定的字符串之外,还能做什么。请尝试一下速度!

Answer 3

另一种方法是在新线上分割数据,然后除第一行和最后一行外重新组合所有数据:

>>> s =  *** START OF DATA *** 

... data

... data

... data

... *** END OF DATA *** 
>>>  
 .join(s.split( 
 )[1:-1])
 data
data
data

此功能正常, 没有数据 :

>>> s =  *** START OF DATA *** 

... *** END OF DATA *** 
>>>  
 .join(s.split( 
 )[1:-1])

Answer 4

你可以在分开后切除第一个和最后一个,简单,脉搏。

mydata =    
data
data
data
   

for data in mydata.split( 
 )[1:-1]:
    print(data)

Answer 5

应的请求,“https://stackoverflow.com/uses/523612/karl-knechtel>>>Karl Knechtel 开始给予赏金,建议针对现有答案提供的解决办法,采用 < code>str.parttion 和 str.rparttion 和 /code > 快速测试一种方法。我将进行以下基准测试。


请注意, @papirrin s 回答不返回字符串, 且 @ mindal s 回答是@jon s 的复制件, 不包括在测试中 :
def AshwiniChaudhary_split_rsplit(s):
    s = s.split( 
 , 1)[-1]
    if s.find( 
 ) == -1:
        return   
    return s.rsplit( 
 , 1)[0]

def BenjaminSpiegl_find_rfind(s):
    return s[s.find( 
 )+1:s.rfind( 
 )]

def jon_split_slice_join(s):
    return  
 .join(s.split( 
 )[1:-1])

def Knechtel_partition_rpartition(s):
    return s.partition( 
 )[2].rpartition( 
 )[0]

funcs = [
    AshwiniChaudhary_split_rsplit,
    BenjaminSpiegl_find_rfind,
    jon_split_slice_join,
    Knechtel_partition_rpartition
]

s =  
 .join([ x  * 80] * (10_000_000 // 80))

# Correctness
for n in range(1, 15):
    expect = None
    for f in funcs:
        result = f(s)
        if expect is None:
            expect = result
        else:
            assert result == expect, (n, f.__name__)

# Speed
from time import perf_counter_ns
from statistics import mean, stdev

ts = {f: [] for f in funcs}
for _ in range(10):
    for f in funcs:
        t0 = perf_counter_ns()
        f(s)
        ts[f].append(perf_counter_ns() - t0)
for f in funcs:
    print(f {f.__name__} {mean(ts[f]) / 1000:.0f}µs ± {stdev(ts[f]) / 1000:.0f}µs )

This outputs, on ATO, the following result:
AshwiniChaudhary_split_rsplit 4304µs ± 764µs
BenjaminSpiegl_find_rfind 1862µs ± 178µs
jon_split_slice_join 31340µs ± 1827µs
Knechtel_partition_rpartition 4270µs ± 166µs

结果显示:

The performance of str.partition and str.rpartition is about equivalent to that of str.split with maxsplit=1, but the approach using str.partition and str.rpartition has a slight edge because they guarantee the number of items in the returning sequence and does not require an if statement testing for the edge case of a single line input needed by the approach using str.split
Slicing the string at indices indentified by str.find and str.rfind is more than twice as fast as the two approaches above because it copies the large string only once during the slice, while the other two approaches copy the bulk of the string twice, on top of having to create additional sequence and string objects
Splitting the string into a large list of small strings is extremely costly due to the number of objects that need to be created

Answer 6

取决于您使用的大小写会消耗字符串的方式, 快速移除它的方法可能是不删除它。

如果您计划按顺序访问字符串中的线条, 您可以在生成每条线时跳过第一线和最后一线, 而不是在全部线条中建立一套新的副本。

避免第一行和最后一行的一个临时办法就是在不产生不必要的副本的情况下对字符串进行迭接,就是跟踪随后的三行,只返回第二行,这样迭接就会在到达最后一行之前完成,而无需知道最后一行断线的位置。

下列函数应为您提供所需的输出 :

def split_generator(s):
  # Keep track of start/end positions for three lines
  start_prev = end_prev = 0
  start = end = 0
  start_next = end_next = 0

  nr_lines = 0

  for idx, c in enumerate(s):
    if c ==  
 :
      nr_lines += 1

      start_prev = start
      end_prev = end
      start = start_next
      end = end_next
      start_next = end_next
      end_next = idx

      if nr_lines >= 3:
        yield s[(start + 1) : end]

  # Handle the case when input string does not finish on "
"
  if s[-1] !=  
  and nr_lines >= 2:
    yield s[(start_next+1):end_next]

您无法用下列方法测试它 :

print("1st example")
for filtered_strs in split_generator( first
second
third ):
  print(filtered_strs)

print("2nd example")
for filtered_strs in split_generator( first
second
third
 ):
  print(filtered_strs)

print("3rd example")
for filtered_strs in split_generator( first
second
third
fourth ):
  print(filtered_strs)

print("4th example")
for filtered_strs in split_generator( first
second
third
fourth
 ):
  print(filtered_strs)

print("5th example")
for filtered_strs in split_generator( first
second
third
fourth
fifth ):
  print(filtered_strs)

将生成输出 :

1st example
second
2nd example
second
3rd example
second
third
4th example
second
third
5th example
second
third
fourth

请注意,这一方法的最大优势是,它只会在当时创建一条新线,而且几乎不会花费任何时间产生第一行产出(而不是等待找到所有线再继续前进),但同样,这可能有用,也可能不取决于使用情况。

Answer 7


本答复涉及“https://stackoverflow.com/uses/523612/karl-knechtel>>Karl Knechtel 的赏金问题。我在此转载赏金案文,因为它将https://meta.stackoverflow.com/q/379247/674039>在赏金期结束后消失:

我怀疑,但不知道,使用 < code>. parttion 和 < code>.rparttion 方法的方法比迄今提供的任何方法都快(避免有条件的逻辑逻辑 ) 。我在找一个执行和解释答案,和/或一个显示建议的各种方法的时间结果的答案。

Setup:

import random
import string

size = 10 * 1024 * 1024  # 10 mb
population = string.ascii_lowercase + "
"
middle = "".join(random.choices(population, k=size))

# sample data
s = f"""
*** START OF DATA ***
{middle}
*** END OF DATA ***"""


def naive_split_join(s):
    """https://stackoverflow.com/q/28134319"""
    return "
".join(s.split("
")[1:-1])


def solve(s):
    """https://stackoverflow.com/a/28134394"""
    s = s.split( 
 , 1)[-1]
    if s.find( 
 ) == -1:
        return   
    return s.rsplit( 
 , 1)[0]


def solve_fast(s):
    """https://stackoverflow.com/a/28134394"""
    ind1 = s.find("
")
    ind2 = s.rfind("
")
    return s[ind1 + 1 : ind2]


def partition(s):
    head, sep, s = s.partition("
")
    s, sep, tail = s.rpartition("
")
    return s


# correctness test
for f in naive_split_join, solve, solve_fast, partition:
    assert f(s) == middle, f"{f} is incorrect"

Timing:

我使用 Python 3.2.3>。

>>> timeit naive_split_join(s)
26.1 ms ± 85.6 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
>>> timeit solve(s)
931 µs ± 10 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
>>> timeit solve_fast(s)
578 µs ± 10.9 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
>>> timeit partition(s)
925 µs ± 6.95 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

从得到的回答比这个基于分区的办法快得多。

Setup:

Timing:

友情链接