Question

I m working with UTF-8 strings. I need to get a slice using byte-based indexes, not char-based.

I found references on the web to String#subseq, which is supposed to be like String#[], but for bytes. Alas, it seems not to have made it to 1.9.1.

Now, why would I want to do that? There s a chance I ll end up with an invalid string should I slice in the middle of a multi-byte char. This sounds like a terrible idea.

Well, I m working with StringScanner, and it turns out its internal pointers are byte-based. I accept other options here.

Here s what I m working with right now, but it s rather verbose:

s.dup.force_encoding("ASCII-8BIT")[ix...pos].force_encoding("UTF-8")

Both ix and pos come from StringScanner, so are byte-based.

Answer 1

You can do this too: s.bytes.to_a[ix...pos].join(""), but that looks even more esoteric to me.

If you re calling the line several times, a nicer way to do it could be this:

class String
  def byteslice(*args)
    self.dup.force_encoding("ASCII-8BIT").slice(*args).force_encoding("UTF-8")
  end
end

s.byteslice(ix...pos)

Answer 2

Doesn t String#bytes do what you want? It returns an enumerator to the bytes in a string (as numbers, since they might not be valid characters, as you pointed out)

str.bytes.to_a.slice(...)

Answer 3

Use this monkeypatch until String#byteslice() is added to Ruby 1.9.

class String
  unless method_defined? :byteslice
    ##
    # Does the same thing as String#slice but
    # operates on bytes instead of characters.
    #
    def byteslice(*args)
      unpack( C* ).slice(*args).pack( C* )
    end
  end
end

友情链接