English 中文(简体)
How do I modify PDF without a library using C# and stream it back to client in ASP.NET?
原标题:

I m having an issue where I m corrupted a PDF and not sure of a proper solution. I ve seen several posts on people trying to just do a basic stream or trying to modify the file with a third party library. This is how my situation differs...

I have all the web pieces in place to get me the PDF streamed back and it works fine until I try to modify it with C#.

  1. I ve modified the PDF in a text editor manually to remove the <> entries and tested that the PDF functions properly after that.

  2. I ve then programmatically streamed the PDF in as byte[] from the database, convert it to a string, using a RegEx to find and remove the same stuff I tried removing manually.

  3. THE PROBLEM! When I try to convert the modified PDF string contents back into a byte[] to stream back, the PDF encoding no longer seems to be correct. What is the correct encoding?

Does anyone know the best way to do something like this? I m just trying to keep my solution as light as possible because our site is geared towards PDF document access so heavy APIs or complex are not preferable unless no other options are available. Also, because this situation is really only when our users view the file in an iframe for "preview", I can t permanently modify the PDF.

Thanks for your help in advance!

最佳回答

Try to use the following BinaryEncoding class as encoding. It basically casts all bytes to chars (and back), so that only ASCII data can correctly be processed as string, but the rest of the data is kept unchanged and nothing is lost as long as you don t use any UNICODE characters > 0x00FF. So for your roundtrip it should work just fine.

public class BinaryEncoding: Encoding {
    private static readonly BinaryEncoding @default = new BinaryEncoding();

    public static new BinaryEncoding Default {
        get {
            return @default;
        }
    }

    public override int GetByteCount(char[] chars, int index, int count) {
        if (chars == null) {
            throw new ArgumentNullException("chars");
        }
        return count;
    }

    public override int GetBytes(char[] chars, int charIndex, int charCount, byte[] bytes, int byteIndex) {
        if (chars == null) {
            throw new ArgumentNullException("chars");
        }
        if (bytes == null) {
            throw new ArgumentNullException("bytes");
        }
        if (charCount < 0) {
            throw new ArgumentOutOfRangeException("charCount");
        }
        unchecked {
            for (int i = 0; i < charCount; i++) {
                bytes[byteIndex+i] = (byte)chars[charIndex+i];
            }
        }
        return charCount;
    }

    public override int GetCharCount(byte[] bytes, int index, int count) {
        if (bytes == null) {
            throw new ArgumentNullException("bytes");
        }
        return count;
    }

    public override int GetChars(byte[] bytes, int byteIndex, int byteCount, char[] chars, int charIndex) {
        if (bytes == null) {
            throw new ArgumentNullException("bytes");
        }
        if (chars == null) {
            throw new ArgumentNullException("chars");
        }
        if (byteCount < 0) {
            throw new ArgumentOutOfRangeException("byteCount");
        }
        unchecked {
            for (int i = 0; i < byteCount; i++) {
                chars[charIndex+i] = (char)bytes[byteIndex+i];
            }
        }
        return byteCount;
    }

    public override int GetMaxByteCount(int charCount) {
        return charCount;
    }

    public override int GetMaxCharCount(int byteCount) {
        return byteCount;
    }
}
问题回答

You seem to be discovering that...

the PDF format is not trivial!

Whereby it may be OK (yet kludgey) to patch a few "text" bytes, in-situ (i.e. keeping size and structure unchanged), "messing" much more that that with the PDF files typically ends up breaking them. Regular expression for sure seem to be a blunt tool for the job.

The PDF file needs to be parsed and seen as a hierarchical collection objects (and then some..), and that s why we need the libraries which encapsulate the knowledge about the format.

If you need convincing, you may peruse the, now ISO standard, specification for the PDF Format (version 1.7) available for free on Adobe web site. BTW, these 750 pages cover the latest version, while there s much overlay, previous versions introduce yet another layer of details to contend with...

Edit:
This said, in re-reading the question, and Lucero s remark, the changes indicated do seem small/safe enough that a "snip and tuck" approach may work.
Beware that this type of approach may lead to issues, over time (when the format encountered is of a different, older or newer!, version, or when the file content, somehow causes different structures to be exposed, or...) or also with some specific uses (for example it may prevent users to use some features of the PDF documents such as forms or security). Maybe a compromise is to learn enough about the format(s) at hand and confirm that the changes are indeed casual.

Also... while the PDF format is a relatively complicated affair, the libraries that deal with it are not necessarily heavy, and they are typically easy to use.

In short, you ll need to weight the benefits and drawbacks of both approaches and pick accordingly ;-) (how was that for a "non-answer").

Look into IText. There is a reason why things like the apache commons library exist.





相关问题
Anyone feel like passing it forward?

I m the only developer in my company, and am getting along well as an autodidact, but I know I m missing out on the education one gets from working with and having code reviewed by more senior devs. ...

NSArray s, Primitive types and Boxing Oh My!

I m pretty new to the Objective-C world and I have a long history with .net/C# so naturally I m inclined to use my C# wits. Now here s the question: I feel really inclined to create some type of ...

C# Marshal / Pinvoke CBitmap?

I cannot figure out how to marshal a C++ CBitmap to a C# Bitmap or Image class. My import looks like this: [DllImport(@"test.dll", CharSet = CharSet.Unicode)] public static extern IntPtr ...

How to Use Ghostscript DLL to convert PDF to PDF/A

How to user GhostScript DLL to convert PDF to PDF/A. I know I kind of have to call the exported function of gsdll32.dll whose name is gsapi_init_with_args, but how do i pass the right arguments? BTW, ...

Linqy no matchy

Maybe it s something I m doing wrong. I m just learning Linq because I m bored. And so far so good. I made a little program and it basically just outputs all matches (foreach) into a label control. ...

热门标签