cross-posted from: https://lemmy.sdf.org/post/24645301

They emailed me a PDF. It opened fine with evince and looked like a simple doc at first. Then I clicked on a field in the form. Strangely, instead of simply populating the field with my text, a PDF note window popped up so my text entry went into a PDF note, which many viewers present as a sticky note icon.

If I were to fax this PDF, the PDF comments would just get lost. So to fill out the form I fed it to LaTeX and used the overpic pkg to write text wherever I choose. LaTeX rejected the file… could not handle this PDF. Then I used the file command to see what I am dealing with:

$ file signature_page.pdf
signature_page.pdf: Java serialization data, version 5

WTF is that? I know PDF supports JavaScript (shitty indeed). Is that what this is? “Java” is not JavaScript, so I’m baffled. Why is java in a PDF? (edit: explainer on java serialization, and some analysis)

My workaround was to use evince to print the PDF to PDF (using a PDF-building printer driver or whatever evince uses), then feed that into LaTeX. That worked.

My question is, how common is this? Is it going to become a mechanism to embed a tracking pixel like corporate assholes do with HTML email?

I probably need to change my habits. I know PDF docs can serve as carriers of copious malware anyway. Some people go to the extreme of creating a one-time use virtual machine with PDF viewer which then prints a PDF to a PDF before destroying the VM which is assumed to be compromised.

My temptation is to take a less tedious approach. E.g. something like:

$ firejail --net=none evince untrusted.pdf

I should be able to improve on that by doing something non-interactive. My first guess:

$ firejail --net=none gs -sDEVICE=pdfwrite -q -dFIXEDMEDIA -dSCALE=1 -o is_this_output_safe.pdf -- /usr/share/ghostscript/*/lib/viewpbm.ps untrusted_input.pdf

output:

Error: /invalidfileaccess in --file--
Operand stack:
   (untrusted_input.pdf)   (r)
Execution stack:
   %interp_exit   .runexec2   --nostringval--   --nostringval--   --nostringval--   2   %stopped_push   --nostringval--   --nostringval--   --nostringval--   false   1   %stopped_push   1990   1   3   %oparray_pop   1989   1   3   %oparray_pop   1977   1   3   %oparray_pop   1833   1   3   %oparray_pop   --nostringval--   %errorexec_pop   .runexec2   --nostringval--   --nostringval--   --nostringval--   2   %stopped_push   --nostringval--   --nostringval--   --nostringval--   %array_continue   --nostringval--
Dictionary stack:
   --dict:769/1123(ro)(G)--   --dict:0/20(G)--   --dict:87/200(L)--   --dict:0/20(L)--
Current allocation mode is local
Last OS error: Permission denied
Current file position is 10479
GPL Ghostscript 10.00.0: Unrecoverable error, exit code 1

What’s my problem? Better ideas? I would love it if attempts to reach the cloud could be trapped and recorded to a log file in the course of neutering the PDF.

(note: I also wonder what happens when Firefox opens this PDF considering Mozilla is happy to blindly execute whatever code it receives no matter the context.)

  • MajorHavoc@programming.dev
    link
    fedilink
    arrow-up
    17
    ·
    edit-2
    20 days ago

    For many years malicious PDF files had the shameful honor of being the number one way people’s PCs got infected, and it’s because of bullshit like this.

    “Surprise, here’s some Java code to execute on your personal computer without asking!” isn’t being done by anyone who is actually your ally.

    We’re just discussing how shitty a shitty person has been toward you, at this point. There’s no good pro-social reason to deliver you an app while calling it a document.

    Do we think it’s a virus? Probably not, but maybe. So we think there’s a tracker? Certainly. The average organization shitty enough to build or use this technology layer has over 500 separate relationships with companies that track you.

    Someone tried to put a tracker in this PDF.

    Whether people like me made it too hard for them is up for analysis.

    I guarantee you that someone tried.

    They’re not good enough at hiding this stuff yet, to feel confident lying about it, so it likely is disclosed in the fine print somewhere, if you’re feeling patient enough to read all of it.

  • sylver_dragon@lemmy.world
    link
    fedilink
    arrow-up
    12
    ·
    20 days ago

    This could just be a really stupid format, put out by a specific application for creating PDFs, because the original authors didn’t want to pay Adobe (never attribute to malice, that which can be sufficiently explained with stupidity).

    Does pdfinfo give any indication of the application used to create the document? If it chokes on the Java bit up front, can you extract just the PDF from the file and look at that? You might also dig through the PDF a bit using Dider Stevens 's Tools, looking for JavaScript or other indicators of PDF fuckery.

    Does the file contain any other Java bytecode? If so, can you pass that through a decompiler?

    would love it if attempts to reach the cloud could be trapped and recorded to a log file in the course of neutering the PDF.

    This is possible, but it takes a bit of setup. In my own lab, I have PolarProxy running in one Virtual Machine (VM), using QEMU/KVM. That acts as a gateway between an isolated network and a network with internet access. It runs transparent TLS break and inspect on port 443/tcp and tcpdump capturing port 80/tcp. It also serves DNS using Bind.

    There is then the “victim” VM which is running bog standard Windows 10. The PolarProxy root cert has been added to the Trusted Roots certificate store. The Default Gateway and DNS servers are hard coded to the PolarProxy VM. Suspicious stuff is tested on this system and all network traffic is recorded on the PolarProxy system in standard pcap format for analysis.

    • evenwicht@lemmy.sdf.orgOP
      link
      fedilink
      arrow-up
      2
      ·
      edit-2
      20 days ago

      Does pdfinfo give any indication of the application used to create the document?

      Oracle Documaker PDF Driver
      PDF version: 1.3

      If it chokes on the Java bit up front, can you extract just the PDF from the file and look at that?

      Not sure how to do that but I did just try pdfimages -all which was not useful since it’s a vector PDF. pdfdetach -list shows 0 attachments. It just occurred to me that pdftocairo could be useful as far as a CLI way to neuter the doc and make it useable, but that’s a kind of a lossy meat-grinder option that doesn’t help with analysis.

      You might also dig through the PDF a bit using Dider Stevens 's Tools,

      Thanks for the tip. I might have to look into that. No readme… I guess this is a /use the source, Luke/ scenario. (edit: found this).

      I appreciate all the tips. I might be tempted to dig into some of those options.

  • GetOffMyLan@programming.dev
    link
    fedilink
    arrow-up
    10
    arrow-down
    11
    ·
    edit-2
    20 days ago

    The file is a serialised java array that contains a pdf file. I’ve seen a few things online about this. Some pdf readers accept it, some don’t.

    And I’m not sure why an application would output a pdf this way. But there’s nothing harmful going on.

    You’re kind of freaking out about nothing.

    • Hirom@beehaw.org
      link
      fedilink
      arrow-up
      17
      ·
      20 days ago

      It’s a fair question. There’s precedent where malware is embedded in PDFs.

      • GetOffMyLan@programming.dev
        link
        fedilink
        arrow-up
        2
        ·
        edit-2
        20 days ago

        Indeed. But the pdf file itself isn’t the issue here. They very clearly don’t know what serialisation is.

        And while there are risks with java serialisation it isn’t being deserialized here.

    • evenwicht@lemmy.sdf.orgOP
      link
      fedilink
      arrow-up
      17
      arrow-down
      3
      ·
      edit-2
      20 days ago

      You’re kind of freaking out about nothing.

      I highly recommend Youtube video l6eaiBIQH8k, if you can track it down. You seem to have no general idea about PDF security problems.

      And I’m not sure why an application would output a pdf this way. But there’s nothing harmful going on.

      If you can’t explain it, then you don’t understand it. Thus you don’t have answers.

      It’s a bad practice to just open a PDF you did not produce without safeguards. Shame on me for doing it… I got sloppy but it won’t happen again.

      • GetOffMyLan@programming.dev
        link
        fedilink
        arrow-up
        2
        ·
        edit-2
        20 days ago

        It’s literally just the format of the file here. If you skip the java serialisation header it’s a normal pdf file. I said nothing about the pdf file itself.

        I did explain what it is. I just don’t know why certain programs encode it this way. It’s supported by multiple pdf readers so it must be semi common but I can’t find a reason for it to be encoded this way.

        I’m trying to help you out there’s no need to be a dick.

      • Troy
        link
        fedilink
        arrow-up
        9
        arrow-down
        17
        ·
        20 days ago

        Wow, your paranoia is dialed up to 11.

        • evenwicht@lemmy.sdf.orgOP
          link
          fedilink
          arrow-up
          8
          arrow-down
          2
          ·
          edit-2
          20 days ago

          Don’t Canadian insurance companies want to know where their customers are? Or are the Canadian privacy safeguards good on this?

          In the US, Europe (despite the GDPR), and other places, banks and insurance companies snoop on their customers to track their whereabouts as a normal common way of doing business. They insert surreptitious tracker pixels in email to not only track the fact that you read their msg but also when you read the msg and your IP (which gives whereabouts). If they suspect you are not where they expect you to be, they take action. They modify your policy. It’s perfectly legal in the US to use sneaky underhanded tracking techniques rather than the transparent mechanism described in RFC 2298. If your suppliers are using RFC 2298 and not involuntary tracking mechanisms, lucky you.

          • Troy
            link
            fedilink
            arrow-up
            3
            arrow-down
            8
            ·
            20 days ago

            Your assertion that the document is malicious without any evidence is what I’m concerned about.

            At some point you have to decide to trust someone. The comment above gave you reason to trust that the document was in a standard, non-malicious format. But you outright rejected their advice in a hostile tone. You base your hostility on a youtube video.

            You should read the essay “on trusting trust” and then make a decision on whether you are going to participate in digital society or live under a bridge with a tinfoil hat.

            In Canada, and elsewhere, insurance companies know everything about you before you even apply, and it’s likely true elsewhere too. Even if they don’t have personally identifiable information, you’ll be in a data bucket with your neighbours, with risk profiles based on neighbourhood, items being insuring, claim rates for people with similar profiles, etc. Very likely every interaction you have with them has been going into a LLM even prior to the advent of ChatGPT, and they will have scored those interactions against a model.

            The personally identifiable information has largely been anonymized in these models. In Canada, for example, there are regulatory bodies like OSFI that they have to report to, and get audited by, to ensure the data is being used in compliance with regulations. Each company will have a compliance department tasked with making sure they’re adhering.

            But what you will end up doing instead is triggering fraudulent behaviour flags. There’s something called “address fraud”, where people go out of their way to disguise their location, because some lower risk address has better rates or whatever. When you do everything you can to scrub your location, this itself is a signal that you are operating as a highly paranoid individual and that might put you in a bucket. If you want to be the most invisible to them, you want to act like you’re in the median of all categories. Because any outlying behaviours further fingerprint you.

            Source: I have a direct connection to advanced analytics within insurance industry (one degree of separation).

            • BearOfaTime@lemm.ee
              link
              fedilink
              arrow-up
              5
              arrow-down
              2
              ·
              edit-2
              20 days ago

              The personally identifiable information has largely been anonymized in these models

              This tells me all we need to know about you.

              You’re an apologist for these companies. It’s been repeatedly demonstrated such anonymization can be pretty easily reversed.

            • evenwicht@lemmy.sdf.orgOP
              link
              fedilink
              arrow-up
              3
              ·
              edit-2
              20 days ago

              Your assertion that the document is malicious without any evidence is what I’m concerned about.

              I did not assert malice. I asked questions. I’m open to evidence proving or disproving malice.

              At some point you have to decide to trust someone. The comment above gave you reason to trust that the document was in a standard, non-malicious format. But you outright rejected their advice in a hostile tone. You base your hostility on a youtube video.

              There was too much uncertainty there to inspire trust. Getoffmylan had no idea why the data was organised as serialised java.

              You should read the essay “on trusting trust” and then make a decision on whether you are going to participate in digital society or live under a bridge with a tinfoil hat.

              I’ll need a more direct reference because that phrase gives copious references. Do you mean this study? Judging from the abstract:

              To what extent should one trust a statement that a program is free of Trojan horses? Perhaps it is more important to trust the people who wrote the software.

              I seem to have received software pretending to be a document. Trust would naturally not be a sensible reaction to that. In the infosec discipline we would be incompetent fools to loosely trust whatever comes at us. We make it a point to avoid trust and when trust cannot be avoided we seek justfiication for trust. We have a zero-trust principle. We also have the rule of leaste privilige which means not to extend trust/permissions where it’s not necessary for the mission. Why would I trust a PDF when I can take steps to access the PDF in a way that does not need excessive trust?

              The masses (security naive folks) operate in the reverse-- they trust by default and look for reasons to distrust. That’s not wise.

              In Canada, and elsewhere, insurance companies know everything about you before you even apply, and it’s likely true elsewhere too.

              When you move, how do they find out if you don’t tell them? Tracking would be one way.

              Privacy is about control. When you call it paranoia, the concept of agency has escaped you. If you have privacy, you can choose what you disclose. What would be good rationale for giving up control?

              Even if they don’t have personally identifiable information, you’ll be in a data bucket with your neighbours, with risk profiles based on neighbourhood, items being insuring, claim rates for people with similar profiles, etc. Very likely every interaction you have with them has been going into a LLM even prior to the advent of ChatGPT, and they will have scored those interactions against a model.

              If we assume that’s true, what do you gain by giving them more solid data to reinforce surreptitious snooping? You can’t control everything but It’s not in your interest to sacrifice control for nothing.

              But what you will end up doing instead is triggering fraudulent behaviour flags. There’s something called “address fraud”, where people go out of their way to disguise their location, because some lower risk address has better rates or whatever.

              Indeed for some types of insurance policies the insurer has a legitimate need to know where you reside. But that’s the insurer’s problem. This does not rationalize a consumer who recklessly feeds surreptitious surveillance. Street wise consumers protect themselves of surveillance. Of course they can (and should) disclose their new address if they move via proper channels.

              Why? Because someone might take a vacation somewhere and interact from another state. How long is a vacation? It’s for the consumer to declare where they intend to live, e.g. via “declaration of domicile”. Insurance companies will harrass people if their intel has an inconsistency. Where is that trust you were talking about? There is no reciprocity here.

              When you do everything you can to scrub your location, this itself is a signal that you are operating as a highly paranoid individual and that might put you in a bucket.

              Sure, you could end up in that bucket if you are in a strong minority of street wise consumers. If the insurer wants to waste their time chasing false positives, the time waste is on them. I would rather laugh at that than join the street unwise club that makes the street wise consumers stand out more.