I want to rip the contents of a pay website, but I have to log in to their web site on a web page to get access

Does anyone have any good tools for Windows for that?

I’m guessing that any such tools must have a built in browser, or be a browser plugin for it to work.

  • marsara9@lemmy.world
    link
    fedilink
    English
    arrow-up
    12
    ·
    1 year ago

    Unless you have an account there’s no easy way to get access to the content on the page. Once you have an account there’s technically nothing stopping you from just saving the HTML file to your computer.

    Something else you can try though, assuming you don’t have an account, is to just turn off JavaScript. If the site lets you partially load the content and then asks you to create an account to read more, they usually just block the content by having JavaScript add an opaque overlay. With JavaScript disabled, obviously it’s not there to add the overlay and you’re able to keep reading.

    • zabadoh@lemmy.mlOP
      link
      fedilink
      English
      arrow-up
      4
      ·
      1 year ago

      I have an account, so that’s not a problem. The problem is how to automate going into every little content page and downloading the content, including the hi-res files.

        • zabadoh@lemmy.mlOP
          link
          fedilink
          arrow-up
          2
          ·
          edit-2
          1 year ago

          Webcopy looks promising if I can get the crawler part of it to work with this site’s authentication…

          edit: I couldn’t get Webcopy’s spider to authenticate correctly.

          Webcopy uses the deprecated version of Internet Explorer in Windows 10 as a module, and I can log into the website using the Capture Forms browser dialog, but the cookies or whatever else don’t translate over to the spider.

    • m-p{3}A
      link
      fedilink
      English
      arrow-up
      2
      ·
      edit-2
      1 year ago

      Depending on the website, there might be some tools specifically tailored for that website you could use that will extract the content you’re looking for, but they’re likely going to be command-line based, and you’ll likely have to extract your cookies so that the tools can work as if you were logged in your account from outside your browser.

      Is it too much to ask which website?

    • Nioxic@lemmy.world
      link
      fedilink
      English
      arrow-up
      1
      ·
      1 year ago

      It also might block the loading of the page content…

      I would assume its being fetched by a javascript script, through an api.

      That is fairly common