• Em Adespoton
    link
    fedilink
    arrow-up
    4
    ·
    27 days ago

    Wouldn’t their patch embeddings return different results depending on the visual boundaries? They don’t appear to use overlap redundancy; this means it’s going to be significantly less resource intensive, but the chance of losing significant signals in the image to text translation surely must be inversely high?

    • ☆ Yσɠƚԋσʂ ☆@lemmy.mlOP
      link
      fedilink
      arrow-up
      2
      arrow-down
      3
      ·
      27 days ago

      Good question, not sure how they account for that. Maybe there’s a higher level layer responsible for dealing with the boundaries?