Annotation feature in Obsidian base is not pulling annotations in all cases

venia_forvess · June 13, 2024, 9:39am

Steps to reproduce

Following all steps assessed as well as the official request from the primary maintainer of PDF++, this bug report is intended to show that not all PDFs in the upstream annotator match the call comment that Obsidian uses in its default annotator. As a result all plugins for PDF functionality inherent an issue were some OCRs for PDFs fail to be captured by the extractor

(Note: take the above description with a grain of salt. I was asked by the developer to submit this and wrote this as well as I could given I’m not a developer))

Link: [Feature] Better support for annotated text extraction · Issue #224 · RyotaUshio/obsidian-pdf-plus · GitHub

image919×539 32 KB

Did you follow the troubleshooting guide?

Yes.

Did you try the above steps in the sandbox vault?

This was tried and proven to have occurred using the default sandbox as shown here:
Link: Not sure what is different, or if it's Adobe, but something weird is happening... · RyotaUshio/obsidian-pdf-plus · Discussion #223 · GitHub

image930×928 143 KB

Expected result

As stated the issue only occurs on some PDFs and as I’m submitting this ticket I’m still honestly not sure how this works. It may be worth reaching out to @ush directly. That said, My personal contribution which brought this to his attention is linked here, and the expected result is as follows in the image:

Link: Not sure what is different, or if it's Adobe, but something weird is happening... · RyotaUshio/obsidian-pdf-plus · Discussion #223 · GitHub

image895×841 43.4 KB

Actual result

As listed further down in my response you can see that the 2023 PDF (the one that does not work, returned and extracted a null highlight. I cannot figure out what defines a PDF that works vs. one that doesn’t for either the comment pulled, nor the highlight itself.

My post: Not sure what is different, or if it's Adobe, but something weird is happening... · RyotaUshio/obsidian-pdf-plus · Discussion #223 · GitHub

image897×1194 136 KB

Environment

SYSTEM INFO:
Obsidian version: v1.6.3
Installer version: v1.5.12
Operating system: Windows 10 Pro 10.0.22631
Login status: logged in
Catalyst license: none
Insider build toggle: off
Live preview: on
Base theme: dark
Community theme: Vauxhall v1.0.1
Snippets enabled: 1
Restricted mode: off
Plugins installed: 12
Plugins enabled: 8
1: Auto Link Title v1.5.4
2: TagFolder v0.18.7
3: ePub Reader v1.0.2
4: Tag Wrangler v0.6.1
5: Media Extended v3.1.0
6: PDF++ v0.39.23
7: Dataview v0.5.66
8: Text Extractor v0.5.2

Additional information

For additional information (because I am definitely not the one to talk to about this), please contact @ush.

ush · June 13, 2024, 11:59am

(Well, I didn’t really intend to “ask” her to file a bug report. My intention was to suggest sending a feature request or bug report as one of possible options. My English might have been ambiguous in the linked GitHub Discussion thread. I’m sorry for the confusion.)

Since the original report does not fully follow the troubleshooting guide and the template, let me re-report this problem.

Once you’ve done the above, delete everything above this line.

In some PDFs, the extraction of annotated text fails, leading to the two problems described below. (As I said in the linked GitHub Discussion thread, I’m not sure if this is considered as a bug although the current behavior is pretty counterintuitive.)

Steps to reproduce

sample_pdf.zip (123.7 KB)

Open the sandbox vault, then download the attached .zip and unzip it in the vault.

Problem 1

Open not_working.pdf, click the highlight annotation “Hi, welcome to Obsidian!”
In the annotation popup, click the “Copy” button. Then paste it to a note.

Problem 2

Right-click on the aforementioned highlight and see what menu items are displayed.

Did you follow the troubleshooting guide? [Y/N]

Y

Expected result

Problem 1

As in the case of working.pdf, not only the link but also the annotated text should be copied, like so

> Hi, welcome to Obsidian!

[[not_working.pdf#page=1&annotation=28R]]

Problem 2

As in the case of working.pdf, both “Copy link to annotation” and “Copy annotation” should be shown.

Actual result

Problem 1

The annotated text is not extracted, and only the link is copied.

[[not_working.pdf#page=1&annotation=28R]]

Problem 2

Only “Copy link to annotation” is shown, and “Copy annotation” is not shown.

Environment

SYSTEM INFO:
Obsidian version: v1.6.3
Installer version: v1.6.3
Operating system: Darwin Kernel Version 22.6.0: Mon Feb 19 19:43:41 PST 2024; root:xnu-8796.141.3.704.6~1/RELEASE_ARM64_T8103 22.6.0
Login status: logged in
Catalyst license: insider
Insider build toggle: off
Live preview: on
Base theme: adapt to system
Community theme: none
Snippets enabled: 0
Restricted mode: off
Plugins installed: 0
Plugins enabled: 0

RECOMMENDATIONS:
none

Additional information

The technical detail is described here: [Feature] Better support for annotated text extraction · Issue #224 · RyotaUshio/obsidian-pdf-plus · GitHub
In short, this problem is caused by how the PDFViewerChild.prototype.getTextByRect method is implemented.
Here, PDFViewerChild means the class of view.viewer.child, where view is a PDF view. (I don’t know the true class name but I named it for my plugin’s typing purpose)
It might be worth mentioning that Adobe Acrobat can successfully extract the annotated text in the attached sample PDF (right-click on the highlight > “Copy text”).

WhiteNoise · June 13, 2024, 12:39pm

Thanks, we will double check this.

venia_forvess · October 1, 2024, 12:21am

Hey @whitenoise!

I am terribly sorry that I haven’t popped here in months, but I ended up working with @ush on this issue with PDF++ and he ended up solving the issue by proxy which is FANTASTIC (If you came across this issue here I highly recommend giving Ryota a quick 5 buck coffee as he worked SUPER HARD on this!)

That said to learn more about this issue here are the GitHub issues that we worked on this for for months - it’s solved now!

github.com/RyotaUshio/obsidian-pdf-plus

[Bug] Exporting Comments in PDF only pulls from loaded view window

opened 01:14AM - 26 Aug 24 UTC

closed 08:41AM - 24 Sep 24 UTC

samanthavenialogan

bug reproduced

### Steps to reproduce 1. Open a PDF document of more than 30-40 pages. 2. …Used example report of 37 pages here: [Report-Content or Context Moderation_Artisanal Community-Reliant and Industrial Approaches-Robyn Caplan-moderation-2018.pdf](https://github.com/user-attachments/files/16742783/Report-Content.or.Context.Moderation_Artisanal.Community-Reliant.and.Industrial.Approaches-Robyn.Caplan-moderation-2018.pdf). 3. Proceed to highlight the document and provide annotated comments for a variety of highlights throughout the full document. 4. Scroll to the middle point of the document at a zoom of 150% for the document. 5. Use the global copy/paste for highlights and comments provided in PDF++. copy all highlights with comments provided as displayed in the following discussion threads: https://github.com/RyotaUshio/obsidian-pdf-plus/discussions/132#discussioncomment-9931568 6. after pasting comments into a new note, or into NotePad++ you will see that comments are only present on the page you are on +/- ~5 pages corresponding with the viewer's load of the PDF's pages. 7. Zoom out gradually to 125%, 100%, 50%, 20%, and 10% as your repeat steps 3 and 4 for each zoom. 8. You will see that the comments captured by the copy/paste increase in a linear fashion with the view window because the loaded view window is forced to render more and more comments due to the zoom. More pages loaded in the viewer = more comments loaded. 9. There is a ceiling to how this zooming workaround functions for documents that cannot properly display all of the pages - it will not be able to take all comments for documents over 30 pages at 10% because you cannot feasibly fit that many pages on the document viewer window. 10. You can further extend this by putting the document into 2 page view but even this is only a limited workaround for larger documents such as books. ### Expected behavior The expected behavior is that the document will copy/paste all comments attached to all highlights and annotations regardless of the viewer for the PDF. This works for highlights but it does not work for the annotations attached to those highlights. ### Actual behavior At the moment the document will take all highlights from the document regardless of whether they are loaded, but attached annotated comments will not copy/paste in full - it will only do so for the targeted document view +/- 5 pages which are loaded in and waiting to be displayed on the window view. The comments seem hindered by the PDF's function of a 'lazy-load-like' feature. ### Screen recordings or screenshots (sandbox vault) In the below footage the following document: [Report-Content or Context Moderation_Artisanal Community-Reliant and Industrial Approaches-Robyn Caplan-moderation-2018.pdf](https://github.com/user-attachments/files/16742741/Report-Content.or.Context.Moderation_Artisanal.Community-Reliant.and.Industrial.Approaches-Robyn.Caplan-moderation-2018.pdf) was marked up with comments from the start through to the end of the document. The document was copied at 200% from the middle of the document (pg. 20) using the standard copying procedure for global copies. The result was subsequently placed into NotePad++. The Notepad was then checked for whether the comments at the start and the end of the document were properly copied. The comments copied to page 26 on the bottom but failed to capture the comments on the 1st-7th pages. https://github.com/user-attachments/assets/bb8a04d0-440a-43f8-ba50-b1efb50abde5 First comment on page 2: ![image](https://github.com/user-attachments/assets/d1436dc8-d8b5-44c2-8ac8-da2342d9f017) Last comment on page 34: ![image](https://github.com/user-attachments/assets/b4e8df05-d432-4df5-b2d7-d3a2f5cba036) Notepad++ copy/paste results first comment recorded relates to a highlight on page 17 and last on page 34 ![image](https://github.com/user-attachments/assets/fbaeabd6-9836-4ce0-858b-79a2a2338d44) ![image](https://github.com/user-attachments/assets/85968ad5-3337-4322-bdf9-01b8204dfb69) ### Obsidian debug info irrelevant - issue is not technically an obsidian bug. ### PDF++ debug info ```json irrelevant - issue would not show for PDF++ bugs because this is not technically a bug. This seems to be a limitation of the current implementation. ``` ### Error messages N/A

venia_forvess · October 1, 2024, 12:27am

yuppers - issue resolved.