Document Contents Are Not Searchable

Platform Notice: Data Center Only - This article only applies to Atlassian products on the Data Center platform.

Note that this KB was created for the Data Center version of the product. Data Center KBs for non-Data-Center-specific features may also work for Server versions of the product, however they have not been tested. Support for Server* products ended on February 15th 2024. If you are running a Server product, you can visit the Atlassian Server end of support announcement to review your migration options.

*Except Fisheye and Crucible

Symptoms

Following errors are shown in the logs:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 2012-06-29 14:41:00,327 WARN [scheduler_Worker-2] [bonnie.search.extractor.BaseAttachmentContentExtractor] addFields Error indexing attachment (Attachment: My_PDF_Examplem.pdf v.2 (8912924) admin) com.atlassian.bonnie.search.extractor.ExtractorException: Error getting content of PDF document at com.atlassian.bonnie.search.extractor.PdfContentExtractor.extractText(PdfContentExtractor.java:66) at com.atlassian.bonnie.search.extractor.BaseAttachmentContentExtractor.addFields(BaseAttachmentContentExtractor.java:40) at com.atlassian.confluence.plugin.descriptor.ExtractorModuleDescriptor$BackwardsCompatibleExtractor.addFields(ExtractorModuleDescriptor.java:36) at com.atlassian.bonnie.search.BaseDocumentBuilder.getDocument(BaseDocumentBuilder.java:104) at com.atlassian.confluence.search.lucene.ConfluenceDocumentBuilder.getDocument(ConfluenceDocumentBuilder.java:97) at com.atlassian.confluence.search.lucene.tasks.AddDocumentIndexTask.perform(AddDocumentIndexTask.java:43) ... Caused by: java.io.IOException: Error: Expected an integer type, actual='' at org.apache.pdfbox.pdfparser.BaseParser.readInt(BaseParser.java:1310) at org.apache.pdfbox.pdfparser.PDFObjectStreamParser.parse(PDFObjectStreamParser.java:81) at org.apache.pdfbox.cos.COSDocument.dereferenceObjectStreams(COSDocument.java:449) at org.apache.pdfbox.pdmodel.PDDocument.openProtection(PDDocument.java:1112) at org.apache.pdfbox.pdmodel.PDDocument.decrypt(PDDocument.java:591) at com.atlassian.bonnie.search.extractor.PdfContentExtractor.extractText(PdfContentExtractor.java:45) ... 30 more

Cause

Confluence is not able to index some attachments. The files in question may be corrupt or Confluence could be experiencing OOM problems during the indexing task.

Workaround

  1. Disable indexing of attachments following the instructions in How to disable indexing of attachments. That will stop Confluence from indexing the content of the attachments, so the contents will no longer be visible in search. The title of the attachment however will still be indexed and searchable.

  2. After the above is done, Rebuild the Content Indexes from scratch.

Updated on April 8, 2025

Still need help?

The Atlassian Community is here for you.