Document Contents Are Not Searchable
Platform Notice: Data Center Only - This article only applies to Atlassian products on the Data Center platform.
Note that this KB was created for the Data Center version of the product. Data Center KBs for non-Data-Center-specific features may also work for Server versions of the product, however they have not been tested. Support for Server* products ended on February 15th 2024. If you are running a Server product, you can visit the Atlassian Server end of support announcement to review your migration options.
*Except Fisheye and Crucible
Symptoms
Following errors are shown in the logs:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
2012-06-29 14:41:00,327 WARN [scheduler_Worker-2] [bonnie.search.extractor.BaseAttachmentContentExtractor] addFields Error indexing attachment (Attachment: My_PDF_Examplem.pdf v.2 (8912924) admin)
com.atlassian.bonnie.search.extractor.ExtractorException: Error getting content of PDF document
at com.atlassian.bonnie.search.extractor.PdfContentExtractor.extractText(PdfContentExtractor.java:66)
at com.atlassian.bonnie.search.extractor.BaseAttachmentContentExtractor.addFields(BaseAttachmentContentExtractor.java:40)
at com.atlassian.confluence.plugin.descriptor.ExtractorModuleDescriptor$BackwardsCompatibleExtractor.addFields(ExtractorModuleDescriptor.java:36)
at com.atlassian.bonnie.search.BaseDocumentBuilder.getDocument(BaseDocumentBuilder.java:104)
at com.atlassian.confluence.search.lucene.ConfluenceDocumentBuilder.getDocument(ConfluenceDocumentBuilder.java:97)
at com.atlassian.confluence.search.lucene.tasks.AddDocumentIndexTask.perform(AddDocumentIndexTask.java:43)
...
Caused by: java.io.IOException: Error: Expected an integer type, actual=''
at org.apache.pdfbox.pdfparser.BaseParser.readInt(BaseParser.java:1310)
at org.apache.pdfbox.pdfparser.PDFObjectStreamParser.parse(PDFObjectStreamParser.java:81)
at org.apache.pdfbox.cos.COSDocument.dereferenceObjectStreams(COSDocument.java:449)
at org.apache.pdfbox.pdmodel.PDDocument.openProtection(PDDocument.java:1112)
at org.apache.pdfbox.pdmodel.PDDocument.decrypt(PDDocument.java:591)
at com.atlassian.bonnie.search.extractor.PdfContentExtractor.extractText(PdfContentExtractor.java:45)
... 30 more
Cause
Confluence is not able to index some attachments. The files in question may be corrupt or Confluence could be experiencing OOM problems during the indexing task.
Workaround
Disable indexing of attachments following the instructions in How to disable indexing of attachments. That will stop Confluence from indexing the content of the attachments, so the contents will no longer be visible in search. The title of the attachment however will still be indexed and searchable.
After the above is done, Rebuild the Content Indexes from scratch.
Was this helpful?