SAXException error when running content anonymizer for confluence
Platform Notice: Data Center Only - This article only applies to Atlassian products on the Data Center platform.
Note that this KB was created for the Data Center version of the product. Data Center KBs for non-Data-Center-specific features may also work for Server versions of the product, however they have not been tested. Support for Server* products ended on February 15th 2024. If you are running a Server product, you can visit the Atlassian Server end of support announcement to review your migration options.
*Except Fisheye and Crucible
Summary
Atlassian may request XML backup to troubleshoot bugs in Confluence. To protect the customer' data from leaking, the tool of Content Anonymizer can be used to clean backup data(entities.xml). However, some special characters may cause SAXException during cleaning.
For example, special character (code 55357: emoji of smiling face) caused below error.
1
2
3
$java -jar confluence-export-cleaner-1.1-jar-with-dependencies.jar entities.xml cleaned.xml
2021-04-14 21:40:12,157 INFO Starting to clean export file 'entities.xml'. This may take a few minutes.
Exception in thread "main" java.lang.RuntimeException: org.xml.sax.SAXException: Cannot output character with code 55357 in the encoding UTF-8' within a CDATA section javax.xml.transform.TransformerException: Cannot output character with code 55357 in the encoding UTF-8' within a CDATA section
Cause
Anonymizer tool is not able to deal with special characters (like smiling face) included in the backup file (entities.xml) of confluence.
Solution
If the size of entities.xml is small, special characters can be removed via editor manually.
However, if the size is too large to edit directly, below method can be used.
Download the tool of removing special character from: atlassian-xml-cleaner-0.1.jar
Running above to remove special character.
1
java -jar atlassian-xml-cleaner-0.1.jar entities.xml > entities-clean.xml
Then running anonymizer tool to clean entities.xml.
Reference
The tool of cleaning special characters is originally used to for Jira, see detail at : Removing invalid characters from XML backups.
Was this helpful?