Custom website connector robots.txt validation fails when robots.txt URL redirects
Platform Notice: Cloud Only - This article only applies to Atlassian apps on the cloud platform.
Summary
When configuring the Custom website connector to URL with context path (for example, https://example.com/documentation), the validation of robots.txt fails even though a robots.txt file appears to exist on the site.
This occurs when the top-level robots.txt URL (for example, https://example.com/robots.txt) responds with an HTTP redirect (e.g., 301) to a different URL (such as the base URL https://example.com/) instead of returning the actual robots.txt file located at https://example.com/documentation/robots.txt.
The Custom website connector currently only supports a standard, top-level robots.txt served directly at https://<host>/robots.txt. The request to that exact URL must succeed (2xx response with robots.txt content). If it redirects elsewhere, the connector treats this as a failure to retrieve robots.txt and validation fails.
Diagnosis
Configure the Web (site crawler) connector to crawl
https://example.com/documentationOpen a terminal and query the top-level
robots.txtURL of the target site. For example, for the affected customer:curl --verbose https://example.com/robots.txtObserve the HTTP response headers. In the situation, the response was:
HTTP/1.1 301 Moved Permanently(or similar redirect status)Location: https://example.com/
Cause
Instead of returning the contents of robots.txt, the server redirects to the base URL.
Solution
To resolve the issue, the site must serve a standard, top-level robots.txt file. If the context path is used, it should redirect to https://example.com/documentation/robots.txt.
Verify the fix from the command line:
curl --verbose https://<host>/robots.txtConfirm that:
There is no 3xx redirect.
The response body contains the expected
robots.txtdirectives.
Re-run connector validation:
In the Web (site crawler) connector configuration, re-run validation for the same site.
The connector should now successfully retrieve and validate
robots.txtand proceed with crawling (subject to the rules defined in the file).
If issues persist:
Capture the full HTTP exchange (headers and status) from
https://<host>/robots.txt.Provide these details to Atlassian Support, so we can confirm whether the behavior still matches this known limitation.
Was this helpful?