Newstex NewsCrawler Bot

The Newstex NewsCrawler ("the bot") is an advanced RSS Feed Reader specifically designed to process RSS feeds from signed and authenticated content producers. It is part of the Newstex content processing system, which aggregates and curates content from various sources for distribution and analysis.

The bot will read RSS feeds from signed bloggers, as well as scrape full text from related URLs in the feed if full-text is not enabled in the RSS feed. The bot may also be used to identify new potential leads for Newstex, however it will not store any content for blogs that have not signed an agreement with Newstex.

If you suspect spamming or abuse from this bot, please contact us at support@newstex.com with an access log of the suspicious queries.

The Bot will typically only read an RSS feed once every 6 hours, however this may be modified depending on an individual publisher's publishing schedule. 

Functionality

RSS Feed Reading: The bot reads RSS feeds from signed bloggers.

Content Scraping: It scrapes full-text from related URLs in the feed if full-text is not enabled in the RSS feed.

Lead Identification: The bot may identify new potential leads for Newstex but does not store content from blogs without a signed agreement with Newstex.

Access Frequency

The bot typically accesses an RSS feed once every 6 hours. This frequency may be adjusted based on the individual publisher's content update schedule. After reading an RSS feed, the bot may also access linked pages from the <link> tag in the provided RSS feed.

When checking for Leads, all websites are cached for up to 24 hours, and you should expect no more than a few dozen requests in any given day (no more than a typical human may request).

Technical Details

The bot uses the user agent NewsCrawler/<version> to identify itself to all URLs being read. The current version is 2.0, making the current User Agent string:

NewsCrawler/2.0

The NewsCrawler bot will usually use one of the IP addresses identified in the Public IP Address list provided by Newstex. Occasional accesses from alternate developer IP addresses may also be observed, but not at a large scale.

Opting Out

The bot respects explicit blocklists from a robots.txt file. If your site hosts a robots.txt file with any of the following details, our bot will not access your site:

Deny All:

Unset

User-agent: *
Disallow: /

Explicitly Blocking NewsCrawler:

Unset

User-agent: NewsCrawler/*
Disallow: /

Note that your robots.txt file may be cached for up to 30 days. If you would like to block access immediately you may either explicitly block our user agent in your Web Application Firewall, or request a blocklist by contact us at support@newstex.com

Reporting Abuse

If you believe our NewsCrawler system is abusing your site, please contact us at support@newstex.com and provide as many details as you can, including IP access logs and full request with headers.