Website Training

Steps to train Website URL:

  1. Go to "Website Links" section in Content Tab
  2. Type/Paste a valid website URL & Press enter.
  3. Mention any specific URL or keywords present in website to include or exclude. (Optional)
  4. Set website scraping depth.
  5. Click on "Save & Train".

Note:

  • Website scraping can be done using 3 different settings

a) Default (7 Pages) - Scraping will be done till 7 pages depth

b) Unlimited Depth: All the URL present inside the added website will be scraped

c) Only Provided Page: Only the added website URL will be scraped


  • Multiple URL's can be added by using semicolon as a separator


Website Training Limitations:

Here is a list of restrictions and content handling Issues that user can face while training the website content

  1. Website Restrictions:
    • Captcha: The site may have CAPTCHA mechanisms to prevent scraping.
    • IP Blocking: If you're making too many requests from a single IP, the site might block it temporarily or permanently.
    • Robots.txt Restrictions: Some websites have a robots.txt file that disallows scraping.
    • Rate Limiting: Websites often limit the number of requests you can make within a specific timeframe.
    • User-Agent Blocking: Some sites block requests with specific user-agent headers, which scraping libraries often use by default.
  2. URL Structure or Invalid URLs:
    • Broken or Invalid URLs: The URLs might be malformed or redirect to a broken link.
    • Redirections: The URL could redirect multiple times, resulting in failures to retrieve content.
    • Query Parameters: Some websites use dynamic content loaded via JavaScript, so query parameters might impact the result.
  3. Content Types and Formats:
    • Non-HTML Content: Some URLs may return non-HTML content (PDFs, images, etc.) which cannot be easily scraped unless explicitly handled.
    • Dynamic Content (JavaScript-based): If a site relies heavily on JavaScript to render its content (e.g., SPA apps), traditional scraping methods might fail.
    • Obfuscated or Minified HTML: Sometimes content is purposefully obfuscated to prevent scraping, making it difficult to parse.