Website Training

Steps to train Website URL:

  1. Go to "Website Links" section in Content Tab
  2. Type/Paste a valid website URL & Press enter.
  3. Mention any specific URL or keywords present in website to include or exclude. (Optional)
  4. Set website scraping depth.
  5. Click on "Save & Train".

Note:

  • Website scraping can be done using 3 different settings

a) Default (7 Pages) - Scraping will be done till 7 pages depth

b) Unlimited Depth: All the URL present inside the added website will be scraped

c) Only Provided Page: Only the added website URL will be scraped


  • Multiple URL's can be added by using semicolon as a separator


Website Training Limitations:

Here is a list of restrictions and content handling Issues that user can face while training the website content

  1. Website Restrictions:
    • Captcha: The site may have CAPTCHA mechanisms to prevent scraping.
    • IP Blocking: If you're making too many requests from a single IP, the site might block it temporarily or permanently.
    • Robots.txt Restrictions: Some websites have a robots.txt file that disallows scraping.
    • Rate Limiting: Websites often limit the number of requests you can make within a specific timeframe.
    • User-Agent Blocking: Some sites block requests with specific user-agent headers, which scraping libraries often use by default.
  2. URL Structure or Invalid URLs:
    • Broken or Invalid URLs: The URLs might be malformed or redirect to a broken link.
    • Redirections: The URL could redirect multiple times, resulting in failures to retrieve content.
    • Query Parameters: Some websites use dynamic content loaded via JavaScript, so query parameters might impact the result.
  3. Content Types and Formats:
    • Non-HTML Content: Some URLs may return non-HTML content (PDFs, images, etc.) which cannot be easily scraped unless explicitly handled.
    • Dynamic Content (JavaScript-based): If a site relies heavily on JavaScript to render its content (e.g., SPA apps), traditional scraping methods might fail.
    • Obfuscated or Minified HTML: Sometimes content is purposefully obfuscated to prevent scraping, making it difficult to parse.

Advanced Website Scraper

Advanced mode introduced for scraping containing JavaScript enabled content and single page application (SPA) websites. Advanced scraping is done using Firecrawl services. Scraper type options (Basic & Advanced) dropdown available in Website Links section of Content Tab.

Value Delivered

  • Increased run time AI accuracy in response generation due to higher content coverage
  • Helps in faster delivery of use cases by reducing manual effort where content scraping required for Javascript enabled content & SPA websites

Basic Vs Advanced Usage Guidelines:

  • Basic: Recommended when there is a simple website. Use it for first time training. If training is failing repeatedly or accuracy is less, then only move to Advanced
  • Advanced: Recommended for website with JavaScript enabled content & single page application websites.

Key Limitation of Advanced Scraper:

  • Max Limit on Website Pages in a single training - 3000
  • Higher training time as compared to basic: ~ 30-40 minutes for 3000 pages
  • Can not perform OCR on images
  • Can not Scrape website hosted contents like pdf, docs, videos etc
  • Can not scrape dynamic content and perform operations like Click, Scroll, Wait while extracting the content