Website Training
Steps to train Website URL:
- Go to "Website Links" section in Content Tab
- Type/Paste a valid website URL & Press enter.
- Mention any specific URL or keywords present in website to include or exclude. (Optional)
- Set website scraping depth.
- Click on "Save & Train".
Note:
- Website scraping can be done using 3 different settings
a) Default (7 Pages) - Scraping will be done till 7 pages depth
b) Unlimited Depth: All the URL present inside the added website will be scraped
c) Only Provided Page: Only the added website URL will be scraped
- Multiple URL's can be added by using semicolon as a separator
Website Training Limitations:
Here is a list of restrictions and content handling Issues that user can face while training the website content
- Website Restrictions:
- Captcha: The site may have CAPTCHA mechanisms to prevent scraping.
- IP Blocking: If you're making too many requests from a single IP, the site might block it temporarily or permanently.
- Robots.txt Restrictions: Some websites have a robots.txt file that disallows scraping.
- Rate Limiting: Websites often limit the number of requests you can make within a specific timeframe.
- User-Agent Blocking: Some sites block requests with specific user-agent headers, which scraping libraries often use by default.
- URL Structure or Invalid URLs:
- Broken or Invalid URLs: The URLs might be malformed or redirect to a broken link.
- Redirections: The URL could redirect multiple times, resulting in failures to retrieve content.
- Query Parameters: Some websites use dynamic content loaded via JavaScript, so query parameters might impact the result.
- Content Types and Formats:
- Non-HTML Content: Some URLs may return non-HTML content (PDFs, images, etc.) which cannot be easily scraped unless explicitly handled.
- Dynamic Content (JavaScript-based): If a site relies heavily on JavaScript to render its content (e.g., SPA apps), traditional scraping methods might fail.
- Obfuscated or Minified HTML: Sometimes content is purposefully obfuscated to prevent scraping, making it difficult to parse.
Updated 7 days ago