Web Import
Import content from public websites directly into your Clear Ideas Sites. Web Import crawls websites, extracts content, and makes it available for AI-powered features like Public AI Chat and AI Enhanced Search.
Purpose
This document explains how to import content from public websites into Clear Ideas Sites. It covers both immediate imports for one-time content capture and scheduled imports for keeping content up to date automatically.
Scope
Covered
- Immediate web import configuration and execution
- Scheduled web import creation and management
- Import options including URL masking, depth control, and file type selection
- Folder organization for imported content
- Integration with Public AI Chat and other AI features
Not Covered
- API-level web import implementation details
- Content indexing and search configuration (see Content Indexing)
- Site creation and folder management (see Sites)
Prerequisites
- Editor role or higher (owner, admin, or editor) on the target Site
- Paid subscription (required for web import functionality)
- Valid public website URL (must start with
http://orhttps://)
Immediate Web Import
Import content from a single URL immediately. This is useful for one-time content capture or testing import settings before creating a schedule.
How to Import
- Navigate to the Site where you want to import content
- Click the Web Import button (globe pointer icon)
- Enter the URL you want to import
- Configure import options (see Configuration Options below)
- Select the destination folder (defaults to Site root)
- Click Import to start the import immediately
Configuration Options
URL: The starting URL to import. Must be a valid HTTP or HTTPS URL.
URL Mask (Advanced): Optional pattern to limit which pages are imported. Use this to restrict imports to specific sections of a website. For example:
https://example.com/docs/*- Only import pages under/docs/https://example.com/blog/*- Only import blog posts
Follow Link Depth (Advanced): Controls how many levels of links to follow from the starting URL:
- 1 Level: Only the current page
- 2 Levels: Current page plus all links found on that page
- 3 Levels: Current page plus two levels of linked pages
File Types: Web Import supports only HTML pages and PDF documents:
- HTML: Import HTML pages and content
- PDF: Import PDF files linked from the pages
Destination Folder: Select the folder where imported content will be stored. Defaults to the Site root.
Version Management
Web Import maintains a one-to-one relationship between each URL and its corresponding file. When importing a URL that already exists in your Site, Web Import creates a new version of the file, preserving the previous version in the file's version history.
A new version is created only when the content has changed. If the content at a URL is identical to the previous import, no new version is created. However, some websites use automated scripts that modify page content on each request—such as adding timestamps, session identifiers, or dynamic elements—even when the substantive content remains unchanged. In these cases, Web Import may create additional versions even though the meaningful content has not changed. This behavior depends on how the source website generates its pages.
Scheduled Web Import
Create recurring imports that run automatically on a schedule. Scheduled imports keep your content synchronized with external websites without manual intervention.
Creating a Scheduled Import
- Navigate to Site Settings > Web Import Schedules
- Click Create Scheduled Import
- Enter the URL and configure import options (same as immediate import)
- Enable the Schedule option
- Configure the schedule:
- Frequency: Hourly, Daily, Weekly, or Monthly
- Time: Specific hour and minute to run
- Days: Days of the week (for weekly) or days of the month (for monthly)
- Select the destination folder
- Click Create to save the schedule
Schedule Configuration
Frequency Options:
- Hourly: Runs every hour at the specified minute
- Daily: Runs once per day at the specified time
- Weekly: Runs on specified days of the week at the specified time
- Monthly: Runs on specified days of the month at the specified time
Time Format: Hours and minutes in 24-hour format (e.g., 09:00 for 9:00 AM, 14:30 for 2:30 PM)
Days of Week: Select one or more days (Monday through Sunday) for weekly schedules
Days of Month: Specify day numbers (1-31) or expressions like "1st Monday" or "last Friday" for monthly schedules
Managing Scheduled Imports
View all scheduled imports in Site Settings > Web Import Schedules. Each schedule shows:
- URL: The website being imported
- Status: Active or Inactive
- Destination Folder: Where content is stored
- Schedule: Human-readable schedule description
- Last Run: When the import last executed
- Next Run: When the import will run next
Actions Available:
- Edit: Modify the schedule configuration
- Activate/Deactivate: Temporarily enable or disable a schedule without deleting it
- Delete: Permanently remove a schedule
Use Cases
Keeping Documentation Up to Date: Schedule regular imports from your documentation website to ensure your Clear Ideas Site always has the latest information.
Powering Public AI Chat: Import content from public websites to make it available through Public AI Chat. Visitors can ask questions about your website content, documentation, or other public resources.
Content Aggregation: Combine content from multiple sources into a single Site for comprehensive search and analysis.
Competitive Intelligence: Regularly import competitor websites or industry resources for analysis and monitoring.
Best Practices
URL Masking: Use URL masks to limit imports to relevant sections. This reduces processing time and ensures you only import content you need.
Depth Selection: Start with depth 1 to test imports. Increase depth gradually to avoid importing too much content unintentionally.
Folder Organization: Create dedicated folders for different import sources. This makes it easier to organize and manage imported content.
Schedule Frequency: Match your schedule frequency to how often the source website updates. Weekly or monthly schedules are common for documentation sites.
Monitor Imports: Check the Last Run and Next Run times regularly to ensure schedules are executing as expected.
Test Before Scheduling: Use immediate import to test your configuration before creating a scheduled import. This helps avoid issues with URL masks or depth settings.
Version Management: Be aware that websites using automated scripts may generate slightly different content on each request, which can result in new versions being created even when substantive content hasn't changed. Monitor version history to understand how frequently your source websites update their content.
Related Documentation
- Public AI Chat - Use imported content in public chat
- Content Indexing - Understand how imported content is indexed
- Sites - Learn about Sites and folder organization
- Site AI Settings - Configure AI features for imported content