Content Indexing
The Clear Ideas App features an integrated indexing system for textual content across supported document formats. When enabled in the site settings, content is automatically indexed upon upload, provided it falls within the supported file formats. Content indexing enables powerful search capabilities and AI features by making document content searchable and accessible.
How Content Indexing Works
Content indexing extracts text from uploaded documents and makes it searchable. The indexing process:
- Upload: Document is uploaded to a site
- Text Extraction: Text is extracted from the document (using OCR for images/PDFs if needed)
- Indexing: Extracted text is processed and indexed
- Search Availability: Content becomes available for search and AI features
Automatic Process: Indexing happens automatically when content is uploaded, provided indexing is enabled in site settings.
Supported File Formats
Clear Ideas supports indexing for a wide range of document formats:
- JSON - .json (application/json)
- Word Document - .docx (application/vnd.openxmlformats-officedocument.wordprocessingml.document)
- OpenDocument Text - .odt (application/vnd.oasis.opendocument.text)
- EPUB - .epub (application/epub+zip)
- PowerPoint Presentation - .pptx (application/vnd.openxmlformats-officedocument.presentationml.presentation)
- PowerPoint - .ppt (application/vnd.ms-powerpoint)
- RTF - .rtf (application/rtf)
- Texinfo - .texinfo (application/x-texinfo)
- LaTeX - .latex (application/x-latex)
- PDF - .pdf (application/pdf)
- reStructuredText - .rst (text/x-rst)
- Textile - .textile (text/x-textile)
- MediaWiki - .mediawiki (text/x-mediawiki)
- DocBook - .docbook (application/docbook+xml)
- JATS - .jats (application/jats+xml)
- Org mode - .org (text/x-org)
- Jupyter Notebook - .ipynb (application/x-ipynb+json)
- CSV - .csv (text/csv)
- AsciiDoc - .asciidoc (text/asciidoc)
- CommonMark - .commonmark (text/commonmark)
- Creole - .creole (text/x-creole)
- OPML - .opml (application/x-opml)
- ICML - .icml (application/vnd.adobe.icml)
- Wiki - .wiki (text/x-wiki)
- Jira - .jira (text/x-jira)
- HTML - .html, .htm (text/html)
- Markdown - .md, .markdown (text/markdown)
- Excel - .xls (application/vnd.ms-excel)
- Excel - .xlsx (application/vnd.openxmlformats-officedocument.spreadsheetml.sheet)
Text-Based Formats: Documents with native text content (DOCX, TXT, MD, etc.) are indexed directly.
Image-Based Formats: Documents requiring text extraction (PDFs, images) use OCR for indexing.
Optical Character Recognition (OCR)
While many supported formats are inherently text-based, others, such as PDFs, may require Optical Character Recognition (OCR) for efficient text extraction.
How OCR Works
Image Analysis: OCR analyzes images and PDF pages to identify text characters
Text Extraction: Extracted text is converted to searchable content
Processing: Extracted text is processed and indexed for search
AI Features: Once extracted, text becomes accessible for AI-enhanced search and AI chat functionalities
OCR Requirements
Site Settings: OCR must be enabled in site settings (see Advanced Search Settings)
Account Settings: OCR must be enabled at the account level
File Types: OCR applies to PDFs and image-based documents
OCR Processing Time
Small Documents: Typically process quickly
Large Documents: May take longer depending on page count and complexity
Background Processing: OCR processing happens in the background
Security and Privacy
All extracted text is secured in an encrypted state throughout its lifecycle in the Clear Ideas App, with the sole exception being when the text is utilized to fulfill an AI Chat request.
Encryption
At Rest: Extracted text is encrypted when stored
In Transit: Text is encrypted when transmitted
AI Processing: Text is decrypted only when needed for AI features, then re-encrypted
For further details on the security of extracted text, refer to our Encryption & Privacy guide.
Content Indexing Status
Administrators can monitor the content indexing status of each file through a dedicated icon displayed next to the file.
Indexing Status Icons
Indexed: File has been successfully indexed and is searchable
Indexing: File is currently being indexed
Not Indexed: File has not been indexed (may be unsupported format or indexing disabled)
Error: Indexing failed (may need to retry or check file format)
For more information, please visit our Information Icons section.
Checking Indexing Status
File List: Indexing status icons appear next to files in list and tile views
File Details: Indexing status is shown in file information panels
Settings: Site settings show overall indexing status and statistics
Enabling Content Indexing
Account-Level Settings
Content indexing must be enabled at the account level:
- Navigate to Settings > Search
- Enable Full-Text Search and OCR for PDFs as needed
- Save settings
Site-Level Settings
Site-level settings can override account defaults:
- Navigate to Site Settings > Search
- Configure indexing settings for the site
- Save settings
Note: Site settings can only restrict features enabled at the account level. See Advanced Search Settings for details.
Indexing and AI Features
AI-Enhanced Search
Indexed content enables AI-enhanced search, which uses semantic understanding to find relevant content even when exact keywords don't match.
Benefits:
- Better search results
- Semantic understanding
- Context-aware results
AI Chat
Indexed content powers AI Chat, allowing AI models to answer questions based on your documents.
Benefits:
- Document-based answers
- Context-aware responses
- Comprehensive understanding
Document Summaries
Indexed content enables AI document summaries, providing quick overviews of document content.
Benefits:
- Quick document understanding
- Time-saving summaries
- Content overviews
Indexing Statistics
Page Equivalents Indexed
Metric: Site statistics show "Page Equivalents Indexed" - an estimate of pages processed for search/AI
Usage: This metric helps understand indexing scope and usage
Tracking: Monitor indexing statistics to understand content processing
Indexing Limits
Plan Limits: Your plan may include limits on indexed pages
Overage: Additional indexed pages beyond plan limits may incur charges
Monitoring: Track indexing usage to stay within plan limits
Best Practices
Enable Indexing
When Needed: Enable indexing for sites where search and AI features are important
Selective: Consider disabling indexing for sites where it's not needed to optimize performance
Review: Regularly review indexing settings to ensure they match your needs
Monitor Status
Regular Checks: Check indexing status regularly to ensure content is being indexed
Error Resolution: Address indexing errors promptly to ensure content is searchable
Status Icons: Use status icons to quickly identify indexing issues
Optimize Content
File Formats: Use supported formats for best indexing results
Text Quality: Ensure documents have clear, readable text for OCR
Structure: Well-structured documents index more effectively
Troubleshooting
Content Not Indexing
Check Settings: Verify indexing is enabled at account and site levels
File Format: Ensure file format is supported for indexing
Status Icons: Check status icons for indexing errors
Retry: Some indexing failures may resolve on retry
Slow Indexing
Large Files: Large files may take longer to index
OCR Processing: OCR processing may take time for image-heavy documents
Background Processing: Indexing happens in the background - allow time for completion
Indexing Errors
Format Issues: Some file formats may not index correctly
Corrupted Files: Corrupted files may fail to index
Support: Contact support if indexing errors persist
Related Documentation
- Advanced Search Settings - Configure search and indexing
- Site AI Settings - Enable AI features that use indexed content
- Encryption and Privacy - Learn about security practices
- Site Statistics - View indexing statistics