WebSearchAPI.ai Scraper API Reference
Complete reference for the WebSearchAPI.ai Scraper endpoint with advanced content extraction capabilities
WebSearchAPI.ai offers a powerful web scraping API built for developers who need full control over content extraction. Extract clean, structured content from any URL with browser rendering, CSS selectors, JavaScript injection, and AI-optimized markdown formatting perfect for LLMs.
Introduction
The WebSearchAPI.ai Scraper API allows you to:
- Extract and parse content from any URL with AI-optimized formatting
- Use full browser rendering for JavaScript-heavy websites
- Target specific content with CSS selectors
- Inject custom JavaScript before scraping
- Generate AI-powered alt text for images
- Customize markdown output formatting
- Extract links and images with summaries
- Control privacy and caching behavior
Endpoint Details
POST
/scrapeBase URL: https://api.websearchapi.ai
Full Endpoint: https://api.websearchapi.ai/scrape
Authentication
Authentication Required All API requests require authentication using your API key in the Authorization header.
Include your API key in the Authorization header using the Bearer token format:
Authorization: Bearer YOUR_API_KEY
You can obtain an API key by signing up for a WebSearchAPI.ai account. Each account receives 1,000 free API credits monthly.
Request Format
All requests to the Scraper API should be made as HTTP POST requests with a JSON body containing the scraping parameters.
Required Headers
| Header | Value |
|---|---|
Content-Type | application/json |
Authorization | Bearer YOUR_API_KEY |
Request Body
The request body should be a JSON object containing your scraping parameters:
{
"url": "https://example.com",
"returnFormat": "markdown",
"engine": "browser",
"targetSelector": "article, main",
"removeSelector": "header, footer, nav, .ads",
"withLinksSummary": true,
"withImagesSummary": true
}Request Parameters
Example Requests
curl -X POST https://api.websearchapi.ai/scrape \
-H "Authorization: Bearer YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"url": "https://example.com/article",
"returnFormat": "markdown",
"engine": "browser",
"targetSelector": "article, main",
"removeSelector": "header, footer, nav, .ads",
"withLinksSummary": true,
"withImagesSummary": true,
"withGeneratedAlt": true,
"timeout": 15
}'Response Format
Success Response (200 OK)
A successful API call returns a JSON object with the following structure:
{
"code": 200,
"data": {
"title": "Example Article Title",
"url": "https://example.com/article",
"content": "# Example Article Title\n\nThis is the extracted content in markdown format...",
"links": {
"https://example.com/page1": "Page 1 Title",
"https://example.com/page2": "Page 2 Title"
},
"images": {
"https://example.com/image1.jpg": "AI-generated alt text for image 1",
"https://example.com/image2.jpg": "AI-generated alt text for image 2"
}
}
}Response Fields
Error Responses
WebSearchAPI.ai returns standard HTTP status codes with JSON error details:
{
"error": {
"code": "VALIDATION_ERROR",
"message": "Invalid request parameters",
"details": {
"url": "Invalid URL format"
}
}
}Returned when the request contains invalid or missing parameters.
Rate Limits and Quotas
WebSearchAPI.ai uses a credit-based system for API usage. Each account receives 1,000 free API credits monthly. Scraper API credit usage varies based on features used:
| Operation | Credits |
|---|---|
| Basic scraping (direct engine) | 1 credit |
| Browser rendering | 2 credits |
| With link/image summaries | +1 credit |
| With AI-generated alt text | +1 credit |
| Screenshot/pageshot | 2 credits |
Pro Tip: Optimize Credit Usage Use the
directengine for simple, static pages to save credits. Reservebrowserandcf-browser-renderingengines for JavaScript-heavy sites. Enable AI features likewithGeneratedAltonly when needed for accessibility or LLM processing.
Advanced Features
Browser Rendering Engines
The Scraper API offers three rendering engines optimized for different use cases:
| Engine | Best For | Speed | JavaScript Support |
|---|---|---|---|
direct | Static HTML pages, simple sites | Fastest | None |
browser | JavaScript-heavy sites, SPAs | Medium | Full |
cf-browser-rendering | Complex SPAs, anti-bot protection | Slower | Full + Advanced |
Choosing the Right Engine
Direct Engine - Perfect for blogs, documentation sites, and static content. Fastest response time with minimal credit usage.
Browser Engine - Ideal for modern websites with JavaScript rendering. Handles React, Vue, and Angular applications.
CF Browser Rendering - Advanced rendering for complex SPAs and sites with anti-bot measures. Best for challenging scraping scenarios.
CSS Selectors for Precise Extraction
Use targetSelector and removeSelector to focus on the content you need:
{
"url": "https://example.com/article",
"targetSelector": "article, main, .content",
"removeSelector": "header, footer, nav, .ads, .sidebar, .comments"
}This approach:
- Reduces noise in extracted content
- Saves tokens for LLM processing
- Improves content quality
- Focuses on relevant information
JavaScript Injection
Execute custom JavaScript before scraping to manipulate the DOM:
{
"url": "https://example.com",
"engine": "browser",
"injectPageScript": "document.querySelector('.cookie-banner')?.remove(); document.querySelector('.newsletter-popup')?.remove();"
}Common use cases:
- Remove popups and overlays
- Trigger lazy-loaded content
- Manipulate page elements
- Extract data from JavaScript variables
AI-Enhanced Features
Generated Alt Text
Enable AI-powered alt text generation for images:
{
"url": "https://example.com",
"withImagesSummary": true,
"withGeneratedAlt": true
}This feature:
- Generates descriptive alt text for images
- Improves accessibility
- Enhances LLM understanding of visual content
- Useful for RAG applications
ReaderLM-v2 Processing
Use advanced AI processing for better content extraction:
{
"url": "https://example.com",
"respondWith": "readerlm-v2"
}Benefits:
- Superior HTML-to-Markdown conversion
- Better preservation of complex structures
- Enhanced table and list formatting
- Optimized for LLM consumption
Markdown Customization
Customize markdown output to match your preferences:
{
"url": "https://example.com",
"returnFormat": "markdown",
"mdHeadingStyle": "atx",
"mdBulletListMarker": "-",
"mdEmDelimiter": "*",
"mdStrongDelimiter": "**",
"mdLinkStyle": "inline"
}Perfect for:
- Maintaining consistent documentation styles
- Matching existing markdown conventions
- Optimizing for specific markdown parsers
- Personal formatting preferences
Privacy and Compliance
Control privacy settings for GDPR compliance:
{
"url": "https://example.com",
"dnt": true,
"noCache": true,
"proxy": "eu"
}Privacy features:
dnt: Sends Do Not Track headernoCache: Bypasses cache for fresh dataproxy: Use EU-based proxies for GDPR compliancesetCookie: Custom cookie settings for authenticated content
Use Cases
RAG Applications
Extract clean, token-optimized content for retrieval-augmented generation systems. Perfect for building knowledge bases and AI assistants.
Knowledge Base Building
Scrape documentation sites and build comprehensive knowledge graphs with link extraction and structured content.
Content Migration
Migrate content from legacy systems to modern platforms with preserved formatting and structure.
News Aggregation
Extract articles from news sites with clean formatting, removing ads and navigation clutter automatically.
Academic Research
Extract citations, references, and structured data from research papers and academic websites.
E-commerce Data
Extract product information, prices, reviews, and inventory status from e-commerce platforms.
Market Intelligence
Monitor competitor websites for pricing changes, product launches, and content updates.
SEO Analysis
Extract page content, metadata, and structure for SEO audits and competitive analysis.
Compliance Monitoring
Track legal and regulatory websites for updates and changes to compliance requirements.
Best Practices
1. Choose the Right Engine
Start with direct engine and upgrade to browser only when needed:
// Try direct first
let payload = { url: targetUrl, engine: 'direct' };
// If content is missing, retry with browser
if (contentIncomplete) {
payload.engine = 'browser';
}2. Use Selectors Wisely
Target specific content to reduce noise and save tokens:
{
"targetSelector": "article, main, .post-content",
"removeSelector": "header, footer, nav, aside, .ads, .related-posts"
}3. Optimize Token Budget
Set token limits for LLM optimization:
{
"tokenBudget": 10000,
"retainImages": "none"
}4. Handle Errors Gracefully
Always implement retry logic with exponential backoff:
async function scrapeWithRetry(url, maxRetries = 3) {
for (let i = 0; i < maxRetries; i++) {
try {
const response = await fetch(apiUrl, options);
if (response.ok) return await response.json();
if (response.status === 429) {
// Rate limited, wait and retry
await sleep(Math.pow(2, i) * 1000);
continue;
}
throw new Error(`HTTP ${response.status}`);
} catch (error) {
if (i === maxRetries - 1) throw error;
}
}
}5. Monitor Credit Usage
Track credits to avoid unexpected charges:
const creditsConsumed = response.headers.get('X-Credits-Consumed');
const creditsRemaining = response.headers.get('X-Credits-Remaining');
if (creditsRemaining < 100) {
console.warn('Low credits! Consider upgrading.');
}