AI Web Scraper - Powered by Crawl4AI
Pricing
$8.00 / 1,000 results
AI Web Scraper - Powered by Crawl4AI
A blazing-fast AI web scraper powered by Crawl4AI. Perfect for LLMs, AI agents, AI automation, model training, sentiment analysis, and content generation. Supports deep crawling, multiple extraction strategies and flexible output (Markdown/JSON). Seamlessly integrates with Make.com, n8n, and Zapier.
Pricing
$8.00 / 1,000 results
Rating
1.0
(1)
Developer

Raizen Technology
Actor stats
8
Bookmarked
286
Total users
14
Monthly active users
8 months ago
Last modified
Categories
Share
AI Web Scraper
Do you need reliable data for your AI agents, LLM pipelines, or training workflows? The AI Web Scraper Actor is your key to fast, flexible, and AIโfriendly web extraction on Apify. Under the hood, it relies on the openโsource Crawl4AI engine to handle anything from simple singleโpage scrapes to deep multiโlink traversals (BFS/DFS/BestFirst). Whether you want clean markdown, JSON extraction, or LLM summarization, just specify your desired strategy via the Actorโs input UI, and youโre set. This Actor integrates well with Make.com, n8n, and Zapier for AI automation
Below is an overview of each setting youโll see in the Apify interface and how it affects your crawls.
Quick How-To
-
Start with URLs
At a minimum, providestartUrlsin the UI (or JSON input)โthe pages you want to scrape. -
Pick a Crawler & Extraction Style
Choose between various crawl strategies (e.g., BFS or DFS) and extraction methods (simple markdown, LLM-based, JSON CSS, etc.). You can also enable content filtering or deeper link exploration. -
Review the Output
Once the Actor finishes, your results will appear in the Apify Datasetโstructured JSON, markdown, or whichever format you chose.
Input Fields Explained
These fields appear in the Actorโs input UI. Customize them to match your use case.
1. startUrls (Required)
List of pages to scrape. For each entry, just provide "url".
Example:
{"startUrls": [{ "url": "https://example.com" }]}
2. browserConfig (Optional)
Configure Playwrightโs browser behaviorโheadless mode, custom user agent, viewport size, etc.
browser_type: โchromiumโ, โfirefoxโ, or โwebkitโheadless: Boolean to run in headless modeverbose_logging: Extra debug logsignore_https_errors: Accept invalid certsuser_agent: E.g. โrandomโ or a custom stringproxy: Proxy server URLviewport_width/viewport_height: Window sizeaccept_downloads: Whether downloads are allowedextra_headers: Additional request headers
3. crawlerConfig (Optional)
Core crawling settingsโtime limits, caching, JavaScript hooks, or multiโpage concurrency.
cache_mode: โBYPASSโ (no cache), โENABLEDโ, etc.page_timeout: Milliseconds to wait for page loadssimulate_user: Stealth by mimicking user actionsremove_overlay_elements: Attempt to remove popupsdelay_before_return_html: Extra wait before final extractionwait_for: Wait time or wait conditionscreenshot/pdf: Capture screenshot or PDFenable_rate_limiting: Rate limit large URL listsmemory_threshold_percent: Pause if memory is too highword_count_threshold: Discard short text blockscss_selector,excluded_tags,excluded_selector: Further refine or skip sections of the DOMonly_text: Keep plain text onlyprettify: Attempt to clean up HTMLkeep_data_attributes: Keep or drop data-* attributesremove_forms: Strip<form>elementsbypass_cache/disable_cache/no_cache_read/no_cache_write: Fineโgrained caching controlswait_until: E.g. โdomcontentloadedโ or โnetworkidleโwait_for_images: Wait for images to fully loadcheck_robots_txt: Respect robots.txt?mean_delay,max_range: Introduce a random delay rangejs_code: Custom JS to run on each pagejs_only: Reuse the same page context without re-navigationignore_body_visibility: Include hidden elementsscan_full_page: Scroll from top to bottom for lazy loadingscroll_delay: Delay between scroll stepsprocess_iframes: Also parse iframesoverride_navigator: Additional stealth tweakmagic: Enable multiple advanced anti-bot tricksadjust_viewport_to_content: Resize viewport to fit contentscreenshot_wait_for: Wait time before taking a screenshotscreenshot_height_threshold: Max doc height to screenshotimage_description_min_word_threshold: Filter out images with minimal alt textimage_score_threshold: Remove lowerโscore imagesexclude_external_images: No external imagesexclude_social_media_domains,exclude_domains: Avoid these domains entirelyexclude_external_links,exclude_social_media_links: Strip external or social media linksverbose: Extra logslog_console: Show browser console logs?stream: Stream results as they come in, or wait until done
4. deepCrawlConfig (Optional)
When you select BFS, DFS, or BestFirst crawling, this config guides link exploration.
max_pages: Stop after crawling this many pagesmax_depth: Depth of linkโfollowinginclude_external: Follow offโdomain links?score_threshold: Filter out lowโscore links (BestFirst)filter_chain: Extra link filter rulesurl_scorer: If you want a custom approach to scoring discovered URLs
5. markdownConfig (Optional)
For HTMLโMarkdown conversions.
ignore_links: Skip anchor linksignore_images: Omit markdown imagesescape_html: Turn<div>into<div>skip_internal_links: Remove sameโpage anchorsinclude_sup_sub: Preserve<sup>/<sub>textcitations: Put footnotes at bottom of filebody_width: Wrap lines at N charsfit_markdown: Use advanced โfitโ mode if also using a filter
6. contentFilterConfig (Optional)
Prune out nav bars, sidebars, or extra text using โpruning,โ โbm25,โ or a second LLM filter.
type: e.g. โpruningโ, โbm25โthreshold: Score cutoffmin_word_threshold: Minimum words to keepbm25_threshold: BM25 filter paramapply_llm_filter: Iftrue, do a second pass with an LLMsemantic_filter: Keep only text about a certain topicword_count_threshold: Another word thresholdsim_threshold,max_dist,top_k,linkage_method: For advanced clustering
7. userAgentConfig (Optional)
Rotate or fix your user agent.
user_agent_mode: โrandomโ or โfixedโdevice_type: โdesktopโ or โmobileโbrowser_type: e.g. โchromeโnum_browsers: If rotating among multiple agents
8. llmConfig (Optional)
For LLM-based extraction or filtering.
provider: e.g. โopenai/gpt-4โ, โgroq/deepseek-r1-distill-llama-70bโapi_token: Modelโs API keyinstruction: Prompt the LLM about how to parse or summarizebase_url: For custom endpointschunk_token_threshold: Big pages โ chunk themapply_chunking: Boolean for chunkinginput_format: โmarkdownโ or โhtmlโtemperature,max_tokens: Standard LLM config
9. session_id (Optional)
Provide a session ID to reuse browser context across multiple runs (logins, multi-step flows, etc.).
10. extractionStrategy (Optional)
Pick one:
SimpleExtractionStrategy: Simple HTMLโMarkdown.LLMExtractionStrategy: Let an LLM parse or summarize.JsonCssExtractionStrategy/JsonXPathExtractionStrategy: Provide a schema to produce structured JSON.
11. crawlStrategy (Optional)
SimpleCrawlStrategy: Just the given start URLsBFSDeepCrawlStrategy: Breadth-first approachDFSDeepCrawlStrategy: Depth-first approachBestFirstCrawlingStrategy: Score links, pick the best first
12. extractionSchema (Optional)
If using CSS/XPath extraction.
name: Your extraction scheme namebaseSelector: Parent selector for repeated elementsfields: Each withname,selector,type, and optionalattribute
Example:
"extractionSchema": {"name": "Custom Extraction","baseSelector": "div.article","fields": [{ "name": "title", "selector": "h1", "type": "text" },{ "name": "link", "selector": "a", "type": "attribute", "attribute": "href" }]}
Usage Examples
Minimal
{"startUrls": [{ "url": "https://example.com" }]}
Scrapes a single page in headless mode with standard markdown output.
JSON CSS Extraction
{"startUrls": [{ "url": "https://news.ycombinator.com/" }],"extractionStrategy": "JsonCssExtractionStrategy","extractionSchema": {"name": "HackerNews","baseSelector": "tr.athing","fields": [{"name": "title","selector": ".titleline a","type": "text"},{"name": "link","selector": ".titleline a","type": "attribute","attribute": "href"}]}}
Generates a JSON array, each object containing โtitleโ and โlink.โ
Pro Tips
- Deep crawling: If you want BFS or DFS, set
crawlStrategyto โBFSDeepCrawlStrategyโ or โDFSDeepCrawlStrategyโ and configuredeepCrawlConfig. - Content filtering: Combine
contentFilterConfigwithextractionStrategyfor maximum clarity and minimal noise. - LLM-based: Choose โLLMExtractionStrategyโ plus
llmConfigfor advanced summarization or structured data. Great for building AI pipelines.
Thanks for trying out the AI Web Scraperโenjoy harnessing rich, clean data for your Apify-based AI solutions!