Debugging Dynamic Content Extraction: waitUntil Options in Browser Rendering Services
When building web scraping solutions, one of the trickiest challenges is handling modern websites that load content dynamically. Today I'll share a production bug I encountered and the simple fix that solved it.
The Setup
I'm working on Zin Flow, a web-to-EPUB converter that extracts article content from web pages. Our backend uses Cloudflare's Browser Rendering service for server-side rendering and content extraction.
The Bug
Symptoms:
- Local development: Full article content extracted ✅
- Production: Only HTML skeleton returned ❌
- Specific websites affected, others worked fine
Initial Investigation:
// Our original implementation
const page = await browser.newPage();
await page.goto(url, {
waitUntil: 'domcontentloaded'
});
const content = await page.content();
This worked locally but failed in production for sites using lazy loading.
Root Cause Analysis
Modern websites, especially those built with React, Vue, or other SPA frameworks, use patterns like:
// React component with lazy loading
function ArticleContent() {
const [content, setContent] = useState('');
useEffect(() => {
// Content loads after component mounts
fetchArticleContent()
.then(data => setContent(data));
}, []);
return <div>{content || 'Loading...'}</div>;
}
// Or React.lazy for code splitting
const Article = React.lazy(() => import('./Article'));
// Intersection Observer for infinite scroll
const LazyArticle = () => {
const [isVisible, setIsVisible] = useState(false);
const ref = useRef();
useEffect(() => {
const observer = new IntersectionObserver(([entry]) => {
if (entry.isIntersecting) {
setIsVisible(true);
// Load content when element becomes visible
loadDynamicContent();
}
});
if (ref.current) observer.observe(ref.current);
}, []);
return <div ref={ref}>{isVisible ? <ActualContent /> : <Skeleton />}</div>;
};
The issue: domcontentloaded
fires before these React hooks, async effects, and dynamic imports complete their execution.
waitUntil Options Explained
Browser rendering services offer several waitUntil
strategies:
Option | Behavior | Use Case |
---|---|---|
load |
Wait for load event |
Basic static sites |
domcontentloaded |
Wait for DOM parsing | Simple dynamic sites |
networkidle0 |
No network activity for 500ms | Heavy dynamic content |
networkidle2 |
≤2 network connections for 500ms | Balanced approach |
The Solution
// Fixed implementation
await page.goto(url, {
waitUntil: 'networkidle2',
timeout: 30000 // Increased timeout for dynamic content
});
Why networkidle2
works:
- Waits for most network requests to settle
- Balances completeness with performance
- Handles lazy loading without excessive waiting
Performance Considerations
The tradeoff matrix:
// Performance vs Completeness
const strategies = {
'domcontentloaded': { speed: 'fast', completeness: 'low' },
'networkidle2': { speed: 'medium', completeness: 'high' },
'networkidle0': { speed: 'slow', completeness: 'highest' }
};
For content extraction, networkidle2
hits the sweet spot.
Advanced Patterns
For more complex scenarios, combine strategies:
// Wait for specific content indicators
await page.goto(url, { waitUntil: 'domcontentloaded' });
// Then wait for specific elements
await page.waitForSelector('.article-content', {
timeout: 10000
});
// Or wait for custom conditions
await page.waitForFunction(
() => document.querySelector('.article-content')?.children.length > 0
);
Environment Differences
Why local worked but production didn't:
- Network latency: Local requests faster, content loads before extraction
- Resource constraints: Production environment may load slower
- Geographic differences: CDN behavior varies by location
- Browser differences: Slight variations in timing
Debugging Tips
// Add logging to understand timing
console.time('page-load');
await page.goto(url, { waitUntil: 'networkidle2' });
console.timeEnd('page-load');
// Check what content is actually loaded
const contentLength = await page.evaluate(
() => document.body.innerText.length
);
console.log(`Content length: ${contentLength}`);
Results
After the fix:
- 📈 Content extraction success rate: 85% → 98%
- ⏱️ Average processing time: +2.3s (acceptable tradeoff)
- 🐛 User-reported empty EPUB issues: Eliminated
Key Takeaways
- Choose waitUntil strategy based on target sites - Modern sites need network-aware waiting
- Test in production-like environments - Local success ≠ production success
- Monitor extraction quality - Track content length and completeness metrics
- Consider timeouts - Dynamic content needs longer timeouts
Tools for Testing
// Test locally with browser rendering options
// Monitor network activity in DevTools
chrome://inspect/#devices
// For Cloudflare Browser Rendering, test with:
curl -X POST "https://api.cloudflare.com/client/v4/accounts/$ACCOUNT_ID/browser-render" \
-H "Authorization: Bearer $API_TOKEN" \
-d '{"url": "target-site.com", "waitUntil": "networkidle2"}'
For fellow developers building content extraction tools: don't underestimate the impact of waitUntil
options. The right choice can mean the difference between extracting meaningful content and empty shells.
Zin Flow is available on iOS and Mac for converting web articles to EPUB format.
Discussion
What waitUntil
strategies have you found most effective for your scraping projects? Any other gotchas with dynamic content extraction?
Top comments (0)