DEV Community

Yi Wan
Yi Wan

Posted on

Debugging Dynamic Content Extraction: waitUntil Options in Browser Rendering Services

Debugging Dynamic Content Extraction: waitUntil Options in Browser Rendering Services

When building web scraping solutions, one of the trickiest challenges is handling modern websites that load content dynamically. Today I'll share a production bug I encountered and the simple fix that solved it.

The Setup

I'm working on Zin Flow, a web-to-EPUB converter that extracts article content from web pages. Our backend uses Cloudflare's Browser Rendering service for server-side rendering and content extraction.

The Bug

Symptoms:

  • Local development: Full article content extracted ✅
  • Production: Only HTML skeleton returned ❌
  • Specific websites affected, others worked fine

Initial Investigation:

// Our original implementation
const page = await browser.newPage();
await page.goto(url, { 
  waitUntil: 'domcontentloaded' 
});
const content = await page.content();
Enter fullscreen mode Exit fullscreen mode

This worked locally but failed in production for sites using lazy loading.

Root Cause Analysis

Modern websites, especially those built with React, Vue, or other SPA frameworks, use patterns like:

// React component with lazy loading
function ArticleContent() {
  const [content, setContent] = useState('');

  useEffect(() => {
    // Content loads after component mounts
    fetchArticleContent()
      .then(data => setContent(data));
  }, []);

  return <div>{content || 'Loading...'}</div>;
}

// Or React.lazy for code splitting
const Article = React.lazy(() => import('./Article'));

// Intersection Observer for infinite scroll
const LazyArticle = () => {
  const [isVisible, setIsVisible] = useState(false);
  const ref = useRef();

  useEffect(() => {
    const observer = new IntersectionObserver(([entry]) => {
      if (entry.isIntersecting) {
        setIsVisible(true);
        // Load content when element becomes visible
        loadDynamicContent();
      }
    });

    if (ref.current) observer.observe(ref.current);
  }, []);

  return <div ref={ref}>{isVisible ? <ActualContent /> : <Skeleton />}</div>;
};
Enter fullscreen mode Exit fullscreen mode

The issue: domcontentloaded fires before these React hooks, async effects, and dynamic imports complete their execution.

waitUntil Options Explained

Browser rendering services offer several waitUntil strategies:

Option Behavior Use Case
load Wait for load event Basic static sites
domcontentloaded Wait for DOM parsing Simple dynamic sites
networkidle0 No network activity for 500ms Heavy dynamic content
networkidle2 ≤2 network connections for 500ms Balanced approach

The Solution

// Fixed implementation
await page.goto(url, { 
  waitUntil: 'networkidle2',
  timeout: 30000  // Increased timeout for dynamic content
});
Enter fullscreen mode Exit fullscreen mode

Why networkidle2 works:

  • Waits for most network requests to settle
  • Balances completeness with performance
  • Handles lazy loading without excessive waiting

Performance Considerations

The tradeoff matrix:

// Performance vs Completeness
const strategies = {
  'domcontentloaded': { speed: 'fast', completeness: 'low' },
  'networkidle2': { speed: 'medium', completeness: 'high' },
  'networkidle0': { speed: 'slow', completeness: 'highest' }
};
Enter fullscreen mode Exit fullscreen mode

For content extraction, networkidle2 hits the sweet spot.

Advanced Patterns

For more complex scenarios, combine strategies:

// Wait for specific content indicators
await page.goto(url, { waitUntil: 'domcontentloaded' });

// Then wait for specific elements
await page.waitForSelector('.article-content', { 
  timeout: 10000 
});

// Or wait for custom conditions
await page.waitForFunction(
  () => document.querySelector('.article-content')?.children.length > 0
);
Enter fullscreen mode Exit fullscreen mode

Environment Differences

Why local worked but production didn't:

  1. Network latency: Local requests faster, content loads before extraction
  2. Resource constraints: Production environment may load slower
  3. Geographic differences: CDN behavior varies by location
  4. Browser differences: Slight variations in timing

Debugging Tips

// Add logging to understand timing
console.time('page-load');
await page.goto(url, { waitUntil: 'networkidle2' });
console.timeEnd('page-load');

// Check what content is actually loaded
const contentLength = await page.evaluate(
  () => document.body.innerText.length
);
console.log(`Content length: ${contentLength}`);
Enter fullscreen mode Exit fullscreen mode

Results

After the fix:

  • 📈 Content extraction success rate: 85% → 98%
  • ⏱️ Average processing time: +2.3s (acceptable tradeoff)
  • 🐛 User-reported empty EPUB issues: Eliminated

Key Takeaways

  1. Choose waitUntil strategy based on target sites - Modern sites need network-aware waiting
  2. Test in production-like environments - Local success ≠ production success
  3. Monitor extraction quality - Track content length and completeness metrics
  4. Consider timeouts - Dynamic content needs longer timeouts

Tools for Testing

// Test locally with browser rendering options
// Monitor network activity in DevTools
chrome://inspect/#devices

// For Cloudflare Browser Rendering, test with:
curl -X POST "https://api.cloudflare.com/client/v4/accounts/$ACCOUNT_ID/browser-render" \
  -H "Authorization: Bearer $API_TOKEN" \
  -d '{"url": "target-site.com", "waitUntil": "networkidle2"}'
Enter fullscreen mode Exit fullscreen mode

For fellow developers building content extraction tools: don't underestimate the impact of waitUntil options. The right choice can mean the difference between extracting meaningful content and empty shells.


Zin Flow is available on iOS and Mac for converting web articles to EPUB format.

Discussion

What waitUntil strategies have you found most effective for your scraping projects? Any other gotchas with dynamic content extraction?

Top comments (0)