Rate Limiting and Pagination
Managing data fetching in OpenETL often involves handling API rate limits and paginating large datasets. This section explains these concepts and how to configure them effectively.
What is Rate Limiting?
Rate limiting caps the frequency or volume of requests to a data source, preventing overload or bans (e.g., "429 Too Many Requests" from APIs). In OpenETL, it ensures pipelines respect source constraints, retrying when needed.
What is Pagination?
Pagination splits large datasets into smaller chunks, fetched page-by-page (e.g., 100 records at a time). OpenETL uses pagination to handle big data efficiently, avoiding memory overload and timeouts.
Configuring Rate Limiting
Rate limiting is set in the pipeline's rate_limiting
option, controlling request pace and retries:
Property | Description |
---|---|
requests_per_second |
Max requests per second |
concurrent_requests |
Max simultaneous requests |
max_retries_on_rate_limit |
Retry attempts on rate limit errors |
rate_limiting: {
requests_per_second: 10,
concurrent_requests: 5,
max_retries_on_rate_limit: 3,
}
This caps requests at 10/sec, allows 5 at once, and retries up to 3 times if throttled.
Configuring Pagination
Pagination is configured in a connector's pagination
option, defining how data is chunked:
-
type
:'offset'
,'cursor'
, or'page'
. -
itemsPerPage?
: Items per fetch (default: 100). -
pageOffsetKey?
: Starting offset (e.g.,'0'
). -
cursorKey?
: Cursor token for cursor-based APIs.
Offset Example:
pagination: {
type: 'offset',
itemsPerPage: 50,
pageOffsetKey: '0',
}
Cursor Example:
pagination: {
type: 'cursor',
itemsPerPage: 100,
}
Offset uses numeric steps; cursor uses tokens (e.g., HubSpot's after
).
Handling Rate Limits and Pagination in Pipelines
Combine both in a pipeline for robust data fetching:
import Orchestrator from 'openetl';
import { hubspot } from '@openetl/hubspot';
const vault = { 'hs-auth': { type: 'oauth2', credentials: { /* ... */ } } };
const orchestrator = Orchestrator(vault, { hubspot });
orchestrator.runPipeline({
id: 'hs-contacts',
source: {
adapter_id: 'hubspot',
endpoint_id: 'contacts',
credential_id: 'hs-auth',
fields: ['firstname'],
pagination: { type: 'cursor', itemsPerPage: 50 },
},
rate_limiting: {
requests_per_second: 5,
concurrent_requests: 2,
max_retries_on_rate_limit: 2,
},
logging: event => console.log(event),
});
This fetches HubSpot contacts 50 at a time using cursor pagination, limits requests to 5/sec with 2 concurrent, retries twice on rate limits, and logs progress. Adapters handle pagination responses (e.g., nextOffset
), while the Orchestrator enforces rate limits.
Next: Error Handling!