
In this blog post, I will discuss how to work with paginated APIs in Etlworks.
Most (but not all) of the APIs put the limit on the amount of data that can be returned or uploaded in one request. They provide a pagination mechanism by responding with the information on how to access the next “page” (hence pagination).
In Etlworks retriving data from the paginated API can be impemnted using nested flow with the loop, together with the technique called dynamic URLs.
Common pagination patterns
Let’s look at the most common pagination patterns:
- The API where the amount of data is controlled by the parameters in the request and the loop stops when there is no more data to return. I will be using Jira API to illustrate the example.
- The API where the server controls the amount of data by providing the “next marker” in the response, so the loop stops when there is no “next marker”. I will be using API provided by an organization called UKRI Gateway to illustrate the example.
API with the page number (JIRA)
The URL for the typical Jira API endpoint looks like below:
https://company.atlassian.net/rest/api/3/searchjql=query
&startAt=pointer&maxResults=10
When called it returns the following JSON:
{"expand": "schema,names", "startAt": 0, "maxResults": 10, "total": 2823, "results": [...]}
As you can see, the request contains the starting pointer and the maximum number of results. The response includes the total number of results.
So, the algorithm to iterate through the pages will look like below:
- Make the first call with startAt=0 and maxResults=1.
- Extract total from the response and calculate the total number of calls which need to be made to the endpoint, with maxResults=100 (100 is a maximum allowed number of results).
- Create a loop to iterate through the pages. The loop ends when the number of calls made to the endpoint reaches the total number of calls, calculated in step 2.
Now, let’s put it all together.
In step 1, we will need to create connections for the example.
- HTTP connection for the JIRA API.
- Database connection to the PostgreSQL database.
The URL for the HTTP connection which returns JIRA ticket for the project TEST will look like below (we are using the version of the API for the JIRA cloud, so substitute company in the URL on your actual company name). Note that JIRA API uses preemptive basic authentication.
https://company.atlassian.net/rest/api/3/searchjql=project%20%3D%20TEST
&startAt={startAt}&maxResults=100
Notice that the value for the startAt parameter in the URL includes token {startAt}. The idea that in the loop we will be replacing token {startAt} on the actual values: 0, 100, 200, 300, etc.
For this example, I will be using a database loop so we will need one more connection to the temporary database, which will be used to drive the loop.
In step 2, we will be creating the flows.
1. Create SQL flow which is used to create a temporary loop table using SQL below:
create table pages (startAt integer)
2. Create JavaScript flow to populate the loop table:
importPackage(com.toolsverse.etl.core.engine); importPackage(com.toolsverse.util); importPackage(com.toolsverse.config); importPackage(com.toolsverse.etl.sql.util); importPackage(java.util); var pageSize = 100; var params = Extractor.lookup(etlConfig, scenario,
"First", "params", "select total"); var totalNumberOfRecords = params.getFieldValue(params.getRecord(0),
"total"); var numberOfPages = Math.floor(totalNumberOfRecords / pageSize); var remainder = totalNumberOfRecords % pageSize; numberOfPages = totalNumberOfRecords == 0 ? 0 :
(numberOfPages == 0 ? 1 :
(remainder > 0 ? numberOfPages + 1 : numberOfPages)); var connection = etlConfig.getConnectionFactory().
getConnection("Temporary database"); var sql = "insert into pages (startAt) values (?)"; var startAt = 0; for (var page = 0; page <numberOfPages; page++) { SqlUtils.executeSql(connection, sql, startAt); startAt = startAt + pageSize; }
3. Create a flow “Extract data from the Web Service and load into the database” where the source is a web service (Jira endpoint above) and the destination is a PostgreSQL database.
4. Create a nested flow which combines flows 1, 2, and 3.
5. Modify step 3 by configuring a database loop where the connection is a Temporary database and the driving SQL is:
select startAt as "startAt" from pages
Download JIRA example from the etlworks.com. You can import the example into your Etlworks account and try it out. Don’t forget to edit the PostgreSQL and JIRA connections before executing the flow.
Download a fully functional example for another paginated api with page number query parameter from the etlworks.com. You can import the example into your Etlworks account and try it out without changing anything. This flow pulls the data from the paginated API and loads in the shared Google Sheet.
API with the next marker (UKRI Gateway)
For this example, I will be using UKRI Gateway API endpoint with the following URL:
https://www.ebi.ac.uk/europepmc/webservices/rest/search
?query=italy&cursorMark={NEXT_MARKER}
If the value of the cursorMark query parameter is * (star character) the API returns the first page, which contains data and possible next value for the cursorMark.
<response>
<nextCursorMark>535543435GUI999==</nextCursorMark>
<data>
...
</data>
</reponse>
So, the algorithm to iterate through the pages will look like below:
- Make the first call with cursorMark=*.
- Get the value for the cursorMark from the response by extracting the value of the nextCursorMark field.
- Create a loop to iterate through the pages. The loop ends when the response doesn’t have the nextCursorMark.
Now, let’s put it all together.
In step 1, we will need to create connections for the example.
- HTTP connection for the UKRI Gateway API endpoint above.
- Connection to the shared Google Sheet.
Notice that the value of cursorMark query parameter is a token {NEXT_MARKER}. The idea is to replace the token with the next marker, make the API call and repeat until there is no next marker anymore.
We will be loading data into the shared Google Sheet by calling the API endpoint multiple times so we will need a connection to the temporary staging database to dump the data from the API before loading the whole thing into the Google Sheet.
In step 2, we will be creating the flows.
1. Create the JavaScript flow to set the original value of the {NEXT_MARKER} token as * (start character).
importPackage(com.toolsverse.config); SystemConfig.instance().getProperties().put("NEXT_MARKER", "*");
2. Create a flow “Extract data from the Web Service and load into the database” where the source is a web service and the destination is a temporary database.
3. Click the “Mapping->Additional Transformations” and add the following JavaScript as a value for the After Extract transformation:
importPackage(com.toolsverse.util); importPackage(com.toolsverse.config); var nextCursorMark = dataSet.getFieldValue(dataSet.getRecord(0),
"nextCursorMark"); if (Utils.isNothing(nextCursorMark)) { nextCursorMark = ""; } SystemConfig.instance().getProperties().put("NEXT_MARKER",
nextCursorMark);
4. Create flow “Extract data from database, push into the well known API” where the source is a temporary database and the destination is a shared Google Sheet.
5. Create nested flow which combines flows 1, 2, 3, and 4.
Modify step 2 by configuring a JavaScript loop, where the JavaScript for the loop is:
importPackage(com.toolsverse.config);
importPackage(com.toolsverse.util);
var nextCursorMark = SystemConfig.instance().getProperties().
get("NEXT_MARKER");
value = !Utils.isNothing(nextCursorMark) ? nextCursorMark : null;
Download a fully functional example for paginated api with next marker from the etlworks.com. You can import the example into your Etlworks account and try it out without changing anything. This flow pulls the data from the paginated API and loads in the shared Google Sheet.