Talk with ChatGPT

Lucija Šošić

/ 2023-04-28

Introduction

Since the release of ChatGPT it has been the number one topic of conversation in the AI/ML community and the feedback has been overwhelming with good reason.

Lauded as a real threat to Google’s business, its capabilities seem endless. From writing college essays to assisting programmers, it responds to all kinds of queries concisely and informatively.

As it has been shown to resolve any inconveniences in the user’s life so far, we decided to test it on something troubling our accounting department. Is there a way for ChatGPT to give them a helping hand with an employee’s PTO requests?

Specifically, we are interested in employees who have requested PTO for ten or more consecutive days. But to make things spicy, we are only looking for intervals that have ten or more workdays, excluding the weekends and national holidays.

SQL problem “Islands and Gaps”

“Islands and Gaps” are a classic problem in SQL that involves finding ranges of existing values. The gap problem focuses on finding the ranges of missing values, whereas solving the island problem involves finding the ranges of existing values.

In a simple example, let’s see the problem in action to understand it better. Suppose we have a table column with sorted values 1, 2, 5, 6, 7, 10, 11, 12, 14, and 15. Data gaps exist because values are missing: (3, 4), (8, 9), and 13. Data islands occur as data gaps separate data groups. In the example, data islands are (1, 2), (5, 6, 7), (10, 11, 12), and (14, 15).

The problem may also arise when dealing with temporal data when there is a need to draw answers from periods of activity or inactivity. For example, how frequently has someone logged into a website? Or, how often did the server go offline?

In our case, the accounting department is facing a real-life example of the gaps and islands problem, focusing on the latter. We want Chatty, short for ChatGPT, to identify PTO islands - groups of continuous sequences that depict consecutive days a person has requested as time off.

To track Chatty’s progress and check if the returned query makes sense at each step, we set up a PostgreSQL table, “time_off_table”, that tracks the first six months of 2022. To keep things simple, we only have one person, “John Doe”, with a couple of short and long periods of request PTO.

Conversation with Chatty

While researching the chat’s ability to understand us and return a meaningful answer, our own linguistic and descriptive skills are also put to the test. The plan is to start with introductory- high-level questions and make our way to the end goal, a query we can proudly present to our accountant. With each interaction, we aim to steer the conversation in the right direction by gradually introducing restrictions on our initial inquiry.

The paragraph below shows our conversation from start to finish and how we applied the chat's suggestion to our data. As we went through all nine steps, some responses we received worked perfectly. In contrast, others needed corrections or additional modifications to work fully. The final result, which meets all our criteria, is also the last step, but we encourage you to try all steps by yourself.

Firstly, we started with a simple question to check if Chatty is aware of the SQL problem of islands and gaps.

User: Can you explain why the SQL islands and gaps problem occurs?

Chatty: The "SQL islands and gaps" problem occurs when data is missing or split into non-contiguous intervals, resulting in "islands" of data surrounded by "gaps" of missing values. This can happen in cases where the data is collected or recorded in an inconsistent manner, or when data is deleted or lost. In a relational database, this can lead to inaccuracies in results and unexpected behavior when querying the data. To resolve this problem, data cleaning and data normalization techniques can be applied to ensure data is complete and consistent, or a more robust data collection process can be implemented to prevent the issue from occurring in the first place.

Great! We can present our problem of tracking employees’ PTO as detecting islands in our data. We can then politely ask for an SQL query that does that for us.

User: If I give you a description of a table, could you give me a SQL query that detects "islands" in data?

Chatty: Sure! Can you please provide me with a description of the table and the criteria for defining an "island"?

For example, if the table has columns "id", "start_date", and "end_date" and an "island" is defined as a gap in the data of more than 1 day between two consecutive rows with the same "id", a SQL query to detect the "islands" could be:

WITH cte AS (
  SELECT id, start_date, end_date,
         LAG(end_date) OVER (PARTITION BY id ORDER BY start_date) AS prev_end_date
  FROM table_name
)
SELECT id, start_date, end_date
FROM cte
WHERE prev_end_date IS NULL OR start_date > prev_end_date + INTERVAL '1 day'

This query uses a Common Table Expression (CTE) to find the end date of the previous row with the same "id". The final result only includes the rows where the previous end date is missing or the difference between the start date of the current row and the end date of the previous row is more than 1 day, indicating an "island" in the data.

So far, so good. Unfortunately, we have our data in a slightly different format. Our next step is to describe as best as possible the table and see if it can manage to return duration in the same manner as before.

User: Please give me a SQL query that finds the duration of the employee's PTO. For example, assume the table "time_off_table" describes how employees took their time off in the past year. The table has three columns. The first column is called "employee_name" and contains a string value representing the employee's name. The second column is called "day_of_year" and includes a timestamp value representing a day in a year. Lastly, the third column, called "PTO", contains an integer value set to 1 if the person in question was on PTO that day and set to 0 if the person was not on PTO that day. Can you give me a SQL query that returns the start date, end date, and the duration of an interval of consecutive days the person was on PTO?

Chatty: Yes, sure! Here's a SQL query that returns the start date, end date, and the duration of an interval of consecutive days the person was on PTO:

WITH cte AS (
  SELECT 
    employee_name,
    day_of_year,
    PTO,
    SUM(PTO) OVER (PARTITION BY employee_name ORDER BY day_of_year) as grp
  FROM 
    time_off_table
)
SELECT 
  employee_name,
  MIN(day_of_year) as start_date,
  MAX(day_of_year) as end_date,
  COUNT(day_of_year) as duration
FROM 
  cte
WHERE 
  PTO = 1
GROUP BY 
  employee_name, grp
ORDER BY 
  employee_name, start_date;

The query uses a Common Table Expression (CTE) to assign a group number to consecutive PTO days for each employee. The main query then selects the start date, end date, and duration of each group.

center-big

Not really satisfied with the result. Maybe we can suggest modifying the query, so we get consecutive days?

User: This query returns each day as its own interval. Can you give me a query that returns intervals of consecutive days the person was on PTO?

Chatty: Sure, you can modify the query to return intervals of consecutive days by grouping the CTE results by the employee name and the group number assigned to consecutive PTO days:

WITH cte AS (
  SELECT 
    employee_name,
    day_of_year,
    PTO,
    SUM(CASE WHEN PTO = 0 THEN 1 ELSE 0 END) 
      OVER (PARTITION BY employee_name ORDER BY day_of_year) as grp
  FROM 
    time_off_table
)
SELECT 
  employee_name,
  MIN(day_of_year) as start_date,
  MAX(day_of_year) as end_date,
  COUNT(day_of_year) as duration
FROM 
  cte
WHERE 
  PTO = 1
GROUP BY 
  employee_name, grp
ORDER BY 
  employee_name, start_date;

This query will return intervals of consecutive days that an employee was on PTO by grouping the CTE results based on the group number assigned to consecutive PTO days.

center-big

Now we are talking! As mentioned before, our goal is to get the list of employees whose duration is equal to or exceeds 10 days, so that is our following guidance for Chatty.

User: This is great! Thank you. Could you modify the query where the duration is 10 or more days?

Chatty: Sure! You can modify the query to only return intervals with a duration of 10 or more days by adding a HAVING clause to the main query:

WITH cte AS (
  SELECT 
    employee_name,
    day_of_year,
    PTO,
    SUM(CASE WHEN PTO = 0 THEN 1 ELSE 0 END) 
      OVER (PARTITION BY employee_name ORDER BY day_of_year) as grp
  FROM 
    time_off_table
)
SELECT 
  employee_name,
  MIN(day_of_year) as start_date,
  MAX(day_of_year) as end_date,
  COUNT(day_of_year) as duration
FROM 
  cte
WHERE 
  PTO = 1
GROUP BY 
  employee_name, grp
HAVING 
  COUNT(day_of_year) >= 10
ORDER BY 
  employee_name, start_date;

This query will only return intervals of consecutive days with a duration of 10 or more days.

center-big

The catch is not to count weekends as part of someone’s PTO. We are only interested if the requested time off has 10 or more strictly workdays. Let’s see if Chatty can modify the query to fulfill our condition.

User: Super! Thank you. By applying this query, I get employees and their start and end days for PTO lasting 10 days or more, but those periods can contain weekends. Can you modify the query to return employees whose PTO contains 10 or more strictly work days?

Chatty: To return intervals of consecutive work days with a duration of 10 or more, you can first create a table that contains a list of work days. You can then join this table with the time_off_table to only include work days in the calculation of duration.

Here's an example query that implements this logic:

WITH workdays AS (
  SELECT 
    day_of_year
  FROM 
    calendar_table
  WHERE 
    is_weekend = 0
),
cte AS (
  SELECT 
    employee_name,
    day_of_year,
    PTO,
    SUM(CASE WHEN PTO = 0 THEN 1 ELSE 0 END) 
      OVER (PARTITION BY employee_name ORDER BY day_of_year) as grp
  FROM 
    time_off_table t
    JOIN workdays w ON t.day_of_year = w.day_of_year
)
SELECT 
  employee_name,
  MIN(day_of_year) as start_date,
  MAX(day_of_year) as end_date,
  COUNT(day_of_year) as duration
FROM 
  cte
WHERE 
  PTO = 1
GROUP BY 
  employee_name, grp
HAVING 
  COUNT(day_of_year) >= 10
ORDER BY 
  employee_name, start_date;

The workdays CTE is used to create a table of work days based on a calendar_table that you would need to create beforehand. The time_off_table is then joined with this workdays table so that only workdays are considered in the calculation of duration. The query then calculates the duration of PTO intervals as in the previous examples.

It seems a bit messy to introduce an additional table just to track workdays. We would like to find a way to detect if it is a working day or not, straight out of date. Hopefully, Chatty is okay with us trying to reduce our scope of work entirely.

User: I am not able to create a table containing work days. Can you modify the query to check the work days from a date?

Chatty: Sure, you can modify the query to check the work days from a specific date. To do this, you can use a subquery in the WHERE clause of the main query that calculates the number of weekdays between a given date and each day in the time_off_table.

Here's an example:

WITH cte AS (
  SELECT 
    employee_name,
    day_of_year,
    PTO,
    SUM(CASE WHEN PTO = 0 THEN 1 ELSE 0 END) 
      OVER (PARTITION BY employee_name ORDER BY day_of_year) as grp
  FROM 
    time_off_table t
    WHERE 
      (date_part('dow', day_of_year) NOT IN (0, 6) AND 
      day_of_year >= (SELECT '2022-01-01'::date + (10 || ' work days')::interval))
)
SELECT 
  employee_name,
  MIN(day_of_year) as start_date,
  MAX(day_of_year) as end_date,
  COUNT(day_of_year) as duration
FROM 
  cte
WHERE 
  PTO = 1
GROUP BY 
  employee_name, grp
HAVING 
  COUNT(day_of_year) >= 10
ORDER BY 
  employee_name, start_date;

This time, the subquery in the WHERE clause calculates the number of weekdays (excluding weekends) between a given date and each day in the time_off_table. The example uses the date '2022-01-01' and adds 10 weekdays to it to obtain the minimum date that is considered in the calculation of duration. You can modify this date as needed to suit your requirements.

By double-checking the returned query, we can see that it will result in an error, as there is a mashup of integer and interval values. Does “I guess we are all humans” apply to this scenario?

Anyways, we need to let Chatty know.

User: Thank you! Unfortunately, I get an error saying there is an invalid input syntax for type interval '10 work days'. Can you correct the query?

Chatty: Yes, my apologies for that. The syntax is incorrect. Here's the corrected query:

WITH cte AS (
  SELECT 
    employee_name,
    day_of_year,
    PTO,
    SUM(CASE WHEN PTO = 0 THEN 1 ELSE 0 END) 
      OVER (PARTITION BY employee_name ORDER BY day_of_year) as grp
  FROM 
    time_off_table t
    WHERE 
      (date_part('dow', day_of_year) NOT IN (0, 6) AND 
      day_of_year >= ('2022-01-01'::date + (10 * INTERVAL '1 day')))
)
SELECT 
  employee_name,
  MIN(day_of_year) as start_date,
  MAX(day_of_year) as end_date,
  COUNT(day_of_year) as duration
FROM 
  cte
WHERE 
  PTO = 1
GROUP BY 
  employee_name, grp
HAVING 
  COUNT(day_of_year) >= 10
ORDER BY 
  employee_name, start_date;

In this corrected query, the 10 * INTERVAL '1 day' expression is used to calculate 10 days from the starting date '2022-01-01'.

center-big

So, after that little hiccup, it looks like we are coming close to a solution! Just as we do not want to include weekends, we want to skip national holidays, as those days are already non-working days.

User: Great! Thank you so much! Could you modify the query to exclude national holidays from the calculation?

**Chatty: **To exclude national holidays from the calculation, you need to create a table that contains the dates of all national holidays. You can then join this table with the time_off_table and exclude the holiday dates in the calculation.

Here's an example:

WITH cte AS (
  SELECT 
    t.employee_name,
    t.day_of_year,
    t.PTO,
    SUM(CASE WHEN t.PTO = 0 THEN 1 ELSE 0 END) 
      OVER (PARTITION BY t.employee_name ORDER BY t.day_of_year) as grp
  FROM 
    time_off_table t
    LEFT JOIN holidays h
      ON t.day_of_year = h.holiday_date
    WHERE 
      (date_part('dow', t.day_of_year) NOT IN (0, 6) AND 
      t.day_of_year >= ('2022-01-01'::date + (10 * INTERVAL '1 day'))) AND
      h.holiday_date IS NULL
)
SELECT 
  employee_name,
  MIN(day_of_year) as start_date,
  MAX(day_of_year) as end_date,
  COUNT(day_of_year) as duration
FROM 
  cte
WHERE 
  PTO = 1
GROUP BY 
  employee_name, grp
HAVING 
  COUNT(day_of_year) >= 10
ORDER BY 
 	 employee_name, start_date;

In the final query, the holidays table contains the dates of all national holidays. The LEFT JOIN between the time_off_table and the holidays table will exclude the holiday dates in the calculation. The WHERE clause in the subquery cte includes a condition that only includes the dates where the holiday_date is NULL, effectively excluding the national holidays.

center-big

In our case, we created a dramatic, dreamy “holidays” table that treats every day in March as a national holiday. We can see that the remaining requested period was successfully narrowed down to February dates.

Findings

Not to sugarcoat everything - ChatGPT needed a nudge in the right direction to work in the desired way. The first time, it produced durations that overlapped or just did not make sense for longer intervals. The second time, it had trouble interpreting the PTO flag, and everything went downhill.

We realized that the input we provide has a great impact on how our intention is interpreted and how the flow goes on from that moment on. Moreover, while it can understand almost everything we ask, it has a limited attention span. We needed to find an optimal way not to overflow it with new information, so it knows our next query is a continuation of the previous one.

Another setback was directly connected to the fact current wave of massive hype surrounding ChatGPT. The chat’s availability and speed differed during the day. There were moments when we had to wait for the query a long time, ultimately stopping the conversation and continuing sometime later.

On the other hand, it should get the recognition it deserves. The process of testing the chat’s possibilities was indeed fun. It was interesting to interact with it, both in a conversational sense and by seeing its technical capabilities.

It showed a vast knowledge of SQL syntax. For example, it proved it could switch the syntax from MS Server to PostgreSQL with ease. Moreover, it understood was able to handle any terminology thrown at it. It corrected itself every single time there was an error. Not only that, but it showed the capability to correct the query even without us mentioning the cause of the error.

Conclusion

Have you tried communicating with ChatGPT on your own? Even if you’re not looking to generate content for personal use, you will still be blown away by its capabilities. We encourage everyone to give it a try in their spare time. In the meantime, we will surely occupy ourselves by playing around.

If you find this reading inspiring, and you want us to cover another topic, let us know. Already thinking about saving time by automating tasks such as scheduling meetings? Or maybe requesting personalized recommendations based on our preferences? iOLAP’s Machine Learning department is very excited to see how tools like this can be utilized in our everyday lives.

Share This Story, Choose Your Platform!

Share This Story

Drive your business forward!

iOLAP experts are here to assist you