You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: py-src/data_formulator/agents/agent_py_data_rec.py
+36-5Lines changed: 36 additions & 5 deletions
Original file line number
Diff line number
Diff line change
@@ -32,6 +32,7 @@
32
32
"recap": "..." // string, a short summary of the user's goal.
33
33
"display_instruction": "..." // string, the even shorter verb phrase describing the users' goal.
34
34
"recommendation": "..." // string, explain why this recommendation is made
35
+
"input_tables": [...] // string[], describe names of the input tables that will be used in the transformation.
35
36
"output_fields": [...] // string[], describe the desired output fields that the output data should have (i.e., the goal of transformed data), it's a good idea to preseve intermediate fields here
36
37
"chart_type": "" // string, one of "point", "bar", "line", "area", "heatmap", "group_bar", 'boxplot'. "chart_type" should either be inferred from user instruction, or recommend if the user didn't specify any.
37
38
"chart_encodings": {
@@ -65,6 +66,7 @@
65
66
- if you mention column names from the input or the output data, highlight the text in **bold**.
66
67
* the column can either be a column in the input data, or a new column that will be computed in the output data.
67
68
* the mention don't have to be exact match, it can be semantically matching, e.g., if you mentioned "average score" in the text while the column to be computed is "Avg_Score", you should still highlight "**average score**" in the text.
69
+
- determine "input_tables", the names of a subset of input tables from [CONTEXT] section that will be used to achieve the user's goal.
68
70
- "chart_type" must be one of "point", "bar", "line", "area", "heatmap", "group_bar", "boxplot"
69
71
- "chart_encodings" should specify which fields should be used to create the visualization
70
72
- decide which visual channels should be used to create the visualization appropriate for the chart type.
- if the user provided one table, then it should be `def transform_data(df1)`, if the user provided multiple tables, then it should be `def transform_data(df1, df2, ...)` and you should consider the join between tables to derive the output.
161
-
- **VERY IMPORTANT** the number of arguments in the function must match the number of tables provided, and the order of arguments must match the order of tables provided.
162
-
- you can use intuitive table names to refer to the input dataframes, for example, if the user provided two tables city and weather, you can use `transform_data(df_city, df_weather)` to refer to the two dataframes, as long as the number and order of the arguments match the number and order of the tables provided.
162
+
- decide the function signature based on the number of tables you decided in the previous step "input_tables":
163
+
- if you decide there will only be one input table, then function signature should be `def transform_data(df1)`
164
+
- if you decided there will be k input tables, then function signature should be `def transform_data(df_1, df_2, ..., df_k)`.
165
+
- instead of using generic names like df1, df2, ..., try to use intuitive table names for function arguments, for example, if you have input_tables: ["City", "Weather"]`, you can use `transform_data(df_city, df_weather)` to refer to the two dataframes.
166
+
- **VERY IMPORTANT** the number of arguments in the function signature must be the same as the number of tables provided in "input_tables", and the order of arguments must match the order of tables provided in "input_tables".
163
167
- datetime objects handling:
164
168
- if the output field is year, convert it to number, if it is year-month / year-month-day, convert it to string object (e.g., "2020-01" / "2020-01-01").
165
169
- if the output is time only: convert hour to number if it's just the hour (e.g., 10), but convert hour:min or h:m:s to string object (e.g., "10:30", "10:30:45")
"display_instruction": "Rank students by average scores",
206
210
"mode": "infer",
207
211
"recommendation": "To rank students based on their average scores, we need to calculate the average score for each student, then sort the data, and finally assign a rank to each student based on their average score.",
Copy file name to clipboardExpand all lines: py-src/data_formulator/agents/agent_py_data_transform.py
+40-9Lines changed: 40 additions & 9 deletions
Original file line number
Diff line number
Diff line change
@@ -34,6 +34,7 @@
34
34
- if you mention column names from the input or the output data, highlight the text in **bold**.
35
35
* the column can either be a column in the input data, or a new column that will be computed in the output data.
36
36
* the mention don't have to be exact match, it can be semantically matching, e.g., if you mentioned "average score" in the text while the column to be computed is "Avg_Score", you should still highlight "**average score**" in the text.
37
+
- determine "input_tables", the names of a subset of input tables from [CONTEXT] section that will be used to achieve the user's goal.
37
38
- determine "output_fields", the desired fields that the output data should have to achieve the user's goal, it's a good idea to include intermediate fields here.
38
39
- then decide "chart_encodings", which maps visualization channels (x, y, color, size, opacity, facet, etc.) to a subset of "output_fields" that will be visualized,
39
40
- the "chart_encodings" should be created to support the user's "chart_type".
@@ -48,7 +49,7 @@
48
49
- e.g., they may mention "use B metric instead" while A metric is in provided fields, in this case, you should update "chart_encodings" to update A metric with B metric.
49
50
- guide on statistical analysis:
50
51
- when the user asks for forecasting or regression analysis, you should consider the following:
51
-
- the output should be a long format table where actual x, y pairs and predicted x, y pairs are included in the X, Y columns, they are differentiated with a third column "is_predicted" that is a boolean field.
52
+
- the output should be a long format table where actual x, y pairs and predicted x, y pairs are included in the X, Y columns, they are differentiated with a third column "is_predicted".
52
53
- i.e., if the user ask for forecasting based on two columns T and Y, the output should be three columns: T, Y, is_predicted, where
53
54
- T, Y columns contain BOTH original values from the data and predicted values from the data.
54
55
- is_predicted is a boolean field to indicate whether the x, y pairs are original values from the data or predicted / regression values from the data.
@@ -65,6 +66,7 @@
65
66
{
66
67
"detailed_instruction": "..." // string, elaborate user instruction with details if the user
67
68
"display_instruction": "..." // string, the short verb phrase describing the users' goal.
69
+
"input_tables": [...] // string[], describe names of the input tables that will be used in the transformation.
68
70
"output_fields": [...] // string[], describe the desired output fields that the output data should have based on the user's goal, it's a good idea to preserve intermediate fields here (i.e., the goal of transformed data)
69
71
"chart_encodings": {
70
72
"x": "",
@@ -79,8 +81,8 @@
79
81
}
80
82
```
81
83
82
-
2. Then, write a python function based on the refined goal, the function input is a dataframe "df" (or multiple dataframes based on tables presented in the [CONTEXT] section) and the output is the transformed dataframe "transformed_df". "transformed_df" should contain all "output_fields" from the refined goal.
83
-
The python function must follow the template provided in [TEMPLATE], do not import any other libraries or modify function name. The function should be as simple as possible and easily readable.
84
+
2. Then, write a python function based on the refined goal, the function input is a dataframe "df" (or multiple dataframes based on tables described in "input_tables") and the output is the transformed dataframe "transformed_df". "transformed_df" should contain all "output_fields" from the refined goal.
85
+
The python function must follow the template provided in [TEMPLATE], only import libraries allowed in the template, do not modify function name. The function should be as simple as possible and easily readable.
84
86
If there is no data transformation needed based on "output_fields", the transformation function can simply "return df".
- if the user provided one table, then it should be `def transform_data(df1)`, if the user provided multiple tables, then it should be `def transform_data(df1, df2, ...)` and you should consider the join between tables to derive the output.
101
-
- **VERY IMPORTANT** the number of arguments in the function must match the number of tables provided, and the order of arguments must match the order of tables provided.
102
-
- try to use intuitive table names to refer to the input dataframes, for example, if the user provided two tables city and weather, you can use `transform_data(df_city, df_weather)` to refer to the two dataframes, as long as the number and order of the arguments match the number and order of the tables provided.
102
+
- decide the function signature based on the number of tables you decided in the previous step "input_tables":
103
+
- if you decide there will only be one input table, then function signature should be `def transform_data(df1)`
104
+
- if you decided there will be k input tables, then function signature should be `def transform_data(df_1, df_2, ..., df_k)`.
105
+
- instead of using generic names like df1, df2, ..., try to use intuitive table names for function arguments, for example, if you have input_tables: ["City", "Weather"]`, you can use `transform_data(df_city, df_weather)` to refer to the two dataframes.
106
+
- **VERY IMPORTANT** the number of arguments in the function signature must be the same as the number of tables provided in "input_tables", and the order of arguments must match the order of tables provided in "input_tables".
103
107
- datetime objects handling:
104
108
- if the output field is year, convert it to number, if it is year-month / year-month-day, convert it to string object (e.g., "2020-01" / "2020-01-01").
105
109
- if the output is time only: convert hour to number if it's just the hour (e.g., 10), but convert hour:min or h:m:s to string object (e.g., "10:30", "10:30:45")
@@ -202,6 +206,7 @@ def transform_data(df):
202
206
203
207
{
204
208
"detailed_instruction": "Create a scatter plot to compare Seattle and Atlanta temperatures with Seattle temperatures on the x-axis and Atlanta temperatures on the y-axis. Color the points by which city is warmer.",
"reason": "To compare Seattle and Atlanta temperatures with Seattle temperatures on the x-axis and Atlanta temperatures on the y-axis, and color points by which city is warmer, separate temperature fields for Seattle and Atlanta are required. Additionally, a new field 'Warmer City' is needed to indicate which city is warmer."
@@ -212,7 +217,7 @@ def transform_data(df):
212
217
import collections
213
218
import numpy as np
214
219
215
-
def transform_data(df):
220
+
def transform_data(df_weather_seattle_atlanta):
216
221
# Pivot the dataframe to have separate columns for Seattle and Atlanta temperatures
0 commit comments