Skip to content

Commit 0017aad

Browse files
committed
fixes and improvements: agent context, db table import
1 parent a214973 commit 0017aad

26 files changed

Lines changed: 831 additions & 465 deletions

README.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -34,6 +34,7 @@ https://github.com/user-attachments/assets/8ca57b68-4d7a-42cb-bcce-43f8b1681ce2
3434
## News 🔥🔥🔥
3535
[01-25-2025] **Data Formulator 0.6** — Explore live data with real-time insights
3636
-**Connect to live data**: Connect to URLs and databases with automatic refresh intervals. Visualizations update automatically as your data changes to provide you live insights. [Example: Stock market data from Yahoo Finance](https://github.com/microsoft/data-formulator/pull/200#issue-3635408217)
37+
- 🎨 **UI Updates**: Directly drag-and-drop fields from the data table to update visualization designs.
3738

3839
[12-08-2025] **Data Formulator 0.5.1** — Connect more, visualize more, move faster
3940
- 🔌 **Community data loaders**: Google BigQuery, MySQL, Postgres, MongoDB

py-src/data_formulator/agent_routes.py

Lines changed: 0 additions & 21 deletions
Original file line numberDiff line numberDiff line change
@@ -31,7 +31,6 @@
3131
from data_formulator.agents.agent_data_clean import DataCleanAgent
3232
from data_formulator.agents.agent_data_clean_stream import DataCleanAgentStream
3333
from data_formulator.agents.agent_code_explanation import CodeExplanationAgent
34-
from data_formulator.agents.agent_query_completion import QueryCompletionAgent
3534
from data_formulator.agents.agent_interactive_explore import InteractiveExploreAgent
3635
from data_formulator.agents.agent_report_gen import ReportGenAgent
3736
from data_formulator.agents.client_utils import Client
@@ -614,26 +613,6 @@ def request_code_expl():
614613
else:
615614
return jsonify({'error': 'Invalid request format'}), 400
616615

617-
@agent_bp.route('/query-completion', methods=['POST'])
618-
def query_completion():
619-
if request.is_json:
620-
logger.info("# request data: ")
621-
content = request.get_json()
622-
623-
client = get_client(content['model'])
624-
625-
data_source_metadata = content["data_source_metadata"]
626-
query = content["query"]
627-
628-
query_completion_agent = QueryCompletionAgent(client=client)
629-
reasoning, query = query_completion_agent.run(data_source_metadata, query)
630-
response = flask.jsonify({ "token": "", "status": "ok", "reasoning": reasoning, "query": query })
631-
else:
632-
response = flask.jsonify({ "token": "", "status": "error", "reasoning": "unable to complete query", "query": "" })
633-
634-
response.headers.add('Access-Control-Allow-Origin', '*')
635-
return response
636-
637616
@agent_bp.route('/get-recommendation-questions', methods=['GET', 'POST'])
638617
def get_recommendation_questions():
639618
def generate():

py-src/data_formulator/agents/agent_exploration.py

Lines changed: 2 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,7 @@
66
import base64
77

88
from data_formulator.agents.agent_utils import extract_json_objects, generate_data_summary
9-
from data_formulator.agents.agent_sql_data_transform import get_sql_table_statistics_str, sanitize_table_name
9+
from data_formulator.agents.agent_sql_data_transform import generate_sql_data_summary
1010

1111
logger = logging.getLogger(__name__)
1212

@@ -151,11 +151,7 @@ def get_chart_message(self, visualization):
151151

152152
def get_data_summary(self, input_tables):
153153
if self.db_conn:
154-
data_summary = ""
155-
for table in input_tables:
156-
table_name = sanitize_table_name(table['name'])
157-
table_summary_str = get_sql_table_statistics_str(self.db_conn, table_name)
158-
data_summary += f"[TABLE {table_name}]\n\n{table_summary_str}\n\n"
154+
data_summary = generate_sql_data_summary(self.db_conn, input_tables)
159155
else:
160156
data_summary = generate_data_summary(input_tables)
161157
return data_summary

py-src/data_formulator/agents/agent_interactive_explore.py

Lines changed: 39 additions & 34 deletions
Original file line numberDiff line numberDiff line change
@@ -6,31 +6,35 @@
66
import pandas as pd
77

88
from data_formulator.agents.agent_utils import extract_json_objects, generate_data_summary
9-
from data_formulator.agents.agent_sql_data_transform import get_sql_table_statistics_str, sanitize_table_name
9+
from data_formulator.agents.agent_sql_data_transform import generate_sql_data_summary
1010

1111
logger = logging.getLogger(__name__)
1212

1313
SYSTEM_PROMPT = '''You are a data exploration expert who suggests interesting questions to help users explore their datasets.
1414
15-
Given a dataset (or a thread of datasets that have been explored), your task is to suggest 4 exploration questions (unless the user explicitly asks for the number of questions), that users can follow to gain insights from their data.
16-
* the user may provide you current explorations they have done, including:
17-
- a thread of exploration questions they have explored
18-
- the latest data sample they are viewing
19-
- the current chart they are viewing
20-
* when the exploration context is provided, make your suggestion based on the context as well as the original dataset; otherwise leverage the original dataset to suggest questions.
15+
This prompt contains the following sections:
16+
- [DATASETS] section: available datasets the user is working with.
17+
- [EXPLORATION THREAD] section (optional): sequence of datasets that have been explored in the order they were created, and what questions are asked to create them. These tables are all created from tables in the [DATASETS] section.
18+
- [CURRENT DATA] section (optional): latest data sample the user is viewing, and the visualization they are looking at at the moment.
19+
- [START QUESTION] section (optional): start question from previous exploration steps for context
20+
21+
Your task is to suggest 4 exploration questions (unless the user explicitly asks for the number of questions), that users can follow to gain insights from their data.
22+
When the exploration context is provided, make your suggestion based on the context as well as the original datasets; otherwise leverage the original datasets to suggest questions.
2123
2224
Guidelines for question suggestions:
23-
1. Suggest interesting analytical questions that are not obvious that can uncover nontrivial insights
24-
2. Use a diverse language style to display the questions (can be questions, statements etc)
25-
3. If there are multiple datasets in a thread, consider relationships between them
25+
1. Suggest interesting analytical questions that can uncover new insights from the data.
26+
2. Use a diverse language style to display the questions (can be questions, statements etc).
27+
3. If there are multiple datasets in a thread, consider relationships between them.
2628
4. CONCISENESS: the questions should be concise and to the point
27-
5. QUESTION: the question should be a new question based on the thread of exploration:
28-
- either a followup question, or a new question that is related to the thread
29+
5. QUESTION: the question should be a new question based on the exploration thread:
30+
- if no exploration thread is provided, start with a high-level overview question that directly visualizes the data to give the user a sense of the data.
31+
- either a followup question, or a new question that is related to the exploration thread
2932
- if the current data is rich, you can ask a followup question to further explore the dataset;
3033
- if the current data is already specialized to answer the previous question, you can ask a new question that is related to the thread but not related to the previous question in the thread, leverage earlier exploration data to ask questions that can expand the exploration horizon
3134
- do not repeat questions that have already been explored in the thread
3235
- do not suggest questions that are not related to the thread (e.g. questions that are completely unrelated to the exploration direction in the thread)
3336
- do not naively follow up if the question is already too low-level when previous iterations have already come into a small subset of the data (suggest new related areas related to the metric / attributes etc)
37+
- leverage other datasets in the [DATASETS] section to suggest questions that are related to the exploration thread.
3438
6. DIVERSITY: the questions should be diverse in difficulty (easy / medium / hard) and the four questions should cover different aspects of the data analysis to expand the user's horizon
3539
- simple questions should be short -- single sentence exploratory questions
3640
- medium questions can be 1-2 sentences exploratory questions
@@ -59,15 +63,17 @@
5963

6064
SYSTEM_PROMPT_AGENT = '''You are a data exploration expert to help users explore their datasets.
6165
66+
This prompt contains the following sections:
67+
- [DATASETS] section: available datasets the user is working with.
68+
- [EXPLORATION THREAD] section (optional): sequence of datasets that have been explored in the order they were created, and what questions are asked to create them. These tables are all created from tables in the [DATASETS] section.
69+
- [CURRENT DATA] section (optional): latest data sample the user is viewing, and the visualization they are looking at at the moment.
70+
- [START QUESTION] section (optional): start question from previous exploration steps for context
71+
6272
Given a dataset (or a thread of datasets that have been explored), your task is to suggest 4 exploration questions (unless the user explicitly asks for the number of questions), that users can follow to gain insights from their data.
63-
* the user may provide you current explorations they have done, including:
64-
- a thread of exploration questions they have explored
65-
- the latest data sample they are viewing
66-
- the current chart they are viewing
67-
* when the exploration context is provided, make your suggestion based on the context as well as the original dataset; otherwise leverage the original dataset to suggest questions.
73+
When the exploration context is provided, make your suggestion based on the context as well as the original datasets; otherwise leverage the original datasets to suggest questions.
6874
6975
Guidelines for question suggestions:
70-
1. Suggest a list of question_groups of interesting analytical questions that are not obvious that can uncover nontrivial insights.
76+
1. Suggest a list of question_groups of interesting analytical questions that can uncover new insights from the data.
7177
2. Use a diverse language style to display the questions (can be questions, statements etc)
7278
3. If there are multiple datasets in a thread, consider relationships between them
7379
4. CONCISENESS: the questions should be concise and to the point
@@ -80,6 +86,7 @@
8086
- hard questions should introduce some new analysis concept but still make it concise
8187
- if suitable, include a group of questions that are related to statistical analysis: forecasting, regression, or clustering.
8288
6. QUESTIONS WITHIN A QUESTION GROUP:
89+
- if the user doesn't provide an exploration thread, start with a high-level overview question that directly visualizes the data to give the user a sense of the data.
8390
- raise new questions that are related to the user's goal, do not repeat questions that have already been explored in the context provided to you.
8491
- if the user provides a start question, suggested questions should be related to the start question.
8592
- the questions should progressively dive deeper into the data, building on top of the previous question.
@@ -113,15 +120,11 @@ def __init__(self, client, agent_exploration_rules="", db_conn=None):
113120
self.agent_exploration_rules = agent_exploration_rules
114121
self.db_conn = db_conn
115122

116-
def get_data_summary(self, input_tables):
123+
def get_data_summary(self, input_tables, table_name_prefix="Table"):
117124
if self.db_conn:
118-
data_summary = ""
119-
for table in input_tables:
120-
table_name = sanitize_table_name(table['name'])
121-
table_summary_str = get_sql_table_statistics_str(self.db_conn, table_name)
122-
data_summary += f"[TABLE {table_name}]\n\n{table_summary_str}\n\n"
125+
data_summary = generate_sql_data_summary(self.db_conn, input_tables, table_name_prefix=table_name_prefix)
123126
else:
124-
data_summary = generate_data_summary(input_tables, include_data_samples=False)
127+
data_summary = generate_data_summary(input_tables, include_data_samples=False, table_name_prefix=table_name_prefix)
125128
return data_summary
126129

127130
def run(self, input_tables, start_question=None, exploration_thread=None,
@@ -144,19 +147,21 @@ def run(self, input_tables, start_question=None, exploration_thread=None,
144147
data_summary = self.get_data_summary(input_tables)
145148

146149
# Build context including exploration thread if available
147-
context = f"[DATASET]\n\n{data_summary}"
150+
context = f"[DATASETS] These are the datasets the user is working with:\n\n{data_summary}"
148151

149152
if exploration_thread:
150-
thread_summary = "Tables in this exploration thread:\n"
151-
for i, table in enumerate(exploration_thread, 1):
152-
table_name = table.get('name', f'Table {i}')
153-
data_summary = self.get_data_summary([{'name': table_name, 'rows': table.get('rows', [])}])
154-
table_description = table.get('description', 'No description available')
155-
thread_summary += f"{i}. {table_name}: {table_description} \n\n{data_summary}\n\n"
156-
context += f"\n\n[EXPLORATION THREAD]\n\n{thread_summary}"
153+
thread_summary = self.get_data_summary(
154+
[{
155+
'name': table.get('name', f'Table {i}'),
156+
'rows': table.get('rows', []),
157+
'attached_metadata': table.get('description', ''),
158+
} for i, table in enumerate(exploration_thread, 1)],
159+
table_name_prefix="Thread Table"
160+
)
161+
context += f"\n\n[EXPLORATION THREAD] These are the sequence of tables the user created in this exploration thread, in the order they were created, and what questions are asked to create them:\n\n{thread_summary}"
157162

158163
if current_data_sample:
159-
context += f"\n\n[CURRENT DATA SAMPLE]\n\n{pd.DataFrame(current_data_sample).head(10).to_string()}"
164+
context += f"\n\n[CURRENT DATA SAMPLE] This is the current data sample the user is viewing, and the visualization they are looking at at the moment is shown below:\n\n{pd.DataFrame(current_data_sample).head(10).to_string()}"
160165

161166
if start_question:
162167
context += f"\n\n[START QUESTION]\n\n{start_question}"

py-src/data_formulator/agents/agent_py_data_rec.py

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -67,6 +67,8 @@
6767
* the column can either be a column in the input data, or a new column that will be computed in the output data.
6868
* the mention don't have to be exact match, it can be semantically matching, e.g., if you mentioned "average score" in the text while the column to be computed is "Avg_Score", you should still highlight "**average score**" in the text.
6969
- determine "input_tables", the names of a subset of input tables from [CONTEXT] section that will be used to achieve the user's goal.
70+
- Note that the first table is the table the user is currently viewing, it should take precedence if the user asks for visualization of the "current table".
71+
- At the same time, leverage table information to determine which tables are relevant to the user's goal and should be used.
7072
- "chart_type" must be one of "point", "bar", "line", "area", "heatmap", "group_bar", "boxplot"
7173
- "chart_encodings" should specify which fields should be used to create the visualization
7274
- decide which visual channels should be used to create the visualization appropriate for the chart type.

py-src/data_formulator/agents/agent_py_data_transform.py

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -35,6 +35,8 @@
3535
* the column can either be a column in the input data, or a new column that will be computed in the output data.
3636
* the mention don't have to be exact match, it can be semantically matching, e.g., if you mentioned "average score" in the text while the column to be computed is "Avg_Score", you should still highlight "**average score**" in the text.
3737
- determine "input_tables", the names of a subset of input tables from [CONTEXT] section that will be used to achieve the user's goal.
38+
- **IMPORTANT** Note that the Table 1 in [CONTEXT] section is the table the user is currently viewing, it should take precedence if the user refers to insights about the "current table".
39+
- At the same time, leverage table information to determine which tables are relevant to the user's goal and should be used.
3840
- determine "output_fields", the desired fields that the output data should have to achieve the user's goal, it's a good idea to include intermediate fields here.
3941
- then decide "chart_encodings", which maps visualization channels (x, y, color, size, opacity, facet, etc.) to a subset of "output_fields" that will be visualized,
4042
- the "chart_encodings" should be created to support the user's "chart_type".

py-src/data_formulator/agents/agent_query_completion.py

Lines changed: 0 additions & 83 deletions
This file was deleted.

0 commit comments

Comments
 (0)