Climate Change

Exploring the Climate Resilience Data Platform: A Survey of Its Assets

This post explores the Climate Resilience Data Platform, showcasing how it sources climate articles, tracks social conversations, and classifies narratives. A behind-the-scenes look at refining data products to reveal societal insights and advance the mission of RepublicOfData.io.

Olivier Dupuis

03 Jan 2025 — 23 min read

Over the past months, I’ve iterated quickly to create a platform that captures climate-related conversations on social networks. I shared my journey and turned a blind eye to my accumulated tech debt. Or didn’t blink twice when confronted with suspicious data quality.

The last months might not have led to any breakthrough innovation worth reporting on, but they have been busy: refactoring the asset management system, enforcing stricter typed datasets, productionizing AI agents, and more.

I think it’s now worth taking a step back and examining the platform's data assets. We won’t dive deep into any story yet (I’ll save that for the next post). Instead, we’ll survey the material we now have.

This will be a deep dive into much metadata, so buckle up!

👨‍💻

At RepublicOfData.io, we openly share code and journey. Here’s a list of repositories that are part of this project:

- Climate Resilience Data Platform: the data platform itself.
- Story notebooks: the notebooks to interact with the data.
- Damn Tool: a tool to interact with the platform’s assets.

🏛️ Design

The platform is built around 4 key data products:

media: The climate-related articles being sourced.
social_networks: The conversations that refer to the articles above.
narratives: The classification of conversations and posts into discourse types and narratives.
analytics: The dimensional representation of those assets for reporting purposes.

Mesh of data products on the Climate Resilience platform.

Here’s a complete list of assets produced by the platform.

import json
from damn_tool.ls import list_assets

result = list_assets(orchestrator='dagster', configs_dir="../.damn/")
data = json.loads(result)

organized_data = {}
for item in data["ls"]:
    product, key = item.split("/", 1)
    if product not in organized_data:
        organized_data[product] = []
    organized_data[product].append(key)

for product, keys in organized_data.items():
    print(f"{product}:")
    for key in keys:
        print(f" - {key}")

Listing assets using the DAMN tool.

analytics:
 - int__media_articles
 - int__social_network_conversations
 - int__social_network_posts
 - int__social_network_user_profiles
 - media_articles_fct
 - social_network_conversations_dim
 - social_network_posts_fct
 - social_network_user_profiles_dim
 - stg__conversation_classifications
 - stg__conversation_event_summaries
 - stg__nytimes_articles
 - stg__post_narrative_associations
 - stg__user_geolocations
 - stg__x_conversation_posts
 - stg__x_conversations
media:
 - nytimes_articles
narratives:
 - conversation_classifications
 - conversation_event_summary
 - post_narrative_associations
social_networks:
 - user_geolocations
 - x_conversation_posts
 - x_conversations

Listing of assets from the Climate Resilience data platform.

📊 Data Assets

Let’s now dive into each of those assets.

Media Assets

For now, we are only monitoring the New York Times for climate-related articles.

Let’s get a selection of the attributes from the media assets.

import json
from damn_tool.show import show_asset

result = show_asset('media/nytimes_articles', orchestrator='dagster', data_warehouse='bigquery', configs_dir="../.damn/")
data = json.loads(result)

# Extract and print the desired attributes
asset_details = data["show"]
orchestrator_details = asset_details["From orchestrator"]
warehouse_details = asset_details["From data warehouse"]

# Combine attributes into a single string
output = (
    f"- Description: {orchestrator_details['description']}\n"
    f"- Compute kind: {orchestrator_details['computeKind']}\n"
    f"- Is partitioned: {orchestrator_details['isPartitioned']}\n"
    f"- Table type: {warehouse_details['table_type']}\n"
    f"- Table schema: {warehouse_details['table_schema']}\n"
    f"- Created: {warehouse_details['created']}"
)

# Print the final string
print(output)

Getting attributes from the nytimes_articles asset.

- Description: Media feed for New York Times
- Compute kind: python
- Is partitioned: True
- Table type: BASE TABLE
- Table schema: media
- Created: 2024-12-12T18:44:11.473000+00:00

Attributes from the nytimes_articles asset.

And some metrics:

import json
from damn_tool.metrics import asset_metrics

result = asset_metrics('media/nytimes_articles', orchestrator='dagster', data_warehouse='bigquery', configs_dir="../.damn/")
data = json.loads(result)

# Extract relevant metrics
orchestrator_metrics = data["metrics"]["From orchestrator"]
dw_metrics = data["metrics"]["From data warehouse"]

# Combine metrics into a single string
output = (
    f"- Row count: {dw_metrics['row_count']}\n"
    f"- Size: {dw_metrics['bytes']}\n"
    f"- Last run id: {orchestrator_metrics['run_id']}\n"
    f"- Last run status: {orchestrator_metrics['status']}\n"
    f"- Last start time: {orchestrator_metrics['start_time']}\n"
    f"- Last end time: {orchestrator_metrics['end_time']}\n"
    f"- Number of partitions: {orchestrator_metrics['num_partitions']}\n"
    f"- Number of partitions materialized: {orchestrator_metrics['num_materialized']}\n"
    f"- Number of failed partitions: {orchestrator_metrics['num_failed']}"
)

# Print the final string
print(output)

Getting metrics from the nytimes_articles asset.

- Row count: 894
- Size: 649.96 KB
- Last run id: e6cf5f89-f14d-4623-9496-82b52efe79d1
- Last run status: SUCCESS
- Last start time: 2024-12-24 10:01:45
- Last end time: 2024-12-24 10:01:54
- Number of partitions: 2275
- Number of partitions materialized: 291
- Number of failed partitions: 0

Metrics from the nytimes_articles asset.

Let’s pull the data from this asset:

import pyarrow.feather as feather
from google.cloud import bigquery

# Construct a BigQuery client object.
client = bigquery.Client()

query = """
    select *
    from `phonic-biplane-420020.media.nytimes_articles`
    order by published_ts desc
    limit 5"""

# Make an API request.
query_job = client.query(query)

# Wait for the job to complete.
articles_df = query_job.result().to_dataframe()

# Write the DataFrame to a Feather file
feather_file = "data/articles.feather"
feather.write_feather(articles_df, feather_file)

# Pass the file path to the global namespace
globals()['articles_feather'] = feather_file

Getting a data sample from the nytimes_articles asset.

library(arrow)

# Read the Feather file
feather_file <- reticulate::py$articles_feather

# Read the Feather file
articles_df <- read_feather(feather_file)

# Display the table
kable(
    articles_df,
    format = "html",
    escape = TRUE   
)

Displaying data sample from the nytimes_articles asset.

MEDIA	ID	TITLE	LINK	SUMMARY	AUTHOR	TAGS	MEDIAS	PUBLISHED_TS
nytimes	https://www.nytimes.com/2024/12/23/us/politics/trump-greenland-panama-canal.html	Trump’s Wish to Control Greenland and Panama Canal: Not a Joke This Time	https://www.nytimes.com/2024/12/23/us/politics/trump-greenland-panama-canal.html	In recent days the president-elect has called for asserting U.S. control over the Panama Canal and Greenland, showing that his “America First” philosophy has an expansionist dimension.	David E. Sanger and Lisa Friedman	United States International Relations,United States Politics and Government,Trump, Donald J,Greenland,Panama Canal and Canal Zone,Presidential Election of 2024,Global Warming,Metals and Minerals,Ships and Shipping,Egede, Mute B,Arctic Regions,Denmark,Panama	https://static01.nyt.com/images/2024/12/23/multimedia/23DC-TRUMP-mpcg/23DC-TRUMP-mpcg-mediumSquareAt3X.jpg	2024-12-23 22:08:12
nytimes	https://www.nytimes.com/2024/12/23/us/politics/trump-greenland-panama-canal.html	Trump’s Wish to Control Greenland and Panama Canal: Not a Joke This Time	https://www.nytimes.com/2024/12/23/us/politics/trump-greenland-panama-canal.html	In recent days the president-elect has called for asserting U.S. control over the Panama Canal and Greenland, showing that his “America First” philosophy has an expansionist dimension.	David E. Sanger and Lisa Friedman	United States International Relations,United States Politics and Government,Trump, Donald J,Greenland,Panama Canal and Canal Zone,Presidential Election of 2024,Global Warming,Metals and Minerals,Ships and Shipping,Egede, Mute B,Arctic Regions,Denmark,Panama	https://static01.nyt.com/images/2024/12/23/multimedia/23DC-TRUMP-mpcg/23DC-TRUMP-mpcg-mediumSquareAt3X.jpg	2024-12-23 21:08:14
nytimes	https://www.nytimes.com/2024/12/23/us/politics/trump-greenland-panama-canal.html	Trump’s Wish to Control Greenland and Panama Canal: Not a Joke This Time	https://www.nytimes.com/2024/12/23/us/politics/trump-greenland-panama-canal.html	In recent days the president-elect has called for asserting U.S. control over the Panama Canal and Greenland, showing that his “America First” philosophy has an expansionist dimension.	David E. Sanger and Lisa Friedman	United States International Relations,United States Politics and Government,Trump, Donald J,Greenland,Panama Canal and Canal Zone,Presidential Election of 2024,Global Warming,Metals and Minerals,Ships and Shipping,Egede, Mute B,Arctic Regions,Denmark,Panama	https://static01.nyt.com/images/2024/12/23/multimedia/23DC-TRUMP-mpcg/23DC-TRUMP-mpcg-mediumSquareAt3X.jpg	2024-12-23 18:20:01
nytimes	https://www.nytimes.com/2024/12/23/us/politics/trump-greenland-panama-canal.html	Trump’s Wish to Control Greenland and Panama Canal: Not a Joke This Time	https://www.nytimes.com/2024/12/23/us/politics/trump-greenland-panama-canal.html	In recent days the president-elect has called for asserting U.S. control over the Panama Canal and Greenland, showing that his “America First” philosophy has an expansionist dimension.	David E. Sanger and Lisa Friedman	United States International Relations,United States Politics and Government,Trump, Donald J,Greenland,Panama Canal and Canal Zone,Presidential Election of 2024,Global Warming,Metals and Minerals,Ships and Shipping,Egede, Mute B,Arctic Regions,Denmark,Panama	https://static01.nyt.com/images/2024/12/23/multimedia/23DC-TRUMP-mpcg/23DC-TRUMP-mpcg-mediumSquareAt3X.jpg	2024-12-23 17:10:55
nytimes	https://www.nytimes.com/2024/12/23/us/politics/trump-greenland-panama-canal.html	Trump’s Wish to Control Greenland and Panama Canal: Not a Joke This Time	https://www.nytimes.com/2024/12/23/us/politics/trump-greenland-panama-canal.html	In recent days the president-elect has called for asserting U.S. control over the Panama Canal and Greenland, showing that his “America First” philosophy has an expansionist dimension.	David E. Sanger and Lisa Friedman	United States International Relations,United States Politics and Government,Trump, Donald J,Greenland,Panama Canal and Canal Zone,Presidential Election of 2024,Global Warming,Metals and Minerals,Ships and Shipping,Egede, Mute B,Arctic Regions,Denmark,Panama	https://static01.nyt.com/images/2024/12/23/multimedia/23DC-TRUMP-mpcg/23DC-TRUMP-mpcg-mediumSquareAt3X.jpg	2024-12-23 16:20:18

Articles are the foundation of the platform. Once climate-related articles are pulled, the platform monitors social networks and capture conversations that refer to them.

We then monitor conversations on social networks (for now, only X) that refer to the articles sourced.

Let’s now get some details on conversations that occur on X.

import json
from damn_tool.show import show_asset

result = show_asset('social_networks/x_conversations', orchestrator='dagster', data_warehouse='bigquery', configs_dir="../.damn/")
data = json.loads(result)

# Extract and print the desired attributes
asset_details = data["show"]
orchestrator_details = asset_details["From orchestrator"]
warehouse_details = asset_details["From data warehouse"]

# Combine attributes into a single string
output = (
    f"- Description: {orchestrator_details['description']}\n"
    f"- Compute kind: {orchestrator_details['computeKind']}\n"
    f"- Is partitioned: {orchestrator_details['isPartitioned']}\n"
    f"- Table type: {warehouse_details['table_type']}\n"
    f"- Table schema: {warehouse_details['table_schema']}\n"
    f"- Created: {warehouse_details['created']}"
)

# Print the final string
print(output)

Getting attributes from the x_conversations asset.

- Description: X conversations that mention this partition's article
- Compute kind: python
- Is partitioned: True
- Table type: BASE TABLE
- Table schema: social_networks
- Created: 2024-12-12T18:44:47.015000+00:00

Attributes from the x_conversations asset.

Some metrics:

import json
from damn_tool.metrics import asset_metrics

result = asset_metrics('social_networks/x_conversations', orchestrator='dagster', data_warehouse='bigquery', configs_dir="../.damn/")
data = json.loads(result)

# Extract relevant metrics
orchestrator_metrics = data["metrics"]["From orchestrator"]
dw_metrics = data["metrics"]["From data warehouse"]

# Combine metrics into a single string
output = (
    f"- Row count: {dw_metrics['row_count']}\n"
    f"- Size: {dw_metrics['bytes']}\n"
    f"- Last run id: {orchestrator_metrics['run_id']}\n"
    f"- Last run status: {orchestrator_metrics['status']}\n"
    f"- Last start time: {orchestrator_metrics['start_time']}\n"
    f"- Last end time: {orchestrator_metrics['end_time']}\n"
    f"- Number of partitions: {orchestrator_metrics['num_partitions']}\n"
    f"- Number of partitions materialized: {orchestrator_metrics['num_materialized']}\n"
    f"- Number of failed partitions: {orchestrator_metrics['num_failed']}"
)

# Print the final string
print(output)

Getting metrics from the x_conversations asset.

- Row count: 8203
- Size: 5.51 MB
- Last run id: 84dd166d-a39c-4574-9b3d-f578b2cb9de3
- Last run status: SUCCESS
- Last start time: 2024-12-24 10:11:44
- Last end time: 2024-12-24 10:11:51
- Number of partitions: 2275
- Number of partitions materialized: 291
- Number of failed partitions: 0

Metrics from the x_conversations asset.

Let’s pull some conversations:

from google.cloud import bigquery
import pyarrow.feather as feather

# Construct a BigQuery client object.
client = bigquery.Client()

query = """
    select * except (tweet_public_metrics, author_public_metrics)
    from `social_networks.x_conversations`
    limit 5"""

# Make an API request.
query_job = client.query(query)

# Wait for the job to complete.
conversations_df = query_job.result().to_dataframe()

# Write the DataFrame to a Feather file
feather_file = "data/x_conversations.feather"
feather.write_feather(conversations_df, feather_file)

# Pass the file path to the global namespace
globals()['x_conversations_feather'] = feather_file

Getting a data sample from the x_conversations asset.

library(arrow)
library(dplyr)

# Read the Feather file
feather_file <- reticulate::py$x_conversations_feather

# Read the Feather file
conversations_df <- read_feather(feather_file)
conversations_df <- conversations_df %>%
  mutate(
    TWEET_TEXT = substr(TWEET_TEXT, 1, 50)
  )

# Display the table
kable(
    conversations_df,
    format = "html",    # Ensures better rendering in Quarto
    escape = TRUE   
)

Displaying data sample from the x_conversations asset.

ARTICLE_URL	TWEET_ID	TWEET_CREATED_AT	TWEET_CONVERSATION_ID	TWEET_TEXT	AUTHOR_ID	AUTHOR_USERNAME	AUTHOR_LOCATION	AUTHOR_DESCRIPTION	AUTHOR_CREATED_AT	PARTITION_HOUR_UTC_TS	RECORD_LOADING_TS
https://www.nytimes.com/2024/12/20/well/cold-deaths-health.html	1870194128631541938	2024-12-20 14:46:41	1870194128631541938	"Even though we are in this warming world, cold-re	759094814897999873	SmithBIDMC	Boston, MA	Official account of the Richard A. and Susan F. Smith Center for Outcomes Research in Cardiology @BIDMChealth @BIDMCCVI. RTs ≠ Endorsements	2016-07-29 14:34:41	2024-12-20 14:00:00	2024-12-20 15:11:47
https://www.nytimes.com/2024/12/20/well/cold-deaths-health.html	1870186321907724544	2024-12-20 14:15:40	1870186321907724544	Check out @nytimes piece by @emilyschmall on our @	860518198277545985	rkwadhera	Boston, MA	Associate Professor @HarvardMed \| Cardiologist @BIDMCHealth \| Associate Director @SmithBIDMC \| Health Policy and Outcomes Research \|	2017-05-05 11:35:00	2024-12-20 14:00:00	2024-12-20 15:11:47
https://www.nytimes.com/2024/12/20/well/cold-deaths-health.html	1870188986104254654	2024-12-20 14:26:15	1870188986104254654	Try North America! 🌎 https://t.co/JgBZQUSMYR	879378417694670848	handsoitgoes	Quebec??=🇨🇦 or 🇫🇷 ??	"🇨🇦/🇺🇸 border kid. Ex-M. SJTO/TJSO (Social Justice Tribunals of Ontario/Tribunaux de justice sociale de l'Ontario), MLS, BA McGill "Anthroapology"	2017-06-26 12:38:47	2024-12-20 14:00:00	2024-12-20 15:11:47
https://www.nytimes.com/2024/12/20/well/cold-deaths-health.html	1870304020801237067	2024-12-20 22:03:21	1870304020801237067	pretty sure it's the homelessness above all https:	1034174877488570370	jhv85	Brooklyn, NY	Writer/researcher specializing in American political development, political economy, party systems and ideology, social democracy. Columnist @compactmag_	2018-08-27 16:24:39	2024-12-20 22:00:00	2024-12-20 23:11:37
https://www.nytimes.com/2024/12/20/well/cold-deaths-health.html	1870328626916528572	2024-12-20 23:41:08	1870328626916528572	#NYT1000本斬り 1428/2000 More People Are Now Dying Fr	842975270168477696	carina38925315		英語を教えて30年以上。今年は日本語教育能力検定試験と日本語教員試験受験。検定試験のみ合格。日本語教師として働く道を模索中。教員試験は来年合格したい。ハングル、イタリア語も勉強中。英検1級、英語通訳案内士、TOEICLR満点SW180/200 https://t.co/k9jjQFhk4o	2017-03-18 01:45:40	2024-12-20 23:00:00	2024-12-21 00:11:44

We then continue monitoring those conversations for 24 hours to capture all their posts. Here are some details on that asset:

import json
from damn_tool.show import show_asset

result = show_asset('social_networks/x_conversation_posts', orchestrator='dagster', data_warehouse='bigquery', configs_dir="../.damn/")
data = json.loads(result)

# Extract and print the desired attributes
asset_details = data["show"]
orchestrator_details = asset_details["From orchestrator"]
warehouse_details = asset_details["From data warehouse"]

# Combine attributes into a single string
output = (
    f"- Description: {orchestrator_details['description']}\n"
    f"- Compute kind: {orchestrator_details['computeKind']}\n"
    f"- Is partitioned: {orchestrator_details['isPartitioned']}\n"
    f"- Table type: {warehouse_details['table_type']}\n"
    f"- Table schema: {warehouse_details['table_schema']}\n"
    f"- Created: {warehouse_details['created']}"
)

# Print the final string
print(output)

Getting attributes from the x_conversation_posts asset.

- Description: Posts within X conversations
- Compute kind: python
- Is partitioned: True
- Table type: BASE TABLE
- Table schema: social_networks
- Created: 2024-12-12T18:44:28.374000+00:00

Attributes from the x_conversation_posts asset.

Some metrics:

import json
from damn_tool.metrics import asset_metrics

result = asset_metrics('social_networks/x_conversation_posts', orchestrator='dagster', data_warehouse='bigquery', configs_dir="../.damn/")
data = json.loads(result)

# Extract relevant metrics
orchestrator_metrics = data["metrics"]["From orchestrator"]
dw_metrics = data["metrics"]["From data warehouse"]

# Combine metrics into a single string
output = (
    f"- Row count: {dw_metrics['row_count']}\n"
    f"- Size: {dw_metrics['bytes']}\n"
    f"- Last run id: {orchestrator_metrics['run_id']}\n"
    f"- Last run status: {orchestrator_metrics['status']}\n"
    f"- Last start time: {orchestrator_metrics['start_time']}\n"
    f"- Last end time: {orchestrator_metrics['end_time']}\n"
    f"- Number of partitions: {orchestrator_metrics['num_partitions']}\n"
    f"- Number of partitions materialized: {orchestrator_metrics['num_materialized']}\n"
    f"- Number of failed partitions: {orchestrator_metrics['num_failed']}"
)

# Print the final string
print(output)

Getting metrics from the x_conversation_posts asset.

- Row count: 4415
- Size: 2.38 MB
- Last run id: 63a9fe77-8ed3-4c77-b59c-b74efe609713
- Last run status: SUCCESS
- Last start time: 2024-12-24 10:21:45
- Last end time: 2024-12-24 10:21:57
- Number of partitions: 758
- Number of partitions materialized: 100
- Number of failed partitions: 0

Metrics from the x_conversation_posts asset.

Let’s get some posts:

from google.cloud import bigquery
import pyarrow.feather as feather

# Construct a BigQuery client object.
client = bigquery.Client()

query = """
    select * except (article_url, tweet_public_metrics, author_public_metrics)
    from `social_networks.x_conversation_posts`
    limit 5"""

# Make an API request.
query_job = client.query(query)

# Wait for the job to complete.
conversation_posts_df = query_job.result().to_dataframe()

# Write the DataFrame to a Feather file
feather_file = "data/x_conversation_posts.feather"
feather.write_feather(conversation_posts_df, feather_file)

# Pass the file path to the global namespace
globals()['x_conversation_posts_feather'] = feather_file

Getting a data sample from the x_conversation_posts asset.

library(arrow)
library(dplyr)

# Read the Feather file
feather_file <- reticulate::py$x_conversation_posts_feather

# Read the Feather file
conversation_posts_df <- read_feather(feather_file)
conversation_posts_df <- conversation_posts_df %>%
  mutate(
    TWEET_TEXT = substr(TWEET_TEXT, 1, 50)
  )

# Display the table
kable(
    conversation_posts_df,
    format = "html",    # Ensures better rendering in Quarto
    escape = TRUE   
)

Displaying the data sample from the x_conversation_posts asset.

TWEET_ID	TWEET_CREATED_AT	TWEET_CONVERSATION_ID	TWEET_TEXT	AUTHOR_ID	AUTHOR_USERNAME	AUTHOR_LOCATION	AUTHOR_DESCRIPTION	AUTHOR_CREATED_AT	PARTITION_HOUR_UTC_TS	RECORD_LOADING_TS
1871177534823383120	2024-12-23 07:54:23	1871176675834417432	@Matthuber78 They are just anti-progress. Unless t	199430225	HibbertMatthew	Tavistock, Devon	Renegade-redhead soixante-sixard. Broadly Liberal but I like boundaries. Blog a bit.	2010-10-06 17:09:24	2024-12-23 07:00:00	2024-12-23 13:21:37
1871178666358501848	2024-12-23 07:58:53	1871176675834417432	@collectifission Regardless of how you model the f	352833079	Matthuber78	Syracuse, NY	Geographer, Lifeblood (2013) @UMinnPress, Climate Change as Class War (2022) @VersoBooks https://t.co/OgdpkbYLz3	2011-08-11 00:08:33	2024-12-23 07:00:00	2024-12-23 13:21:37
1871182808481476922	2024-12-23 08:15:21	1871176675834417432	@Matthuber78 Grinding up olivine is possibly the b	1483186552587132932	collectifission		All about energy and what that means for people, in relation with the rest of the world. With technology we can all live well, with room for nature to flourish.	2022-01-17 16:17:24	2024-12-23 07:00:00	2024-12-23 13:21:37
1871177492033372210	2024-12-23 07:54:13	1871176675834417432	@Matthuber78 Honestly though: I know the IPCC mode	1483186552587132932	collectifission		All about energy and what that means for people, in relation with the rest of the world. With technology we can all live well, with room for nature to flourish.	2022-01-17 16:17:24	2024-12-23 07:00:00	2024-12-23 13:21:37
1871297366130934184	2024-12-23 15:50:33	1871287900257808710	@mzjacobson @nytimes Carbon capture can reduce emi	1574113788344664064	ClimateSageO	Amman	تحويل سياسة المناخ إلى ممارسة #محارب_المناخ	2022-09-25 15:09:08	2024-12-23 13:00:00	2024-12-23 19:21:43

Finally, we are geolocating the users that are part of those conversations. Some details:

import json
from damn_tool.show import show_asset

result = show_asset('social_networks/user_geolocations', orchestrator='dagster', data_warehouse='bigquery', configs_dir="../.damn/")
data = json.loads(result)

# Extract and print the desired attributes
asset_details = data["show"]
orchestrator_details = asset_details["From orchestrator"]
warehouse_details = asset_details["From data warehouse"]

# Combine attributes into a single string
output = (
    f"- Description: {orchestrator_details['description']}\n"
    f"- Compute kind: {orchestrator_details['computeKind']}\n"
    f"- Is partitioned: {orchestrator_details['isPartitioned']}\n"
    f"- Table type: {warehouse_details['table_type']}\n"
    f"- Table schema: {warehouse_details['table_schema']}\n"
    f"- Created: {warehouse_details['created']}"
)

# Print the final string
print(output)

Getting attributes from the `user_geolocations` asset.

- Description: Geolocation of social network user's profile location
- Compute kind: python
- Is partitioned: True
- Table type: BASE TABLE
- Table schema: social_networks
- Created: 2024-12-12T18:46:22.036000+00:00

Attributes from the `user_geolocations` asset.

And some metrics:

import json
from damn_tool.metrics import asset_metrics

result = asset_metrics('social_networks/user_geolocations', orchestrator='dagster', data_warehouse='bigquery', configs_dir="../.damn/")
data = json.loads(result)

# Extract relevant metrics
orchestrator_metrics = data["metrics"]["From orchestrator"]
dw_metrics = data["metrics"]["From data warehouse"]

# Combine metrics into a single string
output = (
    f"- Row count: {dw_metrics['row_count']}\n"
    f"- Size: {dw_metrics['bytes']}\n"
    f"- Last run id: {orchestrator_metrics['run_id']}\n"
    f"- Last run status: {orchestrator_metrics['status']}\n"
    f"- Last start time: {orchestrator_metrics['start_time']}\n"
    f"- Last end time: {orchestrator_metrics['end_time']}\n"
    f"- Number of partitions: {orchestrator_metrics['num_partitions']}\n"
    f"- Number of partitions materialized: {orchestrator_metrics['num_materialized']}\n"
    f"- Number of failed partitions: {orchestrator_metrics['num_failed']}"
)

# Print the final string
print(output)

Getting metrics from the `user_geolocations` asset.

- Row count: 4853
- Size: 523.74 KB
- Last run id: 63a9fe77-8ed3-4c77-b59c-b74efe609713
- Last run status: SUCCESS
- Last start time: 2024-12-24 10:22:06
- Last end time: 2024-12-24 10:22:21
- Number of partitions: 758
- Number of partitions materialized: 74
- Number of failed partitions: 0

Metrics from the `user_geolocations` asset.

Let’s get some geolocated user records:

from google.cloud import bigquery
import pyarrow.feather as feather

# Construct a BigQuery client object.
client = bigquery.Client()

query = """
    select * from `social_networks.user_geolocations`
    order by geolocation_ts desc
    limit 5"""

# Make an API request.
query_job = client.query(query)

# Wait for the job to complete.
user_geolocations_df = query_job.result().to_dataframe()

# Write the DataFrame to a Feather file
feather_file = "data/user_geolocations.feather"
feather.write_feather(user_geolocations_df, feather_file)

# Pass the file path to the global namespace
globals()['user_geolocations_feather'] = feather_file

Getting a data sample from the `user_geolocations` asset.

library(arrow)

# Read the Feather file
feather_file <- reticulate::py$user_geolocations_feather

# Read the Feather file
user_geolocations_df <- read_feather(feather_file)

# Display the table
kable(
    user_geolocations_df,
    format = "html",    # Ensures better rendering in Quarto
    escape = TRUE   
)

Displaying the data sample from the `user_geolocations` asset.

SOCIAL_NETWORK_PROFILE_ID	SOCIAL_NETWORK_PROFILE_USERNAME	LOCATION	COUNTRYNAME	COUNTRYCODE	ADMINNAME1	ADMINCODE1	LATITUDE	LONGITUDE	GEOLOCATION_TS	PARTITION_HOUR_UTC_TS
1478403674569428998	babsi202	Netherlands	The Netherlands	NL		00	52.25	5.75	2024-12-24 10:22:13	2024-12-24 04:00:00
404992844	TomBauser	Frankfurt	Germany	DE	Hesse	05	50.11552	8.68417	2024-12-24 10:22:13	2024-12-24 04:00:00
903288533259083778	DemProud	United States					-14.60485	-57.65625	2024-12-24 10:22:12	2024-12-24 04:00:00
520658313	CharlesHAllison	New York City	United States	US	New York	NY	40.71427	-74.00597	2024-12-24 10:22:12	2024-12-24 04:00:00
2786988640	EricLebedel	Paris	France	FR	Île-de-France	11	48.85341	2.3488	2024-12-24 10:22:11	2024-12-24 04:00:00

Narratives Assets

Now that we have the conversations and posts, we classify them into narratives and discourse types.

🤖

The platform performs many data treatments with the help of AI agents. I’ve covered those extensively elsewhere, so here are a few resources to learn more:

- Blog post on how I’m orchestrating AI agents
- A deep dive webinar I’ve done on the topic with the good folks at Dagster
- An agent’s definition in the platform’s codebase
- The agent’s invocation in the platform’s codebase

Let’s first have a look at the lineage between those narratives assets:

We start by classifying the conversations as to whether or not they discuss climate-related topics. Let’s see some attributes:

import json
from damn_tool.show import show_asset

result = show_asset('narratives/conversation_classifications', orchestrator='dagster', data_warehouse='bigquery', configs_dir="../.damn/")
data = json.loads(result)

# Extract and print the desired attributes
asset_details = data["show"]
orchestrator_details = asset_details["From orchestrator"]
warehouse_details = asset_details["From data warehouse"]

# Combine attributes into a single string
output = (
    f"- Description: {orchestrator_details['description']}\n"
    f"- Compute kind: {orchestrator_details['computeKind']}\n"
    f"- Is partitioned: {orchestrator_details['isPartitioned']}\n"
    f"- Table type: {warehouse_details['table_type']}\n"
    f"- Table schema: {warehouse_details['table_schema']}\n"
    f"- Created: {warehouse_details['created']}"
)

# Print the final string
print(output)

Getting attributes from the `conversation_classifications` asset.

- Description: Classification of conversations as climate-related or not
- Compute kind: LangChain
- Is partitioned: True
- Table type: BASE TABLE
- Table schema: narratives
- Created: 2024-12-12T18:45:28.708000+00:00

Attributes from the `conversation_classifications` asset.

Let’s also have a look at some of its metrics:

import json
from damn_tool.metrics import asset_metrics

result = asset_metrics('narratives/conversation_classifications', orchestrator='dagster', data_warehouse='bigquery', configs_dir="../.damn/")
data = json.loads(result)

# Extract relevant metrics
orchestrator_metrics = data["metrics"]["From orchestrator"]
dw_metrics = data["metrics"]["From data warehouse"]

# Combine metrics into a single string
output = (
    f"- Row count: {dw_metrics['row_count']}\n"
    f"- Size: {dw_metrics['bytes']}\n"
    f"- Last run id: {orchestrator_metrics['run_id']}\n"
    f"- Last run status: {orchestrator_metrics['status']}\n"
    f"- Last start time: {orchestrator_metrics['start_time']}\n"
    f"- Last end time: {orchestrator_metrics['end_time']}\n"
    f"- Number of partitions: {orchestrator_metrics['num_partitions']}\n"
    f"- Number of partitions materialized: {orchestrator_metrics['num_materialized']}\n"
    f"- Number of failed partitions: {orchestrator_metrics['num_failed']}"
)

# Print the final string
print(output)

Getting metrics from the `conversation_classifications` asset.

- Row count: 8127
- Size: 280.32 KB
- Last run id: 0346fbc5-54f5-46a1-8e7a-18299ef93004
- Last run status: SUCCESS
- Last start time: 2024-12-24 07:46:36
- Last end time: 2024-12-24 07:47:09
- Number of partitions: 758
- Number of partitions materialized: 98
- Number of failed partitions: 0

Metrics from the `conversation_classifications` asset.

Let’s get a sample of data for those classifications:

from google.cloud import bigquery
import pyarrow.feather as feather

# Construct a BigQuery client object.
client = bigquery.Client()

query = """
    select * from `dev_narratives.conversation_classifications`
    order by partition_time desc
    limit 5"""

# Make an API request.
query_job = client.query(query)

# Wait for the job to complete.
conversation_classifications_df = query_job.result().to_dataframe()

# Write the DataFrame to a Feather file
feather_file = "data/conversation_classifications.feather"
feather.write_feather(conversation_classifications_df, feather_file)

# Pass the file path to the global namespace
globals()['conversation_classifications_feather'] = feather_file

Getting a data sample from the `conversation_classifications` asset.

library(arrow)

# Read the Feather file
feather_file <- reticulate::py$conversation_classifications_feather

# Read the Feather file
conversation_classifications_df <- read_feather(feather_file)

# Display the table
kable(
    conversation_classifications_df,
    format = "html",    # Ensures better rendering in Quarto
    escape = TRUE   
)

Displaying the data sample from the `conversation_classifications` asset.

CONVERSATION_ID	CLASSIFICATION	PARTITION_TIME
1867074284352393331	True	2024-12-12 10:00:00
1867050503504318545	False	2024-12-12 10:00:00
1867073426214519260	True	2024-12-12 10:00:00
1867075620628300041	True	2024-12-12 10:00:00
1867073189039276360	True	2024-12-12 10:00:00

Then we summarize the events being discussed in those conversations.

import json
from damn_tool.show import show_asset

result = show_asset('narratives/conversation_event_summary', orchestrator='dagster', data_warehouse='bigquery', configs_dir="../.damn/")
data = json.loads(result)

# Extract and print the desired attributes
asset_details = data["show"]
orchestrator_details = asset_details["From orchestrator"]
warehouse_details = asset_details["From data warehouse"]

# Combine attributes into a single string
output = (
    f"- Description: {orchestrator_details['description']}\n"
    f"- Compute kind: {orchestrator_details['computeKind']}\n"
    f"- Is partitioned: {orchestrator_details['isPartitioned']}\n"
    f"- Table type: {warehouse_details['table_type']}\n"
    f"- Table schema: {warehouse_details['table_schema']}\n"
    f"- Created: {warehouse_details['created']}"
)

# Print the final string
print(output)

Getting attributes from the `conversation_event_summary` asset.

- Description: Summary of the event discussed in a conversation
- Compute kind: LangChain
- Is partitioned: True
- Table type: BASE TABLE
- Table schema: narratives
- Created: 2024-12-12T18:45:46.163000+00:00

Attributes from the `conversation_event_summary` asset.

And we have the following metrics for this asset:

import json
from damn_tool.metrics import asset_metrics

result = asset_metrics('narratives/conversation_event_summary', orchestrator='dagster', data_warehouse='bigquery', configs_dir="../.damn/")
data = json.loads(result)

# Extract relevant metrics
orchestrator_metrics = data["metrics"]["From orchestrator"]
dw_metrics = data["metrics"]["From data warehouse"]

# Combine metrics into a single string
output = (
    f"- Row count: {dw_metrics['row_count']}\n"
    f"- Size: {dw_metrics['bytes']}\n"
    f"- Last run id: {orchestrator_metrics['run_id']}\n"
    f"- Last run status: {orchestrator_metrics['status']}\n"
    f"- Last start time: {orchestrator_metrics['start_time']}\n"
    f"- Last end time: {orchestrator_metrics['end_time']}\n"
    f"- Number of partitions: {orchestrator_metrics['num_partitions']}\n"
    f"- Number of partitions materialized: {orchestrator_metrics['num_materialized']}\n"
    f"- Number of failed partitions: {orchestrator_metrics['num_failed']}"
)

# Print the final string
print(output)

Getting metrics from the `conversation_event_summary` asset.

- Row count: 289
- Size: 348.86 KB
- Last run id: 0346fbc5-54f5-46a1-8e7a-18299ef93004
- Last run status: SUCCESS
- Last start time: 2024-12-24 07:46:37
- Last end time: 2024-12-24 07:47:59
- Number of partitions: 758
- Number of partitions materialized: 48
- Number of failed partitions: 0

Metrics from the `conversation_event_summary` asset.

And a sample of data:

from google.cloud import bigquery
import pyarrow.feather as feather

# Construct a BigQuery client object.
client = bigquery.Client()

query = """
    select * except (research_cycles)
    from `dev_narratives.conversation_event_summary`
    order by partition_time desc
    limit 5"""

# Make an API request.
query_job = client.query(query)

# Wait for the job to complete.
conversation_event_summaries_df = query_job.result().to_dataframe()

# Write the DataFrame to a Feather file
feather_file = "data/conversation_event_summaries.feather"
feather.write_feather(conversation_event_summaries_df, feather_file)

# Pass the file path to the global namespace
globals()['conversation_event_summaries_feather'] = feather_file

Getting a data sample from the `conversation_event_summary` asset.

library(arrow)
library(dplyr)

# Read the Feather file
feather_file <- reticulate::py$conversation_event_summaries_feather

# Read the Feather file
conversation_event_summaries_df <- read_feather(feather_file)
conversation_event_summaries_df <- conversation_event_summaries_df %>%
  mutate(
    EVENT_SUMMARY = substr(EVENT_SUMMARY, 1, 50)
  )

# Display the table
kable(
    conversation_event_summaries_df,
    format = "html",    # Ensures better rendering in Quarto
    escape = TRUE   
)

Displaying the data sample from the `conversation_event_summary` asset.

CONVERSATION_ID	EVENT_SUMMARY	PARTITION_TIME
1867041079163306152	On December 11, 2024, the Supreme Court issued a p	2024-12-12 07:00:00
1867022812025602097	A recent analysis highlights the significant benef	2024-12-12 07:00:00
1865150335284662433	In response to escalating geopolitical tensions, p	2024-12-07 04:00:00
1865181471918420443	In response to escalating threats from Russia and	2024-12-07 04:00:00
1865124092875067832	In response to escalating geopolitical tensions, p	2024-12-07 01:00:00

Finally, we associate each post to a discourse type and extract a narrative from it. Some attributes:

import json
from damn_tool.show import show_asset

result = show_asset('narratives/post_narrative_associations', orchestrator='dagster', data_warehouse='bigquery', configs_dir="../.damn/")
data = json.loads(result)

# Extract and print the desired attributes
asset_details = data["show"]
orchestrator_details = asset_details["From orchestrator"]
warehouse_details = asset_details["From data warehouse"]

# Combine attributes into a single string
output = (
    f"- Description: {orchestrator_details['description']}\n"
    f"- Compute kind: {orchestrator_details['computeKind']}\n"
    f"- Is partitioned: {orchestrator_details['isPartitioned']}\n"
    f"- Table type: {warehouse_details['table_type']}\n"
    f"- Table schema: {warehouse_details['table_schema']}\n"
    f"- Created: {warehouse_details['created']}"
)

# Print the final string
print(output)

Getting attributes from the `post_narrative_associations` asset.

- Description: Associations between social network posts and narrative types
- Compute kind: LangChain
- Is partitioned: True
- Table type: BASE TABLE
- Table schema: narratives
- Created: 2024-12-12T18:46:06.324000+00:00

Attributes from the `post_narrative_associations` asset.

And some metrics:

import json
from damn_tool.metrics import asset_metrics

result = asset_metrics('narratives/post_narrative_associations', orchestrator='dagster', data_warehouse='bigquery', configs_dir="../.damn/")
data = json.loads(result)

# Extract relevant metrics
orchestrator_metrics = data["metrics"]["From orchestrator"]
dw_metrics = data["metrics"]["From data warehouse"]

# Combine metrics into a single string
output = (
    f"- Row count: {dw_metrics['row_count']}\n"
    f"- Size: {dw_metrics['bytes']}\n"
    f"- Last run id: {orchestrator_metrics['run_id']}\n"
    f"- Last run status: {orchestrator_metrics['status']}\n"
    f"- Last start time: {orchestrator_metrics['start_time']}\n"
    f"- Last end time: {orchestrator_metrics['end_time']}\n"
    f"- Number of partitions: {orchestrator_metrics['num_partitions']}\n"
    f"- Number of partitions materialized: {orchestrator_metrics['num_materialized']}\n"
    f"- Number of failed partitions: {orchestrator_metrics['num_failed']}"
)

# Print the final string
print(output)

Getting metrics from the `post_narrative_associations` asset.

- Row count: 688
- Size: 366.30 KB
- Last run id: 0346fbc5-54f5-46a1-8e7a-18299ef93004
- Last run status: SUCCESS
- Last start time: 2024-12-24 07:48:11
- Last end time: 2024-12-24 07:49:58
- Number of partitions: 758
- Number of partitions materialized: 65
- Number of failed partitions: 0

Metrics from the `post_narrative_associations` asset.

Let’s get a sample of that data:

from google.cloud import bigquery
import pyarrow.feather as feather

# Construct a BigQuery client object.
client = bigquery.Client()

query = """
    select * from `dev_narratives.post_narrative_associations`
    where discourse_type != "N/A"
    order by partition_time desc
    limit 5"""

# Make an API request.
query_job = client.query(query)

# Wait for the job to complete.
post_narrative_associations_df = query_job.result().to_dataframe()

# Write the DataFrame to a Feather file
feather_file = "data/post_narrative_associations.feather"
feather.write_feather(post_narrative_associations_df, feather_file)

# Pass the file path to the global namespace
globals()['post_narrative_associations_feather'] = feather_file

Getting a data sample from the `post_narrative_associations` asset.

library(arrow)

# Read the Feather file
feather_file <- reticulate::py$post_narrative_associations_feather

# Read the Feather file
post_narrative_associations_df <- read_feather(feather_file)

# Display the table
kable(
    post_narrative_associations_df,
    format = "html",    # Ensures better rendering in Quarto
    escape = TRUE   
)

Displaying the data sample from the `post_narrative_associations` asset.

POST_ID	DISCOURSE_TYPE	NARRATIVE	PARTITION_TIME
1867041083688906784	Critical	This post discusses a significant legal ruling that reflects the ongoing struggle between environmental regulation and economic interests, particularly related to coal usage. The Supreme Court's decision to reject the Kentucky electric utility's request to block the EPA's efforts illustrates the critical dimension of climate change discourse, where economic and political power structures are challenged in favor of public health and environmental protection. It emphasizes the need for robust regulatory frameworks to manage hazardous materials like coal ash, which are direct byproducts of fossil fuel consumption, and highlights the societal implications of maintaining such harmful practices.	2024-12-12 07:00:00
1867022813896323229	Integrative	The post highlights the health and economic benefits of adopting heat pumps, framing climate change as an issue that intertwines environmental and social dimensions. By presenting data on reduced premature deaths, hospital visits, and asthma attacks, the discourse suggests that improving energy efficiency and transitioning to cleaner technologies not only addresses climate change but also significantly enhances public health and economic wellbeing. This integrative perspective encourages a holistic view of climate solutions, emphasizing the need to change societal norms and practices towards sustainable energy use.	2024-12-12 07:00:00
1854325713236611518	Critical	The post questions whether China adheres to the climate agenda, which reflects a critical discourse. It implies skepticism about China's commitment to international climate agreements, possibly due to perceived economic and political interests that may not align with aggressive climate action. This aligns with the broader narrative that addressing climate change requires challenging existing economic systems and power structures, as these can lead to uneven and unsustainable patterns of development and energy use. The post's context, following the election of Donald Trump, who has expressed intentions to roll back climate regulations, further underscores the tension between political actions and global climate commitments.	2024-11-07 04:00:00
1854377519069151413	Critical	This post highlights the social and economic implications of climate change, criticizing the lack of action to mitigate its effects despite the clear evidence of its future costs. The mention of 'our kids & grandkids' underscores the intergenerational impact of climate inaction. The tone suggests frustration with the current economic and political systems that prioritize short-term gains over long-term sustainability, which aligns with the Critical discourse. This reflects a challenge to power structures that maintain high fossil fuel consumption and neglect the urgency of climate policies.	2024-11-07 04:00:00
1854336568451797187	Critical	The post reflects a critical discourse on climate change in the context of Donald Trump's election as President. It implies dissatisfaction and frustration with the electoral outcome, suggesting that the country's leadership is not conducive to addressing climate change. This aligns with the critical discourse type, where climate change is seen as a social problem exacerbated by political structures and leadership decisions that prioritize economic gains from fossil fuels over sustainable development. The post echoes concerns that Trump's policies, which include dismantling climate regulations and promoting fossil fuel production, are at odds with the urgent need for climate action.	2024-11-07 04:00:00

Analytics Assets

Finally, the platform produces a dimensional representation of those assets for reporting purposes. I won’t go into details here, but here’s an overview of the dbt project that generates those assets:

📈 Conclusion

No sexy graphics, no groundbreaking insights. But isn’t that 90% of the work for data product builders? Designing the product, putting the pieces together, setting guardrails, and ensuring data quality.

There’s so much left to improve, but we at least now have a foundation to build upon. And most importantly, some data to explore new corners of our society. Because that’s the ultimate goal of RepublicOfData.io - to explore the dark corners of our society with data. And to get better at the craft of data product building along the way.

Exploring the Climate Resilience Data Platform: A Survey of Its Assets

Olivier Dupuis

🏛️ Design

📊 Data Assets

Media Assets

Narratives Assets

Analytics Assets

📈 Conclusion

Read more

Agentic Data Fusion for Market Intelligence

How AI Agents Assemble the Climate Narratives Report

Orchestrating an AI Agent Prototype with Hex, Dagster, and LangChain 🤖

Unleashing Investigative Reporter AI Agents across a Data Platform: A Prototype

🏛️ Design

📊 Data Assets

Media Assets

Social Networks Assets

Narratives Assets

Analytics Assets

📈 Conclusion

Read more

Agentic Data Fusion for Market Intelligence

How AI Agents Assemble the Climate Narratives Report

Orchestrating an AI Agent Prototype with Hex, Dagster, and LangChain 🤖

Unleashing Investigative Reporter AI Agents across a Data Platform: A Prototype