Exploring the Climate Resilience Data Platform: A Survey of Its Assets

This post explores the Climate Resilience Data Platform, showcasing how it sources climate articles, tracks social conversations, and classifies narratives. A behind-the-scenes look at refining data products to reveal societal insights and advance the mission of RepublicOfData.io.

Exploring the Climate Resilience Data Platform: A Survey of Its Assets

Over the past months, I’ve iterated quickly to create a platform that captures climate-related conversations on social networks. I shared my journey and turned a blind eye to my accumulated tech debt. Or didn’t blink twice when confronted with suspicious data quality.

The last months might not have led to any breakthrough innovation worth reporting on, but they have been busy: refactoring the asset management system, enforcing stricter typed datasets, productionizing AI agents, and more.

I think it’s now worth taking a step back and examining the platform's data assets. We won’t dive deep into any story yet (I’ll save that for the next post). Instead, we’ll survey the material we now have.

This will be a deep dive into much metadata, so buckle up!

👨‍💻
At RepublicOfData.io, we openly share code and journey. Here’s a list of repositories that are part of this project:

- Climate Resilience Data Platform: the data platform itself.
- Story notebooks: the notebooks to interact with the data.
- Damn Tool: a tool to interact with the platform’s assets.


🏛️ Design

The platform is built around 4 key data products:

  • media: The climate-related articles being sourced.
  • social_networks: The conversations that refer to the articles above.
  • narratives: The classification of conversations and posts into discourse types and narratives.
  • analytics: The dimensional representation of those assets for reporting purposes.
Mesh of data products on the Climate Resilience platform.

Here’s a complete list of assets produced by the platform.

import json
from damn_tool.ls import list_assets

result = list_assets(orchestrator='dagster', configs_dir="../.damn/")
data = json.loads(result)

organized_data = {}
for item in data["ls"]:
    product, key = item.split("/", 1)
    if product not in organized_data:
        organized_data[product] = []
    organized_data[product].append(key)

for product, keys in organized_data.items():
    print(f"{product}:")
    for key in keys:
        print(f" - {key}")

Listing assets using the DAMN tool.

analytics:
 - int__media_articles
 - int__social_network_conversations
 - int__social_network_posts
 - int__social_network_user_profiles
 - media_articles_fct
 - social_network_conversations_dim
 - social_network_posts_fct
 - social_network_user_profiles_dim
 - stg__conversation_classifications
 - stg__conversation_event_summaries
 - stg__nytimes_articles
 - stg__post_narrative_associations
 - stg__user_geolocations
 - stg__x_conversation_posts
 - stg__x_conversations
media:
 - nytimes_articles
narratives:
 - conversation_classifications
 - conversation_event_summary
 - post_narrative_associations
social_networks:
 - user_geolocations
 - x_conversation_posts
 - x_conversations

Listing of assets from the Climate Resilience data platform.


📊 Data Assets

Let’s now dive into each of those assets.

Media Assets

For now, we are only monitoring the New York Times for climate-related articles.

Media data product's assets.

Let’s get a selection of the attributes from the media assets.

import json
from damn_tool.show import show_asset

result = show_asset('media/nytimes_articles', orchestrator='dagster', data_warehouse='bigquery', configs_dir="../.damn/")
data = json.loads(result)

# Extract and print the desired attributes
asset_details = data["show"]
orchestrator_details = asset_details["From orchestrator"]
warehouse_details = asset_details["From data warehouse"]

# Combine attributes into a single string
output = (
    f"- Description: {orchestrator_details['description']}\n"
    f"- Compute kind: {orchestrator_details['computeKind']}\n"
    f"- Is partitioned: {orchestrator_details['isPartitioned']}\n"
    f"- Table type: {warehouse_details['table_type']}\n"
    f"- Table schema: {warehouse_details['table_schema']}\n"
    f"- Created: {warehouse_details['created']}"
)

# Print the final string
print(output)

Getting attributes from the nytimes_articles asset.

- Description: Media feed for New York Times
- Compute kind: python
- Is partitioned: True
- Table type: BASE TABLE
- Table schema: media
- Created: 2024-12-12T18:44:11.473000+00:00

Attributes from the nytimes_articles asset.

And some metrics:

import json
from damn_tool.metrics import asset_metrics

result = asset_metrics('media/nytimes_articles', orchestrator='dagster', data_warehouse='bigquery', configs_dir="../.damn/")
data = json.loads(result)

# Extract relevant metrics
orchestrator_metrics = data["metrics"]["From orchestrator"]
dw_metrics = data["metrics"]["From data warehouse"]

# Combine metrics into a single string
output = (
    f"- Row count: {dw_metrics['row_count']}\n"
    f"- Size: {dw_metrics['bytes']}\n"
    f"- Last run id: {orchestrator_metrics['run_id']}\n"
    f"- Last run status: {orchestrator_metrics['status']}\n"
    f"- Last start time: {orchestrator_metrics['start_time']}\n"
    f"- Last end time: {orchestrator_metrics['end_time']}\n"
    f"- Number of partitions: {orchestrator_metrics['num_partitions']}\n"
    f"- Number of partitions materialized: {orchestrator_metrics['num_materialized']}\n"
    f"- Number of failed partitions: {orchestrator_metrics['num_failed']}"
)

# Print the final string
print(output)

Getting metrics from the nytimes_articles asset.

- Row count: 894
- Size: 649.96 KB
- Last run id: e6cf5f89-f14d-4623-9496-82b52efe79d1
- Last run status: SUCCESS
- Last start time: 2024-12-24 10:01:45
- Last end time: 2024-12-24 10:01:54
- Number of partitions: 2275
- Number of partitions materialized: 291
- Number of failed partitions: 0

Metrics from the nytimes_articles asset.

Let’s pull the data from this asset:

import pyarrow.feather as feather
from google.cloud import bigquery

# Construct a BigQuery client object.
client = bigquery.Client()

query = """
    select *
    from `phonic-biplane-420020.media.nytimes_articles`
    order by published_ts desc
    limit 5"""

# Make an API request.
query_job = client.query(query)

# Wait for the job to complete.
articles_df = query_job.result().to_dataframe()

# Write the DataFrame to a Feather file
feather_file = "data/articles.feather"
feather.write_feather(articles_df, feather_file)

# Pass the file path to the global namespace
globals()['articles_feather'] = feather_file

Getting a data sample from the nytimes_articles asset.

library(arrow)

# Read the Feather file
feather_file <- reticulate::py$articles_feather

# Read the Feather file
articles_df <- read_feather(feather_file)

# Display the table
kable(
    articles_df,
    format = "html",
    escape = TRUE   
)

Displaying data sample from the nytimes_articles asset.

MEDIA ID TITLE LINK SUMMARY AUTHOR TAGS MEDIAS PUBLISHED_TS
nytimes https://www.nytimes.com/2024/12/23/us/politics/trump-greenland-panama-canal.html Trump’s Wish to Control Greenland and Panama Canal: Not a Joke This Time https://www.nytimes.com/2024/12/23/us/politics/trump-greenland-panama-canal.html In recent days the president-elect has called for asserting U.S. control over the Panama Canal and Greenland, showing that his “America First” philosophy has an expansionist dimension. David E. Sanger and Lisa Friedman United States International Relations,United States Politics and Government,Trump, Donald J,Greenland,Panama Canal and Canal Zone,Presidential Election of 2024,Global Warming,Metals and Minerals,Ships and Shipping,Egede, Mute B,Arctic Regions,Denmark,Panama https://static01.nyt.com/images/2024/12/23/multimedia/23DC-TRUMP-mpcg/23DC-TRUMP-mpcg-mediumSquareAt3X.jpg 2024-12-23 22:08:12
nytimes https://www.nytimes.com/2024/12/23/us/politics/trump-greenland-panama-canal.html Trump’s Wish to Control Greenland and Panama Canal: Not a Joke This Time https://www.nytimes.com/2024/12/23/us/politics/trump-greenland-panama-canal.html In recent days the president-elect has called for asserting U.S. control over the Panama Canal and Greenland, showing that his “America First” philosophy has an expansionist dimension. David E. Sanger and Lisa Friedman United States International Relations,United States Politics and Government,Trump, Donald J,Greenland,Panama Canal and Canal Zone,Presidential Election of 2024,Global Warming,Metals and Minerals,Ships and Shipping,Egede, Mute B,Arctic Regions,Denmark,Panama https://static01.nyt.com/images/2024/12/23/multimedia/23DC-TRUMP-mpcg/23DC-TRUMP-mpcg-mediumSquareAt3X.jpg 2024-12-23 21:08:14
nytimes https://www.nytimes.com/2024/12/23/us/politics/trump-greenland-panama-canal.html Trump’s Wish to Control Greenland and Panama Canal: Not a Joke This Time https://www.nytimes.com/2024/12/23/us/politics/trump-greenland-panama-canal.html In recent days the president-elect has called for asserting U.S. control over the Panama Canal and Greenland, showing that his “America First” philosophy has an expansionist dimension. David E. Sanger and Lisa Friedman United States International Relations,United States Politics and Government,Trump, Donald J,Greenland,Panama Canal and Canal Zone,Presidential Election of 2024,Global Warming,Metals and Minerals,Ships and Shipping,Egede, Mute B,Arctic Regions,Denmark,Panama https://static01.nyt.com/images/2024/12/23/multimedia/23DC-TRUMP-mpcg/23DC-TRUMP-mpcg-mediumSquareAt3X.jpg 2024-12-23 18:20:01
nytimes https://www.nytimes.com/2024/12/23/us/politics/trump-greenland-panama-canal.html Trump’s Wish to Control Greenland and Panama Canal: Not a Joke This Time https://www.nytimes.com/2024/12/23/us/politics/trump-greenland-panama-canal.html In recent days the president-elect has called for asserting U.S. control over the Panama Canal and Greenland, showing that his “America First” philosophy has an expansionist dimension. David E. Sanger and Lisa Friedman United States International Relations,United States Politics and Government,Trump, Donald J,Greenland,Panama Canal and Canal Zone,Presidential Election of 2024,Global Warming,Metals and Minerals,Ships and Shipping,Egede, Mute B,Arctic Regions,Denmark,Panama https://static01.nyt.com/images/2024/12/23/multimedia/23DC-TRUMP-mpcg/23DC-TRUMP-mpcg-mediumSquareAt3X.jpg 2024-12-23 17:10:55
nytimes https://www.nytimes.com/2024/12/23/us/politics/trump-greenland-panama-canal.html Trump’s Wish to Control Greenland and Panama Canal: Not a Joke This Time https://www.nytimes.com/2024/12/23/us/politics/trump-greenland-panama-canal.html In recent days the president-elect has called for asserting U.S. control over the Panama Canal and Greenland, showing that his “America First” philosophy has an expansionist dimension. David E. Sanger and Lisa Friedman United States International Relations,United States Politics and Government,Trump, Donald J,Greenland,Panama Canal and Canal Zone,Presidential Election of 2024,Global Warming,Metals and Minerals,Ships and Shipping,Egede, Mute B,Arctic Regions,Denmark,Panama https://static01.nyt.com/images/2024/12/23/multimedia/23DC-TRUMP-mpcg/23DC-TRUMP-mpcg-mediumSquareAt3X.jpg 2024-12-23 16:20:18

Articles are the foundation of the platform. Once climate-related articles are pulled, the platform monitors social networks and capture conversations that refer to them.

Social Networks Assets

We then monitor conversations on social networks (for now, only X) that refer to the articles sourced.

Social network data product's assets.

Let’s now get some details on conversations that occur on X.

import json
from damn_tool.show import show_asset

result = show_asset('social_networks/x_conversations', orchestrator='dagster', data_warehouse='bigquery', configs_dir="../.damn/")
data = json.loads(result)

# Extract and print the desired attributes
asset_details = data["show"]
orchestrator_details = asset_details["From orchestrator"]
warehouse_details = asset_details["From data warehouse"]

# Combine attributes into a single string
output = (
    f"- Description: {orchestrator_details['description']}\n"
    f"- Compute kind: {orchestrator_details['computeKind']}\n"
    f"- Is partitioned: {orchestrator_details['isPartitioned']}\n"
    f"- Table type: {warehouse_details['table_type']}\n"
    f"- Table schema: {warehouse_details['table_schema']}\n"
    f"- Created: {warehouse_details['created']}"
)

# Print the final string
print(output)

Getting attributes from the x_conversations asset.

- Description: X conversations that mention this partition's article
- Compute kind: python
- Is partitioned: True
- Table type: BASE TABLE
- Table schema: social_networks
- Created: 2024-12-12T18:44:47.015000+00:00

Attributes from the x_conversations asset.

Some metrics:

import json
from damn_tool.metrics import asset_metrics

result = asset_metrics('social_networks/x_conversations', orchestrator='dagster', data_warehouse='bigquery', configs_dir="../.damn/")
data = json.loads(result)

# Extract relevant metrics
orchestrator_metrics = data["metrics"]["From orchestrator"]
dw_metrics = data["metrics"]["From data warehouse"]

# Combine metrics into a single string
output = (
    f"- Row count: {dw_metrics['row_count']}\n"
    f"- Size: {dw_metrics['bytes']}\n"
    f"- Last run id: {orchestrator_metrics['run_id']}\n"
    f"- Last run status: {orchestrator_metrics['status']}\n"
    f"- Last start time: {orchestrator_metrics['start_time']}\n"
    f"- Last end time: {orchestrator_metrics['end_time']}\n"
    f"- Number of partitions: {orchestrator_metrics['num_partitions']}\n"
    f"- Number of partitions materialized: {orchestrator_metrics['num_materialized']}\n"
    f"- Number of failed partitions: {orchestrator_metrics['num_failed']}"
)

# Print the final string
print(output)

Getting metrics from the x_conversations asset.

- Row count: 8203
- Size: 5.51 MB
- Last run id: 84dd166d-a39c-4574-9b3d-f578b2cb9de3
- Last run status: SUCCESS
- Last start time: 2024-12-24 10:11:44
- Last end time: 2024-12-24 10:11:51
- Number of partitions: 2275
- Number of partitions materialized: 291
- Number of failed partitions: 0

Metrics from the x_conversations asset.

Let’s pull some conversations:

from google.cloud import bigquery
import pyarrow.feather as feather

# Construct a BigQuery client object.
client = bigquery.Client()

query = """
    select * except (tweet_public_metrics, author_public_metrics)
    from `social_networks.x_conversations`
    limit 5"""

# Make an API request.
query_job = client.query(query)

# Wait for the job to complete.
conversations_df = query_job.result().to_dataframe()

# Write the DataFrame to a Feather file
feather_file = "data/x_conversations.feather"
feather.write_feather(conversations_df, feather_file)

# Pass the file path to the global namespace
globals()['x_conversations_feather'] = feather_file

Getting a data sample from the x_conversations asset.

library(arrow)
library(dplyr)

# Read the Feather file
feather_file <- reticulate::py$x_conversations_feather

# Read the Feather file
conversations_df <- read_feather(feather_file)
conversations_df <- conversations_df %>%
  mutate(
    TWEET_TEXT = substr(TWEET_TEXT, 1, 50)
  )

# Display the table
kable(
    conversations_df,
    format = "html",    # Ensures better rendering in Quarto
    escape = TRUE   
)

Displaying data sample from the x_conversations asset.

ARTICLE_URL TWEET_ID TWEET_CREATED_AT TWEET_CONVERSATION_ID TWEET_TEXT AUTHOR_ID AUTHOR_USERNAME AUTHOR_LOCATION AUTHOR_DESCRIPTION AUTHOR_CREATED_AT PARTITION_HOUR_UTC_TS RECORD_LOADING_TS
https://www.nytimes.com/2024/12/20/well/cold-deaths-health.html 1870194128631541938 2024-12-20 14:46:41 1870194128631541938 "Even though we are in this warming world, cold-re 759094814897999873 SmithBIDMC Boston, MA Official account of the Richard A. and Susan F. Smith Center for Outcomes Research in Cardiology @BIDMChealth @BIDMCCVI. RTs ≠ Endorsements 2016-07-29 14:34:41 2024-12-20 14:00:00 2024-12-20 15:11:47
https://www.nytimes.com/2024/12/20/well/cold-deaths-health.html 1870186321907724544 2024-12-20 14:15:40 1870186321907724544 Check out @nytimes piece by @emilyschmall on our @ 860518198277545985 rkwadhera Boston, MA Associate Professor @HarvardMed | Cardiologist @BIDMCHealth | Associate Director @SmithBIDMC | Health Policy and Outcomes Research | 2017-05-05 11:35:00 2024-12-20 14:00:00 2024-12-20 15:11:47
https://www.nytimes.com/2024/12/20/well/cold-deaths-health.html 1870188986104254654 2024-12-20 14:26:15 1870188986104254654 Try North America! 🌎 https://t.co/JgBZQUSMYR 879378417694670848 handsoitgoes Quebec??=🇨🇦 or 🇫🇷 ?? "🇨🇦/🇺🇸 border kid. Ex-M. SJTO/TJSO (Social Justice Tribunals of Ontario/Tribunaux de justice sociale de l'Ontario), MLS, BA McGill "Anthroapology" 2017-06-26 12:38:47 2024-12-20 14:00:00 2024-12-20 15:11:47
https://www.nytimes.com/2024/12/20/well/cold-deaths-health.html 1870304020801237067 2024-12-20 22:03:21 1870304020801237067 pretty sure it's the homelessness above all https: 1034174877488570370 jhv85 Brooklyn, NY Writer/researcher specializing in American political development, political economy, party systems and ideology, social democracy. Columnist @compactmag_ 2018-08-27 16:24:39 2024-12-20 22:00:00 2024-12-20 23:11:37
https://www.nytimes.com/2024/12/20/well/cold-deaths-health.html 1870328626916528572 2024-12-20 23:41:08 1870328626916528572 #NYT1000本斬り 1428/2000 More People Are Now Dying Fr 842975270168477696 carina38925315 英語を教えて30年以上。今年は日本語教育能力検定試験と日本語教員試験受験。検定試験のみ合格。日本語教師として働く道を模索中。教員試験は来年合格したい。ハングル、イタリア語も勉強中。英検1級、英語通訳案内士、TOEICLR満点SW180/200 https://t.co/k9jjQFhk4o 2017-03-18 01:45:40 2024-12-20 23:00:00 2024-12-21 00:11:44

We then continue monitoring those conversations for 24 hours to capture all their posts. Here are some details on that asset:

import json
from damn_tool.show import show_asset

result = show_asset('social_networks/x_conversation_posts', orchestrator='dagster', data_warehouse='bigquery', configs_dir="../.damn/")
data = json.loads(result)

# Extract and print the desired attributes
asset_details = data["show"]
orchestrator_details = asset_details["From orchestrator"]
warehouse_details = asset_details["From data warehouse"]

# Combine attributes into a single string
output = (
    f"- Description: {orchestrator_details['description']}\n"
    f"- Compute kind: {orchestrator_details['computeKind']}\n"
    f"- Is partitioned: {orchestrator_details['isPartitioned']}\n"
    f"- Table type: {warehouse_details['table_type']}\n"
    f"- Table schema: {warehouse_details['table_schema']}\n"
    f"- Created: {warehouse_details['created']}"
)

# Print the final string
print(output)

Getting attributes from the x_conversation_posts asset.

- Description: Posts within X conversations
- Compute kind: python
- Is partitioned: True
- Table type: BASE TABLE
- Table schema: social_networks
- Created: 2024-12-12T18:44:28.374000+00:00

Attributes from the x_conversation_posts asset.

Some metrics:

import json
from damn_tool.metrics import asset_metrics

result = asset_metrics('social_networks/x_conversation_posts', orchestrator='dagster', data_warehouse='bigquery', configs_dir="../.damn/")
data = json.loads(result)

# Extract relevant metrics
orchestrator_metrics = data["metrics"]["From orchestrator"]
dw_metrics = data["metrics"]["From data warehouse"]

# Combine metrics into a single string
output = (
    f"- Row count: {dw_metrics['row_count']}\n"
    f"- Size: {dw_metrics['bytes']}\n"
    f"- Last run id: {orchestrator_metrics['run_id']}\n"
    f"- Last run status: {orchestrator_metrics['status']}\n"
    f"- Last start time: {orchestrator_metrics['start_time']}\n"
    f"- Last end time: {orchestrator_metrics['end_time']}\n"
    f"- Number of partitions: {orchestrator_metrics['num_partitions']}\n"
    f"- Number of partitions materialized: {orchestrator_metrics['num_materialized']}\n"
    f"- Number of failed partitions: {orchestrator_metrics['num_failed']}"
)

# Print the final string
print(output)

Getting metrics from the x_conversation_posts asset.

- Row count: 4415
- Size: 2.38 MB
- Last run id: 63a9fe77-8ed3-4c77-b59c-b74efe609713
- Last run status: SUCCESS
- Last start time: 2024-12-24 10:21:45
- Last end time: 2024-12-24 10:21:57
- Number of partitions: 758
- Number of partitions materialized: 100
- Number of failed partitions: 0

Metrics from the x_conversation_posts asset.

Let’s get some posts:

from google.cloud import bigquery
import pyarrow.feather as feather

# Construct a BigQuery client object.
client = bigquery.Client()

query = """
    select * except (article_url, tweet_public_metrics, author_public_metrics)
    from `social_networks.x_conversation_posts`
    limit 5"""

# Make an API request.
query_job = client.query(query)

# Wait for the job to complete.
conversation_posts_df = query_job.result().to_dataframe()

# Write the DataFrame to a Feather file
feather_file = "data/x_conversation_posts.feather"
feather.write_feather(conversation_posts_df, feather_file)

# Pass the file path to the global namespace
globals()['x_conversation_posts_feather'] = feather_file

Getting a data sample from the x_conversation_posts asset.

library(arrow)
library(dplyr)

# Read the Feather file
feather_file <- reticulate::py$x_conversation_posts_feather

# Read the Feather file
conversation_posts_df <- read_feather(feather_file)
conversation_posts_df <- conversation_posts_df %>%
  mutate(
    TWEET_TEXT = substr(TWEET_TEXT, 1, 50)
  )

# Display the table
kable(
    conversation_posts_df,
    format = "html",    # Ensures better rendering in Quarto
    escape = TRUE   
)

Displaying the data sample from the x_conversation_posts asset.

TWEET_ID TWEET_CREATED_AT TWEET_CONVERSATION_ID TWEET_TEXT AUTHOR_ID AUTHOR_USERNAME AUTHOR_LOCATION AUTHOR_DESCRIPTION AUTHOR_CREATED_AT PARTITION_HOUR_UTC_TS RECORD_LOADING_TS
1871177534823383120 2024-12-23 07:54:23 1871176675834417432 @Matthuber78 They are just anti-progress. Unless t 199430225 HibbertMatthew Tavistock, Devon Renegade-redhead soixante-sixard. Broadly Liberal but I like boundaries. Blog a bit. 2010-10-06 17:09:24 2024-12-23 07:00:00 2024-12-23 13:21:37
1871178666358501848 2024-12-23 07:58:53 1871176675834417432 @collectifission Regardless of how you model the f 352833079 Matthuber78 Syracuse, NY Geographer, Lifeblood (2013) @UMinnPress, Climate Change as Class War (2022) @VersoBooks https://t.co/OgdpkbYLz3 2011-08-11 00:08:33 2024-12-23 07:00:00 2024-12-23 13:21:37
1871182808481476922 2024-12-23 08:15:21 1871176675834417432 @Matthuber78 Grinding up olivine is possibly the b 1483186552587132932 collectifission All about energy and what that means for people, in relation with the rest of the world. With technology we can all live well, with room for nature to flourish. 2022-01-17 16:17:24 2024-12-23 07:00:00 2024-12-23 13:21:37
1871177492033372210 2024-12-23 07:54:13 1871176675834417432 @Matthuber78 Honestly though: I know the IPCC mode 1483186552587132932 collectifission All about energy and what that means for people, in relation with the rest of the world. With technology we can all live well, with room for nature to flourish. 2022-01-17 16:17:24 2024-12-23 07:00:00 2024-12-23 13:21:37
1871297366130934184 2024-12-23 15:50:33 1871287900257808710 @mzjacobson @nytimes Carbon capture can reduce emi 1574113788344664064 ClimateSageO Amman تحويل سياسة المناخ إلى ممارسة #محارب_المناخ 2022-09-25 15:09:08 2024-12-23 13:00:00 2024-12-23 19:21:43

Finally, we are geolocating the users that are part of those conversations. Some details:

import json
from damn_tool.show import show_asset

result = show_asset('social_networks/user_geolocations', orchestrator='dagster', data_warehouse='bigquery', configs_dir="../.damn/")
data = json.loads(result)

# Extract and print the desired attributes
asset_details = data["show"]
orchestrator_details = asset_details["From orchestrator"]
warehouse_details = asset_details["From data warehouse"]

# Combine attributes into a single string
output = (
    f"- Description: {orchestrator_details['description']}\n"
    f"- Compute kind: {orchestrator_details['computeKind']}\n"
    f"- Is partitioned: {orchestrator_details['isPartitioned']}\n"
    f"- Table type: {warehouse_details['table_type']}\n"
    f"- Table schema: {warehouse_details['table_schema']}\n"
    f"- Created: {warehouse_details['created']}"
)

# Print the final string
print(output)

Getting attributes from the `user_geolocations` asset.

- Description: Geolocation of social network user's profile location
- Compute kind: python
- Is partitioned: True
- Table type: BASE TABLE
- Table schema: social_networks
- Created: 2024-12-12T18:46:22.036000+00:00

Attributes from the `user_geolocations` asset.

And some metrics:

import json
from damn_tool.metrics import asset_metrics

result = asset_metrics('social_networks/user_geolocations', orchestrator='dagster', data_warehouse='bigquery', configs_dir="../.damn/")
data = json.loads(result)

# Extract relevant metrics
orchestrator_metrics = data["metrics"]["From orchestrator"]
dw_metrics = data["metrics"]["From data warehouse"]

# Combine metrics into a single string
output = (
    f"- Row count: {dw_metrics['row_count']}\n"
    f"- Size: {dw_metrics['bytes']}\n"
    f"- Last run id: {orchestrator_metrics['run_id']}\n"
    f"- Last run status: {orchestrator_metrics['status']}\n"
    f"- Last start time: {orchestrator_metrics['start_time']}\n"
    f"- Last end time: {orchestrator_metrics['end_time']}\n"
    f"- Number of partitions: {orchestrator_metrics['num_partitions']}\n"
    f"- Number of partitions materialized: {orchestrator_metrics['num_materialized']}\n"
    f"- Number of failed partitions: {orchestrator_metrics['num_failed']}"
)

# Print the final string
print(output)

Getting metrics from the `user_geolocations` asset.

- Row count: 4853
- Size: 523.74 KB
- Last run id: 63a9fe77-8ed3-4c77-b59c-b74efe609713
- Last run status: SUCCESS
- Last start time: 2024-12-24 10:22:06
- Last end time: 2024-12-24 10:22:21
- Number of partitions: 758
- Number of partitions materialized: 74
- Number of failed partitions: 0

Metrics from the `user_geolocations` asset.

Let’s get some geolocated user records:

from google.cloud import bigquery
import pyarrow.feather as feather

# Construct a BigQuery client object.
client = bigquery.Client()

query = """
    select * from `social_networks.user_geolocations`
    order by geolocation_ts desc
    limit 5"""

# Make an API request.
query_job = client.query(query)

# Wait for the job to complete.
user_geolocations_df = query_job.result().to_dataframe()

# Write the DataFrame to a Feather file
feather_file = "data/user_geolocations.feather"
feather.write_feather(user_geolocations_df, feather_file)

# Pass the file path to the global namespace
globals()['user_geolocations_feather'] = feather_file

Getting a data sample from the `user_geolocations` asset.

library(arrow)

# Read the Feather file
feather_file <- reticulate::py$user_geolocations_feather

# Read the Feather file
user_geolocations_df <- read_feather(feather_file)

# Display the table
kable(
    user_geolocations_df,
    format = "html",    # Ensures better rendering in Quarto
    escape = TRUE   
)

Displaying the data sample from the `user_geolocations` asset.

SOCIAL_NETWORK_PROFILE_ID SOCIAL_NETWORK_PROFILE_USERNAME LOCATION_ORDER LOCATION COUNTRYNAME COUNTRYCODE ADMINNAME1 ADMINCODE1 LATITUDE LONGITUDE GEOLOCATION_TS PARTITION_HOUR_UTC_TS
1478403674569428998 babsi202 0 Netherlands The Netherlands NL 00 52.25 5.75 2024-12-24 10:22:13 2024-12-24 04:00:00
404992844 TomBauser 0 Frankfurt Germany DE Hesse 05 50.11552 8.68417 2024-12-24 10:22:13 2024-12-24 04:00:00
903288533259083778 DemProud 0 United States -14.60485 -57.65625 2024-12-24 10:22:12 2024-12-24 04:00:00
520658313 CharlesHAllison 0 New York City United States US New York NY 40.71427 -74.00597 2024-12-24 10:22:12 2024-12-24 04:00:00
2786988640 EricLebedel 0 Paris France FR Île-de-France 11 48.85341 2.3488 2024-12-24 10:22:11 2024-12-24 04:00:00

Narratives Assets

Now that we have the conversations and posts, we classify them into narratives and discourse types.

🤖
The platform performs many data treatments with the help of AI agents. I’ve covered those extensively elsewhere, so here are a few resources to learn more:

- Blog post on how I’m orchestrating AI agents
- A deep dive webinar I’ve done on the topic with the good folks at Dagster
- An agent’s definition in the platform’s codebase
- The agent’s invocation in the platform’s codebase

Let’s first have a look at the lineage between those narratives assets:

Narrative data product's assets.

We start by classifying the conversations as to whether or not they discuss climate-related topics. Let’s see some attributes:

import json
from damn_tool.show import show_asset

result = show_asset('narratives/conversation_classifications', orchestrator='dagster', data_warehouse='bigquery', configs_dir="../.damn/")
data = json.loads(result)

# Extract and print the desired attributes
asset_details = data["show"]
orchestrator_details = asset_details["From orchestrator"]
warehouse_details = asset_details["From data warehouse"]

# Combine attributes into a single string
output = (
    f"- Description: {orchestrator_details['description']}\n"
    f"- Compute kind: {orchestrator_details['computeKind']}\n"
    f"- Is partitioned: {orchestrator_details['isPartitioned']}\n"
    f"- Table type: {warehouse_details['table_type']}\n"
    f"- Table schema: {warehouse_details['table_schema']}\n"
    f"- Created: {warehouse_details['created']}"
)

# Print the final string
print(output)

Getting attributes from the `conversation_classifications` asset.

- Description: Classification of conversations as climate-related or not
- Compute kind: LangChain
- Is partitioned: True
- Table type: BASE TABLE
- Table schema: narratives
- Created: 2024-12-12T18:45:28.708000+00:00

Attributes from the `conversation_classifications` asset.

Let’s also have a look at some of its metrics:

import json
from damn_tool.metrics import asset_metrics

result = asset_metrics('narratives/conversation_classifications', orchestrator='dagster', data_warehouse='bigquery', configs_dir="../.damn/")
data = json.loads(result)

# Extract relevant metrics
orchestrator_metrics = data["metrics"]["From orchestrator"]
dw_metrics = data["metrics"]["From data warehouse"]

# Combine metrics into a single string
output = (
    f"- Row count: {dw_metrics['row_count']}\n"
    f"- Size: {dw_metrics['bytes']}\n"
    f"- Last run id: {orchestrator_metrics['run_id']}\n"
    f"- Last run status: {orchestrator_metrics['status']}\n"
    f"- Last start time: {orchestrator_metrics['start_time']}\n"
    f"- Last end time: {orchestrator_metrics['end_time']}\n"
    f"- Number of partitions: {orchestrator_metrics['num_partitions']}\n"
    f"- Number of partitions materialized: {orchestrator_metrics['num_materialized']}\n"
    f"- Number of failed partitions: {orchestrator_metrics['num_failed']}"
)

# Print the final string
print(output)

Getting metrics from the `conversation_classifications` asset.

- Row count: 8127
- Size: 280.32 KB
- Last run id: 0346fbc5-54f5-46a1-8e7a-18299ef93004
- Last run status: SUCCESS
- Last start time: 2024-12-24 07:46:36
- Last end time: 2024-12-24 07:47:09
- Number of partitions: 758
- Number of partitions materialized: 98
- Number of failed partitions: 0

Metrics from the `conversation_classifications` asset.

Let’s get a sample of data for those classifications:

from google.cloud import bigquery
import pyarrow.feather as feather

# Construct a BigQuery client object.
client = bigquery.Client()

query = """
    select * from `dev_narratives.conversation_classifications`
    order by partition_time desc
    limit 5"""

# Make an API request.
query_job = client.query(query)

# Wait for the job to complete.
conversation_classifications_df = query_job.result().to_dataframe()

# Write the DataFrame to a Feather file
feather_file = "data/conversation_classifications.feather"
feather.write_feather(conversation_classifications_df, feather_file)

# Pass the file path to the global namespace
globals()['conversation_classifications_feather'] = feather_file

Getting a data sample from the `conversation_classifications` asset.

library(arrow)

# Read the Feather file
feather_file <- reticulate::py$conversation_classifications_feather

# Read the Feather file
conversation_classifications_df <- read_feather(feather_file)

# Display the table
kable(
    conversation_classifications_df,
    format = "html",    # Ensures better rendering in Quarto
    escape = TRUE   
)

Displaying the data sample from the `conversation_classifications` asset.

CONVERSATION_ID CLASSIFICATION PARTITION_TIME
1867074284352393331 True 2024-12-12 10:00:00
1867050503504318545 False 2024-12-12 10:00:00
1867073426214519260 True 2024-12-12 10:00:00
1867075620628300041 True 2024-12-12 10:00:00
1867073189039276360 True 2024-12-12 10:00:00

Then we summarize the events being discussed in those conversations.

import json
from damn_tool.show import show_asset

result = show_asset('narratives/conversation_event_summary', orchestrator='dagster', data_warehouse='bigquery', configs_dir="../.damn/")
data = json.loads(result)

# Extract and print the desired attributes
asset_details = data["show"]
orchestrator_details = asset_details["From orchestrator"]
warehouse_details = asset_details["From data warehouse"]

# Combine attributes into a single string
output = (
    f"- Description: {orchestrator_details['description']}\n"
    f"- Compute kind: {orchestrator_details['computeKind']}\n"
    f"- Is partitioned: {orchestrator_details['isPartitioned']}\n"
    f"- Table type: {warehouse_details['table_type']}\n"
    f"- Table schema: {warehouse_details['table_schema']}\n"
    f"- Created: {warehouse_details['created']}"
)

# Print the final string
print(output)

Getting attributes from the `conversation_event_summary` asset.

- Description: Summary of the event discussed in a conversation
- Compute kind: LangChain
- Is partitioned: True
- Table type: BASE TABLE
- Table schema: narratives
- Created: 2024-12-12T18:45:46.163000+00:00

Attributes from the `conversation_event_summary` asset.

And we have the following metrics for this asset:

import json
from damn_tool.metrics import asset_metrics

result = asset_metrics('narratives/conversation_event_summary', orchestrator='dagster', data_warehouse='bigquery', configs_dir="../.damn/")
data = json.loads(result)

# Extract relevant metrics
orchestrator_metrics = data["metrics"]["From orchestrator"]
dw_metrics = data["metrics"]["From data warehouse"]

# Combine metrics into a single string
output = (
    f"- Row count: {dw_metrics['row_count']}\n"
    f"- Size: {dw_metrics['bytes']}\n"
    f"- Last run id: {orchestrator_metrics['run_id']}\n"
    f"- Last run status: {orchestrator_metrics['status']}\n"
    f"- Last start time: {orchestrator_metrics['start_time']}\n"
    f"- Last end time: {orchestrator_metrics['end_time']}\n"
    f"- Number of partitions: {orchestrator_metrics['num_partitions']}\n"
    f"- Number of partitions materialized: {orchestrator_metrics['num_materialized']}\n"
    f"- Number of failed partitions: {orchestrator_metrics['num_failed']}"
)

# Print the final string
print(output)

Getting metrics from the `conversation_event_summary` asset.

- Row count: 289
- Size: 348.86 KB
- Last run id: 0346fbc5-54f5-46a1-8e7a-18299ef93004
- Last run status: SUCCESS
- Last start time: 2024-12-24 07:46:37
- Last end time: 2024-12-24 07:47:59
- Number of partitions: 758
- Number of partitions materialized: 48
- Number of failed partitions: 0

Metrics from the `conversation_event_summary` asset.

And a sample of data:

from google.cloud import bigquery
import pyarrow.feather as feather

# Construct a BigQuery client object.
client = bigquery.Client()

query = """
    select * except (research_cycles)
    from `dev_narratives.conversation_event_summary`
    order by partition_time desc
    limit 5"""

# Make an API request.
query_job = client.query(query)

# Wait for the job to complete.
conversation_event_summaries_df = query_job.result().to_dataframe()

# Write the DataFrame to a Feather file
feather_file = "data/conversation_event_summaries.feather"
feather.write_feather(conversation_event_summaries_df, feather_file)

# Pass the file path to the global namespace
globals()['conversation_event_summaries_feather'] = feather_file

Getting a data sample from the `conversation_event_summary` asset.

library(arrow)
library(dplyr)

# Read the Feather file
feather_file <- reticulate::py$conversation_event_summaries_feather

# Read the Feather file
conversation_event_summaries_df <- read_feather(feather_file)
conversation_event_summaries_df <- conversation_event_summaries_df %>%
  mutate(
    EVENT_SUMMARY = substr(EVENT_SUMMARY, 1, 50)
  )

# Display the table
kable(
    conversation_event_summaries_df,
    format = "html",    # Ensures better rendering in Quarto
    escape = TRUE   
)

Displaying the data sample from the `conversation_event_summary` asset.

CONVERSATION_ID EVENT_SUMMARY PARTITION_TIME
1867041079163306152 On December 11, 2024, the Supreme Court issued a p 2024-12-12 07:00:00
1867022812025602097 A recent analysis highlights the significant benef 2024-12-12 07:00:00
1865150335284662433 In response to escalating geopolitical tensions, p 2024-12-07 04:00:00
1865181471918420443 In response to escalating threats from Russia and 2024-12-07 04:00:00
1865124092875067832 In response to escalating geopolitical tensions, p 2024-12-07 01:00:00

Finally, we associate each post to a discourse type and extract a narrative from it. Some attributes:

import json
from damn_tool.show import show_asset

result = show_asset('narratives/post_narrative_associations', orchestrator='dagster', data_warehouse='bigquery', configs_dir="../.damn/")
data = json.loads(result)

# Extract and print the desired attributes
asset_details = data["show"]
orchestrator_details = asset_details["From orchestrator"]
warehouse_details = asset_details["From data warehouse"]

# Combine attributes into a single string
output = (
    f"- Description: {orchestrator_details['description']}\n"
    f"- Compute kind: {orchestrator_details['computeKind']}\n"
    f"- Is partitioned: {orchestrator_details['isPartitioned']}\n"
    f"- Table type: {warehouse_details['table_type']}\n"
    f"- Table schema: {warehouse_details['table_schema']}\n"
    f"- Created: {warehouse_details['created']}"
)

# Print the final string
print(output)

Getting attributes from the `post_narrative_associations` asset.

- Description: Associations between social network posts and narrative types
- Compute kind: LangChain
- Is partitioned: True
- Table type: BASE TABLE
- Table schema: narratives
- Created: 2024-12-12T18:46:06.324000+00:00

Attributes from the `post_narrative_associations` asset.

And some metrics:

import json
from damn_tool.metrics import asset_metrics

result = asset_metrics('narratives/post_narrative_associations', orchestrator='dagster', data_warehouse='bigquery', configs_dir="../.damn/")
data = json.loads(result)

# Extract relevant metrics
orchestrator_metrics = data["metrics"]["From orchestrator"]
dw_metrics = data["metrics"]["From data warehouse"]

# Combine metrics into a single string
output = (
    f"- Row count: {dw_metrics['row_count']}\n"
    f"- Size: {dw_metrics['bytes']}\n"
    f"- Last run id: {orchestrator_metrics['run_id']}\n"
    f"- Last run status: {orchestrator_metrics['status']}\n"
    f"- Last start time: {orchestrator_metrics['start_time']}\n"
    f"- Last end time: {orchestrator_metrics['end_time']}\n"
    f"- Number of partitions: {orchestrator_metrics['num_partitions']}\n"
    f"- Number of partitions materialized: {orchestrator_metrics['num_materialized']}\n"
    f"- Number of failed partitions: {orchestrator_metrics['num_failed']}"
)

# Print the final string
print(output)

Getting metrics from the `post_narrative_associations` asset.

- Row count: 688
- Size: 366.30 KB
- Last run id: 0346fbc5-54f5-46a1-8e7a-18299ef93004
- Last run status: SUCCESS
- Last start time: 2024-12-24 07:48:11
- Last end time: 2024-12-24 07:49:58
- Number of partitions: 758
- Number of partitions materialized: 65
- Number of failed partitions: 0

Metrics from the `post_narrative_associations` asset.

Let’s get a sample of that data:

from google.cloud import bigquery
import pyarrow.feather as feather

# Construct a BigQuery client object.
client = bigquery.Client()

query = """
    select * from `dev_narratives.post_narrative_associations`
    where discourse_type != "N/A"
    order by partition_time desc
    limit 5"""

# Make an API request.
query_job = client.query(query)

# Wait for the job to complete.
post_narrative_associations_df = query_job.result().to_dataframe()

# Write the DataFrame to a Feather file
feather_file = "data/post_narrative_associations.feather"
feather.write_feather(post_narrative_associations_df, feather_file)

# Pass the file path to the global namespace
globals()['post_narrative_associations_feather'] = feather_file

Getting a data sample from the `post_narrative_associations` asset.

library(arrow)

# Read the Feather file
feather_file <- reticulate::py$post_narrative_associations_feather

# Read the Feather file
post_narrative_associations_df <- read_feather(feather_file)

# Display the table
kable(
    post_narrative_associations_df,
    format = "html",    # Ensures better rendering in Quarto
    escape = TRUE   
)

Displaying the data sample from the `post_narrative_associations` asset.

POST_ID DISCOURSE_TYPE NARRATIVE PARTITION_TIME
1867041083688906784 Critical This post discusses a significant legal ruling that reflects the ongoing struggle between environmental regulation and economic interests, particularly related to coal usage. The Supreme Court's decision to reject the Kentucky electric utility's request to block the EPA's efforts illustrates the critical dimension of climate change discourse, where economic and political power structures are challenged in favor of public health and environmental protection. It emphasizes the need for robust regulatory frameworks to manage hazardous materials like coal ash, which are direct byproducts of fossil fuel consumption, and highlights the societal implications of maintaining such harmful practices. 2024-12-12 07:00:00
1867022813896323229 Integrative The post highlights the health and economic benefits of adopting heat pumps, framing climate change as an issue that intertwines environmental and social dimensions. By presenting data on reduced premature deaths, hospital visits, and asthma attacks, the discourse suggests that improving energy efficiency and transitioning to cleaner technologies not only addresses climate change but also significantly enhances public health and economic wellbeing. This integrative perspective encourages a holistic view of climate solutions, emphasizing the need to change societal norms and practices towards sustainable energy use. 2024-12-12 07:00:00
1854325713236611518 Critical The post questions whether China adheres to the climate agenda, which reflects a critical discourse. It implies skepticism about China's commitment to international climate agreements, possibly due to perceived economic and political interests that may not align with aggressive climate action. This aligns with the broader narrative that addressing climate change requires challenging existing economic systems and power structures, as these can lead to uneven and unsustainable patterns of development and energy use. The post's context, following the election of Donald Trump, who has expressed intentions to roll back climate regulations, further underscores the tension between political actions and global climate commitments. 2024-11-07 04:00:00
1854377519069151413 Critical This post highlights the social and economic implications of climate change, criticizing the lack of action to mitigate its effects despite the clear evidence of its future costs. The mention of 'our kids & grandkids' underscores the intergenerational impact of climate inaction. The tone suggests frustration with the current economic and political systems that prioritize short-term gains over long-term sustainability, which aligns with the Critical discourse. This reflects a challenge to power structures that maintain high fossil fuel consumption and neglect the urgency of climate policies. 2024-11-07 04:00:00
1854336568451797187 Critical The post reflects a critical discourse on climate change in the context of Donald Trump's election as President. It implies dissatisfaction and frustration with the electoral outcome, suggesting that the country's leadership is not conducive to addressing climate change. This aligns with the critical discourse type, where climate change is seen as a social problem exacerbated by political structures and leadership decisions that prioritize economic gains from fossil fuels over sustainable development. The post echoes concerns that Trump's policies, which include dismantling climate regulations and promoting fossil fuel production, are at odds with the urgent need for climate action. 2024-11-07 04:00:00

Analytics Assets

Finally, the platform produces a dimensional representation of those assets for reporting purposes. I won’t go into details here, but here’s an overview of the dbt project that generates those assets:

Analytics data product's assets.

📈 Conclusion

No sexy graphics, no groundbreaking insights. But isn’t that 90% of the work for data product builders? Designing the product, putting the pieces together, setting guardrails, and ensuring data quality.

There’s so much left to improve, but we at least now have a foundation to build upon. And most importantly, some data to explore new corners of our society. Because that’s the ultimate goal of RepublicOfData.io - to explore the dark corners of our society with data. And to get better at the craft of data product building along the way.