Dimensional Emotion Mining in Disaster-related News Headlines

Codes

Datasets

May 26, 2025

Some studies (e.g., Martínez et al., 2025) suggest GPT models can replace human affective ratings of stimuli like single words and multiple-words expressions. I was curious whether this is the case for sentences and how I can use this to test an idea I wan sleeping on for a while.

Since AI, at least for now, lacks the complexity of human brain and has not experienced pain, suffering and the joy of our emotional experiences, I rather to be a little bit skeptical about its power to estimate human’s emotional reactions. So, I decided to test this for myself.

First, let’s see whether GPT models do a good job about mining Valence, Arousal, and Dominance (VAD). The first question that we should answer is whether GPT estimates of VAD have a strong correlation with human ratings. I used emobank to check it out.

library(httr)
library(jsonlite)
library(tidyverse)

You can get emobank from github: https://github.com/JULIELab/EmoBank

file_path <- choose.files()
emobank <- read.csv(file_path)

Now we need some examples of human ratings of VAD to train GPT. I picked three examples from emobank to train the model with few-shot learning; lowest and highest valence examples plus one that falls somewhere between.

# lowest valence 
emobank[which.min(emobank$V),]
#row 1147 --> V 1.2 

# highest valence
emobank[which.max(emobank$V),]
# row 9596 --> V 4.6

# middle valence
hist(emobank$V) # it is normally distributed, so pick the mean
mean(emobank$V) # 2.98
emobank[emobank$V == 3, ]
# row 31 sounds like a good example of a neutral sentence 

We have three examples with lowest, higest, and average valence:

emobank[c(1147, 9596, 31), "text"]

We can prepare a prompt based on three examples but change the scale to 1-9 for more fine-grained mining (emobank ratings are from 1-5).

Note: I tried a few different prompts, ratings, and GOT models. The ones below were those who worked best. For example, GPT-4o worked alightly better than GPT-4o-mini.

few_shot_prompt <- '
Rate the emotional valence of the following sentence as if you were a human. 
Use a scale from 1 to 9, where 1 means very negative/unpleasant and 9 means very positive/pleasant. 
Use the examples below as a guide for how humans typically rate valence.

Sentence: "Fuck you"
Valence: 1

Sentence: "What kind of work does Goodwill do?"
Valence: 5

Sentence: "lol Wonderful Simply Superb!"
Valence: 9

Only respond with a single number from 1 to 9.
'

Create a subset of emobank for comparing GPT ratings with human ratings on randomly selected items

emobank_GPT <- emobank[sample(nrow(emobank), 300), ]

Now I call GPT to estimate valence based on the prompt above.

# function to call GPT for ratings 

api_key <- "my api key"

# empty column for storing GPY estimates 
emobank_GPT$V_GPT <- NA


get_valence <- function(sentence, few_shot_prompt, api_key) {
  prompt <- paste0(few_shot_prompt, "\nSentence: \"", sentence, "\"\nValence:")

  response <- POST(
    url = "https://api.openai.com/v1/chat/completions",
    add_headers(Authorization = paste("Bearer", api_key)),
    content_type_json(),
    body = toJSON(list(
      model = "gpt-4o",
      messages = list(list(role = "user", content = prompt)),
      temperature = 0
    ), auto_unbox = TRUE)
  )

  out <- content(response, as = "parsed")
  answer <- out$choices[[1]]$message$content
  rating <- as.numeric(gsub("[^0-9.]", "", answer))
  return(rating)
}

for (i in 1:nrow(emobank_GPT)) {
  print(paste("row", i))
  sentence <- emobank_GPT$text[i]
  valence <- get_valence(sentence, few_shot_prompt, api_key)
  emobank_GPT$V_GPT[i] <- valence
  Sys.sleep(0.1)  
}

Compare the GPT ratings with human ratings.

cor(emobank_GPT$V, emobank_GPT$V_GPT, method = "pearson")
cor(emobank_GPT$V, emobank_GPT$V_GPT, method = "spearman")

GPT-4o estimate of valence has a correltion of .76 with human ratings: acceptable fow such a complicated task. Now let’s see wether we find same pattern with arousal and dominence.

# pick three examples for arousal for prompt engineering 

# lowest arousal
emobank[which.min(emobank$A),]
#row 3536 --> A 1.8 

# highest arousal
emobank[which.max(emobank$A),]
# row 7788 --> A 4.4

# medium
hist(emobank$A) # normal distribution
mean(emobank$A) # 3.04
emobank[emobank$A == 3,]
# row 796 sounds a neutral arousal "The document speaks for itself,"

Three examples for arousal

emobank[c(3536, 7788, 796), "text"]

# prepare the prompt for arousal
few_shot_prompt <- '
Rate the emotional arousal of the following sentence as if you were a human. 
Use a scale from 1 to 9, where 1 means very calm and 9 means very excited. 
Use the examples below as a guide for how humans typically rate arousal.

Sentence: "I was feeling calm and private that night."
Arousal: 1

Sentence: "The document speaks for itself."
Arousal: 5

Sentence: "My God, yes, yes, yes!"
Arousal: 9

Only respond with a single number from 1 to 9.
'

emobank_GPT$A_GPT <- NA

temparature is set on 0 so GPt doens’t get creative.

get_arousal <- function(sentence, few_shot_prompt, api_key) {
  prompt <- paste0(few_shot_prompt, "\nSentence: \"", sentence, "\"\nArousal:")

  response <- POST(
    url = "https://api.openai.com/v1/chat/completions",
    add_headers(Authorization = paste("Bearer", api_key)),
    content_type_json(),
    body = toJSON(list(
      model = "gpt-4o",
      messages = list(list(role = "user", content = prompt)),
      temperature = 0
    ), auto_unbox = TRUE)
  )

  out <- content(response, as = "parsed")
  answer <- out$choices[[1]]$message$content
  rating <- as.numeric(gsub("[^0-9.]", "", answer))
  return(rating)
}

for (i in 1:nrow(emobank_GPT)) {
  cat("Processing row", i, "\n")  # print progress to console
  sentence <- emobank_GPT$text[i]
  arousal <- get_arousal(sentence, few_shot_prompt, api_key)
  emobank_GPT$A_GPT[i] <- arousal
  Sys.sleep(0.1)
}

correltion of GPT estimate of arousal with uman ratings of arousal

cor(emobank_GPT$A, emobank_GPT$A_GPT, method = "pearson")
cor(emobank_GPT$A, emobank_GPT$A_GPT, method = "spearman")

GPT performance on arousal is not acceptable, with around .51 correlation with human ratings. Check dominance.

#lowest dominence
emobank[which.min(emobank$D), ] #row 3908

# higest dominence
emobank[which.max(emobank$D), ] # row 8057 "NO it might not be that clear so add another example"
max(emobank$D)
emobank[emobank$D == 4.1, ] # 7982

# meduim 
hist(emobank$D) # normal distibution
mean(emobank$D) # 3.06
emobank[emobank$D == 3, ] # row 667 looks like a good candidate for meduim level dominence

emobank[c(3908, 8057, 7982), "text"]
# I tought "NO" is not the best exmaple of high dominence for GPT so picked another: "I can fix that"

preparing the prompt

few_shot_prompt <- '
Rate the emotional dominence of the following sentence as if you were a human. 
Use a scale from 1 to 9, where 1 means very submissive / powerless and 9 means very dominant / powerful. 
Use the examples below as a guide for how humans typically rate dominence.

Sentence: "I shivered as I walked past the pale man’s blank eyes, wondering what they were staring at."
Dominence: 1

Sentence: "Hall is to return to Washington on April 22."
Dominence: 5

Sentence: "I can fix that."
Dominence: 9

Only respond with a single number from 1 to 9.
'

emobank_GPT$D_GPT <- NA

# Function to get dominance
get_dominance <- function(sentence, few_shot_prompt, api_key) {
  prompt <- paste0(few_shot_prompt, "\nSentence: \"", sentence, "\"\nDominance:")

  response <- POST(
    url = "https://api.openai.com/v1/chat/completions",
    add_headers(Authorization = paste("Bearer", api_key)),
    content_type_json(),
    body = toJSON(list(
      model = "gpt-4o",
      messages = list(list(role = "user", content = prompt)),
      temperature = 0
    ), auto_unbox = TRUE)
  )

  out <- content(response, as = "parsed")
  answer <- out$choices[[1]]$message$content
  rating <- as.numeric(gsub("[^0-9.]", "", answer))
  return(rating)
}

# Loop through sentences and store dominance scores
for (i in 1:nrow(emobank_GPT)) {
  cat("Processing row", i, "\n")  # show progress in console
  sentence <- emobank_GPT$text[i]
  dom <- get_dominance(sentence, few_shot_prompt, api_key)
  emobank_GPT$D_GPT[i] <- dom
  Sys.sleep(0.1)  
}

cor(emobank_GPT$D, emobank_GPT$D_GPT, method = "pearson")
cor(emobank_GPT$D, emobank_GPT$D_GPT, method = "spearman")

Dominance estimate are not good either.

In sum, GPT estimate valence ratings of stimuli might substitute human ratings but for now its work on arousal and dominence is not acceptable. Wehave to rely on human ratings for a forseeble future.

write_csv(emobank_GPT, 
          "C:\\Users\\sos523\\Dropbox (Lehigh University)\\Projects\\NLP\\Emotion mining\\emobank_GPT.csv")

The dataset which includes 300 exmamples, their human ratings, and GPT estimates can be found here: https://github.com/soheilshapouri/dimensional-emotion-mining-disaster-news/blob/main/emobank_GPT.csv

Since valence was the only dimension that GPT4o estimate appropriately, I will only use this one. I already showed GPt estimate has a good correltion with human ratings so I use these estimate to check the valence of disaster-related news.

My previous study (Shapouri et al., 2023) shows that valence ratings of natural and technological disasters visual stimuli are different. Does this generalize to news headlines? We’ll see.

First, we need some news headlines. I got the standard API of thenewsapi.com ($49/month).Using the code below I mined news related to disasters.

“All News” endpoint of thenewsapi was used for news harvesting. EM_DAT dataset was used to find a compreensive list of disasters. Using ChatGPT suggestions, I also created impact-terms vector to only find disaster news where some people have been killed whihc helps with contorling severity of disasters.

# Disaster terms from EM_DAT
disaster_terms <- c(
  "storm", "flood", "epidemic", "earthquake", "drought", "wildfire", "infestation",
  "chemical spill", "gas leak", "oil spill", "heat wave", "tornado", "cold wave",
  "tsunami", "forest fire", "viral disease", "derecho", "hail", "lightning", 
  "thunderstorms", "severe weather", "land fire", "poisoning", "cyclone",
  "industrial fire", "industrial explosion", "bridge collapse", "building collapse", 
  "industrial accident", "train accident", "boat accident", "ferry accident", 
  "helicopter crash", "plane crash", "car accident", "motorcycle crash", 
  "bike accident", "collision", "landslide", "avalanche", "blizzard",
  "rockfall", "bus accident", "vehicle accident","typhoon", "famine",
  "infectious disease", "lava", "hurricane", "building fire"
)

impact_terms <- c(
  "kill", "dead", "death", "fatalit*", "died", "casualt*"
)
impact_query <- paste(impact_terms, collapse = "|")

# news mining 
all_articles <- list()
index <- 1
article_count <- 0
backup_count <- 1

# getting the news 
# only "general" category, excluding sports category, ...
for (term in disaster_terms) {
  for (page_num in 1:10) {
    message("Keyword: ", term)
    message("  Page: ", page_num)
    
    search_query <- paste0("(", term, ") + (", impact_query, ")")
    
    response <- GET(
      "https://api.thenewsapi.com/v1/news/all",
      query = list(
        api_token = "my api token",
        language = "en",
        categories = "general",
        search = search_query,
        search_fields = "title, main_text",
        exclude_categories = "sports",
        published_before = "2025-05-01",
        published_after = "2000-01-01",
        page = page_num,
        limit = 100
      )
    )
    
    json_text <- content(response, as = "text", encoding = "UTF-8")
    parsed_data <- fromJSON(json_text, flatten = TRUE)
    
    if (!is.null(parsed_data$data) && length(parsed_data$data) > 0) {
      df <- parsed_data$data
      df$disaster_term <- term
      all_articles[[index]] <- df
      index <- index + 1
      article_count <- article_count + nrow(df)
      
      # Backup every 1000 articles
      if (article_count >= 1000) {
        news_data_temp <- do.call(rbind, lapply(all_articles, function(x) {
          data.frame(lapply(x, function(col) {
            if (is.list(col)) sapply(col, toString) else col
          }), stringsAsFactors = FALSE)
        }))
        write.csv(news_data_temp, paste0("news_backup_", backup_count, ".csv"), row.names = FALSE)
        message("✔ Backup created: news_backup_", backup_count, ".csv")
        backup_count <- backup_count + 1
        article_count <- 0
      }
    } else {
      break  # Stop paging if no more results
    }
    
    Sys.sleep(1)  # Pause to respect API rate limits
  }
}

news_data <- do.call(rbind, all_articles)

news_data_clean <- data.frame(
  lapply(news_data, function(col) {
    if (is.list(col)) sapply(col, toString) else col
  }), 
  stringsAsFactors = FALSE
)

write.csv(news_data_clean, "C:\\Users\\sos523\\Downloads\\news_data_full.csv", row.names = FALSE)

The full dataset before cleaning can be found here: https://github.com/soheilshapouri/dimensional-emotion-mining-disaster-news/blob/main/news_data_full.csv It has 8777 disaster-related news along with their headlines, URLs, etc.

Form here on, I did a lot of data cleaning, and experiments.

Removed duplicates (by code)
Removed some irrelavant rows manually (e.g., epidemics that were about opioids epidemic not about infecious diseases; cases where someone stormed somewere not storm as natural disasters)
Scarped fulltext of news articles (Many came back NA, 403 or 404 erros, irrelavant text)
counted number of words in eadline and article body
corrected the format of piblication dates
Used openAI API and GPT4omini to extract number of poeple reported killed in the headlines

This cleaned dataset with extra columns is here: https://github.com/soheilshapouri/dimensional-emotion-mining-disaster-news/blob/main/news_merged1_v2.csv v2 means it is different from my own version; I removed ful texts of articles to make sure copy rights are not violated.

Sidenote: I scraped full text of articles to count number of popele killed in each news peice; it turned out in 43 out 50 cases that number is reported in the title.

OK. We now have thousands od news headline, number of people killed reported there, and we can use GPT to estimate valence of these headlines.

file_path <- choose.files()
news_merged1 <- read.csv(file_path)

news_merged1 %>% 
  glimpse()

#Add a column for GPT estimates 
news_merged1$Valence_GPT <- NA

The prompt that worked before.

few_shot_prompt <- '
Rate the emotional valence of the following sentence as if you were a human. 
Use a scale from 1 to 9, where 1 means very negative/unpleasant and 9 means very positive/pleasant. 
Use the examples below as a guide for how humans typically rate valence.

Sentence: "Fuck you"
Valence: 1

Sentence: "What kind of work does Goodwill do?"
Valence: 5

Sentence: "lol Wonderful Simply Superb!"
Valence: 9

Only respond with a single number from 1 to 9.
'

# defined the function above 
get_valence

for (i in 1:nrow(news_merged1)) {
  cat("Processing row", i, "\n")
  
  sentence <- news_merged1$title[i]  
  valence <- get_valence(sentence, few_shot_prompt, api_key)
  news_merged1$Valence_GPT[i] <- valence
  
  Sys.sleep(0.1)  # to avoid rate limits
  
  # Save a temporary copy every 500 rows or at the end
  if (i %% 500 == 0 || i == nrow(news_merged1)) {
    write_csv(news_merged1, "news_merged1_valence_progress.csv")
  }
}
# took about a second per row 

write_csv(news_merged1, 
          "C:\\Users\\sos523\\Dropbox (Lehigh University)\\Projects\\NLP\\Emotion mining\\news_merged2.csv")

Full dataset can be found here: https://github.com/soheilshapouri/dimensional-emotion-mining-disaster-news/blob/main/news_merged2_v2.csv

We can test whether valence ratings of natural and technological disasters are different.

# natural and technogical 
natural <- c(
  "lightning", "storm", "typhoon", "earthquake", "avalanche", "landslide", 
  "forest fire", "hurricane", "rockfall", "tornado", "wildfire", "tsunami", 
  "cyclone", "hail", "blizzard", "drought", "cold wave", "epidemic", 
  "heat wave", "derecho", "land fire", "infectious disease", "infestation", 
  "viral disease", "lava", "thunderstorms", "severe weather", "famine"
)

technological <- c(
  tech_disasters <- c(
    "train accident", "collision", "helicopter crash", "poisoning", 
    "building fire", "plane crash", "industrial accident", "bridge collapse", 
    "bus accident", "building collapse", "vehicle accident", "gas leak", 
    "boat accident", "car accident", "motorcycle crash", "ferry accident", 
    "oil spill", "bike accident", "industrial fire", "industrial explosion", 
    "chemical spill"
  )
)

news_merged2 <- news_merged1

news_merged2$disaster_type <- ifelse(news_merged2$disaster_term %in% natural, "natural", "technological")

news_merged2$disaster_type <- as.factor(news_merged2$disaster_type)

Comparing valence of natural and technological disasters

hist(news_merged2$Valence_GPT[news_merged2$disaster_type == "natural"])
hist(news_merged2$Valence_GPT[news_merged2$disaster_type == "technological"])
t.test(Valence_GPT ~ disaster_type, data = news_merged2, na.action = na.omit, conf.level = .95)

While the difference between natural and technological disasters is statistically significant (p < .001), the 95% confidence interval indicates the difference in mean valence ratings is small (between 0.14 and 0.25). This range is negligible on a 1 to 9 scale, suggesting limited practical significance.

But we can also contorl for number of poeple killed to see what happens.

reg <- lm(Valence_GPT ~ disaster_type + num_killed, data = news_merged2, na.action = na.omit)
summary(reg)
confint(reg, level =.95)

A linear regression was conducted to examine the effect of disaster type and number of people killed on the emotional valence of news headlines (GPT-estimated). The overall model was statistically significant, F(2, 5559) = 15.94, p < .001, but explained very little variance (R² = .006).

The type of disaster significantly predicted valence. Technological disasters were rated slightly more positive than natural disasters, B = 0.175, SE = 0.031, t(5559) = 5.65, p < .001, 95% CI [0.114, 0.236].

The number of people killed was not a significant predictor of valence, B = 1.16e-06, SE = 1.63e-06, t(5559) = 0.71, p = .479, 95% CI [–2.04e-06, 4.35e-06].

In short:

GPT models can be used for estimating human ratings of stimuli but only valence not arousal or dominance.
There was not a huge difference between valence of natural and technological disasters; this could be due to the fact that news outlets use a relatively fixed neutral format, or simply (and more likely) my hypothesis was wrong.

References Martínez, G., Molero, J. D., González, S., Conde, J., Brysbaert, M., & Reviriego, P. (2025). Using large language models to estimate features of multi-word expressions: Concreteness, valence, arousal. Behavior Research Methods, 57(1), 1-11.

Shapouri, S., Martin, L. L., & Arhami, O. (2023). Affective responses to natural and technological disasters; an evolutionary perspective. Adaptive human behavior and physiology, 9(3), 308-322.

Sven Buechel and Udo Hahn. 2017. EmoBank: Studying the Impact of Annotation Perspective and Representation Format on Dimensional Emotion Analysis. In EACL 2017 - Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics. Valencia, Spain, April 3-7, 2017. Volume 2, Short Papers, pages 578-585. Available: http://aclweb.org/anthology/E17-2092

Sven Buechel and Udo Hahn. 2017. Readers vs. writers vs. texts: Coping with different perspectives of text understanding in emotion annotation. In LAW 2017 - Proceedings of the 11th Linguistic Annotation Workshop @ EACL 2017. Valencia, Spain, April 3, 2017, pages 1-12. Available: https://sigann.github.io/LAW-XI-2017/papers/LAW01.pdf