regex - How to extract only person A's statements in a conversation between two persons A and B -
i have record of conversations between 2 arbitrary persons , b.
c1 <- "person a: blabla...something person b: blabla else person a: ok blabla" c2 <- "person a: again blabla person b: blabla else person a: blabla"
the data frame looks this:
df <- data.frame(id = rbind(123, 345), conversation = rbind(c1, c2)) df id conversation c1 123 person a: blabla...something person b: blabla else person a: ok blabla c2 345 person a: again blabla person b: blabla else person a: blabla
now extract part of person , put in data frame. result should be:
id person_a 1 123 blabla...something ok blabla 2 345 again blabla blabla
i'm big fan of solving sort of problem in way gives access data (that includes person b's discourse well). love tidyr's extract
sort of column splitting. used use do.call(rbind, strsplit()))
approach love how clean extract
approach is.
c1 <- "person a: blabla...something person b: blabla else person a: ok blabla" c2 <- "person a: again blabla person b: blabla else person a: blabla" c3 <- "person a: again blabla person b: blabla else" df <- data.frame(id = rbind(123, 345, 567), conversation = rbind(c1, c2, c3)) if (!require("pacman")) install.packages("pacman") pacman::p_load(dplyr, tidyr) conv <- strsplit(as.character(df[["conversation"]]), "\\s+(?=person\\s)", perl=true) df2 <- df[rep(1:nrow(df), sapply(conv, length)), ,drop=false] rownames(df2) <- null df2[["conversation"]] <- unlist(conv) df2 %>% extract(conversation, c("person", "conversation"), "([^:]+):\\s+(.+)") ## id person conversation ## 1 123 person blabla...something ## 2 123 person b blabla else ## 3 123 person ok blabla ## 4 345 person again blabla ## 5 345 person b blabla else ## 6 345 person blabla ## 7 567 person again blabla ## 8 567 person b blabla else df2 %>% extract(conversation, c("person", "conversation"), "([^:]+):\\s+(.+)") %>% filter(person == "person a") ## id person conversation ## 1 123 person blabla...something ## 2 123 person ok blabla ## 3 345 person again blabla ## 4 345 person blabla ## 5 567 person again blabla
or collapse them show in desired output:
df2 %>% extract(conversation, c("person", "conversation"), "([^:]+):\\s+(.+)") %>% filter(person == "person a") %>% group_by(id) %>% select(-person) %>% summarise(person_a =paste(conversation, collapse=" ")) ## id person_a ## 1 123 blabla...something ok blabla ## 2 345 again blabla blabla ## 3 567 again blabla
edit: in reality suspect data has real names "john smith" vs. "person a". if case initial regex split capture first , last name uses caps followed colon:
c1 <- "greg smith: blabla...something sue williams: blabla else greg smith: ok blabla" c2 <- "greg smith: again blabla sue williams: blabla else greg smith: blabla" c3 <- "greg smith: again blabla sue williams: blabla else" df <- data.frame(id = rbind(123, 345, 567), conversation = rbind(c1, c2, c3))r conv <- strsplit(as.character(df[["conversation"]]), "\\s+(?=([a-z][a-z]+\\s+[a-z][a-z]+:))", perl=true) df2 <- df[rep(1:nrow(df), sapply(conv, length)), ,drop=false] rownames(df2) <- null df2[["conversation"]] <- unlist(conv) df2 %>% extract(conversation, c("person", "conversation"), "([^:]+):\\s+(.+)") ## id person conversation ## 1 123 greg smith blabla...something ## 2 123 sue williams blabla else ## 3 123 greg smith ok blabla ## 4 345 greg smith again blabla ## 5 345 sue williams blabla else ## 6 345 greg smith blabla ## 7 567 greg smith again blabla ## 8 567 sue williams blabla else