Labelling groups of two columns in SQL (BigQuery SQL if possible) -
given table
name ip = |a 1 | |b 1 | |c 1 | |b 2 | |c 2 | |d 3 | |e 2 |
if 2 names share same ip belong in same group. ip same name belong in same group. if find names ip 1, {a, b, c}, should include ips associated {a,b,c} in same group {1,2} , again names ips aren't include {e} , forth. in particular example, in {a,b,c,e} x {1, 2} in same group. results above table be
name ip group = |a 1 1 | |b 1 1 | |c 1 1 | |b 2 1 | |c 2 1 | |d 3 2 | |e 2 1 |
just clear:
if names a, b, , c ip 1 grouped , should have
a, 1 = group1 b, 1 = group1 c, 1 = group1
if names a, b share ip 2, should not make new group instead should should in same group this:
a, 1 = group1 b, 1 = group1 c, 1 = group1 a, 2 = group1 b, 2 = group1
the goal solve in google bigquery sql.
so far have
select ip, row_number() on () group, group_concat(name,',') names, group ip
which yields of names ip , gives group, doesn't find ips name or find group pairs encompasses names , ips.
note, can use split access names concatenated (in case ',').
update - called transitive closure. if difficult, sufficient show how first iteration of transitive closure (how find ips associated names associated each ip) , label these groups.
here solution first iteration. bit long , might improved, have.
step 1.
select name, nest(ip) ips, group_concat(string(ip)) sip (select 'a' name, 1 ip), (select 'b' name, 1 ip), (select 'c' name, 1 ip), (select 'b' name, 2 ip), (select 'c' name, 2 ip), (select 'd' name, 3 ip), (select 'e' name, 2 ip) group name
store results in temporary table x
step 2.
select a.name name, group_concat(b.name) cluster ( select a.name, b.name ( select a.*, b.* dataset.x cross join dataset.x b ) omit record if every(not b.sip contains string(a.ips)) group 1, 2 order 1, 2) group 1
store results in temporary table y
step 3.
select cluster ( select group_concat(part) cluster ( select name, part ( select a.name name, split(b.cluster) part dataset.y cross join dataset.y b b.cluster contains a.name) group 1, 2 order 1, 2) group name) group cluster
this should produce unique clusters, i.e.
a,b,c,e d