Labelling groups of two columns in SQL (BigQuery SQL if possible) -


given table

     name  ip  = |a     1  |     |b     1  |     |c     1  |     |b     2  |     |c     2  |     |d     3  |     |e     2  | 

if 2 names share same ip belong in same group. ip same name belong in same group. if find names ip 1, {a, b, c}, should include ips associated {a,b,c} in same group {1,2} , again names ips aren't include {e} , forth. in particular example, in {a,b,c,e} x {1, 2} in same group. results above table be

     name  ip  group = |a     1     1    |     |b     1     1    |     |c     1     1    |     |b     2     1    |     |c     2     1    |     |d     3     2    |     |e     2     1    | 

just clear:

if names a, b, , c ip 1 grouped , should have

a, 1 = group1 b, 1 = group1 c, 1 = group1 

if names a, b share ip 2, should not make new group instead should should in same group this:

a, 1 = group1 b, 1 = group1 c, 1 = group1 a, 2 = group1 b, 2 = group1 

the goal solve in google bigquery sql.

so far have

select ip, row_number() on () group, group_concat(name,',') names, group ip 

which yields of names ip , gives group, doesn't find ips name or find group pairs encompasses names , ips.

note, can use split access names concatenated (in case ',').

update - called transitive closure. if difficult, sufficient show how first iteration of transitive closure (how find ips associated names associated each ip) , label these groups.

here solution first iteration. bit long , might improved, have.

step 1.

select name, nest(ip) ips, group_concat(string(ip)) sip  (select 'a' name, 1 ip), (select 'b' name, 1 ip), (select 'c' name, 1 ip), (select 'b' name, 2 ip), (select 'c' name, 2 ip), (select 'd' name, 3 ip), (select 'e' name, 2 ip) group name 

store results in temporary table x

step 2.

select a.name name, group_concat(b.name) cluster ( select a.name, b.name ( select a.*, b.* dataset.x cross join dataset.x b ) omit record if every(not b.sip contains string(a.ips)) group 1, 2 order 1, 2) group 1 

store results in temporary table y

step 3.

select cluster ( select group_concat(part) cluster ( select name, part ( select a.name name, split(b.cluster) part  dataset.y cross join dataset.y b b.cluster contains a.name) group 1, 2 order 1, 2)  group name) group cluster 

this should produce unique clusters, i.e.

a,b,c,e d 

Popular posts from this blog

c# - ODP.NET Oracle.ManagedDataAccess causes ORA-12537 network session end of file -

matlab - Compression and Decompression of ECG Signal using HUFFMAN ALGORITHM -

utf 8 - split utf-8 string into bytes in python -