dplyr - which R function to use for Text Auto-Correction? -


i have csv document 2 columns contains commodity category , commodity name.

ex:

sl.no. commodity category commodity name 1      stationary         pencil 2      stationary         pen 3      stationary         marker 4      office utensils    chair 5      office utensils    drawer 6      hardware           monitor 7      hardware           cpu 

and have csv file contains various commodity names.

ex:

sl.no. commodity name 1      pancil 2      pencil-hb 02 3      pencil-apsara 4      pancil-nataraj 5      pen-parker 6      pen-reynolds 7      monitor-x001rl 

the output standardise , categorise commodity names , classify them respective commodity categories shown below :

sl.no. commodity name   commodity category 1      pencil           stationary 2      pencil           stationary 3      pencil           stationary 4      pancil           stationary 5      pen              stationary 6      pen              stationary 7      monitor          hardware 

step 1) first have use nltk (text mining methods) , clean data seperate "pencil" "pencil-hb 02" .

step 2) after cleaning have use approximate string match technique i.e agrep() match patterns "pencil *" or correcting "pancil" "pencil".

step 3)once correcting pattern have categorise. no idea how.

this have thought about. started step 2 , i'm stuck in step 2 only. i'm not finding exact method code this. there way output required? if yes please suggest me method can proceed with.

you use stringdist package. correct function below correct commodity.name in file2 based on distances of item different cname.

then left_join used join 2 tables.

i notice there classifications if use default options stringdistmatrix. can try changing weight argument of stringdistmatrix better correction result.

> library(dplyr) > library(stringdist) >  > file1 <- read.csv("/users/randy/desktop/file1.csv") > file2 <- read.csv("/users/randy/desktop/file2.csv") >  > head(file1)   sl.no. commodity.category commodity.name 1      1         stationary         pencil 2      2         stationary            pen 3      3         stationary         marker 4      4    office utensils          chair 5      5    office utensils         drawer 6      6           hardware        monitor > head(file2)   sl.no. commodity.name 1      1         pancil 2      2   pencil-hb 02 3      3  pencil-apsara 4      4 pancil-nataraj 5      5     pen-parker 6      6   pen-reynolds >  > cname <- levels(file1$commodity.name) > correct <- function(x){ +     factor(sapply(x, function(z) cname[which.min(stringdistmatrix(z, cname, weight=c(1,0.1,1,1)))]), cname) + } >  > correctedfile2 <- file2 %>% + transmute(commodity.name.old = commodity.name, commodity.name = correct(commodity.name)) >  > correctedfile2 %>% + inner_join(file1[,-1], by="commodity.name")   commodity.name.old commodity.name commodity.category 1             pancil         pencil         stationary 2       pencil-hb 02         pencil         stationary 3      pencil-apsara         pencil         stationary 4     pancil-nataraj         pencil         stationary 5         pen-parker            pen         stationary 6       pen-reynolds            pen         stationary 7     monitor-x001rl        monitor           hardware 

if need "others" category, need play weights. added row "diesel" in file2. compute score using stringdist customized weights (you should try varying values). if score large 2 (this value related how weights assigned), doesn't correct anything.

ps: don't know possible labels, have as.character convect factor character.

ps2: using tolower case insensitive scoring.

> head(file2)   sl.no. commodity.name 1      1         diesel 2      2         pancil 3      3   pencil-hb 02 4      4  pencil-apsara 5      5 pancil-nataraj 6      6     pen-parker >  > cname <- levels(file1$commodity.name) > cname.lower <- tolower(cname) > correct_1 <- function(x){ +     scores = stringdistmatrix(tolower(x), cname.lower, weight=c(1,0.001,1,0.5)) +     if (min(scores)>2) { +         return(x) +     } else { +         return(as.character(cname[which.min(scores)])) +     } + } > correct <- function(x) { +     sapply(as.character(x), correct_1) + } >  > correctedfile2 <- file2 %>% + transmute(commodity.name.old = commodity.name, commodity.name = correct(commodity.name)) >  > file1$commodity.name = as.character(file1$commodity.name) > correctedfile2 %>% + left_join(file1[,-1], by="commodity.name")   commodity.name.old commodity.name commodity.category 1             diesel         diesel               <na> 2             pancil         pencil         stationary 3       pencil-hb 02         pencil         stationary 4      pencil-apsara         pencil         stationary 5     pancil-nataraj         pencil         stationary 6         pen-parker            pen         stationary 7       pen-reynolds            pen         stationary 8     monitor-x001rl        monitor           hardware 

Popular posts from this blog

c# - ODP.NET Oracle.ManagedDataAccess causes ORA-12537 network session end of file -

matlab - Compression and Decompression of ECG Signal using HUFFMAN ALGORITHM -

utf 8 - split utf-8 string into bytes in python -