dplyr - which R function to use for Text Auto-Correction? -
i have csv document 2 columns contains commodity category , commodity name.
ex:
sl.no. commodity category commodity name 1 stationary pencil 2 stationary pen 3 stationary marker 4 office utensils chair 5 office utensils drawer 6 hardware monitor 7 hardware cpu
and have csv file contains various commodity names.
ex:
sl.no. commodity name 1 pancil 2 pencil-hb 02 3 pencil-apsara 4 pancil-nataraj 5 pen-parker 6 pen-reynolds 7 monitor-x001rl
the output standardise , categorise commodity names , classify them respective commodity categories shown below :
sl.no. commodity name commodity category 1 pencil stationary 2 pencil stationary 3 pencil stationary 4 pancil stationary 5 pen stationary 6 pen stationary 7 monitor hardware
step 1) first have use nltk (text mining methods) , clean data seperate "pencil" "pencil-hb 02" .
step 2) after cleaning have use approximate string match technique i.e agrep() match patterns "pencil *" or correcting "pancil" "pencil".
step 3)once correcting pattern have categorise. no idea how.
this have thought about. started step 2 , i'm stuck in step 2 only. i'm not finding exact method code this. there way output required? if yes please suggest me method can proceed with.
you use stringdist
package. correct
function below correct commodity.name
in file2 based on distances of item different cname
.
then left_join
used join 2 tables.
i notice there classifications if use default options stringdistmatrix
. can try changing weight
argument of stringdistmatrix
better correction result.
> library(dplyr) > library(stringdist) > > file1 <- read.csv("/users/randy/desktop/file1.csv") > file2 <- read.csv("/users/randy/desktop/file2.csv") > > head(file1) sl.no. commodity.category commodity.name 1 1 stationary pencil 2 2 stationary pen 3 3 stationary marker 4 4 office utensils chair 5 5 office utensils drawer 6 6 hardware monitor > head(file2) sl.no. commodity.name 1 1 pancil 2 2 pencil-hb 02 3 3 pencil-apsara 4 4 pancil-nataraj 5 5 pen-parker 6 6 pen-reynolds > > cname <- levels(file1$commodity.name) > correct <- function(x){ + factor(sapply(x, function(z) cname[which.min(stringdistmatrix(z, cname, weight=c(1,0.1,1,1)))]), cname) + } > > correctedfile2 <- file2 %>% + transmute(commodity.name.old = commodity.name, commodity.name = correct(commodity.name)) > > correctedfile2 %>% + inner_join(file1[,-1], by="commodity.name") commodity.name.old commodity.name commodity.category 1 pancil pencil stationary 2 pencil-hb 02 pencil stationary 3 pencil-apsara pencil stationary 4 pancil-nataraj pencil stationary 5 pen-parker pen stationary 6 pen-reynolds pen stationary 7 monitor-x001rl monitor hardware
if need "others" category, need play weights. added row "diesel" in file2. compute score using stringdist
customized weights (you should try varying values). if score large 2 (this value related how weights assigned), doesn't correct anything.
ps: don't know possible labels, have as.character
convect factor
character
.
ps2: using tolower
case insensitive scoring.
> head(file2) sl.no. commodity.name 1 1 diesel 2 2 pancil 3 3 pencil-hb 02 4 4 pencil-apsara 5 5 pancil-nataraj 6 6 pen-parker > > cname <- levels(file1$commodity.name) > cname.lower <- tolower(cname) > correct_1 <- function(x){ + scores = stringdistmatrix(tolower(x), cname.lower, weight=c(1,0.001,1,0.5)) + if (min(scores)>2) { + return(x) + } else { + return(as.character(cname[which.min(scores)])) + } + } > correct <- function(x) { + sapply(as.character(x), correct_1) + } > > correctedfile2 <- file2 %>% + transmute(commodity.name.old = commodity.name, commodity.name = correct(commodity.name)) > > file1$commodity.name = as.character(file1$commodity.name) > correctedfile2 %>% + left_join(file1[,-1], by="commodity.name") commodity.name.old commodity.name commodity.category 1 diesel diesel <na> 2 pancil pencil stationary 3 pencil-hb 02 pencil stationary 4 pencil-apsara pencil stationary 5 pancil-nataraj pencil stationary 6 pen-parker pen stationary 7 pen-reynolds pen stationary 8 monitor-x001rl monitor hardware