The project aims to implement a sophisticated search algorithm using a plain tsv file with input entries.
Start rest api server:
python rest-api.py --verbose --w2v /home/dima/models/ArModel100w2v.txt --port 8080 --debug
Parameters:
verbose- controls quantity of activeprints primarily during processing the query (i.e. output of tokens and lemmas)w2v- path to thew2vmodel for looking up similar words (if not set, thenw2vmodel will have not been created and as a consequence server will start much faster) it is possible that in future other embeddings will be available as well (suche as bert, elmo, fasttext and so on)port- port on which web server will be running ondebug- passes over directly to theapp.runmethod of theflaskmodule
curl --request POST \
--url http://localhost:8080/search \
--header 'content-type: application/json' \
--data '{
"query": "грачи улятели",
"fix-misspellings": true,
"use-embeddings": false,
"similar-tokens-score-weight": 1.5,
"similar-tokens-quantity": 2,
"products-quantity": 10,
"similar-products-quantity": 5,
"min-word-difference-ratio": 90
}' | python -m json.tool | ascii2uni -a U -q
Parameters:
query- search query to usefix-misspellings- wether or not use module for fixing spelling mistakes by applying Levenstein distancemax-word-difference-ratio- gives minimal score of similarity between word to fix and found fixed word for exchange to make senseuse-embeddings- wether or not apply embeddings to search for similar tokenssimilar-tokens-score-weight- how much score brings one match of lemma or token when searching for similar wordssimilar-tokens-quantity- how many similar tokens to considerproducts-quantity- how many products return as a result of direct searchsimilar-products-quantity- how many products return as a result of search by similar words found using embeddings Response example:
{
"http://tender.hack.spb/1159309": {
"name": "Кonos фильтр бумажный для кофеварок №4, 80 шт (Folie)",
"score": 1,
"keywords": [
"бумажный"
]
},
"http://tender.hack.spb/1100321": {
"name": "Полотенца бумажные 110 шт., KIMBERLY-CLARK Scott, комплект 16 шт., Slimfold, белые, 29,5х19 см, М-fold, диспенсер 601535, АРТ.5856",
"score": 1,
"keywords": [
"бумажный"
]
},
"http://tender.hack.spb/1159308": {
"name": "Кonos фильтр бумажный для кофеварок №4, 100 шт (Folie)",
"score": 1,
"keywords": [
"бумажный"
]
},
"http://tender.hack.spb/1159304": {
"name": "Кonos фильтр бумажный для кофеварки №4, 100 шт",
"score": 1,
"keywords": [
"бумажный"
]
},
"http://tender.hack.spb/1161231": {
"name": "Бумажные фильтр-мешки Karcher 5 шт. 6.904-322.0",
"score": 1,
"keywords": [
"бумажный"
]
}
}make-graph.py- to convert data fromtsvformat tottlgood_inqury.py- for fixing spelling mistakesdata_parsers.py- for parsingtsvrows and converting them todictscalculate-name-embeddings.py- for calculating embedding vectors of product names withELMoas a base model (requires a lot of computing power)load_elmo_embeddings.py- to test thatELMoembeddings were saved properlypreprocess-user-input.py- the main model which processes user's query - splits it into tokens, lemmatizes and connects to some other modulesrest-api.py- code for startingflaskhttp serverrun-remote-sparql-query.py- basic example of running a simple sparqle query referring to a remote blazegraph instancerun-sparql-query.py- basic example of running a simple sparqle query referring to a local blazegraph instance