Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
90 commits
Select commit Hold shift + click to select a range
902f7ec
move config to config.py and use it to simplify get_user_ids.py
xaiki Aug 11, 2018
9073397
use config.py in streaming.py and use getopt to pass args
xaiki Aug 11, 2018
a564e32
rename save_to_db to db_mysql and port functions from get_user_ids
xaiki Aug 11, 2018
88067bf
make get_user_ids use getopts
xaiki Aug 11, 2018
6c5f484
add sqlite driver
xaiki Aug 11, 2018
beb3b66
add elasticsearch driver
xaiki Aug 11, 2018
55c0332
Allow following users and hashtags (track)
xaiki Aug 11, 2018
ee9e8f0
add requirements.txt
xaiki Aug 11, 2018
e8eb572
documentation update in README.md
xaiki Aug 11, 2018
12ba35b
move db files to their own directory
xaiki Aug 17, 2018
9c73a88
port streaming to new DB code,
xaiki Aug 17, 2018
5e7b720
detect connection errors and retry
xaiki Aug 17, 2018
cabf63f
-f is for config file
xaiki Aug 17, 2018
ed32799
PYNX: normalize users too
xaiki Aug 18, 2018
d4161e9
move arg parsing to config.py and use it in streaming.py
xaiki Aug 18, 2018
08d2f8b
ignore .pyc files
xaiki Aug 18, 2018
30b2af3
config: unify load_* function's naming: use filename for all
xaiki Aug 18, 2018
a5280db
config: implement load_csv and introduce CSV_FILE config option
xaiki Aug 18, 2018
5e74f4e
config: rename user option to users
xaiki Aug 18, 2018
da0fe99
get_user_ids: use new config code
xaiki Aug 18, 2018
f4c94a5
port monitoring to new config system
xaiki Aug 18, 2018
af8759b
port screenshot code to new config API
xaiki Aug 18, 2018
372320d
fix sqlite driver
xaiki Jan 14, 2019
e3207c7
allow for empty ids or tracks
xaiki Jan 14, 2019
0b87ae4
sqlite fix
xaiki Apr 30, 2019
9a84af6
switch config to argparse
xaiki Apr 30, 2019
713e65d
pep8
xaiki Apr 30, 2019
cd6a1fa
port DB to python3
xaiki Aug 28, 2020
28d5ec3
db:sqlite: fix import
xaiki Aug 28, 2020
8792ef8
db:sqlite: only create table once
xaiki Aug 28, 2020
44a5478
db:sqlite: port to python3
xaiki Aug 28, 2020
a1314bb
db:sqlite: sanitize text input by passing to sqlite direct python obj…
xaiki Aug 28, 2020
e161c65
config: process default config and dbs, add last config term as cmdli…
xaiki Aug 28, 2020
d2746c0
streaming: fix code and add error handeling
xaiki Aug 28, 2020
7a71645
2to3
xaiki Aug 28, 2020
63e5673
2to3
xaiki Aug 28, 2020
742c031
add little config tool
xaiki Aug 28, 2020
397ae35
update README
xaiki Aug 28, 2020
06f126b
config: remove debug
xaiki Aug 28, 2020
0757c2e
i never get markdown
xaiki Aug 28, 2020
06f9a69
fix get_user_ids
xaiki Aug 28, 2020
72bad01
make ids nargs too
xaiki Aug 28, 2020
3d57f1d
migrate to logging
xaiki Aug 28, 2020
0c52462
add tsv driver and make it default
xaiki Aug 28, 2020
f1591f7
allow to pass usernames to commandline
xaiki Aug 28, 2020
56aa5b3
get_user_ids is now a bit weird, but will leave it like that
xaiki Aug 28, 2020
c285879
support mutiple DB outputs
xaiki Aug 28, 2020
3b22fcd
update pynx DB
xaiki Sep 10, 2020
df570ba
config: much cleaner and generic driver import
xaiki Sep 10, 2020
1544c2a
config: better logging, that actually helps
xaiki Sep 10, 2020
9ce600c
DB/generic: cleaner generic method interception
xaiki Sep 10, 2020
1c152a6
DB/generic: now that all goes through multi, really generic is about …
xaiki Sep 10, 2020
93916d0
DB: s/save/saveTweet to be more consistent and open the way to save o…
xaiki Sep 10, 2020
7167e1f
DB/sqlite: refactor and introduce saveAuthor
xaiki Sep 10, 2020
dbcf310
DB/tsv: actually use filename to write
xaiki Sep 10, 2020
9221b80
DB/tsv: actual fix
xaiki Sep 10, 2020
52ed160
config: dbs actually needed more love to make multi work
xaiki Sep 10, 2020
3338f3b
streaming: fix signal_handler
xaiki Sep 10, 2020
011c68e
use coloredlogs for more readability
xaiki Sep 10, 2020
ba7a9e7
DB/multi: separate getattr and fn call so that we properly catch errors
xaiki Sep 10, 2020
506cbb9
DB/sqlite: fix error reporting in execute
xaiki Sep 10, 2020
211ce28
DB/sqlite: fix saveAuthor upsert syntax
xaiki Sep 10, 2020
8f82dcc
streaming: saveAuthor
xaiki Sep 10, 2020
db13393
config: properly implement debug and coloredlogs
xaiki Sep 10, 2020
19ca438
DB: generic getAuthor should raise an exception
xaiki Sep 10, 2020
f9b7d2b
DB: pynx, split out full text extractor, should probably go to utils
xaiki Sep 10, 2020
b0f3739
DB: pynx, trivial saveAuthor
xaiki Sep 10, 2020
ecf95ec
DB: sqlite, implement getAuthor and simplify saveAuthor
xaiki Sep 10, 2020
9ac3a00
DB: pynx, make add_tags return tags
xaiki Sep 10, 2020
8e2e0bf
DB/config: allow for passing instanciated DBs to config
xaiki Sep 10, 2020
1342574
streaming: add debug option
xaiki Sep 10, 2020
74a718b
get_user_ids: cache ids in SQLITE by default
xaiki Sep 10, 2020
9d9c017
config: remove debug code
xaiki Sep 10, 2020
0288e31
get_user_ids: remplace class nonsense with SimpleNamespace
xaiki Sep 10, 2020
6f67a71
DB/generic: back to nothing
xaiki Sep 10, 2020
66602a0
DB/sqlite: properly handle no results in getAuthor
xaiki Sep 10, 2020
ffb8d0d
config: cleaner db loading
xaiki Sep 10, 2020
baa5380
get_user_ids: bugfix, SimpleNamespace actually needs an import
xaiki Sep 10, 2020
e950c64
get_user_ids: force sqlite db if the provided one doesn't support aut…
xaiki Sep 10, 2020
bb1ee66
rewrite: proper caching, and bulk fetch
xaiki Sep 12, 2020
75954c7
DB/sqlite: don't reuquire authors to be NOT NULL
xaiki Sep 12, 2020
318cb95
config: row and csv, better logic and logging
xaiki Sep 12, 2020
0e42e27
get_user_ids: actually process csv argument
xaiki Sep 12, 2020
07fedf1
get_user_ids: all lower and tsv output
xaiki Sep 12, 2020
fa7ba78
add pydictor/get_user_ids example
xaiki Sep 12, 2020
47ec95f
config: FetchUsersAction: support for not having a db in config object
xaiki Sep 12, 2020
388ebe0
config: make USERS add elements to 'ids' and add USERS_NOFETCH to byp…
xaiki Sep 12, 2020
8a6c7f5
introduce utils.twitter_login and use it in streaming and get_user_ids
xaiki Sep 12, 2020
869e1be
implement blocking, supports CSV files
xaiki Sep 12, 2020
514a8ff
Rewrite for testing
xaiki Sep 14, 2020
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
*.pyc
Empty file added DB/__init__.py
Empty file.
11 changes: 11 additions & 0 deletions DB/generic.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
import logging
import os

class DB:
def __init__(self):
self.name = "Generic DB Driver"

def _WIPE(self):
self.close()
os.remove(self.filename)
self.open()
38 changes: 38 additions & 0 deletions DB/multi.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,38 @@
from . import generic
import logging

class Driver(generic.DB):
def __init__(self, databases = []):
self.name = "Multiple Dispatch DB Driver"

if type(databases) == list:
self.dbs = databases
else:
self.dbs = [databases]

logging.debug(self.dbs)

def __getattribute__(self, name):
try:
return object.__getattribute__(self, name)
except AttributeError:
pass

def wrapper(*args, **kwargs):
for d in self.dbs:
logging.debug(f"{d} -> {name}({args})")
fn = None
try:
fn = getattr(d, name)
except AttributeError:
logging.warn(f"{d} has no attribute {name}")

if fn: fn(*args, **kwargs)

return wrapper

def getTweets(self):
return self.dbs[0].getTweets()

def _WIPE(self):
return [d._WIPE() for d in self.dbs]
94 changes: 94 additions & 0 deletions DB/mysql.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,94 @@
import MySQLdb
import logging

from . import generic


class MySQLDriver(generic.DB):
def __init__(self):
super(self)
self.name = "MySQL DB Driver"
self.db = MySQLdb.connect(host="", user="", passwd="", db="", charset="utf8")

def getTweets():
self.db.cursor()
return cur.execute(
"""SELECT * \
FROM Tweets \
WHERE Deleted=0"""
)

def writeSuccess(path):
cur = self.db.cursor()
try:
cur.execute(
"""UPDATE Tweets \
SET Screenshot=1 \
WHERE Tweet_Id=%s""",
[path],
)
self.db.commit()
logging.info(f"Screenshot OK. Tweet id {path}")
except MySQLdb.Error as e:
try:
logging.error(("MySQL Error [%d]: %s" % (e.args[0], e.args[1])))
except IndexError:
logging.error(("MySQL Error: %s" % str(e)))

logging.error(("Error", e.args[0], e.args[1]))
logging.warning(("Warning:", path, "not saved to database"))
return True

def markDeleted(path):
cur = self.db.cursor()
try:
cur.execute(
"""UPDATE Tweets \
SET Deleted=1 \
WHERE Tweet_Id=%s""",
[path],
)
self.db.commit()
logging.info(("Tweet marked as deleted ", path))
except MySQLdb.Error as e:
try:
logging.error(("MySQL Error [%d]: %s" % (e.args[0], e.args[1])))
except IndexError:
logging.error(("MySQL Error: %s" % str(e)))

logging.error(("Error", e.args[0], e.args[1]))
logging.warning(("Warning:", path, "not saved to database"))
return True

def getLogs():
cur = self.db.cursor()
return cur.execute(
"SELECT Url, Tweet_Id FROM Tweets WHERE Screenshot=0 AND Deleted=0 "
)

def saveTweet(url, status):
(author, text, id_str) = (status.user.screen_name, status.text, status.id_str)
cur = db.cursor()

cur.execute(
"CREATE TABLE IF NOT EXISTS Tweets(Id INT PRIMARY KEY AUTO_INCREMENT, \
Author VARCHAR(255), \
Text VARCHAR(255), \
Url VARCHAR(255), \
Tweet_Id VARCHAR(255), \
Screenshot INT, \
Deleted INT)"
)

try:
cur.execute(
"""INSERT INTO Tweets(Author, Text, Url, Tweet_Id, Screenshot, Deleted)
VALUES (%s, %s, %s, %s, %s, %s)""",
(author, text, url, id_str, 0, 0),
)
self.db.commit()
logging.info(("Wrote to database:", author, id_str))
except MySQLdb.Error as e:
logging.error(("Error", e.args[0], e.args[1]))
self.db.rollback()
logging.error("ERROR writing database")
138 changes: 138 additions & 0 deletions DB/pynx.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,138 @@
import networkx as nx
import unicodedata
import logging
import json
import re
import os

from . import generic
from . import utils

hashre = re.compile(r"(#\w+)")
userre = re.compile(r"(@\w+)")


def normalize(input_str):
return unicodedata.normalize("NFKD", input_str).encode("ASCII", "ignore").lower()


def add_node(G, node, attr={}):
try:
G[node]["weight"] += 1
except KeyError:
G.add_node(node, weight=1)


def add_edge(G, n, p):
try:
G.edges[n, p]["weight"] += 1
except KeyError:
G.add_edge(n, p, weight=1)


def add_tags(G, text):
tags = hashre.findall(text)
for i, t in enumerate(tags):
n = normalize(t)
add_node(G, n)
for u in tags[i:]:
u = normalize(u)
add_node(G, u)
add_edge(G, t, u)
return tags

def add_users(G, text, status):
users = set(userre.findall(text))
if status.in_reply_to_screen_name:
users.add("@%s" % status.in_reply_to_screen_name)
try:
users.append("@%s" % status.retweeted_status.user.screen_name)
except AttributeError:
pass
u = normalize("@%s" % status.user.screen_name)
add_node(G, u)
for v in users:
add_edge(G, u, normalize(v))


class Driver(generic.DB):
def __init__(self, filename="graph.gexf"):
generic.DB.__init__(self)

self.name = "NetworkX DB Driver"

self.type = filename.split(".")[-1] or "gexf"
if self.type == 'pynx': # this is for test handeling
self.type = "gexf"
filename.replace('pynx', 'gexf')

self.filename = filename

self.open()

def open(self):
self._user_graph = "user-%s" % self.filename
self._hash_graph = "hash-%s" % self.filename
self._twit_graph = "twit-%s" % self.filename

self._write = getattr(nx, "write_%s" % self.type)
self._read = getattr(nx, "read_%s" % self.type)

self.U = self._open_graph(self._user_graph)
self.H = self._open_graph(self._hash_graph)
self.T = self._open_graph(self._twit_graph)

logging.info(f"graphs opened {self.U.nodes()} {self.H.nodes()} {self.T.nodes()}")

def _WIPE(self):
self.close()

os.remove(self._user_graph)
os.remove(self._hash_graph)
os.remove(self._twit_graph)

self.open()

def _open_graph(self, filename):
try:
return self._read(filename)
except IOError:
return nx.Graph()

def getTweets(self):
return [n for n in self.U.nodes()]

# def getAuthor(self, screen_name):
# u = normalize("@%s" % screen_name)
# return self.U.neighbors(u)

def markDeleted(self, id):
nx.set_node_attributes(self.U, {id: {"deleted": True}})

def _write_all(self):
self._write(self.H, self._hash_graph)
self._write(self.U, self._user_graph)
self._write(self.T, self._twit_graph)

def close(self):
self._write_all()

def saveTweet(self, status):
text = utils.extract_text(status)

add_tags(self.H, text)
add_users(self.U, text, status)

logging.info(f"H, {self.H.nodes()}")
self._write_all()

def saveAuthor(self, user):
u = normalize("@%s" % user.screen_name)
add_node(self.U, u)
nx.set_node_attributes(self.U, {u: {'id': user.id, 'created_at': user.created_at.isoformat()}})

self._write_all()

if __name__ == "__main__":
G = nx.Graph()
add_users(G, "RT @test blah blah #gnu @other", {})
Loading