py-crawler

Description

The goal with py-crawler is to efficiently gather meaningful information from desired sources to enable better information correlation, analysis, and faster querying for specific information when needed. By collecting the content of articles from various sources and feeds, py-crawler can then use natural language processing (NLP) to find related articles, even if those relations are subtle. This results in better information correlation across platforms and increases the general scope of information gathering. Additionally, py-crawler can use pre-saved data to answer queries from users looking for specific information. With its index, py-crawler can search for queried terms and tell the user where to look for more information, as well as provide more related sources of information by using NLP and other methods.

Methodology

Py-crawler started as a project to provide better correlation among analysts and researchers looking for information on specific topics and from the same sources, and to automate the grunt work of parsing through articles for information.

The initial crawling phase gathers general information about articles from the feeds defined in feeds.json. From there, the crawler visits each of the articles from the feeds to get the raw content, as well as any links off that page to potentially related articles. Notably, (at this time) the crawler does not go further than 1 level deep, i.e. it only visits and gets content from articles directly off the original RSS feed; although it gets outbound links from these articles, these links are to provide the user with quick access to related resources on the topic queried.

For each article visited, after getting the outbound links, the crawler tokenizes the raw content and adds the respective data to the index (more information on the index can be found in the README of the data folder). Importantly, the crawler also saves the non-tokenized raw content to the data/raw_contents/ directory, since the vast majority of NLP models consider the location of a word in a document, not just the frequency, and thus require the raw content with the word placements preserved.

To improve efficiency at runtime, the crawler stores additional information in the other files in the data/ directory. the README in the data/ directory contains more details on these files and their purposes.

File System Structure

py-crawler
|-- crawler/
|   |-- crawler_exceptions.py
|   |-- Crawler.py
|-- config/
|   |-- feeds.json
|   |-- Paths.py
|-- data/
|   |-- article_contents/
|   |   |-- feed1/
|   |   |   |-- 0.txt
|   |   |   |-- ... 
|   |   |-- ... 
|   |-- jsons/
|   |   |-- index.json
|   |   |-- articles.json
|   |   |-- index.json
|   |   |-- seen_article_links.json
|   |-- README.md
|-- data_analysis/
|   |-- Analyzer.py
|   |-- Indexer.py
|-- rss_feed/
|   |-- feeds.json
|   |-- RSS.py
|-- README.md
|-- requirements.txt
|-- testing.ipynb

Requirements

NLTK

The NLTK stopwords package is required for the stemming portion of tokenization. To install:

#!/usr/bin/env python3

import nltk
import ssl

try:
    _create_unverified_https_context = ssl._create_unverified_context
except AttributeError:
    pass
else:
    ssl._create_default_https_context = _create_unverified_https_context

nltk.download('stopwords')

Python Packages

A complete list of the required python packages can be found in requirements.txt. These packages can be installed in a virtual env automatically (depending on environment), or they can be installed via pip:

pip install -r requirements.txt
OR
pip install [package_name]
OR
python3 pip install [package_name]

The pip command is dependent on your environment and may vary.

Name		Name	Last commit message	Last commit date
Latest commit History 28 Commits
config		config
data		data
objects		objects
.gitignore		.gitignore
README.md		README.md
crawler_testing.ipynb		crawler_testing.ipynb
indexer_testing.ipynb		indexer_testing.ipynb
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

py-crawler

Description

Methodology

File System Structure

Requirements

NLTK

Python Packages

About

Uh oh!

Releases

Packages

Uh oh!

Languages

jjheals/py-crawler

Folders and files

Latest commit

History

Repository files navigation

py-crawler

Description

Methodology

File System Structure

Requirements

NLTK

Python Packages

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages