text-mining-project

Summary

The drug experience website https://erowid.org/experiences/ is mined with beautiful soup to gain an insight into whether there are gender disparities regarding drug use, and if there are common themes among drug users.

Please note: The final project report can be found in the Latec Document. This README helps to explain the code and justify the steps made, however not all of the code will be used in the final report.

Introduction

This was initally started as a research project, as part of my degree. However, I enjoy working with NLP and decided to work on it to improve the reseach and also to reaquaint myself with natural language processing.

The reseach that I wanted to undertake was to look into spiritualism in psychedelic drugs.

Psychoactive, or psychadelic drugs, such as mushrooms and LSD, are believed by many users to enhance conciousness, giving practitioners a spiritual, or even religious, experience (Letcher, 2007. Watts, 1968). Entrenched in indigenous culture, the use and popularity of these drugs is growing globally, despite prohibition, (Letcher, 2007. Rager, 2013). According to Letcher, practitioners experiences' with these drugs are "weighted", meaning that many people encounter similar hallucinations whilst on these drugs. According to my own research, psychadelic drugs are amongst the most popular recreational drugs across both genders (see figure 1). What is it about the experiences people have on these drugs that make them so popular? Do people have shared or similar experiences? Is there a common goal or experience that these drug users strive for?

In order to answer these questions I will datamine the Erowid experience vaults. This database is built on drug users documenting their own personal experiences, and therefore provides valuable insights into the popularity and the appeal of such substances.

Methods

Each of the experiences were downloaded, and processed using beautiful soup. Initially a dict was created in the form of:

{experience_id {drug: gender}}

The genders were seperated into male and female, and charts made to compare whether drug use differed across users:

As you can see the differences are not substantial. Mushrooms and LSD are amongst the most popular drugs used, and therefore highlights the relevance of my research question.

The next step was to find all the reports which related to Mushrooms and LSD, and a count of the most frequent words was made. The text was lowered and stopwords removed, however more text processing could be done here to improve results:

print word_count(tokenised).most_common(30)

[('like', 115720), ('would', 85580), ('felt', 84980), ('could', 69560), ('one', 65020), ('time', 64560), ('back', 61120), ('around', 52980), ('trip', 47460), ('started', 47440), ('get', 46260), ('really', 44680), ('got', 42820), ('went', 40700), ('feel', 40300), ('looked', 39840), ('thought', 39520), ('still', 38980), ('me.', 38760), ('go', 38640), ('seemed', 38000), ('going', 36560), ('see', 35500), ('began', 35440), ('everything', 35360), ('feeling', 34460), ('much', 32760), ('first', 32560), ('decided', 32080), ('even', 32000)]

As you can see the majority of the 30 most common words can be discarded, and if there is time, more work will be done to improve this.

Finally, a (initially small) selection of words associated with spiritualism (as taken from wikipedia) and there subsequent word frequencies were found, and graphed:

spiritual_words = ['god', 'spirit', 'heaven', 'hell', 'universe','magic','atheist', 'creation', 'concious', 'exist']

The drugs were then split into a list of drugs classed as Serotonergic psychedelics and those that were not. This was done according to a list on wikipedia

According to Lancaster University, word frequency is often normalised by Frequency per million words = ( frequency ÷ text no. words ) x 1,000,000

Therefore the count of each of the spirit words (as defined above) were found for the pschedelic drugs and the non psychedelic drugs, and normalised.

Different spiritual words from wikipedia were used:

['spirituality', 'afterlife', 'agnosticism', 'ahimsa', 'aikido', 'akashic records', 'ancestor worship', 'asceticism', 'atheism', 'bagua (concept)', "bah\xc3\xa1'\xc3\xad faith", 'blessing', 'chakra', 'chant', 'channelling', 'creation', 'consciousness', 'contemplation', 'cosmogony', 'deism', 'deity', 'dhammapada', 'dharma', 'dhikr', 'emanationism', 'enlightenment', 'entheogen', 'epigenesis', 'epiphany', 'eschatology', 'esotericism', 'eternal return', 'eternity', 'eutheism, dystheism, and maltheism', 'existence', 'exorcism', 'faith healing', 'fasting', 'glossolalia', 'gnosticism', 'god', 'goddess', 'great awakenings', 'guru granth sahib', 'guru', 'hymn', 'i ching', 'iconolatry', 'inner peace', 'integrity', 'involution', 'japa', 'jihad', 'karma', 'koan', 'lataif-e-sitta', 'love', 'mantra', 'meaning of life', 'meditation', 'metaphysics', "mind's eye", 'miracle', 'moksha', 'muraqaba', 'mysticism', 'nasma', 'neopaganism', 'new age', 'nirvana', 'nondualism', 'oneness', 'pandeism', 'panentheism', 'pantheism', 'parapsychology', 'physical universe', 'pilgrimage', 'plane (cosmology)', 'prayer', 'prophecy', 'qi', 'qigong', 'reality', 'reincarnation', 'religion', 'religious ecstasy', 'repentance', 'responsibility assumption', 'revelation', 'revivalism', 'ritual', 'sacrifice', 'sadhana', 'saint', 'salvation', 'satguru', 'sbnr', 'seven virtues', 'shabd', 'shamanism', 'shinto', 'shunyata', 'simran', 'soul', 'spirit', 'spiritism', 'spiritual evolution', 'spiritualism', 'spirituality', 'sufi whirling', 'sufism', 'supplication', 'tao te ching', 'tenrikyo', 'theism', 'theosis', 'tithe', 'torah', 'transcendentalism', 'unitarian universalism', 'veneration', 'vipassana', 'wabi-sabi', 'worship', 'yana (buddhism)', 'yin and yang', 'yoga', 'zazen']

However, many of these words are vetry specific to certain relions, and not necessarily the type of words associated with the concioussness and self-awareness.

Therefore, the decision was made to go with the short spritiual words list as defined above, as these seemed less tied to relisigion.

spiritual words and normalised frequency psychedelics
{'heaven': 410.69744298138124, 'magic': 1174.9066090353438, 'universe': 2570.0153733039233, 'concious': 70.55381027709082, 'atheist': 54.586369003854465, 'creation': 275.15986008065414, 'exist': 805.0561088459625, 'god': 3301.9183209678495, 'hell': 3276.667483605522, 'spirit': 1329.7536558013799}
spiritual words and normalised frequency, rest of drugs
{'heaven': 190.43153902649394, 'magic': 548.4651050909255, 'universe': 1178.782362921309, 'concious': 32.85222456889807, 'atheist': 25.056781450854462, 'creation': 126.3975419854214, 'exist': 373.6244523005188, 'god': 1539.600015813613, 'hell': 1522.8954948463768, 'spirit': 610.2718326696998}

As we can see from the initial output, spiritual words do occur with more frequency in relation to psychedelics.

A graph was made in order to visualised the differences better:

Chi Square test

A chi square test was carried out to compare whether spiritual words are over-represented in the psychoactive drugs as compared to the rest of the data.

This was carries out using the normilised data:

scipy.stats.chisquare(normalised_frequency, f_exp=rest_normalised_frequency)

with the output:

Power_divergenceResult(statistic=1296.8442867656056, pvalue=1.4860939588891628e-273)

The standard α value is 0.5, and therefore with a p-value of 1.49 we can accept the hypothesis, that spiritual words do occur more frequently in reports abuot psychedelics. Perhaps spiritualy curious people are drawn to psychedelics, or the hallucinagenic nature of the drugs causes users to question the nature of percieved reality.

Conclusion

The initial text processing needs to be refined in order to improve results. However the word frequencies suggest that evidence can be found to suggest that there is spiritualism within the modern use of psychoactive drugs.

The project can be improved/ expanded upon by adding further drug experience webites, and looking at further spiritual words. The list of psychedelic drugs on wikipedia may not be comprehensive, and people may experience hallucinations on drugs other than the psychadelics.

There are also problems with the data, for example, misspelled words may be missed from the search.

Name		Name	Last commit message	Last commit date
Latest commit History 41 Commits
graphs		graphs
GetDrugGender.ipynb		GetDrugGender.ipynb
README.md		README.md
drug_list.ipynb		drug_list.ipynb
erowid-paper.bib		erowid-paper.bib
erowid2.ipynb		erowid2.ipynb
final-assignment.tex		final-assignment.tex
print_page.py		print_page.py
test_erowid.ipynb		test_erowid.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

text-mining-project

Summary

Introduction

Methods

Chi Square test

Conclusion

About

Uh oh!

Releases

Packages

Languages

IzzySmith/text-mining-project

Folders and files

Latest commit

History

Repository files navigation

text-mining-project

Summary

Introduction

Methods

Chi Square test

Conclusion

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages