Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
General Information
Hyperreal Talk (Polish clear web message board) messages data.
Haitao Shi (The University of Edinburgh, UK); Leszek Świeca (Kazimierz Wielki University in Bydgoszcz, Poland).
The dataset is part of the research supported by the Polish National Science Centre (Narodowe Centrum Nauki) grant 2021/43/B/HS6/00710.
Project title: “Rhizomatic networks, circulation of meanings and contents, and offline contexts of online drug trade” (2022-2025; PLN 956 620; funding institution: Polish National Science Centre [NCN], call: OPUS 22; Principal Investigator: Piotr Siuda [Kazimierz Wielki University in Bydgoszcz, Poland]).
Data Collection Context
Polish clear web message board called Hyperreal Talk (https://hyperreal.info/talk/).
This dataset was developed within the abovementioned project. The project delves into internet dynamics within disruptive activities, specifically focusing on the online drug trade in Poland. It aims to (1) examine the utilization of the open internet, including social media, in the drug trade; (2) delineate the role of darknet environments in narcotics distribution; and (3) uncover the intricate flow of drug trade-related content and its meanings between the open web and the darknet, and how these meanings are shaped within the so-called drug subculture.
The Hyperreal Talk forum emerges as a pivotal online space on the Polish internet, serving as a hub for discussions and the exchange of knowledge and experiences concerning drug use. It plays a crucial role in investigating the narratives and discourses that shape the drug subculture and the broader societal perceptions of drug consumption. The dataset has been instrumental in conducting analyses pertinent to the earlier project goals.
The dataset was compiled using the Scrapy framework, a web crawling and scraping library for Python. This tool facilitated systematic content extraction from the targeted message board.
The data was collected in two periods, i.e., in September 2023 and November 2023.
Data Content
The dataset comprises all messages posted on the Polish-language Hyperreal Talk message board from its inception until November 2023. These messages include the initial posts that start each thread and the subsequent posts (replies) within those threads. The dataset is organized into two directories: “hyperreal” and “hyperreal_hidden.” The “hyperreal” directory contains accessible posts without needing to log in to Hyperreal Talk, while the “hyperreal_hidden” directory holds posts that can only be viewed by logged-in users. For each directory, a .txt file has been prepared detailing the structure of the message board folders from which the posts were extracted. The dataset includes 6,248,842 posts.
The data has been cleaned and processed using regular expressions in Python. Additionally, all personal information was removed through regular expressions. The data has been hashed to exclude all identifiers related to instant messaging apps and email addresses. Furthermore, all usernames appearing in messages have been eliminated.
The dataset consists of the following files:
Zipped .txt files (hyperreal.zip) containing messages that are visible without logging into Hyperreal Talk. These files are organized into individual directories that mirror the folder structure found on the Hyperreal Talk message board.
Zipped .txt files (hyperreal_hidden.zip) containing messages that are visible only after logging into Hyperreal Talk. Similar to the first type, these files are organized into directories corresponding to the website’s folder structure.
A .csv file that lists all the messages, including file names and the content of each post.
Accessibility and Usage
The data can be accessed without any restrictions.
Attached are .txt files detailing the tree of folders for “hyperreal.zip” and “hyperreal_hidden.zip.”
Documentation on the Python regular expressions used for scraping, cleaning, processing, and anonymizing the data can be found on GitHub at the following URLs:
https://github.com/LeszekSwieca/Project_2021-43-B-HS6-00710
https://github.com/HaitaoShi/Scrapy_hyperreal"
Ethical Considerations
A set of data handling policies aimed at ensuring safety and ethics has been outlined in the following paper:
Harviainen, J.T., Haasio, A., Ruokolainen, T., Hassan, L., Siuda, P., Hamari, J. (2021). Information Protection in Dark Web Drug Markets Research [in:] Proceedings of the 54th Hawaii International Conference on System Sciences, HICSS 2021, Grand Hyatt Kauai, Hawaii, USA, 4-8 January 2021, Maui, Hawaii, (ed.) Tung X. Bui, Honolulu, HI, pp. 4673-4680.
The primary safeguard was the early-stage hashing of usernames and identifiers from the messages, utilizing automated systems for irreversible hashing. Recognizing that scraping and automatic name removal might not catch all identifiers, the data underwent manual review to ensure compliance with research ethics and thorough anonymization.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
General Information
Dopek.eu (Polish clear web and dark web message board) messages data.
Haitao Shi (The University of Edinburgh, UK); Leszek Świeca (Kazimierz Wielki University in Bydgoszcz, Poland).
The dataset is part of the research supported by the Polish National Science Centre (Narodowe Centrum Nauki) grant 2021/43/B/HS6/00710.
Project title: “Rhizomatic networks, circulation of meanings and contents, and offline contexts of online drug trade” (2022-2025; PLN 956 620; funding institution: Polish National Science Centre [NCN], call: OPUS 22; Principal Investigator: Piotr Siuda [Kazimierz Wielki University in Bydgoszcz, Poland]).
Data Collection Context
Clear web and dark web message board called dopek.eu (https://dopek.eu/).
This dataset was developed within the abovementioned project. The project delves into internet dynamics within disruptive activities, specifically focusing on the online drug trade in Poland. It aims to (1) examine the utilization of the open internet, including social media, in the drug trade; (2) delineate the role of darknet environments in narcotics distribution; and (3) uncover the intricate flow of drug trade-related content and its meanings between the open web and the darknet, and how these meanings are shaped within the so-called drug subculture.
The dopek.eu forum emerges as a pivotal online space on the Polish internet, serving as a hub for trading, discussions, and the exchange of knowledge and experiences concerning the use of the so-called new psychoactive substances (designer drugs). The dataset has been instrumental in conducting analyses pertinent to the earlier project goals.
The dataset was compiled using the Scrapy framework, a web crawling and scraping library for Python. This tool facilitated systematic content extraction from the targeted message board.
The data was collected in October 2023.
Data Content
The dataset comprises all messages posted on dopek.eu from its inception until October 2023. These messages include the initial posts that start each thread and the subsequent posts (replies) within those threads. A .txt file has been prepared detailing the structure of the message board folders from which the posts were extracted. The dataset includes 171,121 posts.
The data has been cleaned and processed using regular expressions in Python. Additionally, all personal information was removed through regular expressions. The data has been hashed to exclude all identifiers related to instant messaging apps and email addresses. Furthermore, all usernames appearing in messages have been eliminated.
The dataset consists of the following types of files:
Zipped .txt files (dopek.zip) containing all messages (posts).
A .csv file that lists all the messages, including file names and the content of each post.
Accessibility and Usage
The data can be accessed without any restrictions.
Attached are .txt files detailing the tree of folders for “dopek.zip”.
Ethical Considerations
A set of data handling policies aimed at ensuring safety and ethics has been outlined in the following paper:
Harviainen, J.T., Haasio, A., Ruokolainen, T., Hassan, L., Siuda, P., Hamari, J. (2021). Information Protection in Dark Web Drug Markets Research [in:] Proceedings of the 54th Hawaii International Conference on System Sciences, HICSS 2021, Grand Hyatt Kauai, Hawaii, USA, 4-8 January 2021, Maui, Hawaii, (ed.) Tung X. Bui, Honolulu, HI, pp. 4673-4680.
The primary safeguard was the early-stage hashing of usernames and identifiers from the posts, utilizing automated systems for irreversible hashing. Recognizing that scraping and automatic name removal might not catch all identifiers, the data underwent manual review to ensure compliance with research ethics and thorough anonymization.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Overview Welcome to Kaggle's second annual Machine Learning and Data Science Survey ― and our first-ever survey data challenge.
This year, as last year, we set out to conduct an industry-wide survey that presents a truly comprehensive view of the state of data science and machine learning. The survey was live for one week in October, and after cleaning the data we finished with 23,859 responses, a 49% increase over last year!
There's a lot to explore here. The results include raw numbers about who is working with data, what’s happening with machine learning in different industries, and the best ways for new data scientists to break into the field. We've published the data in as raw a format as possible without compromising anonymization, which makes it an unusual example of a survey dataset.
Challenge This year Kaggle is launching the first Data Science Survey Challenge, where we will be awarding a prize pool of $28,000 to kernel authors who tell a rich story about a subset of the data science and machine learning community..
In our second year running this survey, we were once again awed by the global, diverse, and dynamic nature of the data science and machine learning industry. This survey data EDA provides an overview of the industry on an aggregate scale, but it also leaves us wanting to know more about the many specific communities comprised within the survey. For that reason, we’re inviting the Kaggle community to dive deep into the survey datasets and help us tell the diverse stories of data scientists from around the world.
The challenge objective: tell a data story about a subset of the data science community represented in this survey, through a combination of both narrative text and data exploration. A “story” could be defined any number of ways, and that’s deliberate. The challenge is to deeply explore (through data) the impact, priorities, or concerns of a specific group of data science and machine learning practitioners. That group can be defined in the macro (for example: anyone who does most of their coding in Python) or the micro (for example: female data science students studying machine learning in masters programs). This is an opportunity to be creative and tell the story of a community you identify with or are passionate about!
Submissions will be evaluated on the following:
Composition - Is there a clear narrative thread to the story that’s articulated and supported by data? The subject should be well defined, well researched, and well supported through the use of data and visualizations. Originality - Does the reader learn something new through this submission? Or is the reader challenged to think about something in a new way? A great entry will be informative, thought provoking, and fresh all at the same time. Documentation - Are your code, and kernel, and additional data sources well documented so a reader can understand what you did? Are your sources clearly cited? A high quality analysis should be concise and clear at each step so the rationale is easy to follow and the process is reproducible To be valid, a submission must be contained in one kernel, made public on or before the submission deadline. Participants are free to use any datasets in addition to the Kaggle Data Science survey, but those datasets must also be publicly available on Kaggle by the deadline for a submission to be valid.
While the challenge is running, Kaggle will also give a Weekly Kernel Award of $1,500 to recognize excellent kernels that are public analyses of the survey. Weekly Kernel Awards will be announced every Friday between 11/9 and 11/30.
How to Participate To make a submission, complete the submission form. Only one submission will be judged per participant, so if you make multiple submissions we will review the last (most recent) entry.
No submission is necessary for the Weekly Kernels Awards. To be eligible, a kernel must be public and use the 2018 Data Science Survey as a data source.
Timeline All dates are 11:59PM UTC
Submission deadline: December 3rd
Winners announced: December 10th
Weekly Kernels Award prize winners announcements: November 9th, 16th, 23rd, and 30th
All kernels are evaluated after the deadline.
Rules To be eligible to win a prize in either of the above prize tracks, you must be:
a registered account holder at Kaggle.com; the older of 18 years old or the age of majority in your jurisdiction of residence; and not a resident of Crimea, Cuba, Iran, Syria, North Korea, or Sudan Your kernels will only be eligible to win if they have been made public on kaggle.com by the above deadline. All prizes are awarded at the discretion of Kaggle. Kaggle reserves the right to cancel or modify prize criteria.
Unfortunately employees, interns, contractors, officers and directors of Kaggle Inc., and their parent companies, are not eligible to win any prizes.
Survey Methodology ...
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
In modern anesthesia, multiple medical devices are used simultaneously to comprehensively monitor real-time vital signs to optimize patient care and improve surgical outcomes. However, interpreting the dynamic changes of time-series biosignals and their correlations is a difficult task even for experienced anesthesiologists. Recent advanced machine learning technologies have shown promising results in biosignal analysis, however, research and development in this area is relatively slow due to the lack of biosignal datasets for machine learning. The VitalDB (Vital Signs DataBase) is an open dataset created specifically to facilitate machine learning studies related to monitoring vital signs in surgical patients. This dataset contains high-resolution multi-parameter data from 6,388 cases, including 486,451 waveform and numeric data tracks of 196 intraoperative monitoring parameters, 73 perioperative clinical parameters, and 34 time-series laboratory result parameters. All data is stored in the public cloud after anonymization. The dataset can be freely accessed and analysed using application programming interfaces and Python library. The VitalDB public dataset is expected to be a valuable resource for biosignal research and development.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
This repository includes the data and codes to generate analysis results and figures.
To ensure participant confidentiality, we replaced all reported concepts and details—including names mentioned in their personal narratives—with alphanumeric codes. This anonymization process did not compromise the integrity of our analysis, and the results align with our published findings.
Python library versions are provided by the requirements.txt file.
We trained the Transformer model in a docker container. We provided "fast_nlp_tf1.tar.xz" compressed by xz utils.
To build the environment, follow the command lines below.
$ xz -d -p {# of processes} fast_nlp_tf1.tar.xz
$ docker run --gpus all --name fast_nlp_env -v {path of this repo}:/Projects -itd fast_nlp:tf1.15.0-py3
Spontaneous thought plays a crucial role in shaping affective traits and mental health. However, its dynamic and unconstrained nature makes it challenging to quantify and model effectively. To address this, we employed the Free Association Semantic Task (FAST) to obtain self-generated spontaneous thoughts, which were then analyzed using network modeling and natural language processing (NLP) to decode key affective content dimensions of self-generated thought, including valence, self-relevance, and time. In two studies (n = 213 and n = 137), we found that degree centrality and semantic distance between consecutive concepts were associated with the overall self-relevance level of self-generated thought. To capture the trial-by-trial dynamics of content dimensions, we developed a Transformer-based model, with which we extracted dynamic features and were able to predict individual differences in general negative affectivity. These findings highlight the potential of computational linguistic and network models to quantify spontaneous thought and predict affective traits, offering a scalable approach for real-time, automated mental health assessments while reducing reliance on retrospective self-reports.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
This dataset contains 200+ images of DJI Agras crop sprayer drones taken by private drone operators throughout South America and the Caribbean. The images are classified by generation and drone model, as shown in the table below.
| Gen | Model | Arms | Rotors | Nozzle Type | Nozzles | Images |
|---|---|---|---|---|---|---|
| 02 | DJI Agras T16 | 6 | 6 | Pressure (Flat fan) | 8 | 15 |
| 02 | DJI Agras T20 | 6 | 6 | Pressure (Flat fan) | 8 | 44 |
| 03 | DJI Agras T10 | 4 | 4 | Pressure (Flat fan) | 4 | 1 |
| 03 | DJI Agras T30 | 6 | 6 | Pressure (Flat fan) | 16 | 75 |
| 04 | DJI Agras T20P | 4 | 4 | Centrifugal | 2 | 8 |
| 04 | DJI Agras T40 | 4 | 8 | Centrifugal | 2 | 67 |
| 05 | DJI Agras T50 | 4 | 8 | Centrifugal | 2/4 | 17 |
A couple of technical notes: * The tank size in liters is given in the model name after the letter T, e.g., the T16 has a 16-liter tank, the T30 has a 30-liter tank, and so on. An exception to this rule is the T50, which has a standard tank size of 40 liters and the option to install a 50-liter tank. * Each rotor is equipped with two propeller blades. Hence, the total number of propeller blades on a drone is twice the number of rotors.
This dataset is obviously too small to train models from scratch, but it is ideal to test fine-tuning methods or few-shot learning methods. Here are a few ideas: * Combine this dataset with one containing camera drones, i.e., small drones used for photography and videography (e.g., DJI Phantom, Mavic, Inspire, Matrice; Autel EVO; etc.). Fine-tune a model to distinguish crop sprayer drones from camera drones. * Fine-tune a model to classify drones by nozzle type: flat fan pressure nozzles (T16/T20/T10/T30) vs. centrifugal nozzles (T20P/T40/T50). * Fine-tune a model to classify by number of arms: 6-arm models (T16/T20/T30) vs. 4-arm models (T10/T20P/T40/T50).
The majority of the images in this dataset come from WhatsApp group chat conversations and were taken with various smartphone cameras. A small number of images were taken by me using my own smartphone when I worked as a crop spraying services provider.
To ensure anonymization, all faces and identifying information (e.g., logos, truck license plates) were blurred using Gaussian kernels.
Additionally, during metadata cleaning, all Exif metadata (including ICC color profiles) was removed. However, all images were originally captured in sRGB or close-to-sRGB color spaces. As a result, standard image viewers (e.g., Ubuntu's default viewer) render them without visible changes. You can safely assume sRGB when loading the images.
If you are using Python libraries such as PIL, PyTorch, or Keras, you can ensure consistent color handling by explicitly converting images to RGB mode and treating pixel values as standard 0–255 sRGB values.
Using PIL (standalone)
python
from PIL import Image
img = Image.open("path/to/image.jpg").convert("RGB") # Force sRGB interpretation
img_array = np.array(img) / 255.0 # Normalize if needed
Using PyTorch with torchvision ```python from torchvision import transforms from PIL import Image
transform = transforms.Compose([ transforms.ConvertImageDtype(torch.float), transforms.ToTensor(), # Converts to [0, 1] and permutes (H, W, C) to (C, H, W) ])
img = Image.open("path/to/image.jpg").convert("RGB") tensor = transform(img) ```
Using Keras ```python from tensorflow.keras.preprocessing.image import load_img, img_to_array
img = load_img("path/to/image.jpg", color_mode='rgb') img_array = img_to_array(img) / 255.0 # Normalize if required by your model ```
For suggestions, questions, or feedback, you can reach me at luis.i.reyes.castro@gmail.com. In case you download this dataset from Kaggle, you can find the original repository here.
Facebook
Twitterhttps://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
The CMT1A-BioStampNPoint2023 dataset provides data from a wearable sensor accelerometry study conducted for studying gait, balance, and activity in 15 individuals with Charcot-Marie-Tooth disease Type 1A (CMT1A). In addition to individuals with CMT1A, the dataset also includes data for 15 controls that also went through the same in-clinic study protocol as the CMT1A participants with a substantial fraction (9) of the controls also participating in the in-home study protocol. For the CMT1A participants, data is provided for 15 participants for the baseline visit and associated home recording duration and, additionally, for a subset of 12 of these participants data is also provided for a 12-month longitudinal visit and associated home recording duration. For controls, no longitudinal data is provided as none was recorded. The data were acquired using lightweight MC 10 BioStamp NPoint sensors (MC 10 Inc, Lexington, MA), three of which were attached to each participant for gathering data over a roughly one day interval. For additional details, see the description in the "README.md" included with the dataset. Methods The dataset contains data from wearable sensors and clinical data. The wearable sensor data was acquired using wearable sensors and the clinical data was extracted from the clinical record. The sensor data has not been processed per-se but the start of the recording time has been anonymized to comply with HIPPA requirements. Both the sensor data and the clinical data passed through a Python program for the aforementioned time anonymization and for standard formatting. Additional details of the time anonymization are provided in the file "README.md" included with the dataset.
Facebook
TwitterODC Public Domain Dedication and Licence (PDDL) v1.0http://www.opendatacommons.org/licenses/pddl/1.0/
License information was derived automatically
Dataset Overview: This dataset pertains to the examination results of students who participated in a series of academic assessments at a fictitious educational institution named "University of Exampleville." The assessments were administered across various courses and academic levels, with a focus on evaluating students' performance in general management and domain-specific topics.
Columns: The dataset comprises 12 columns, each representing specific attributes and performance indicators of the students. These columns encompass information such as the students' names (which have been anonymized), their respective universities, academic program names (including BBA and MBA), specializations, the semester of the assessment, the type of examination domain (general management or domain-specific), general management scores (out of 50), domain-specific scores (out of 50), total scores (out of 100), student ranks, and percentiles.
Data Collection: The examination data was collected during a standardized assessment process conducted by the University of Exampleville. The exams were designed to assess students' knowledge and skills in general management and their chosen domain-specific subjects. It involved students from both BBA and MBA programs who were in their final year of study.
Data Format: The dataset is available in a structured format, typically as a CSV file. Each row represents a unique student's performance in the examination, while columns contain specific information about their results and academic details.
Data Usage: This dataset is valuable for analyzing and gaining insights into the academic performance of students pursuing BBA and MBA degrees. It can be used for various purposes, including statistical analysis, performance trend identification, program assessment, and comparison of scores across domains and specializations. Furthermore, it can be employed in predictive modeling or decision-making related to curriculum development and student support.
Data Quality: The dataset has undergone preprocessing and anonymization to protect the privacy of individual students. Nevertheless, it is essential to use the data responsibly and in compliance with relevant data protection regulations when conducting any analysis or research.
Data Format: The exam data is typically provided in a structured format, commonly as a CSV (Comma-Separated Values) file. Each row in the dataset represents a unique student's examination performance, and each column contains specific attributes and scores related to the examination. The CSV format allows for easy import and analysis using various data analysis tools and programming languages like Python, R, or spreadsheet software like Microsoft Excel.
Here's a column-wise description of the dataset:
Name OF THE STUDENT: The full name of the student who took the exam. (Anonymized)
UNIVERSITY: The university where the student is enrolled.
PROGRAM NAME: The name of the academic program in which the student is enrolled (BBA or MBA).
Specialization: If applicable, the specific area of specialization or major that the student has chosen within their program.
Semester: The semester or academic term in which the student took the exam.
Domain: Indicates whether the exam was divided into two parts: general management and domain-specific.
GENERAL MANAGEMENT SCORE (OUT of 50): The score obtained by the student in the general management part of the exam, out of a maximum possible score of 50.
Domain-Specific Score (Out of 50): The score obtained by the student in the domain-specific part of the exam, also out of a maximum possible score of 50.
TOTAL SCORE (OUT of 100): The total score obtained by adding the scores from the general management and domain-specific parts, out of a maximum possible score of 100.
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
General Information
Hyperreal Talk (Polish clear web message board) messages data.
Haitao Shi (The University of Edinburgh, UK); Leszek Świeca (Kazimierz Wielki University in Bydgoszcz, Poland).
The dataset is part of the research supported by the Polish National Science Centre (Narodowe Centrum Nauki) grant 2021/43/B/HS6/00710.
Project title: “Rhizomatic networks, circulation of meanings and contents, and offline contexts of online drug trade” (2022-2025; PLN 956 620; funding institution: Polish National Science Centre [NCN], call: OPUS 22; Principal Investigator: Piotr Siuda [Kazimierz Wielki University in Bydgoszcz, Poland]).
Data Collection Context
Polish clear web message board called Hyperreal Talk (https://hyperreal.info/talk/).
This dataset was developed within the abovementioned project. The project delves into internet dynamics within disruptive activities, specifically focusing on the online drug trade in Poland. It aims to (1) examine the utilization of the open internet, including social media, in the drug trade; (2) delineate the role of darknet environments in narcotics distribution; and (3) uncover the intricate flow of drug trade-related content and its meanings between the open web and the darknet, and how these meanings are shaped within the so-called drug subculture.
The Hyperreal Talk forum emerges as a pivotal online space on the Polish internet, serving as a hub for discussions and the exchange of knowledge and experiences concerning drug use. It plays a crucial role in investigating the narratives and discourses that shape the drug subculture and the broader societal perceptions of drug consumption. The dataset has been instrumental in conducting analyses pertinent to the earlier project goals.
The dataset was compiled using the Scrapy framework, a web crawling and scraping library for Python. This tool facilitated systematic content extraction from the targeted message board.
The data was collected in two periods, i.e., in September 2023 and November 2023.
Data Content
The dataset comprises all messages posted on the Polish-language Hyperreal Talk message board from its inception until November 2023. These messages include the initial posts that start each thread and the subsequent posts (replies) within those threads. The dataset is organized into two directories: “hyperreal” and “hyperreal_hidden.” The “hyperreal” directory contains accessible posts without needing to log in to Hyperreal Talk, while the “hyperreal_hidden” directory holds posts that can only be viewed by logged-in users. For each directory, a .txt file has been prepared detailing the structure of the message board folders from which the posts were extracted. The dataset includes 6,248,842 posts.
The data has been cleaned and processed using regular expressions in Python. Additionally, all personal information was removed through regular expressions. The data has been hashed to exclude all identifiers related to instant messaging apps and email addresses. Furthermore, all usernames appearing in messages have been eliminated.
The dataset consists of the following files:
Zipped .txt files (hyperreal.zip) containing messages that are visible without logging into Hyperreal Talk. These files are organized into individual directories that mirror the folder structure found on the Hyperreal Talk message board.
Zipped .txt files (hyperreal_hidden.zip) containing messages that are visible only after logging into Hyperreal Talk. Similar to the first type, these files are organized into directories corresponding to the website’s folder structure.
A .csv file that lists all the messages, including file names and the content of each post.
Accessibility and Usage
The data can be accessed without any restrictions.
Attached are .txt files detailing the tree of folders for “hyperreal.zip” and “hyperreal_hidden.zip.”
Documentation on the Python regular expressions used for scraping, cleaning, processing, and anonymizing the data can be found on GitHub at the following URLs:
https://github.com/LeszekSwieca/Project_2021-43-B-HS6-00710
https://github.com/HaitaoShi/Scrapy_hyperreal"
Ethical Considerations
A set of data handling policies aimed at ensuring safety and ethics has been outlined in the following paper:
Harviainen, J.T., Haasio, A., Ruokolainen, T., Hassan, L., Siuda, P., Hamari, J. (2021). Information Protection in Dark Web Drug Markets Research [in:] Proceedings of the 54th Hawaii International Conference on System Sciences, HICSS 2021, Grand Hyatt Kauai, Hawaii, USA, 4-8 January 2021, Maui, Hawaii, (ed.) Tung X. Bui, Honolulu, HI, pp. 4673-4680.
The primary safeguard was the early-stage hashing of usernames and identifiers from the messages, utilizing automated systems for irreversible hashing. Recognizing that scraping and automatic name removal might not catch all identifiers, the data underwent manual review to ensure compliance with research ethics and thorough anonymization.