8 datasets found
  1. Z

    Hyperreal Talk (Polish clear web message board) messages data

    • data.niaid.nih.gov
    Updated Mar 18, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Siuda, Piotr; Shi, Haitao; Świeca, Leszek (2024). Hyperreal Talk (Polish clear web message board) messages data [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_10810250
    Explore at:
    Dataset updated
    Mar 18, 2024
    Dataset provided by
    Kazimierz Wielki University in Bydgoszcz
    University of Edinburgh
    Authors
    Siuda, Piotr; Shi, Haitao; Świeca, Leszek
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    General Information

    1. Title of Dataset

    Hyperreal Talk (Polish clear web message board) messages data.

    1. Data Collectors

    Haitao Shi (The University of Edinburgh, UK); Leszek Świeca (Kazimierz Wielki University in Bydgoszcz, Poland).

    1. Funding Information

    The dataset is part of the research supported by the Polish National Science Centre (Narodowe Centrum Nauki) grant 2021/43/B/HS6/00710.

    Project title: “Rhizomatic networks, circulation of meanings and contents, and offline contexts of online drug trade” (2022-2025; PLN 956 620; funding institution: Polish National Science Centre [NCN], call: OPUS 22; Principal Investigator: Piotr Siuda [Kazimierz Wielki University in Bydgoszcz, Poland]).

    Data Collection Context

    1. Data Source

    Polish clear web message board called Hyperreal Talk (https://hyperreal.info/talk/).

    1. Purpose

    This dataset was developed within the abovementioned project. The project delves into internet dynamics within disruptive activities, specifically focusing on the online drug trade in Poland. It aims to (1) examine the utilization of the open internet, including social media, in the drug trade; (2) delineate the role of darknet environments in narcotics distribution; and (3) uncover the intricate flow of drug trade-related content and its meanings between the open web and the darknet, and how these meanings are shaped within the so-called drug subculture.

    The Hyperreal Talk forum emerges as a pivotal online space on the Polish internet, serving as a hub for discussions and the exchange of knowledge and experiences concerning drug use. It plays a crucial role in investigating the narratives and discourses that shape the drug subculture and the broader societal perceptions of drug consumption. The dataset has been instrumental in conducting analyses pertinent to the earlier project goals.

    1. Collection Method

    The dataset was compiled using the Scrapy framework, a web crawling and scraping library for Python. This tool facilitated systematic content extraction from the targeted message board.

    1. Collection Date

    The data was collected in two periods, i.e., in September 2023 and November 2023.

    Data Content

    1. Data Description

    The dataset comprises all messages posted on the Polish-language Hyperreal Talk message board from its inception until November 2023. These messages include the initial posts that start each thread and the subsequent posts (replies) within those threads. The dataset is organized into two directories: “hyperreal” and “hyperreal_hidden.” The “hyperreal” directory contains accessible posts without needing to log in to Hyperreal Talk, while the “hyperreal_hidden” directory holds posts that can only be viewed by logged-in users. For each directory, a .txt file has been prepared detailing the structure of the message board folders from which the posts were extracted. The dataset includes 6,248,842 posts.

    1. Data Cleaning, Processing, and Anonymization

    The data has been cleaned and processed using regular expressions in Python. Additionally, all personal information was removed through regular expressions. The data has been hashed to exclude all identifiers related to instant messaging apps and email addresses. Furthermore, all usernames appearing in messages have been eliminated.

    1. File Formats and Variables/Fields

    The dataset consists of the following files:

    Zipped .txt files (hyperreal.zip) containing messages that are visible without logging into Hyperreal Talk. These files are organized into individual directories that mirror the folder structure found on the Hyperreal Talk message board.

    Zipped .txt files (hyperreal_hidden.zip) containing messages that are visible only after logging into Hyperreal Talk. Similar to the first type, these files are organized into directories corresponding to the website’s folder structure.

    A .csv file that lists all the messages, including file names and the content of each post.

    Accessibility and Usage

    1. Access Conditions

    The data can be accessed without any restrictions.

    1. Related Documentation

    Attached are .txt files detailing the tree of folders for “hyperreal.zip” and “hyperreal_hidden.zip.”

    Documentation on the Python regular expressions used for scraping, cleaning, processing, and anonymizing the data can be found on GitHub at the following URLs:

    https://github.com/LeszekSwieca/Project_2021-43-B-HS6-00710

    https://github.com/HaitaoShi/Scrapy_hyperreal"

    Ethical Considerations

    1. Ethics Statement

    A set of data handling policies aimed at ensuring safety and ethics has been outlined in the following paper:

    Harviainen, J.T., Haasio, A., Ruokolainen, T., Hassan, L., Siuda, P., Hamari, J. (2021). Information Protection in Dark Web Drug Markets Research [in:] Proceedings of the 54th Hawaii International Conference on System Sciences, HICSS 2021, Grand Hyatt Kauai, Hawaii, USA, 4-8 January 2021, Maui, Hawaii, (ed.) Tung X. Bui, Honolulu, HI, pp. 4673-4680.

    The primary safeguard was the early-stage hashing of usernames and identifiers from the messages, utilizing automated systems for irreversible hashing. Recognizing that scraping and automatic name removal might not catch all identifiers, the data underwent manual review to ensure compliance with research ethics and thorough anonymization.

  2. Z

    Dopek.eu (Polish clear web and dark web message board) messages data

    • data.niaid.nih.gov
    Updated Mar 18, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Siuda, Piotr; Shi, Haitao; Świeca, Leszek (2024). Dopek.eu (Polish clear web and dark web message board) messages data [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_10810554
    Explore at:
    Dataset updated
    Mar 18, 2024
    Dataset provided by
    Kazimierz Wielki University in Bydgoszcz
    University of Edinburgh
    Authors
    Siuda, Piotr; Shi, Haitao; Świeca, Leszek
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    General Information

    1. Title of Dataset

    Dopek.eu (Polish clear web and dark web message board) messages data.

    1. Data Collectors

    Haitao Shi (The University of Edinburgh, UK); Leszek Świeca (Kazimierz Wielki University in Bydgoszcz, Poland).

    1. Funding Information

    The dataset is part of the research supported by the Polish National Science Centre (Narodowe Centrum Nauki) grant 2021/43/B/HS6/00710.

    Project title: “Rhizomatic networks, circulation of meanings and contents, and offline contexts of online drug trade” (2022-2025; PLN 956 620; funding institution: Polish National Science Centre [NCN], call: OPUS 22; Principal Investigator: Piotr Siuda [Kazimierz Wielki University in Bydgoszcz, Poland]).

    Data Collection Context

    1. Data Source

    Clear web and dark web message board called dopek.eu (https://dopek.eu/).

    1. Purpose

    This dataset was developed within the abovementioned project. The project delves into internet dynamics within disruptive activities, specifically focusing on the online drug trade in Poland. It aims to (1) examine the utilization of the open internet, including social media, in the drug trade; (2) delineate the role of darknet environments in narcotics distribution; and (3) uncover the intricate flow of drug trade-related content and its meanings between the open web and the darknet, and how these meanings are shaped within the so-called drug subculture.

    The dopek.eu forum emerges as a pivotal online space on the Polish internet, serving as a hub for trading, discussions, and the exchange of knowledge and experiences concerning the use of the so-called new psychoactive substances (designer drugs). The dataset has been instrumental in conducting analyses pertinent to the earlier project goals.

    1. Collection Method

    The dataset was compiled using the Scrapy framework, a web crawling and scraping library for Python. This tool facilitated systematic content extraction from the targeted message board.

    1. Collection Date

    The data was collected in October 2023.

    Data Content

    1. Data Description

    The dataset comprises all messages posted on dopek.eu from its inception until October 2023. These messages include the initial posts that start each thread and the subsequent posts (replies) within those threads. A .txt file has been prepared detailing the structure of the message board folders from which the posts were extracted. The dataset includes 171,121 posts.

    1. Data Cleaning, Processing, and Anonymization

    The data has been cleaned and processed using regular expressions in Python. Additionally, all personal information was removed through regular expressions. The data has been hashed to exclude all identifiers related to instant messaging apps and email addresses. Furthermore, all usernames appearing in messages have been eliminated.

    1. File Formats and Variables/Fields

    The dataset consists of the following types of files:

    Zipped .txt files (dopek.zip) containing all messages (posts).

    A .csv file that lists all the messages, including file names and the content of each post.

    Accessibility and Usage

    1. Access Conditions

    The data can be accessed without any restrictions.

    1. Related Documentation

    Attached are .txt files detailing the tree of folders for “dopek.zip”.

    Ethical Considerations

    1. Ethics Statement

    A set of data handling policies aimed at ensuring safety and ethics has been outlined in the following paper:

    Harviainen, J.T., Haasio, A., Ruokolainen, T., Hassan, L., Siuda, P., Hamari, J. (2021). Information Protection in Dark Web Drug Markets Research [in:] Proceedings of the 54th Hawaii International Conference on System Sciences, HICSS 2021, Grand Hyatt Kauai, Hawaii, USA, 4-8 January 2021, Maui, Hawaii, (ed.) Tung X. Bui, Honolulu, HI, pp. 4673-4680.

    The primary safeguard was the early-stage hashing of usernames and identifiers from the posts, utilizing automated systems for irreversible hashing. Recognizing that scraping and automatic name removal might not catch all identifiers, the data underwent manual review to ensure compliance with research ethics and thorough anonymization.

  3. 2025 Kaggle Machine Learning & Data Science Survey

    • kaggle.com
    Updated Jan 28, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hina Ismail (2025). 2025 Kaggle Machine Learning & Data Science Survey [Dataset]. https://www.kaggle.com/datasets/sonialikhan/2025-kaggle-machine-learning-and-data-science-survey
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 28, 2025
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Hina Ismail
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Overview Welcome to Kaggle's second annual Machine Learning and Data Science Survey ― and our first-ever survey data challenge.

    This year, as last year, we set out to conduct an industry-wide survey that presents a truly comprehensive view of the state of data science and machine learning. The survey was live for one week in October, and after cleaning the data we finished with 23,859 responses, a 49% increase over last year!

    There's a lot to explore here. The results include raw numbers about who is working with data, what’s happening with machine learning in different industries, and the best ways for new data scientists to break into the field. We've published the data in as raw a format as possible without compromising anonymization, which makes it an unusual example of a survey dataset.

    Challenge This year Kaggle is launching the first Data Science Survey Challenge, where we will be awarding a prize pool of $28,000 to kernel authors who tell a rich story about a subset of the data science and machine learning community..

    In our second year running this survey, we were once again awed by the global, diverse, and dynamic nature of the data science and machine learning industry. This survey data EDA provides an overview of the industry on an aggregate scale, but it also leaves us wanting to know more about the many specific communities comprised within the survey. For that reason, we’re inviting the Kaggle community to dive deep into the survey datasets and help us tell the diverse stories of data scientists from around the world.

    The challenge objective: tell a data story about a subset of the data science community represented in this survey, through a combination of both narrative text and data exploration. A “story” could be defined any number of ways, and that’s deliberate. The challenge is to deeply explore (through data) the impact, priorities, or concerns of a specific group of data science and machine learning practitioners. That group can be defined in the macro (for example: anyone who does most of their coding in Python) or the micro (for example: female data science students studying machine learning in masters programs). This is an opportunity to be creative and tell the story of a community you identify with or are passionate about!

    Submissions will be evaluated on the following:

    Composition - Is there a clear narrative thread to the story that’s articulated and supported by data? The subject should be well defined, well researched, and well supported through the use of data and visualizations. Originality - Does the reader learn something new through this submission? Or is the reader challenged to think about something in a new way? A great entry will be informative, thought provoking, and fresh all at the same time. Documentation - Are your code, and kernel, and additional data sources well documented so a reader can understand what you did? Are your sources clearly cited? A high quality analysis should be concise and clear at each step so the rationale is easy to follow and the process is reproducible To be valid, a submission must be contained in one kernel, made public on or before the submission deadline. Participants are free to use any datasets in addition to the Kaggle Data Science survey, but those datasets must also be publicly available on Kaggle by the deadline for a submission to be valid.

    While the challenge is running, Kaggle will also give a Weekly Kernel Award of $1,500 to recognize excellent kernels that are public analyses of the survey. Weekly Kernel Awards will be announced every Friday between 11/9 and 11/30.

    How to Participate To make a submission, complete the submission form. Only one submission will be judged per participant, so if you make multiple submissions we will review the last (most recent) entry.

    No submission is necessary for the Weekly Kernels Awards. To be eligible, a kernel must be public and use the 2018 Data Science Survey as a data source.

    Timeline All dates are 11:59PM UTC

    Submission deadline: December 3rd

    Winners announced: December 10th

    Weekly Kernels Award prize winners announcements: November 9th, 16th, 23rd, and 30th

    All kernels are evaluated after the deadline.

    Rules To be eligible to win a prize in either of the above prize tracks, you must be:

    a registered account holder at Kaggle.com; the older of 18 years old or the age of majority in your jurisdiction of residence; and not a resident of Crimea, Cuba, Iran, Syria, North Korea, or Sudan Your kernels will only be eligible to win if they have been made public on kaggle.com by the above deadline. All prizes are awarded at the discretion of Kaggle. Kaggle reserves the right to cancel or modify prize criteria.

    Unfortunately employees, interns, contractors, officers and directors of Kaggle Inc., and their parent companies, are not eligible to win any prizes.

    Survey Methodology ...

  4. p

    Data from: VitalDB, a high-fidelity multi-parameter vital signs database in...

    • physionet.org
    Updated Sep 21, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hyung-Chul Lee; Chul-Woo Jung (2022). VitalDB, a high-fidelity multi-parameter vital signs database in surgical patients [Dataset]. http://doi.org/10.13026/czw8-9p62
    Explore at:
    Dataset updated
    Sep 21, 2022
    Authors
    Hyung-Chul Lee; Chul-Woo Jung
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    In modern anesthesia, multiple medical devices are used simultaneously to comprehensively monitor real-time vital signs to optimize patient care and improve surgical outcomes. However, interpreting the dynamic changes of time-series biosignals and their correlations is a difficult task even for experienced anesthesiologists. Recent advanced machine learning technologies have shown promising results in biosignal analysis, however, research and development in this area is relatively slow due to the lack of biosignal datasets for machine learning. The VitalDB (Vital Signs DataBase) is an open dataset created specifically to facilitate machine learning studies related to monitoring vital signs in surgical patients. This dataset contains high-resolution multi-parameter data from 6,388 cases, including 486,451 waveform and numeric data tracks of 196 intraoperative monitoring parameters, 73 perioperative clinical parameters, and 34 time-series laboratory result parameters. All data is stored in the public cloud after anonymization. The dataset can be freely accessed and analysed using application programming interfaces and Python library. The VitalDB public dataset is expected to be a valuable resource for biosignal research and development.

  5. Modeling Spontaneous Thought: A Network- and Langauge-based Computational...

    • zenodo.org
    bin, text/x-python +2
    Updated Apr 15, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jihoon Han; Jihoon Han; Byeol Kim Lux; Byeol Kim Lux; Eunjin Lee; Yongseok Yoo; Yongseok Yoo; Choong-Wan Woo; Choong-Wan Woo; Eunjin Lee (2025). Modeling Spontaneous Thought: A Network- and Langauge-based Computational Method [Dataset]. http://doi.org/10.5281/zenodo.15213401
    Explore at:
    bin, text/x-python, xz, txtAvailable download formats
    Dataset updated
    Apr 15, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Jihoon Han; Jihoon Han; Byeol Kim Lux; Byeol Kim Lux; Eunjin Lee; Yongseok Yoo; Yongseok Yoo; Choong-Wan Woo; Choong-Wan Woo; Eunjin Lee
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Modeling Spontaneous Thought: A Network- and Langauge-based Computational Method

    This repository includes the data and codes to generate analysis results and figures.

    Data anonymization

    To ensure participant confidentiality, we replaced all reported concepts and details—including names mentioned in their personal narratives—with alphanumeric codes. This anonymization process did not compromise the integrity of our analysis, and the results align with our published findings.

    Dependencies

    • CanlabCore
    • cocoanCORE
    • spm12
    • Matlab 2022b
    • Conda 24.9.2
    • Docker version 24.0.7, build afdd53b
    • nvidia-docker2
    • xz (XZ Utils) 5.2.5,
    • liblzma 5.2.5
    • Python 3.6
    • Tensorflow 1.15.0

    Python library versions are provided by the requirements.txt file.

    Docker environment

    We trained the Transformer model in a docker container. We provided "fast_nlp_tf1.tar.xz" compressed by xz utils.

    To build the environment, follow the command lines below.

    $ xz -d -p {# of processes} fast_nlp_tf1.tar.xz

    $ docker run --gpus all --name fast_nlp_env -v {path of this repo}:/Projects -itd fast_nlp:tf1.15.0-py3

    Abstract

    Spontaneous thought plays a crucial role in shaping affective traits and mental health. However, its dynamic and unconstrained nature makes it challenging to quantify and model effectively. To address this, we employed the Free Association Semantic Task (FAST) to obtain self-generated spontaneous thoughts, which were then analyzed using network modeling and natural language processing (NLP) to decode key affective content dimensions of self-generated thought, including valence, self-relevance, and time. In two studies (n = 213 and n = 137), we found that degree centrality and semantic distance between consecutive concepts were associated with the overall self-relevance level of self-generated thought. To capture the trial-by-trial dynamics of content dimensions, we developed a Transformer-based model, with which we extracted dynamic features and were able to predict individual differences in general negative affectivity. These findings highlight the potential of computational linguistic and network models to quantify spontaneous thought and predict affective traits, offering a scalable approach for real-time, automated mental health assessments while reducing reliance on retrospective self-reports.

  6. Crop Sprayer Drone

    • kaggle.com
    zip
    Updated Jul 21, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Luis (2025). Crop Sprayer Drone [Dataset]. https://www.kaggle.com/datasets/lireyesc/crop-sprayer-drone
    Explore at:
    zip(103826093 bytes)Available download formats
    Dataset updated
    Jul 21, 2025
    Authors
    Luis
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    This dataset contains 200+ images of DJI Agras crop sprayer drones taken by private drone operators throughout South America and the Caribbean. The images are classified by generation and drone model, as shown in the table below.

    GenModelArmsRotorsNozzle TypeNozzlesImages
    02DJI Agras T1666Pressure (Flat fan)815
    02DJI Agras T2066Pressure (Flat fan)844
    03DJI Agras T1044Pressure (Flat fan)41
    03DJI Agras T3066Pressure (Flat fan)1675
    04DJI Agras T20P44Centrifugal28
    04DJI Agras T4048Centrifugal267
    05DJI Agras T5048Centrifugal2/417

    A couple of technical notes: * The tank size in liters is given in the model name after the letter T, e.g., the T16 has a 16-liter tank, the T30 has a 30-liter tank, and so on. An exception to this rule is the T50, which has a standard tank size of 40 liters and the option to install a 50-liter tank. * Each rotor is equipped with two propeller blades. Hence, the total number of propeller blades on a drone is twice the number of rotors.

    Purpose

    This dataset is obviously too small to train models from scratch, but it is ideal to test fine-tuning methods or few-shot learning methods. Here are a few ideas: * Combine this dataset with one containing camera drones, i.e., small drones used for photography and videography (e.g., DJI Phantom, Mavic, Inspire, Matrice; Autel EVO; etc.). Fine-tune a model to distinguish crop sprayer drones from camera drones. * Fine-tune a model to classify drones by nozzle type: flat fan pressure nozzles (T16/T20/T10/T30) vs. centrifugal nozzles (T20P/T40/T50). * Fine-tune a model to classify by number of arms: 6-arm models (T16/T20/T30) vs. 4-arm models (T10/T20P/T40/T50).

    Data Provenance, Anonymization and ICC Profiles

    The majority of the images in this dataset come from WhatsApp group chat conversations and were taken with various smartphone cameras. A small number of images were taken by me using my own smartphone when I worked as a crop spraying services provider.

    To ensure anonymization, all faces and identifying information (e.g., logos, truck license plates) were blurred using Gaussian kernels.

    Additionally, during metadata cleaning, all Exif metadata (including ICC color profiles) was removed. However, all images were originally captured in sRGB or close-to-sRGB color spaces. As a result, standard image viewers (e.g., Ubuntu's default viewer) render them without visible changes. You can safely assume sRGB when loading the images.

    If you are using Python libraries such as PIL, PyTorch, or Keras, you can ensure consistent color handling by explicitly converting images to RGB mode and treating pixel values as standard 0–255 sRGB values.

    Examples of Safe Image Loading

    Using PIL (standalone) python from PIL import Image img = Image.open("path/to/image.jpg").convert("RGB") # Force sRGB interpretation img_array = np.array(img) / 255.0 # Normalize if needed

    Using PyTorch with torchvision ```python from torchvision import transforms from PIL import Image

    transform = transforms.Compose([ transforms.ConvertImageDtype(torch.float), transforms.ToTensor(), # Converts to [0, 1] and permutes (H, W, C) to (C, H, W) ])

    img = Image.open("path/to/image.jpg").convert("RGB") tensor = transform(img) ```

    Using Keras ```python from tensorflow.keras.preprocessing.image import load_img, img_to_array

    Load image in RGB mode (do not resize unless required)

    img = load_img("path/to/image.jpg", color_mode='rgb') img_array = img_to_array(img) / 255.0 # Normalize if required by your model ```

    Contact

    For suggestions, questions, or feedback, you can reach me at luis.i.reyes.castro@gmail.com. In case you download this dataset from Kaggle, you can find the original repository here.

  7. n

    CMT1A-BioStampNPoint2023: Charcot-Marie-Tooth disease type 1A accelerometry...

    • data.niaid.nih.gov
    • search.dataone.org
    • +1more
    zip
    Updated Jun 8, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Karthik Dinesh; Nicole White; Lindsay Baker; Janet Sowden; Steffen Behrens-Spraggins; Elizabeth P Wood; Julie L Charles; David Herrmann; Gaurav Sharma; Katy Eichinger (2023). CMT1A-BioStampNPoint2023: Charcot-Marie-Tooth disease type 1A accelerometry dataset from three wearable sensor study [Dataset]. http://doi.org/10.5061/dryad.p5hqbzktr
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jun 8, 2023
    Dataset provided by
    University of Rochester Medical Center
    University of Rochester
    Authors
    Karthik Dinesh; Nicole White; Lindsay Baker; Janet Sowden; Steffen Behrens-Spraggins; Elizabeth P Wood; Julie L Charles; David Herrmann; Gaurav Sharma; Katy Eichinger
    License

    https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html

    Description

    The CMT1A-BioStampNPoint2023 dataset provides data from a wearable sensor accelerometry study conducted for studying gait, balance, and activity in 15 individuals with Charcot-Marie-Tooth disease Type 1A (CMT1A). In addition to individuals with CMT1A, the dataset also includes data for 15 controls that also went through the same in-clinic study protocol as the CMT1A participants with a substantial fraction (9) of the controls also participating in the in-home study protocol. For the CMT1A participants, data is provided for 15 participants for the baseline visit and associated home recording duration and, additionally, for a subset of 12 of these participants data is also provided for a 12-month longitudinal visit and associated home recording duration. For controls, no longitudinal data is provided as none was recorded. The data were acquired using lightweight MC 10 BioStamp NPoint sensors (MC 10 Inc, Lexington, MA), three of which were attached to each participant for gathering data over a roughly one day interval. For additional details, see the description in the "README.md" included with the dataset. Methods The dataset contains data from wearable sensors and clinical data. The wearable sensor data was acquired using wearable sensors and the clinical data was extracted from the clinical record. The sensor data has not been processed per-se but the start of the recording time has been anonymized to comply with HIPPA requirements. Both the sensor data and the clinical data passed through a Python program for the aforementioned time anonymization and for standard formatting. Additional details of the time anonymization are provided in the file "README.md" included with the dataset.

  8. Students Test Data

    • kaggle.com
    zip
    Updated Sep 12, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    ATHARV BHARASKAR (2023). Students Test Data [Dataset]. https://www.kaggle.com/datasets/atharvbharaskar/students-test-data/discussion
    Explore at:
    zip(3986 bytes)Available download formats
    Dataset updated
    Sep 12, 2023
    Authors
    ATHARV BHARASKAR
    License

    ODC Public Domain Dedication and Licence (PDDL) v1.0http://www.opendatacommons.org/licenses/pddl/1.0/
    License information was derived automatically

    Description

    Dataset Overview: This dataset pertains to the examination results of students who participated in a series of academic assessments at a fictitious educational institution named "University of Exampleville." The assessments were administered across various courses and academic levels, with a focus on evaluating students' performance in general management and domain-specific topics.

    Columns: The dataset comprises 12 columns, each representing specific attributes and performance indicators of the students. These columns encompass information such as the students' names (which have been anonymized), their respective universities, academic program names (including BBA and MBA), specializations, the semester of the assessment, the type of examination domain (general management or domain-specific), general management scores (out of 50), domain-specific scores (out of 50), total scores (out of 100), student ranks, and percentiles.

    Data Collection: The examination data was collected during a standardized assessment process conducted by the University of Exampleville. The exams were designed to assess students' knowledge and skills in general management and their chosen domain-specific subjects. It involved students from both BBA and MBA programs who were in their final year of study.

    Data Format: The dataset is available in a structured format, typically as a CSV file. Each row represents a unique student's performance in the examination, while columns contain specific information about their results and academic details.

    Data Usage: This dataset is valuable for analyzing and gaining insights into the academic performance of students pursuing BBA and MBA degrees. It can be used for various purposes, including statistical analysis, performance trend identification, program assessment, and comparison of scores across domains and specializations. Furthermore, it can be employed in predictive modeling or decision-making related to curriculum development and student support.

    Data Quality: The dataset has undergone preprocessing and anonymization to protect the privacy of individual students. Nevertheless, it is essential to use the data responsibly and in compliance with relevant data protection regulations when conducting any analysis or research.

    Data Format: The exam data is typically provided in a structured format, commonly as a CSV (Comma-Separated Values) file. Each row in the dataset represents a unique student's examination performance, and each column contains specific attributes and scores related to the examination. The CSV format allows for easy import and analysis using various data analysis tools and programming languages like Python, R, or spreadsheet software like Microsoft Excel.

    Here's a column-wise description of the dataset:

    Name OF THE STUDENT: The full name of the student who took the exam. (Anonymized)

    UNIVERSITY: The university where the student is enrolled.

    PROGRAM NAME: The name of the academic program in which the student is enrolled (BBA or MBA).

    Specialization: If applicable, the specific area of specialization or major that the student has chosen within their program.

    Semester: The semester or academic term in which the student took the exam.

    Domain: Indicates whether the exam was divided into two parts: general management and domain-specific.

    GENERAL MANAGEMENT SCORE (OUT of 50): The score obtained by the student in the general management part of the exam, out of a maximum possible score of 50.

    Domain-Specific Score (Out of 50): The score obtained by the student in the domain-specific part of the exam, also out of a maximum possible score of 50.

    TOTAL SCORE (OUT of 100): The total score obtained by adding the scores from the general management and domain-specific parts, out of a maximum possible score of 100.

  9. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Siuda, Piotr; Shi, Haitao; Świeca, Leszek (2024). Hyperreal Talk (Polish clear web message board) messages data [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_10810250

Hyperreal Talk (Polish clear web message board) messages data

Explore at:
Dataset updated
Mar 18, 2024
Dataset provided by
Kazimierz Wielki University in Bydgoszcz
University of Edinburgh
Authors
Siuda, Piotr; Shi, Haitao; Świeca, Leszek
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

General Information

  1. Title of Dataset

Hyperreal Talk (Polish clear web message board) messages data.

  1. Data Collectors

Haitao Shi (The University of Edinburgh, UK); Leszek Świeca (Kazimierz Wielki University in Bydgoszcz, Poland).

  1. Funding Information

The dataset is part of the research supported by the Polish National Science Centre (Narodowe Centrum Nauki) grant 2021/43/B/HS6/00710.

Project title: “Rhizomatic networks, circulation of meanings and contents, and offline contexts of online drug trade” (2022-2025; PLN 956 620; funding institution: Polish National Science Centre [NCN], call: OPUS 22; Principal Investigator: Piotr Siuda [Kazimierz Wielki University in Bydgoszcz, Poland]).

Data Collection Context

  1. Data Source

Polish clear web message board called Hyperreal Talk (https://hyperreal.info/talk/).

  1. Purpose

This dataset was developed within the abovementioned project. The project delves into internet dynamics within disruptive activities, specifically focusing on the online drug trade in Poland. It aims to (1) examine the utilization of the open internet, including social media, in the drug trade; (2) delineate the role of darknet environments in narcotics distribution; and (3) uncover the intricate flow of drug trade-related content and its meanings between the open web and the darknet, and how these meanings are shaped within the so-called drug subculture.

The Hyperreal Talk forum emerges as a pivotal online space on the Polish internet, serving as a hub for discussions and the exchange of knowledge and experiences concerning drug use. It plays a crucial role in investigating the narratives and discourses that shape the drug subculture and the broader societal perceptions of drug consumption. The dataset has been instrumental in conducting analyses pertinent to the earlier project goals.

  1. Collection Method

The dataset was compiled using the Scrapy framework, a web crawling and scraping library for Python. This tool facilitated systematic content extraction from the targeted message board.

  1. Collection Date

The data was collected in two periods, i.e., in September 2023 and November 2023.

Data Content

  1. Data Description

The dataset comprises all messages posted on the Polish-language Hyperreal Talk message board from its inception until November 2023. These messages include the initial posts that start each thread and the subsequent posts (replies) within those threads. The dataset is organized into two directories: “hyperreal” and “hyperreal_hidden.” The “hyperreal” directory contains accessible posts without needing to log in to Hyperreal Talk, while the “hyperreal_hidden” directory holds posts that can only be viewed by logged-in users. For each directory, a .txt file has been prepared detailing the structure of the message board folders from which the posts were extracted. The dataset includes 6,248,842 posts.

  1. Data Cleaning, Processing, and Anonymization

The data has been cleaned and processed using regular expressions in Python. Additionally, all personal information was removed through regular expressions. The data has been hashed to exclude all identifiers related to instant messaging apps and email addresses. Furthermore, all usernames appearing in messages have been eliminated.

  1. File Formats and Variables/Fields

The dataset consists of the following files:

Zipped .txt files (hyperreal.zip) containing messages that are visible without logging into Hyperreal Talk. These files are organized into individual directories that mirror the folder structure found on the Hyperreal Talk message board.

Zipped .txt files (hyperreal_hidden.zip) containing messages that are visible only after logging into Hyperreal Talk. Similar to the first type, these files are organized into directories corresponding to the website’s folder structure.

A .csv file that lists all the messages, including file names and the content of each post.

Accessibility and Usage

  1. Access Conditions

The data can be accessed without any restrictions.

  1. Related Documentation

Attached are .txt files detailing the tree of folders for “hyperreal.zip” and “hyperreal_hidden.zip.”

Documentation on the Python regular expressions used for scraping, cleaning, processing, and anonymizing the data can be found on GitHub at the following URLs:

https://github.com/LeszekSwieca/Project_2021-43-B-HS6-00710

https://github.com/HaitaoShi/Scrapy_hyperreal"

Ethical Considerations

  1. Ethics Statement

A set of data handling policies aimed at ensuring safety and ethics has been outlined in the following paper:

Harviainen, J.T., Haasio, A., Ruokolainen, T., Hassan, L., Siuda, P., Hamari, J. (2021). Information Protection in Dark Web Drug Markets Research [in:] Proceedings of the 54th Hawaii International Conference on System Sciences, HICSS 2021, Grand Hyatt Kauai, Hawaii, USA, 4-8 January 2021, Maui, Hawaii, (ed.) Tung X. Bui, Honolulu, HI, pp. 4673-4680.

The primary safeguard was the early-stage hashing of usernames and identifiers from the messages, utilizing automated systems for irreversible hashing. Recognizing that scraping and automatic name removal might not catch all identifiers, the data underwent manual review to ensure compliance with research ethics and thorough anonymization.

Search
Clear search
Close search
Google apps
Main menu