7 datasets found

Malware Timestamps 2
kaggle.com
Updated Jan 26, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Chris Deotte (2019). Malware Timestamps 2 [Dataset]. https://www.kaggle.com/cdeotte/malware-timestamps-2/notebooks
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jan 26, 2019
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Chris Deotte
Description
This python dictionary maps Microsoft's Census_OSVersion (str) to a timestamp (datetime.datetime). Use this to add the time when a user's OS was last updated.

The Microsoft Malware dataset has 579 unique Census_OSVersion's. This dictionary contains dates for 324 of them which constitutes 99.85% of the Microsoft data.

These timestamps were downloaded from https://support.microsoft.com/en-us/help/4043454/windows-10-windows-server-update-history and https://changewindows.org/build/17134

For a dictionary of timestamps for AvSigVersion, go here https://www.kaggle.com/cdeotte/malware-timestamps
Z
Dataset for: The Evolution of the Manosphere Across the Web
data.niaid.nih.gov
zenodo.org
Updated Aug 30, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Manoel Horta Ribeiro (2020). Dataset for: The Evolution of the Manosphere Across the Web [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_4007912
Explore at:
Dataset updated
Aug 30, 2020
Dataset provided by
Emiliano De Cristofaro
Summer Long
Jeremy Blackburn
Gianluca Stringhini
Stephanie Greenberg
Savvas Zannettou
Barry Bradlyn
Manoel Horta Ribeiro
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The Evolution of the Manosphere Across the Web

We make available data related to subreddit and standalone forums from the manosphere.

We also make available Perspective API annotations for all posts.

You can find the code in GitHub.

Please cite this paper if you use this data:

@article{ribeiroevolution2021, title={The Evolution of the Manosphere Across the Web}, author={Ribeiro, Manoel Horta and Blackburn, Jeremy and Bradlyn, Barry and De Cristofaro, Emiliano and Stringhini, Gianluca and Long, Summer and Greenberg, Stephanie and Zannettou, Savvas}, booktitle = {{Proceedings of the 15th International AAAI Conference on Weblogs and Social Media (ICWSM'21)}}, year={2021} }

Reddit data

We make available data for forums and for relevant subreddits (56 of them, as described in subreddit_descriptions.csv). These are available, 1 line per post in each subreddit Reddit in /ndjson/reddit.ndjson. A sample for example is:

{ "author": "Handheld_Gaming", "date_post": 1546300852, "id_post": "abcusl", "number_post": 9.0, "subreddit": "Braincels", "text_post": "Its been 2019 for almost 1 hour And I am at a party with 120 people, half of them being foids. The last year had been the best in my life. I actually was happy living hope because I was redpilled to the death.

Now that I am blackpilled I see that I am the shortest of all men and that I am the only one with a recessed jaw.

Its over. Its only thanks to my age old friendship with chads and my social skills I had developed in the past year that a lot of men like me a lot as a friend.

No leg lengthening syrgery is gonna save me. Ignorance was a bliss. Its just horror now seeing that everyone can make out wirth some slin hoe at the party.

I actually feel so unbelivably bad for turbomanlets. Life as an unattractive manlet is a pain, I cant imagine the hell being an ugly turbomanlet is like. I would have roped instsntly if I were one. Its so unfair.

Tallcels are fakecels and they all can (and should) suck my cock.

If I were 17cm taller my life would be a heaven and I would be the happiest man alive.

Just cope and wait for affordable body tranpslants.", "thread": "t3_abcusl" }

Forums

We here describe the .sqlite and .ndjson files that contain the data from the following forums.

(avfm) --- https://d2ec906f9aea-003845.vbulletin.net (incels) --- https://incels.co/ (love_shy) --- http://love-shy.com/lsbb/ (redpilltalk) --- https://redpilltalk.com/ (mgtow) --- https://www.mgtow.com/forums/ (rooshv) --- https://www.rooshvforum.com/ (pua_forum) --- https://www.pick-up-artist-forum.com/ (the_attraction) --- http://www.theattractionforums.com/

The files are in folders /sqlite/ and /ndjson.

2.1 .sqlite

All the tables in the sqlite. datasets follow a very simple {key:value} format. Each key is a thread name (for example /threads/housewife-is-like-a-job.123835/) and each value is a python dictionary or a list. This file contains three tables:

idx each key is the relative address to a thread and maps to a post. Each post is represented by a dict:

"type": (list) in some forums you can add a descriptor such as [RageFuel] to each topic, and you may also have special types of posts, like sticked/pool/locked posts.
"title": (str) title of the thread; "link": (str) link to the thread; "author_topic": (str) username that created the thread; "replies": (int) number of replies, may differ from number of posts due to difference in crawling date; "views": (int) number of views; "subforum": (str) name of the subforum; "collected": (bool) indicates if raw posts have been collected; "crawled_idx_at": (str) datetime of the collection.

processed_posts each key is the relative address to a thread and maps to a list with posts (in order). Each post is represented by a dict:

"author": (str) author's username; "resume_author": (str) author's little description; "joined_author": (str) date author joined; "messages_author": (int) number of messages the author has; "text_post": (str) text of the main post; "number_post": (int) number of the post in the thread; "id_post": (str) unique post identifier (depends), for sure unique within thread; "id_post_interaction": (list) list with other posts ids this post quoted; "date_post": (str) datetime of the post, "links": (tuple) nice tuple with the url parsed, e.g. ('https', 'www.youtube.com', '/S5t6K9iwcdw'); "thread": (str) same as key; "crawled_at": (str) datetime of the collection.

raw_posts each key is the relative address to a thread and maps to a list with unprocessed posts (in order). Each post is represented by a dict:

"post_raw": (binary) raw html binary; "crawled_at": (str) datetime of the collection.

2.2 .ndjson

Each line consists of a json object representing a different comment with the following fields:

"author": (str) author's username; "resume_author": (str) author's little description; "joined_author": (str) date author joined; "messages_author": (int) number of messages the author has; "text_post": (str) text of the main post; "number_post": (int) number of the post in the thread; "id_post": (str) unique post identifier (depends), for sure unique within thread; "id_post_interaction": (list) list with other posts ids this post quoted; "date_post": (str) datetime of the post, "links": (tuple) nice tuple with the url parsed, e.g. ('https', 'www.youtube.com', '/S5t6K9iwcdw'); "thread": (str) same as key; "crawled_at": (str) datetime of the collection.

Perspective

We also run each post and reddit post through perspective, the files are located in the /perspective/ folder. They are compressed with gzip. One example output

{ "id_post": 5200, "hate_output": { "text": "I still can\u2019t wrap my mind around both of those articles about these c~~~s sleeping with poor Haitian Men. Where\u2019s the uproar?, where the hell is the outcry?, the \u201cpig\u201d comments or the \u201ccreeper comments\u201d. F~~~ing hell, if roles were reversed and it was an article about Men going to Europe where under 18 sex in legal, you better believe they would crucify the writer of that article and DEMAND an apology by the paper that wrote it.. This is exactly what I try and explain to people about the double standards within our modern society. A bunch of older women, wanna get their kicks off by sleeping with poor Men, just before they either hit or are at menopause age. F~~~ing unreal, I\u2019ll never forget going to Sweden and Norway a few years ago with one of my buddies and his girlfriend who was from there, the legal age of consent in Norway is 16 and in Sweden it\u2019s 15. I couldn\u2019t believe it, but my friend told me \u201c hey, it\u2019s normal here\u201d . Not only that but the age wasn\u2019t a big different in other European countries as well. One thing i learned very quickly was how very Misandric Sweden as well as Denmark were.", "TOXICITY": 0.6079781, "SEVERE_TOXICITY": 0.53744453, "INFLAMMATORY": 0.7279288, "PROFANITY": 0.58842486, "INSULT": 0.5511079, "OBSCENE": 0.9830818, "SPAM": 0.17009115 } }

Working with sqlite

A nice way to read some of the files of the dataset is using SqliteDict, for example:

from sqlitedict import SqliteDict processed_posts = SqliteDict("./data/forums/incels.sqlite", tablename="processed_posts")

for key, posts in processed_posts.items(): for post in posts: # here you could do something with each post in the dataset pass

Helpers

Additionally, we provide two .sqlite files that are helpers used in the analyses. These are related to reddit, and not to the forums! They are:

channel_dict.sqlite a sqlite where each key corresponds to a subreddit and values are lists of dictionaries users who posted on it, along with timestamps.

author_dict.sqlite a sqlite where each key corresponds to an author and values are lists of dictionaries of the subreddits they posted on, along with timestamps.

These are used in the paper for the migration analyses.

Examples and particularities for forums

Although we did our best to clean the data and be consistent across forums, this is not always possible. In the following subsections we talk about the particularities of each forum, directions to improve the parsing which were not pursued as well as give some examples on how things work in each forum.

6.1 incels

Check out an archived version of the front page, the thread page and a post page, as well as a dump of the data stored for a thread page and a post page.

types: for the incel forums the special types associated with each thread in the idx table are “Sticky”, “Pool”, “Closed”, and the custom types added by users, such as [LifeFuel]. These last ones are all in brackets. You can see some examples of these in the on the example thread page.

quotes: quotes in this forum were quite nice and thus, all quotations are deterministic.

6.2 LoveShy

Check out an archived version of the front page, the thread page and a post page, as well as a dump of the data stored for a thread page and a post page.

types: no types were parsed. There are some rules in the forum, but not significant.

quotes: quotes were obtained from exact text+author match, or author match + a jaccard
f
Dataset for "Neural embedding of beliefs reveals the role of relative...
figshare.com
zip
Updated Feb 6, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Byunghwee Lee (2025). Dataset for "Neural embedding of beliefs reveals the role of relative dissonance in human decision-making". [Dataset]. http://doi.org/10.6084/m9.figshare.28327019.v3
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.28327019.v3
Dataset updated
Feb 6, 2025
Dataset provided by
figshare
Authors
Byunghwee Lee
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This project contains the dataset used to generate the results of the study "Neural embedding of beliefs reveals the role of relative dissonance in human decision-making" (arXiv:2408.07237).Authors: Byunghwee Lee1, Rachith Aiyappa1, Yong-Yeol Ahn1, Haewoon Kwak1, Jisun An11 Center for Complex Networks and Systems Research, Luddy School of Informatics, Computing, and Engineering, Indiana University, Bloomington, Indiana, USA, 47408DDO_dataset.zip (original Debate.org dataset)This archive contains the original raw Debate.org dataset, which was obtained from the publicly accessible website (https://esdurmus.github.io/ddo.html), maintained by Esin Durmus [1,2]. All credit for this dataset belongs entirely to the original authors, Esin Durmus and Claire Cardie. We do not claim any authorship or modifications to this dataset. It is provided here solely for reproducibility and reference in our study.The dataset includes the following three files:- debates.json: This JSON file contains a Python dictionary that assigns a debate name --- a unique name for each debate --- to debate information- users.json: This JSON file includes a Python dictionary containing user information- readme.md file from the authors (Esin Durmus and Claire Cardie)When using this dataset, please reference Debate.org and cite the following works:[1] Esin Durmus and Claire Cardie. 2019. A Corpus for Modeling User and Language Effects in Argumentation on Online Debating. In Proceedings of the 57th Conference of the Association for Computational Linguistics. Florence, Italy. Association for Computational Linguistics.[2] Esin Durmus and Claire Cardie. 2018. Exploring the Role of Prior Beliefs for Argument Persuasion. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL).df_ddo_including_only_truebeliefs_nodup(N192307).pThis file contains a pre-processed dataset used in our project (arXiv:2408.07237). The dataset includes records of user participation in debates (both as debaters and voters) as well as voting records across various debates. The belief triplet dataset used for fine-tuning a Sentence-BERT model was generated based on this pre-processed dataset. Detailed explanations of the pre-processing procedure are provided in the Methods section of the paper.When using this pre-processed dataset, please cite the following reference (in addition to the two papers mentioned above):[3] Lee, B., Aiyappa, R., Ahn, Y. Y., Kwak, H., & An, J. (2024). Neural embedding of beliefs reveals the role of relative dissonance in human decision-making. arXiv preprint arXiv:2408.07237.model_full_data.zipThis zip file contains five fine-tuned S-BERT models trained using a 5-fold belief triplet dataset. After unzipping the files, users can import the models using the 'sentence_transformers' Python library (https://sbert.net/).
Z
HRDIC dataset of strain localization in shot peened Ni superalloy
data.niaid.nih.gov
explore.openaire.eu
+1more
Updated Jul 19, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Quinta da Fonseca, João (2024). HRDIC dataset of strain localization in shot peened Ni superalloy [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_4728015
Explore at:
Dataset updated
Jul 19, 2024
Dataset provided by
Quinta da Fonseca, João
Orozco-Caballero, Alberto
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
We used high-resolution digital image correlation (HRDIC) to measure strain distribution at each deformation step near the shot peened surface of a Ni superalloy, deformed in tension at 450 C. To make the DIC analysis possible, we first developed a fine, homogeneous distributed gold speckle pattern by remodelling a thin gold layer previously deposited on the polished sample surface. The gold layer was deposited using an Edwards S150B sputter coater for 4 minutes providing a ~40-45 nm thickness gold layer. Then, the sample was placed in the heating plate of the remodelling device for 6 hours at 300 °C. During the remodelling process, water vapour flows onto the surface of the coated material and remodels the gold layer into fine speckles. In order to avoid any possible speckle coarsening during the high temperature tensile testing, the remodelled sample was heated in a furnace at 475 °C for 5 hours and air cooled to stabilise the gold pattern.

Due to the required stabilization heat treatment, the average speckle size was 120 nm, which is coarser than in previous studies at room temperature, where no pattern thermal stabilization is needed. We used a FEI Magellan HR 400L FE-SEM with a theoretical resolution of ≤ 0.9 nm at <1kV to take backscattered electron images of the pattern. The images were obtained at a working distance of 3.5 mm, 5 kV and 0.8 nA beam current. Mosaics of 30x15 images were used to cover 950x420 squared microns, an area shown area in Fig. 1). Each image contains 2048 x 1768 pixels and has a horizontal field of view of 43 microns. The images were overlapped by 20% to enable easy stitching prior to the digital image correlation. We obtained 7 mosaics, one before tensile testing and 6 after each deformation step. The mosaics for the un-deformed and deformed state were correlated using LaVision’s digital image correlation (DIC) software (version DaVis 8.3). The correlation was performed using a sub-window size of 1616 pixels and no overlap, which provides a spatial resolution of about 335 nm. This resolution allowed us to handle the large data sets generated when covering such large areas (each mosaic was 45.5k x 20k pixels in size). The correlation produces full-field in-plane displacement maps u(x1, x2, 0) on the plane x1x2 with normal x3.

Code for reading and visualizing this data is available here: https://doi.org/10.5281/zenodo.4727939. The shear strain data is also provided as a python dictionary saved as a NumPy array. Visualization previews are given as .png and .gif files.
Z
FSD50K
data.niaid.nih.gov
opendatalab.com
+2more
Updated Apr 24, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Xavier Serra (2022). FSD50K [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_4060431
Explore at:
Dataset updated
Apr 24, 2022
Dataset provided by
Xavier Favory
Jordi Pons
Xavier Serra
Frederic Font
Eduardo Fonseca
Description
FSD50K is an open dataset of human-labeled sound events containing 51,197 Freesound clips unequally distributed in 200 classes drawn from the AudioSet Ontology. FSD50K has been created at the Music Technology Group of Universitat Pompeu Fabra.

Citation

If you use the FSD50K dataset, or part of it, please cite our TASLP paper (available from [arXiv] [TASLP]):

@article{fonseca2022FSD50K, title={{FSD50K}: an open dataset of human-labeled sound events}, author={Fonseca, Eduardo and Favory, Xavier and Pons, Jordi and Font, Frederic and Serra, Xavier}, journal={IEEE/ACM Transactions on Audio, Speech, and Language Processing}, volume={30}, pages={829--852}, year={2022}, publisher={IEEE} }

Paper update: This paper has been published in TASLP at the beginning of 2022. The accepted camera-ready version includes a number of improvements with respect to the initial submission. The main updates include: estimation of the amount of label noise in FSD50K, SNR comparison between FSD50K and AudioSet, improved description of evaluation metrics including equations, clarification of experimental methodology and some results, some content moved to Appendix for readability. The TASLP-accepted camera-ready version is available from arXiv (in particular, it is v2 in arXiv, displayed by default).

Data curators

Eduardo Fonseca, Xavier Favory, Jordi Pons, Mercedes Collado, Ceren Can, Rachit Gupta, Javier Arredondo, Gary Avendano and Sara Fernandez

Contact

You are welcome to contact Eduardo Fonseca should you have any questions, at efonseca@google.com.

ABOUT FSD50K

Freesound Dataset 50k (or FSD50K for short) is an open dataset of human-labeled sound events containing 51,197 Freesound clips unequally distributed in 200 classes drawn from the AudioSet Ontology [1]. FSD50K has been created at the Music Technology Group of Universitat Pompeu Fabra.

What follows is a brief summary of FSD50K's most important characteristics. Please have a look at our paper (especially Section 4) to extend the basic information provided here with relevant details for its usage, as well as discussion, limitations, applications and more.

Basic characteristics:

FSD50K contains 51,197 audio clips from Freesound, totalling 108.3 hours of multi-labeled audio

The dataset encompasses 200 sound classes (144 leaf nodes and 56 intermediate nodes) hierarchically organized with a subset of the AudioSet Ontology.

The audio content is composed mainly of sound events produced by physical sound sources and production mechanisms, including human sounds, sounds of things, animals, natural sounds, musical instruments and more. The vocabulary can be inspected in vocabulary.csv (see Files section below).

The acoustic material has been manually labeled by humans following a data labeling process using the Freesound Annotator platform [2].

Clips are of variable length from 0.3 to 30s, due to the diversity of the sound classes and the preferences of Freesound users when recording sounds.

All clips are provided as uncompressed PCM 16 bit 44.1 kHz mono audio files.

Ground truth labels are provided at the clip-level (i.e., weak labels).

The dataset poses mainly a large-vocabulary multi-label sound event classification problem, but also allows development and evaluation of a variety of machine listening approaches (see Sec. 4D in our paper).

In addition to audio clips and ground truth, additional metadata is made available (including raw annotations, sound predominance ratings, Freesound metadata, and more), allowing a variety of analyses and sound event research tasks (see Files section below).

The audio clips are grouped into a development (dev) set and an evaluation (eval) set such that they do not have clips from the same Freesound uploader.

Dev set:

40,966 audio clips totalling 80.4 hours of audio

Avg duration/clip: 7.1s

114,271 smeared labels (i.e., labels propagated in the upwards direction to the root of the ontology)

Labels are correct but could be occasionally incomplete

A train/validation split is provided (Sec. 3H). If a different split is used, it should be specified for reproducibility and fair comparability of results (see Sec. 5C of our paper)

Eval set:

10,231 audio clips totalling 27.9 hours of audio

Avg duration/clip: 9.8s

38,596 smeared labels

Eval set is labeled exhaustively (labels are correct and complete for the considered vocabulary)

Note: All classes in FSD50K are represented in AudioSet, except Crash cymbal, Human group actions, Human voice, Respiratory sounds, and Domestic sounds, home sounds.

LICENSE

All audio clips in FSD50K are released under Creative Commons (CC) licenses. Each clip has its own license as defined by the clip uploader in Freesound, some of them requiring attribution to their original authors and some forbidding further commercial reuse. Specifically:

The development set consists of 40,966 clips with the following licenses:

CC0: 14,959

CC-BY: 20,017

CC-BY-NC: 4616

CC Sampling+: 1374

The evaluation set consists of 10,231 clips with the following licenses:

CC0: 4914

CC-BY: 3489

CC-BY-NC: 1425

CC Sampling+: 403

For attribution purposes and to facilitate attribution of these files to third parties, we include a mapping from the audio clips to their corresponding licenses. The licenses are specified in the files dev_clips_info_FSD50K.json and eval_clips_info_FSD50K.json.

In addition, FSD50K as a whole is the result of a curation process and it has an additional license: FSD50K is released under CC-BY. This license is specified in the LICENSE-DATASET file downloaded with the FSD50K.doc zip file. We note that the choice of one license for the dataset as a whole is not straightforward as it comprises items with different licenses (such as audio clips, annotations, or data split). The choice of a global license in these cases may warrant further investigation (e.g., by someone with a background in copyright law).

Usage of FSD50K for commercial purposes:

If you'd like to use FSD50K for commercial purposes, please contact Eduardo Fonseca and Frederic Font at efonseca@google.com and frederic.font@upf.edu.

Also, if you are interested in using FSD50K for machine learning competitions, please contact Eduardo Fonseca and Frederic Font at efonseca@google.com and frederic.font@upf.edu.

FILES

FSD50K can be downloaded as a series of zip files with the following directory structure:

root │
└───FSD50K.dev_audio/ Audio clips in the dev set │
└───FSD50K.eval_audio/ Audio clips in the eval set │
└───FSD50K.ground_truth/ Files for FSD50K's ground truth │ │
│ └─── dev.csv Ground truth for the dev set │ │
│ └─── eval.csv Ground truth for the eval set
│ │
│ └─── vocabulary.csv List of 200 sound classes in FSD50K │
└───FSD50K.metadata/ Files for additional metadata │ │
│ └─── class_info_FSD50K.json Metadata about the sound classes │ │
│ └─── dev_clips_info_FSD50K.json Metadata about the dev clips │ │
│ └─── eval_clips_info_FSD50K.json Metadata about the eval clips │ │
│ └─── pp_pnp_ratings_FSD50K.json PP/PNP ratings
│ │
│ └─── collection/ Files for the sound collection format
│
└───FSD50K.doc/ │
└───README.md The dataset description file that you are reading │
└───LICENSE-DATASET License of the FSD50K dataset as an entity

Each row (i.e. audio clip) of dev.csv contains the following information:

fname: the file name without the .wav extension, e.g., the fname 64760 corresponds to the file 64760.wav in disk. This number is the Freesound id. We always use Freesound ids as filenames.

labels: the class labels (i.e., the ground truth). Note these class labels are smeared, i.e., the labels have been propagated in the upwards direction to the root of the ontology. More details about the label smearing process can be found in Appendix D of our paper.

mids: the Freebase identifiers corresponding to the class labels, as defined in the AudioSet Ontology specification

split: whether the clip belongs to train or val (see paper for details on the proposed split)

Rows in eval.csv follow the same format, except that there is no split column.

Note: We use a slightly different format than AudioSet for the naming of class labels in order to avoid potential problems with spaces, commas, etc. Example: we use Accelerating_and_revving_and_vroom instead of the original Accelerating, revving, vroom. You can go back to the original AudioSet naming using the information provided in vocabulary.csv (class label and mid for the 200 classes of FSD50K) and the AudioSet Ontology specification.

Files with additional metadata (FSD50K.metadata/)

To allow a variety of analysis and approaches with FSD50K, we provide the following metadata:

class_info_FSD50K.json: python dictionary where each entry corresponds to one sound class and contains: FAQs utilized during the annotation of the class, examples (representative audio clips), and verification_examples (audio clips presented to raters during annotation as a quality control mechanism). Audio clips are described by the Freesound id. Note: It may be that some of these examples are not included in the FSD50K release.

dev_clips_info_FSD50K.json: python dictionary where each entry corresponds to one dev clip and contains: title,
CREMP-CycPeptMPDB: Conformer-rotamer ensembles of macrocyclic peptides for...
zenodo.org
data.niaid.nih.gov
application/gzip, bz2 +1
Updated Aug 13, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Colin A. Grambow; Colin A. Grambow; Hayley Weir; Hayley Weir; Christian N. Cunningham; Christian N. Cunningham; Tommaso Biancalani; Tommaso Biancalani; Kangway V. Chuang; Kangway V. Chuang (2024). CREMP-CycPeptMPDB: Conformer-rotamer ensembles of macrocyclic peptides for machine learning with permeability annotations [Dataset]. http://doi.org/10.5281/zenodo.10798262
Explore at:
application/gzip, bz2, csvAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.10798262
Dataset updated
Aug 13, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Colin A. Grambow; Colin A. Grambow; Hayley Weir; Hayley Weir; Christian N. Cunningham; Christian N. Cunningham; Tommaso Biancalani; Tommaso Biancalani; Kangway V. Chuang; Kangway V. Chuang
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
CREMP-CycPeptMPDB: A resource generated for the rapid development and evaluation of machine learning models for permeable macrocyclic peptides. CREMP-CycPeptMPDB contains 3,258 unique macrocyclic peptides and their high-quality structural ensembles generated using the Conformer-Rotamer Ensemble Sampling Tool (CREST). Altogether, this dataset contains nearly 8.7 million unique macrocycle geometries, each annotated with energies derived from semi-empirical tight-binding DFT calculations and with experimental membrane permeability measurements obtained from the CycPeptMPDB database. We anticipate that this dataset will enable the development of machine learning models that can improve peptide design and optimization for novel therapeutics.

This dataset complements the CREMP dataset, which contains a larger selection of conformer ensembles for homodetic macrocyclic peptides.

We provide the data in two available formats, either as Python pickle files, which provide quick read access with RDKit version 2022.09.5 or later, and as text-based SDF files with associated metadata in JSON format. Each file is named based on its amino acid sequence, with residues separated by periods, using standard one-letter codes with lowercase letters representing D-amino acids and "Me" prefixes representing N-methylated amino acids. The sequences are in no particular order, e.g., "C.R.E.M.P" and "R.E.M.P.C" correspond to the same peptide macrocycle. The filename extensions are ".pickle", ".sdf", and ".json".

Each file in the “pickle” folder contains a Python dictionary with amino acid sequence, SMILES, CREST metadata, and a single RDKit molecule object containing all conformers. All files in the folder were compressed into a single “pickle.tar.gz” archive. In the “sdf_and_json” folder, each individual SDF file contains all conformers, each associated with its own JSON file that contains CREST metadata. Similarly, all are compressed into another single archive, “sdf_and_json.tar.bz2”. A single summary CSV file is also provided containing ”sequence”, “smiles”, “num_monomers”, “num_atoms”, “num_heavy_atoms”, along with the CREST metadata “totalconfs”, “uniqueconfs”, “lowestenergy”, “poplowestpct”, “temperature”, “ensembleenergy”, “ensembleentropy”, and “ensemblefreeenergy”. The number of unique conformers with different 3D structures is given by “uniqueconfs”, while “totalconfs” includes the number of rotamers in addition.

The unzipped sizes of the archives are approximately 13 GB for "pickle.tar.gz" and 84 GB for "sdf_and_json.tar.bz2". If you encounter errors when trying to load the pickle files, please make sure your RDKit version is at least 2022.09.5. If that doesn't work, try other Python versions.
s
Data used in 'Local Wind Regime Induced by Giant Linear Dunes: Comparison of...
eprints.soton.ac.uk
data.subak.org
+1more
Updated Mar 10, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Claudin, Philippe; Baddock, Mattew; Nield, Joanna M.; Delorme, Pauline; Gadal, Cyril; Wiggs, Giles; Narteau, Clément (2022). Data used in 'Local Wind Regime Induced by Giant Linear Dunes: Comparison of ERA5-Land Reanalysis with Surface Measurements.' [Dataset]. http://doi.org/10.5281/zenodo.6343137
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.6343137
Dataset updated
Mar 10, 2022
Dataset provided by
Zenodo
Authors
Claudin, Philippe; Baddock, Mattew; Nield, Joanna M.; Delorme, Pauline; Gadal, Cyril; Wiggs, Giles; Narteau, Clément
Description
This repository contains the data used in: Gadal, C., Delorme, P., Narteau, C. et al. Local Wind Regime Induced by Giant Linear Dunes: Comparison of ERA5-Land Reanalysis with Surface Measurements. Boundary-Layer Meteorol 185, 309–332 (2022). https://doi.org/10.1007/s10546-022-00733-6 where wind data measured at 4 different places in and across the Namib Sand Sea are compared to the data from the ERA5/ERA5Land climate reanalyses. The use this data, one should first look at the GitHub repository https://github.com/Cgadal/GiantDunes and at the corresponding documentation https://cgadal.github.io/GiantDunes/. The description sometimes refers to scripts used in https://github.com/Cgadal/GiantDunes/tree/master/Processing. The two folders 'raw_data' and 'processed_data' contain the input raw_data, and the output data after processing used to make the paper figures, respectively. In each of them, '.npy' files contain Python dictionaries with different variables in them. They can be loaded using the Python library numpy as data = np.load('file.npy', allow_pickle=True).item(); and the different keys (variables) can be printed with data.keys() or data[station].keys() if data.keys() return the different stations. Unless specified otherwise below, note that all variables are given in the International System of Units (SI), and wind direction is given anticlockwise, with the 0 being a wind blowing from the West to the East. raw_data: DEM: contains the Digital Elevation Models of the two stations from the SRTM30, downloaded from here: https://dwtkns.com/srtm30m/ ERA5: hourly data from the ER5 climate reanalysis, on surface (_BLH) and pressure levels (_levels). Downloaded from https://cds.climate.copernicus.eu/ ERA5Land: hourly data from the ER5Land climate reanalysis Downloaded from https://cds.climate.copernicus.eu/ KML_points: kml points of the measurement station. It can be opened directly in GoogleEarth. measured_wind_data: contains the measured in situ data. The windspeed is measured using Vector Instruments A100-LK cup anemometers, the wind direction using Vector Instruments W200-P wind vane and the time using Campbell Instruments CR10X and CR1000X dataloggers. processed_data: 'Data_preprocessed.npy': preprocessed_data, output of 1_data_preprocessing_plot.py 'Data_DEM.npy': properties of the processed DEM, the output of 2_DEM_analysis_plot.py 'Data_calib_roughness.npy': data from the calibration of the hydrodynamic roughnesses, the output of 3_roughness_calibration_plot.py 'Data_final.npy': file containing all computed quantities 'time_series_hydro_coeffs.npy': file containing the time series of the calculated hydrodynamic coefficients by '5_norun_hydro_coeff_time_series.npy'. Depending on the loaded data file, main dictionary keys can be: 'lat': latitude, in degree 'lon': longitude, in degree 'time': time vector, in datetime objects (https://docs.python.org/3/library/datetime.html) 'DEM': elevation data array in [m], with dimensions matching 'lat' and 'lon' vectors 'z_mes', 'z_insitu', 'z_ERA5LAND': height of the corresponding velocity 'direction': measured wind direction, in [degrees] 'velocity': measured wind velocity, in [m/s] 'orientaion': dune pattern orientation, [deg] 'wavelength': dune pattern wavelength, [km] 'z0_insitu': chosen hydrodynamic roughness for the considered station. 'U_insitu', 'Orientation_insitu': hourly averaged measured wind velocities and direction 'U_era', 'Orientation_era': hourly 10m wind data from the ERA5Land data set 'Boundary layer height', 'blh': boundary layer height from the hourly ERA5 dataset 'Pressure levels', 'levels': Pressure levels from the pressure levels ERA5 dataset 'Temperature', 't': Temperature from the pressure levels ERA5 dataset 'Specific humidity', 'q': Specific humidity from the pressure levels ERA5 dataset 'Geopotential', 'z': Geopotential from the pressure levels ERA5 dataset 'Virtual_potential_temperature': Virtual potential temperature calculated from the pressure levels ERA5 dataset 'Potential_temperature': Potential temperature calculated from the pressure levels ERA5 dataset 'Density': Density calculated from the pressure levels ERA5 dataset 'height': Vertical coordinates calculated from the pressure levels ERA5 dataset 'theta_ground': Averaged virtual potential temperature within the ABL. 'delta_theta': Virtual potential temperature at the ABL. 'gradient_free_atm': Virtual potential temperature gradient in the FA. 'Froude': time series of the Froude number U/((delta_theta/theta_ground)*g*BLH) 'kH': time series of the number 'kH' 'kLB': time series of the internal Froude number kU/N Other keys are not relevant and are stored for verification purposes. For more details, please contact Cyril Gadal (see authors), and look at the following GitHub repository: https://github.com/Cgadal/GiantDunes, where all the codes are present.
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

Chris Deotte (2019). Malware Timestamps 2 [Dataset]. https://www.kaggle.com/cdeotte/malware-timestamps-2/notebooks

Malware Timestamps 2

A Python dictionary mapping Census_OSVersion to Datetime

Explore at:

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

Jan 26, 2019

Dataset provided by

Kagglehttp://kaggle.com/

Authors

Chris Deotte

Description

This python dictionary maps Microsoft's Census_OSVersion (str) to a timestamp (datetime.datetime). Use this to add the time when a user's OS was last updated.

The Microsoft Malware dataset has 579 unique Census_OSVersion's. This dictionary contains dates for 324 of them which constitutes 99.85% of the Microsoft data.

These timestamps were downloaded from https://support.microsoft.com/en-us/help/4043454/windows-10-windows-server-update-history and https://changewindows.org/build/17134

For a dictionary of timestamps for AvSigVersion, go here https://www.kaggle.com/cdeotte/malware-timestamps

Clear search

Close search

Google apps

Main menu

Malware Timestamps 2

Dataset for: The Evolution of the Manosphere Across the Web

Dataset for "Neural embedding of beliefs reveals the role of relative...

HRDIC dataset of strain localization in shot peened Ni superalloy

FSD50K

CREMP-CycPeptMPDB: Conformer-rotamer ensembles of macrocyclic peptides for...

Data used in 'Local Wind Regime Induced by Giant Linear Dunes: Comparison of...

Malware Timestamps 2

A Python dictionary mapping Census_OSVersion to Datetime