6 datasets found

LatentJam Dataset
zenodo.org
explore.openaire.eu
+1more
bin, txt
Updated May 15, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Philip Tovstogan; Philip Tovstogan; Xavier Serra; Dmitry Bogdanov; Xavier Serra; Dmitry Bogdanov (2022). LatentJam Dataset [Dataset]. http://doi.org/10.5281/zenodo.6010468
Explore at:
bin, txtAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.6010468
Dataset updated
May 15, 2022
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Philip Tovstogan; Philip Tovstogan; Xavier Serra; Dmitry Bogdanov; Xavier Serra; Dmitry Bogdanov
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This LatentJam dataset contains 12 music latent representations (9 content-based and 3 collaborative-filtering) for 29 275 music tracks from Jamendo. It is released as part of the publication "Similarity of Nearest-Neighbor Query Results in Deep Latent Spaces". For more details on the individual models used for extraction, refer to the text of the paper.

The example code that uses this dataset and reproduces the experiments and analysis done in the paper is available at https://github.com/philtgun/compare-embeddings. See the README for more details on how to use this dataset and individual files.

The content-based representations have been extracted with the Essentia library (essentia-tensorflow 2.1b6.dev374) during the internship of the first author in the Jamendo in 2021 as part of the MIP-Frontiers project. The collaborative filtering representations have been computed from data provided by Jamendo with the Implicit library (AlternatingLeastSquares algorithm, implicit 0.4.4). The Python version used is 3.7.13.
This dataset is released under CC BY-NC-SA 4.0 License.

Please cite the publication if you use this dataset:

@inproceedings{tovstogan_similarity_2009, title = {Similarity of nearest-neighbor query results in deep latent spaces}, author = {Tovstogan, Philip and Serra, Xavier and Bogdanov, Dmitry}, booktitle = {Proceedings of the 19th Sound and Music Computing Conference ({SMC})}, year = {2022} }
Hessian QM9 Dataset
figshare.com
bin
Updated Dec 12, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nicholas Williams (2024). Hessian QM9 Dataset [Dataset]. http://doi.org/10.6084/m9.figshare.26363959.v4
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.26363959.v4
Dataset updated
Dec 12, 2024
Dataset provided by
Figsharehttp://figshare.com/
Authors
Nicholas Williams
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
OverviewHessian QM9 is the first database of equilibrium configurations and numerical Hessian matrices, consisting of 41,645 molecules from the QM9 dataset at the $\omega$B97x/6-31G* level. Molecular Hessians were calculated in vacuum, as well as in water, tetrahydrofuran, and toluene using an implicit solvation model.A pre-print article associated with this dataset is available at here.Data recordsThe dataset is stored in Hugging Face's dataset format. For each of the four implicit solvent environments (vacuum, THF, toluene, and water), the data is divided into separate datasets containing vibrational analysis of 41,645 optimized geometries. Labels are associated with the QM9 molecule labelling system given by Ramakrishnan et al.Please note that only molecules containing H, C, N, O were considered. This exclusion was due to the limited number of molecules containing fluorine in the QM9 dataset, which was not sufficient to build a good description of the chemical environment for fluorine atoms. Including these molecules may have reduced the overall precision of any models trained on our data.Load the dataset:Use the following Python script to load the dataset dictionary: pythonfrom datasets import load_from_diskdataset = load_from_disk(root_directory)print(dataset)Expected output:pythonDatasetDict({vacuum: Dataset({features: ['energy', 'positions', 'atomic_numbers', 'forces', 'frequencies', 'normal_modes', 'hessian', 'label'],num_rows: 41645}),thf: Dataset({features: ['energy', 'positions', 'atomic_numbers', 'forces', 'frequencies', 'normal_modes', 'hessian', 'label'],num_rows: 41645}),toluene: Dataset({features: ['energy', 'positions', 'atomic_numbers', 'forces', 'frequencies', 'normal_modes', 'hessian', 'label'],num_rows: 41645}),water: Dataset({features: ['energy', 'positions', 'atomic_numbers', 'forces', 'frequencies', 'normal_modes', 'hessian', 'label'],num_rows: 41645})})DFT MethodsAll DFT calculations were carried out using the NWChem software package. The density functional used was $\omega$B97x with a 6-31G* basis set to create data compatible with the ANI-1/ANI-1x/ANI-2x datasets. The self-consistent field (SCF) cycle was converged when changes in total energy and density were less than 1e-6 eV. All molecules in the set are neutral with a multiplicity of 1. The Mura-Knowles radial quadrature and Lebedev angular quadrature were used in the integration. Structures were optimized in vacuum and three solvents (tetrahydrofuran, toluene, and water) using an implicit solvation model.The Hessian matrices, vibrational frequencies, and normal modes were computed for a subset of 41,645 molecular geometries using the finite differences method.Example model weightsAn example model trained on Hessian data is included in this dataset. Full details of the model will be provided in an upcoming publication. The model is an E(3)-equivariant graph neural network using the e3x package with specific architecture details. To load the model weights, use:pythonparams = jnp.load('params_train_f128_i5_b16.npz', allow_pickle=True)['params'].item()
e
Data used for the preprint version of the paper "Implicit time...
data.europa.eu
unknown
Updated Aug 31, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bundesanstalt für Wasserbau (2024). Data used for the preprint version of the paper "Implicit time discretisation as a potential avenue to achieve full coupling between shallow water flow and bedload transport" [Utz, 2024] [Dataset]. https://data.europa.eu/data/datasets/4a71ed2e-0e44-44c9-af53-14289b18d2b7?locale=fi
Explore at:
unknownAvailable download formats
Dataset updated
Aug 31, 2024
Dataset authored and provided by
Bundesanstalt für Wasserbau
Description
This dataset contains all data, which have been used to write the linked paper. In addition, it contains all Python scripts used for the evaluation of the data. It should be noted that the Python module pynocular is used within the scripts. This module is not yet published, but it is planned for release via https://github.com/baw-de.
Dataset for Elsender & Bate (2024): An implicit algorithm for simulating the...
zenodo.org
application/gzip
Updated Mar 7, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Daniel Elsender; Daniel Elsender; Matthew Bate; Matthew Bate (2024). Dataset for Elsender & Bate (2024): An implicit algorithm for simulating the dynamics of small dust grains with smoothed particle hydrodynamics [Dataset]. http://doi.org/10.5281/zenodo.10794912
Explore at:
application/gzipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.10794912
Dataset updated
Mar 7, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Daniel Elsender; Daniel Elsender; Matthew Bate; Matthew Bate
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset contains 10 files with the data used to create the 11 figures in the journal paper:

Elsender, Daniel & Bate, Matthew R., 2024, Monthly Notices of the Royal Astronomical Society

The files contain the data and plotting scripts required to make the figures. Some files contain smoothed particle hydrodynamics (SPH) dump files, within these files are the SPLASH (Price 2007) config files that were used to produce the figures in the paper.

The SPH dump files are Fortran binary files written in big endian format and generated by the sphNG code (Benz 1990; Bate 1995; Whitehouse & Bate 2004; Price & Monaghan 2007). They can be read, visualised, and manipulated using the free, publicly available SPLASH visualisation code (which reads sphNG dump files), written by Daniel J. Price, that can be downloaded from:

http://users.monash.edu.au/~dprice/splash/

Files are as follows:

fig_1: Used for plotting the results of the dustywave tests. Contains 4 sub folders; k_1, k_10, k_100, k_1000, which refer to the drag coefficients. Each sub folder has 2 SPH dump files, VD- and VG-, along with 3 SPLASH config files. VD- contains information to plot the dust quantities, and VG- contains information to plot the gas quantities

fig_2: Used to plot the results of the dustyshock test. Contains 1 SPH dump and 2 SPLASH configs.

fig_3_4: Used to plot maximum densities in 4 different protostellar collapse calculations. Within this file is a python script (central_dens.py) and two sub folders: non-rotating and B008. Within these sub folders are dust-as-mixture and dust-as-particles, these refer to the two different dust evolution methods used. Within each of these folders there are 6 folders, one for each dust grain species, containing a text file 'dustmass.txt'. The 'dustmass.txt' files have three columns: time, gas density, and dust density.

fig_5: Contains SPH dump files used to plot snapshots of three of the dust-as-mixture calculations, grain sizes 1, 30, 100 microns. The SPLASH config files are in the top level of this folder with the SPH dump files within sub folders 1mu, 30mu, and 100mu.

fig_6: The same as fig_5 except to plot snapshots from some of the dust-as-particles calculations, with grain sizes 1, 30, and 100 microns.

fig_7: Used to plot a radial dust density profile from 4 dust-as-mixture calculations; 2 calcultions use the implicit method described in the paper, and 2 use an explicit timestepping method. In the top level of the folder is the python script profile.py that will recreate the figure. The data used to make the plots are in files beginning QD-, radius is in units of [cm] and density in units of [g cm^(-3)].

fig_8: Contains data used to plot snapshots from 2 planet in disc calculations that employ either the implicit method from the paper or an explicit method. The SPLASH configs are in the top level of the folder and the explicit and implicit data files are in sub folders named as such. Each folder contains the 4 SPH dumps used in the paper.

fig_9: 3 text files and 1 python script. The text files contain 3 colunms: time [yrs], dust mass and gas mass [solar masses]. The files correspond to calculations of discs with embedded planets using different implementations of dust. The data from implicit method described in the paper is in dustmass_imp.txt, for the explicit version of this calculation in dustmass.txt, and in an explicit calculation with a different dust formulation (described in the paper) dustmass_old.txt

fig_10: Contains, in sub folders implicit and explicit, SPH dumps files for the results of the dust settle test after 20 orbits and the SPLASH config files to set the plot up.

fig_11: Two sub folders, explicit and implicit, each containing the SPH dump file needed to plot the cross section of the disc. The SPLASH config files are set up to take a cross section of the disc.
Neural Joint Space Implicit Signed Distance Functions [Data & Code]
zenodo.org
bin, zip
Updated Apr 24, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mikhail Koptev; Mikhail Koptev (2025). Neural Joint Space Implicit Signed Distance Functions [Data & Code] [Dataset]. http://doi.org/10.5281/zenodo.8387165
Explore at:
zip, binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.8387165
Dataset updated
Apr 24, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Mikhail Koptev; Mikhail Koptev
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
These data files containg code sources for dataset creation & model learning (neural-jsdf.zip) and collected synthetic dataset of free & collided postures for robotic arm Franka (sdf_3m_full_mesh.mat). Follow the Readme.MD files to launch the code if needed.

Corresponding Git repo: https://github.com/epfl-lasa/Neural-JSDF
LLMWorldOfWords/LWOW: First release
zenodo.org
zip
Updated Apr 30, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Katherine Elizabeth Abramski; Katherine Elizabeth Abramski; Riccardo Improta; Riccardo Improta; Giulio Rossetti; Giulio Rossetti; Massimo Stella; Massimo Stella (2025). LLMWorldOfWords/LWOW: First release [Dataset]. http://doi.org/10.5281/zenodo.15222294
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.15222294
Dataset updated
Apr 30, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Katherine Elizabeth Abramski; Katherine Elizabeth Abramski; Riccardo Improta; Riccardo Improta; Giulio Rossetti; Giulio Rossetti; Massimo Stella; Massimo Stella
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
Apr 15, 2025
Area covered
Lviv
Description
The "LLM World of Words" (LWOW) [1] is a collection of datasets of English free association norms generated by various large language models (LLMs). Currently, the collection consists of datasets generated by Mistral, LLaMA3, and Claude Haiku. The datasets are modeled after the "Small World of Words" (SWOW) (https://smallworldofwords.org/en/project/) [2] English free association norms, generated by humans, consisting of over 12,000 cue words and over 3 million responses. The purpose of the LWOW datasets is to provide a way to investigate various aspects of the semantic memory of LLMs using an approach that has been applied extensively for investigating the semantic memory of humans. These datasets, together with the SWOW dataset, can be used to gain insights about similarities and differences in the language structures possessed by humans and LLMs.

What are free associations?

Free associations are implicit mental connections between words or concepts. They are typically accessed by presenting humans (or AI agents) with a cue word and then asking them to respond with the first words that come to mind. The responses represent implicit associations that connect different concepts in the mind, reflecting the semantic representations that underly patterns of thought, memory, and language. For example, given the cue word "woman", a common free association response might be "man", reflecting the associative mental relation between these two concepts.

How can they be used?

Free associations have been extensively used in cognitive psychology and linguistics as a tool for studying language and cognitive information processing. They provide a way for researchers to understand how conceptual knowledge is organized and accessed in the mind. Free associations are often used to built network models of semantic memory by connecting cue words to their responses. When thousands of cues and responses are connected in this way, the result is a complex network model that represents the complex organization of semantic knowledge. Such models enable the investigation of complex cognitive processes that take place within semantic memory, and can be used to study a variety of cognitive phenomena such as language learning, creativity, personality traits, and cognitive biases.

Validation of the datasets with semantic priming

The LWOW datasets were validated using data from the Semantic Priming Project (https://www.montana.edu/attmemlab/spp.html) [3], which implements a lexical decision task (LDT) to study semantic priming. The semantic priming effect is the cognitive phenomenon that a target word (e.g. nurse) is more easily recognized when it is prompted by a related prime word (e.g. doctor) compared to an unrelated prime word (e.g. doctrine). We simulated the semantic priming effect within network models of semantic memory built from both the LWOW and the SWOW free association norms by implementing spreading activation processes within the networks [4]. We found that the final activation levels of prime-target pairs correlated significantly with reaction time data for the same prime-target pairs from the LDT. Specifically, the activation of a target node (e.g. nurse) is higher when a related prime node (e.g. doctor) is activated compared to an unrelated prime node (e.g. doctrine). These results demonstrate how the LWOW datasets can be used for investigating cognitive and linguistic phenomena in LLMs, demonstrating the validity of the datasets.

Investigating gender biases

To demonstrate how this dataset can be used to investigate gender biases in LLMs compared to humans, we conducted an analysis using network models of semantic memory built from both the LWOW and the SWOW free association norms. We applied a methodology that simulates semantic priming within the networks to measure the strength of association between pairs of concepts, for example, "woman" and "forecful" vs. "man" and "forceful". We applied this methodology using a set of female-related and male-related primes, and a set of female-related and male-related targets. This analysis revealed that certain adjectives like "forceful" and "strong" are more strongly associated with certain genders, shedding light on the types of stereotypical gender biases that both humans and LLMs possess.

Technical notes

The free associations were generated (either via API or locally, depending on the LLM) by providing each LLM with a set of cue words and the following prompt: "You will be provided with an input word. Write the first 3 words you associate to it separated by a comma." This prompt was repeated 100 times for each cue word, resulting in a dataset of 11,545 unique cues words and 3,463,500 total responses for each LLM.

How to access and use the datasets

The LWOW datasets for Mistral, Llama3, and Haiku can be found in the LWOW_datasets folder, which contains two subfolders. The .csv files of the processed cues and responses can be found in the processed_datasets folder while the .csv files of the edge lists of the semantic networks constructed from the datasets can be found in the graphs/edge_lists folder.

Since the LWOW datasets are intended to be used in comparison to humans, we have further processed the original SWOW dataset to create a Human dataset that is aligned with the processing that we applied to the LWOW datasets. While this human dataset is not included in this repository due to the license of the original SWOW dataset, it can be easily reproduced by running the code provided in the reproducibility folder. We highly encourage you to generate this dataset as it enabales a direct comparison between humans and LLMs. The Human dataset can be generated with the following steps:

Go to the SWOW research page (https://smallworldofwords.org/en/project/research) [2] and download the English processed data (SWOW-EN18). Save this .csv file with the name "SWOW-EN.R100.csv" in the reproducibility/data/original_datasets folder.

Run the python file FA_data_Cleaning.py saved in the reproducibility folder. This will generate a .csv of the processed Human dataset, which will be saved in the reproducibility/data/processed_datasets folder. Note that this python script will also regenerate the .csv files of the processed LWOW datasets (the same that can be found in the LWOW_datasets/processed_datasets folder).

Run the python file FA_build_Networks.py saved in the reproducibility folder. This will generate a .csv of the edge list of the semantic network constructed from the Human dataset, which will be saved in the reproducibility/data/graphs/edge_lists folder. Note that this python script will also regenerate the .csv files of the same edges lists of the LLM networks (the same that can be found in the LWOW_datasets/graphs/edge_lists folder). This python script will also produce igraph versions of all the semantic networks.

How to reproduce the data and analyses

To reproduce the analyses, first the required external files need to be downloaded:

Go to the SWOW research page (https://smallworldofwords.org/en/project/research) [2] and download the English data SWOW-EN18. Save this .csv file with the name "SWOW-EN.R100.csv" in the reproducibility/data/original_datasets folder.

Go to the Semantic Priming Project (https://www.montana.edu/attmemlab/spp.html) [3] and download the LDT Priming Data. Save this .csv file with the name "primingLDT_data.csv" in the reproducibility/data/LDT_analyses folder.

Once the files are saved in the correct folders, follow the instructions in each script, which can be found in the reproducibility folder. The scripts should be run in the following order:

FA_data_Generation.py: generates the raw LLM datasets

FA_data_Cleaning.py: processes the original SWOW dataset and the raw LLM datasets

FA_build_Networks.py: builds the semantic networks from the datasets

FA_analyses_LDT_Gender.py and FA_spreadr.r: implements spreading activation processes within the networks in order to validate the datasets and investigate gender biases

Do you want to know more? Read the Preprint!

Abramski, K., et al. (2024). The "LLM World of Words" English free association norms generated by large language models (https://arxiv.org/abs/2412.01330)

Funding & Legal

SoBigData.it which receives funding from the European Union – NextGenerationEU – National Recovery and Resilience Plan (Piano Nazionale di Ripresa e Resilienza, PNRR) – Project: “SoBigData.it – Strengthening the Italian RI for Social Mining and Big Data Analytics” – Prot. IR0000013 – Avviso n. 3264 del 28/12/2021;

EU NextGenerationEU programme under the funding schemes PNRR-PE-AI FAIR (Future Artificial Intelligence Research).

The HumaneAI-Net project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 952026.

COGNOSCO grant funded by Università di Trento (Grant ID: PS 22_27).

For speaking requests and enquiries, please contact:

Katherine Abramski : katherine.abramski@phd.unipi.it

Giulio Rossetti : giulio.rossetti@isti.cnr.it

Massimo Stella : massimo.stella-1@unitn.it

References

[1] Abramski, K., et al. (2024). The" LLM World of Words" English free association norms generated
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

Philip Tovstogan; Philip Tovstogan; Xavier Serra; Dmitry Bogdanov; Xavier Serra; Dmitry Bogdanov (2022). LatentJam Dataset [Dataset]. http://doi.org/10.5281/zenodo.6010468

LatentJam Dataset

Explore at:

bin, txtAvailable download formats

Unique identifier

https://doi.org/10.5281/zenodo.6010468

Dataset updated

May 15, 2022

Dataset provided by

Zenodohttp://zenodo.org/

Authors

Philip Tovstogan; Philip Tovstogan; Xavier Serra; Dmitry Bogdanov; Xavier Serra; Dmitry Bogdanov

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

This LatentJam dataset contains 12 music latent representations (9 content-based and 3 collaborative-filtering) for 29 275 music tracks from Jamendo. It is released as part of the publication "Similarity of Nearest-Neighbor Query Results in Deep Latent Spaces". For more details on the individual models used for extraction, refer to the text of the paper.

The example code that uses this dataset and reproduces the experiments and analysis done in the paper is available at https://github.com/philtgun/compare-embeddings. See the README for more details on how to use this dataset and individual files.

The content-based representations have been extracted with the Essentia library (essentia-tensorflow 2.1b6.dev374) during the internship of the first author in the Jamendo in 2021 as part of the MIP-Frontiers project. The collaborative filtering representations have been computed from data provided by Jamendo with the Implicit library (AlternatingLeastSquares algorithm, implicit 0.4.4). The Python version used is 3.7.13.
This dataset is released under CC BY-NC-SA 4.0 License.

Please cite the publication if you use this dataset:

@inproceedings{tovstogan_similarity_2009,
  title = {Similarity of nearest-neighbor query results in deep latent spaces},
  author = {Tovstogan, Philip and Serra, Xavier and Bogdanov, Dmitry},
  booktitle = {Proceedings of the 19th Sound and Music Computing Conference ({SMC})},
  year = {2022}
}

Clear search

Close search

Google apps

Main menu

LatentJam Dataset

Hessian QM9 Dataset

Data used for the preprint version of the paper "Implicit time...

Dataset for Elsender & Bate (2024): An implicit algorithm for simulating the...

Neural Joint Space Implicit Signed Distance Functions [Data & Code]

LLMWorldOfWords/LWOW: First release

What are free associations?

How can they be used?

Validation of the datasets with semantic priming

Investigating gender biases

Technical notes

How to access and use the datasets

How to reproduce the data and analyses

Do you want to know more? Read the Preprint!

Funding & Legal

References

LatentJam Dataset