32 datasets found

PandasPlotBench
huggingface.co
Updated Nov 25, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
PandasPlotBench [Dataset]. https://huggingface.co/datasets/JetBrains-Research/PandasPlotBench
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Nov 25, 2024
Dataset provided by
JetBrainshttp://jetbrains.com/
Authors
JetBrains Research
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
PandasPlotBench

PandasPlotBench is a benchmark to assess the capability of models in writing the code for visualizations given the description of the Pandas DataFrame. 🛠️ Task. Given the plotting task and the description of a Pandas DataFrame, write the code to build a plot. The dataset is based on the MatPlotLib gallery. The paper can be found in arXiv: https://arxiv.org/abs/2412.02764v1. To score your model on this dataset, you can use the our GitHub repository. 📩 If you have… See the full description on the dataset page: https://huggingface.co/datasets/JetBrains-Research/PandasPlotBench.
o
CO2 emissions DF - analysis
market.oceanprotocol.com
Updated Dec 12, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
cesar (2023). CO2 emissions DF - analysis [Dataset]. https://market.oceanprotocol.com/asset/did:op:da39016a45f3b473a386f9380dd760797dce7bb8e2950f78f41cd9c777d499c2
Explore at:
Dataset updated
Dec 12, 2023
Dataset authored and provided by
cesar
License
https://market.oceanprotocol.com/termshttps://market.oceanprotocol.com/terms
Description
this df is about the analysis of the predictions of CO2 fossil emissions and GHG
Z
Python Time Normalized Superposed Epoch Analysis (SEAnorm) Example Data Set
data.niaid.nih.gov
Updated Jul 15, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Walton, Sam D. (2022). Python Time Normalized Superposed Epoch Analysis (SEAnorm) Example Data Set [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_6835136
Explore at:
Dataset updated
Jul 15, 2022
Dataset provided by
Walton, Sam D.
Murphy, Kyle R.
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Solar Wind Omni and SAMPEX ( Solar Anomalous and Magnetospheric Particle Explorer) datasets used in examples for SEAnorm, a time normalized superposed epoch analysis package in python.

Both data sets are stored as either a HDF5 or a compressed csv file (csv.bz2) which contain a Pandas DataFrame of either the Solar Wind Omni and SAMPEX data sets. The data sets where written with pandas.DataFrame.to_hdf() and pandas.DataFrame.to_csv() using a compression level of 9. The DataFrames can be read using pandas.DataFrame.read_hdf( ) or pandas.DataFrame.read_csv( ) depending on the file format.

The Solar Wind Omni data sets contains solar wind velocity (V) and dynamic pressure (P), the southward interplanetary magnetic field in Geocentric Solar Ecliptic System (GSE) coordinates (B_Z_GSE), the auroral electrojet index (AE), and the Sym-H index all at 1 minute cadence.

The SAMPEX data set contains electron flux from the Proton/Electron Telescope (PET) at two energy channels 1.5-6.0 MeV (ELO) and 2.5-14 MeV (EHI) at an approximate 6 second cadence.
Z
Database of Uniaxial Cyclic and Tensile Coupon Tests for Structural Metallic...
data.niaid.nih.gov
zenodo.org
Updated Dec 24, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Lignos, Dimitrios G. (2022). Database of Uniaxial Cyclic and Tensile Coupon Tests for Structural Metallic Materials [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_6965146
Explore at:
Dataset updated
Dec 24, 2022
Dataset provided by
Ozden, Selimcan
Hartloper, Alexander R.
de Castro e Sousa, Albano
Lignos, Dimitrios G.
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Database of Uniaxial Cyclic and Tensile Coupon Tests for Structural Metallic Materials

Background

This dataset contains data from monotonic and cyclic loading experiments on structural metallic materials. The materials are primarily structural steels and one iron-based shape memory alloy is also included. Summary files are included that provide an overview of the database and data from the individual experiments is also included.

The files included in the database are outlined below and the format of the files is briefly described. Additional information regarding the formatting can be found through the post-processing library (https://github.com/ahartloper/rlmtp/tree/master/protocols).

Usage

The data is licensed through the Creative Commons Attribution 4.0 International.

If you have used our data and are publishing your work, we ask that you please reference both:

this database through its DOI, and

any publication that is associated with the experiments. See the Overall_Summary and Database_References files for the associated publication references.

Included Files

Overall_Summary_2022-08-25_v1-0-0.csv: summarises the specimen information for all experiments in the database.

Summarized_Mechanical_Props_Campaign_2022-08-25_v1-0-0.csv: summarises the average initial yield stress and average initial elastic modulus per campaign.

Unreduced_Data-#_v1-0-0.zip: contain the original (not downsampled) data

Where # is one of: 1, 2, 3, 4, 5, 6. The unreduced data is broken into separate archives because of upload limitations to Zenodo. Together they provide all the experimental data.

We recommend you un-zip all the folders and place them in one "Unreduced_Data" directory similar to the "Clean_Data"

The experimental data is provided through .csv files for each test that contain the processed data. The experiments are organised by experimental campaign and named by load protocol and specimen. A .pdf file accompanies each test showing the stress-strain graph.

There is a "db_tag_clean_data_map.csv" file that is used to map the database summary with the unreduced data.

The computed yield stresses and elastic moduli are stored in the "yield_stress" directory.

Clean_Data_v1-0-0.zip: contains all the downsampled data

The experimental data is provided through .csv files for each test that contain the processed data. The experiments are organised by experimental campaign and named by load protocol and specimen. A .pdf file accompanies each test showing the stress-strain graph.

There is a "db_tag_clean_data_map.csv" file that is used to map the database summary with the clean data.

The computed yield stresses and elastic moduli are stored in the "yield_stress" directory.

Database_References_v1-0-0.bib

Contains a bibtex reference for many of the experiments in the database. Corresponds to the "citekey" entry in the summary files.

File Format: Downsampled Data

These are the "LP_

The header of the first column is empty: the first column corresponds to the index of the sample point in the original (unreduced) data

Time[s]: time in seconds since the start of the test

e_true: true strain

Sigma_true: true stress in MPa

(optional) Temperature[C]: the surface temperature in degC

These data files can be easily loaded using the pandas library in Python through:

import pandas data = pandas.read_csv(data_file, index_col=0)

The data is formatted so it can be used directly in RESSPyLab (https://github.com/AlbanoCastroSousa/RESSPyLab). Note that the column names "e_true" and "Sigma_true" were kept for backwards compatibility reasons with RESSPyLab.

File Format: Unreduced Data

These are the "LP_

The first column is the index of each data point

S/No: sample number recorded by the DAQ

System Date: Date and time of sample

Time[s]: time in seconds since the start of the test

C_1_Force[kN]: load cell force

C_1_Déform1[mm]: extensometer displacement

C_1_Déplacement[mm]: cross-head displacement

Eng_Stress[MPa]: engineering stress

Eng_Strain[]: engineering strain

e_true: true strain

Sigma_true: true stress in MPa

(optional) Temperature[C]: specimen surface temperature in degC

The data can be loaded and used similarly to the downsampled data.

File Format: Overall_Summary

The overall summary file provides data on all the test specimens in the database. The columns include:

hidden_index: internal reference ID

grade: material grade

spec: specifications for the material

source: base material for the test specimen

id: internal name for the specimen

lp: load protocol

size: type of specimen (M8, M12, M20)

gage_length_mm_: unreduced section length in mm

avg_reduced_dia_mm_: average measured diameter for the reduced section in mm

avg_fractured_dia_top_mm_: average measured diameter of the top fracture surface in mm

avg_fractured_dia_bot_mm_: average measured diameter of the bottom fracture surface in mm

fy_n_mpa_: nominal yield stress

fu_n_mpa_: nominal ultimate stress

t_a_deg_c_: ambient temperature in degC

date: date of test

investigator: person(s) who conducted the test

location: laboratory where test was conducted

machine: setup used to conduct test

pid_force_k_p, pid_force_t_i, pid_force_t_d: PID parameters for force control

pid_disp_k_p, pid_disp_t_i, pid_disp_t_d: PID parameters for displacement control

pid_extenso_k_p, pid_extenso_t_i, pid_extenso_t_d: PID parameters for extensometer control

citekey: reference corresponding to the Database_References.bib file

yield_stress_mpa_: computed yield stress in MPa

elastic_modulus_mpa_: computed elastic modulus in MPa

fracture_strain: computed average true strain across the fracture surface

c,si,mn,p,s,n,cu,mo,ni,cr,v,nb,ti,al,b,zr,sn,ca,h,fe: chemical compositions in units of %mass

file: file name of corresponding clean (downsampled) stress-strain data

File Format: Summarized_Mechanical_Props_Campaign

Meant to be loaded in Python as a pandas DataFrame with multi-indexing, e.g.,

tab1 = pd.read_csv('Summarized_Mechanical_Props_Campaign_' + date + version + '.csv', index_col=[0, 1, 2, 3], skipinitialspace=True, header=[0, 1], keep_default_na=False, na_values='')

citekey: reference in "Campaign_References.bib".

Grade: material grade.

Spec.: specifications (e.g., J2+N).

Yield Stress [MPa]: initial yield stress in MPa

size, count, mean, coefvar: number of experiments in campaign, number of experiments in mean, mean value for campaign, coefficient of variation for campaign

Elastic Modulus [MPa]: initial elastic modulus in MPa

size, count, mean, coefvar: number of experiments in campaign, number of experiments in mean, mean value for campaign, coefficient of variation for campaign

Caveats

The files in the following directories were tested before the protocol was established. Therefore, only the true stress-strain is available for each:

A500

A992_Gr50

BCP325

BCR295

HYP400

S460NL

S690QL/25mm

S355J2_Plates/S355J2_N_25mm and S355J2_N_50mm
Z
Analysis of references in the IPCC AR6 WG2 Report of 2022
data.niaid.nih.gov
Updated Mar 11, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bianca Kramer (2022). Analysis of references in the IPCC AR6 WG2 Report of 2022 [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_6327206
Explore at:
Dataset updated
Mar 11, 2022
Dataset provided by
Cameron Neylon
Bianca Kramer
License
https://creativecommons.org/licenses/publicdomain/https://creativecommons.org/licenses/publicdomain/
Description
This repository contains data on 17,419 DOIs cited in the IPCC Working Group 2 contribution to the Sixth Assessment Report, and the code to link them to the dataset built at the Curtin Open Knowledge Initiative (COKI).

References were extracted from the report's PDFs (downloaded 2022-03-01) via Scholarcy and exported as RIS and BibTeX files. DOI strings were identified from RIS files by pattern matching and saved as CSV file. The list of DOIs for each chapter and cross chapter paper was processed using a custom Python script to generate a pandas DataFrame which was saved as CSV file and uploaded to Google Big Query.

We used the main object table of the Academic Observatory, which combines information from Crossref, Unpaywall, Microsoft Academic, Open Citations, the Research Organization Registry and Geonames to enrich the DOIs with bibliographic information, affiliations, and open access status. A custom query was used to join and format the data and the resulting table was visualised in a Google DataStudio dashboard.

This version of the repository also includes the set of DOIs from references in the IPCC Working Group 1 contribution to the Sixth Assessment Report as extracted by Alexis-Michel Mugabushaka and shared on Zenodo: https://doi.org/10.5281/zenodo.5475442 (CC-BY)

A brief descriptive analysis was provided as a blogpost on the COKI website.

The repository contains the following content:

Data:

data/scholarcy/RIS/ - extracted references as RIS files

data/scholarcy/BibTeX/ - extracted references as BibTeX files

IPCC_AR6_WGII_dois.csv - list of DOIs

data/10.5281_zenodo.5475442/ - references from IPCC AR6 WG1 report

Processing:

preprocessing.R - preprocessing steps for identifying and cleaning DOIs

process.py - Python script for transforming data and linking to COKI data through Google Big Query

Outcomes:

Dataset on BigQuery - requires a google account for access and bigquery account for querying

Data Studio Dashboard - interactive analysis of the generated data

Zotero library of references extracted via Scholarcy

PDF version of blogpost

Note on licenses: Data are made available under CC0 (with the exception of WG1 reference data, which have been shared under CC-BY 4.0) Code is made available under Apache License 2.0
d
University Archives web archive collection derivatives
search.dataone.org
borealisdata.ca
Updated Dec 28, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ruest, Nick; Wilk, Jocelyn; Thurman, Alex (2023). University Archives web archive collection derivatives [Dataset]. https://search.dataone.org/view/sha256%3A6e1852b2330a5035d1d8ca3a5bb7f3d5ebb74441ca7744f883cbc30259aae1e8
Explore at:
Dataset updated
Dec 28, 2023
Dataset provided by
Borealis
Authors
Ruest, Nick; Wilk, Jocelyn; Thurman, Alex
Description
Web archive derivatives of the University Archives collection from Columbia University Libraries. The derivatives were created with the Archives Unleashed Toolkit and Archives Unleashed Cloud. The cul-1914-parquet.tar.gz derivatives are in the Apache Parquet format, which is a columnar storage format. These derivatives are generally small enough to work with on your local machine, and can be easily converted to Pandas DataFrames. See this notebook for examples. Domains .webpages().groupBy(ExtractDomainDF($"url").alias("url")).count().sort($"count".desc) Produces a DataFrame with the following columns: domain count Web Pages .webpages().select($"crawl_date", $"url", $"mime_type_web_server", $"mime_type_tika", RemoveHTMLDF(RemoveHTTPHeaderDF(($"content"))).alias("content")) Produces a DataFrame with the following columns: crawl_date url mime_type_web_server mime_type_tika content Web Graph .webgraph() Produces a DataFrame with the following columns: crawl_date src dest anchor Image Links .imageLinks() Produces a DataFrame with the following columns: src image_url Binary Analysis Images PDFs Presentation program files Spreadsheets Text files Word processor files The cul-1914-auk.tar.gz derivatives are the standard set of web archive derivatives produced by the Archives Unleashed Cloud. Gephi file, which can be loaded into Gephi. It will have basic characteristics already computed and a basic layout. Raw Network file, which can also be loaded into Gephi. You will have to use that network program to lay it out yourself. Full text file. In it, each website within the web archive collection will have its full text presented on one line, along with information around when it was crawled, the name of the domain, and the full URL of the content. Domains count file. A text file containing the frequency count of domains captured within your web archive. Due to file size restrictions in Scholars Portal Dataverse, each of the derivative files needed to be split into 1G parts. These parts can be joined back together with cat. For example: cat cul-1914-parquet.tar.gz.part* > cul-1914-parquet.tar.gz
Burke Library New York City Religions web archive collection derivatives
zenodo.org
data.niaid.nih.gov
application/gzip
Updated Mar 9, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nick Ruest; Nick Ruest; Matthew C. Baker; Alex Thurman; Matthew C. Baker; Alex Thurman (2020). Burke Library New York City Religions web archive collection derivatives [Dataset]. http://doi.org/10.5281/zenodo.3701455
Explore at:
application/gzipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.3701455
Dataset updated
Mar 9, 2020
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Nick Ruest; Nick Ruest; Matthew C. Baker; Alex Thurman; Matthew C. Baker; Alex Thurman
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
New York
Description
Web archive derivatives of the Burke Library New York City Religions collection from Columbia University Libraries. The derivatives were created with the Archives Unleashed Toolkit and Archives Unleashed Cloud.

The cul-1945-parquet.tar.gz derivatives are in the Apache Parquet format, which is a columnar storage format. These derivatives are generally small enough to work with on your local machine, and can be easily converted to Pandas DataFrames. See this notebook for examples.

Domains

.webpages().groupBy(ExtractDomainDF($"url").alias("url")).count().sort($"count".desc)

Produces a DataFrame with the following columns:

domain

count

Web Pages

.webpages().select($"crawl_date", $"url", $"mime_type_web_server", $"mime_type_tika", RemoveHTMLDF(RemoveHTTPHeaderDF(($"content"))).alias("content"))

Produces a DataFrame with the following columns:

crawl_date

url

mime_type_web_server

mime_type_tika

content

Web Graph

.webgraph()

Produces a DataFrame with the following columns:

crawl_date

src

dest

anchor

Image Links

.imageLinks()

Produces a DataFrame with the following columns:

src

image_url

Binary Analysis

Images

PDFs

Presentation program files

Spreadsheets

Text files

Word processor files

The cul-1945-auk.tar.gz derivatives are the standard set of web archive derivatives produced by the Archives Unleashed Cloud.

Gephi file, which can be loaded into Gephi. It will have basic characteristics already computed and a basic layout.

Raw Network file, which can also be loaded into Gephi. You will have to use that network program to lay it out yourself.

Full text file. In it, each website within the web archive collection will have its full text presented on one line, along with information around when it was crawled, the name of the domain, and the full URL of the content.

Domains count file. A text file containing the frequency count of domains captured within your web archive.
Z
Data from: Bayesian Analysis for Remote Biosignature Identification on...
data.niaid.nih.gov
zenodo.org
Updated Jul 29, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Vincent Kofman (2023). Bayesian Analysis for Remote Biosignature Identification on exoEarths (BARBIE) I: Using Grid-Based Nested Sampling in Coronagraphy Observation Simulations for H2O [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7897197
Explore at:
Dataset updated
Jul 29, 2023
Dataset provided by
Nicholas Susemiehl
Avi Mandell
Michael Dane Moore
Geronimo Villanueva
Vincent Kofman
Michael D. Himes
Natasha Latouf
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
We present all of the data across our SNR and abundance study for the molecule H2O for an exoEarth twin. The wavelength range is from 0.515-1 micron, with 25 evenly spaced 20% bandpasses in this range. The SNR ranges from 3-16, and the abundance values range from log10(VMR) = -3.5 to -1.5 in steps of 0.5 and 0.25 (all presented in VMR in the associated table). We present the lower and upper wavelength per bandpass, the input H2O value (abundance case), the retrieved H2O value (presented as the log10(VMR)), the lower and upper limits of the 68% credible region (presented as the log10(VMR)), and the log-Bayes factor for H2O. For more information about how these were calculated, please see Bayesian Analysis for Remote Biosignature Identification on exoEarths (BARBIE) I: Using Grid-Based Nested Sampling in Coronagraphy Observation Simulations for H2O, accepted and currently available on arXiv.

To open this csv as a Pandas dataframe, use the following command:

your_dataframe_name = pd.read_csv(f'zenodo_table.csv', dtype={'Input H2O': str})
Z
polyOne Data Set - 100 million hypothetical polymers including 29 properties...
data.niaid.nih.gov
zenodo.org
Updated Mar 24, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rampi Ramprasad (2023). polyOne Data Set - 100 million hypothetical polymers including 29 properties [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7124187
Explore at:
Dataset updated
Mar 24, 2023
Dataset provided by
Christopher Kuenneth
Rampi Ramprasad
Description
polyOne Data Set

The data set contains 100 million hypothetical polymers each with 29 predicted properties using machine learning models. We use PSMILES strings to represent polymer structures, see here and here. The polymers are generated by decomposing previously synthesized polymers into unique chemical fragments. Random and enumerative compositions of these fragments yield 100 million hypothetical PSMILES strings. All PSMILES strings are chemically valid polymers but, mostly, have never been synthesized before. More information can be found in the paper. Please note the license agreement in the LICENSE file.

Full data set including the properties

The data files are in Apache Parquet format. The files start with polyOne_*.parquet.

I recommend using dask (pip install dask) to load and process the data set. Pandas also works but is slower.

Load sharded data set with dask python import dask.dataframe as dd ddf = dd.read_parquet("*.parquet", engine="pyarrow")

For example, compute the description of data set ```python df_describe = ddf.describe().compute() df_describe

PSMILES strings only generated_polymer_smiles_train.txt - 80 million PSMILES strings for training polyBERT. One string per line. generated_polymer_smiles_dev.txt - 20 million PSMILES strings for testing polyBERT. One string per line.
Freely Accessible eJournals web archive collection derivatives
zenodo.org
data.niaid.nih.gov
application/gzip
Updated Feb 2, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nick Ruest; Nick Ruest (2020). Freely Accessible eJournals web archive collection derivatives [Dataset]. http://doi.org/10.5281/zenodo.3633671
Explore at:
application/gzipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.3633671
Dataset updated
Feb 2, 2020
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Nick Ruest; Nick Ruest
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Web archive derivatives of the Freely Accessible eJournals collection from Columbia University Libraries. The derivatives were created with the Archives Unleashed Toolkit and Archives Unleashed Cloud.

The cul-5921-parquet.tar.gz derivatives are in the Apache Parquet format, which is a columnar storage format. These derivatives are generally small enough to work with on your local machine, and can be easily converted to Pandas DataFrames. See this notebook for examples.

Domains

.webpages().groupBy(ExtractDomainDF($"url").alias("url")).count().sort($"count".desc)

Produces a DataFrame with the following columns:

domain

count

Web Pages

.webpages().select($"crawl_date", $"url", $"mime_type_web_server", $"mime_type_tika", RemoveHTMLDF(RemoveHTTPHeaderDF(($"content"))).alias("content"))

Produces a DataFrame with the following columns:

crawl_date

url

mime_type_web_server

mime_type_tika

content

Web Graph

.webgraph()

Produces a DataFrame with the following columns:

crawl_date

src

dest

anchor

Image Links

.imageLinks()

Produces a DataFrame with the following columns:

src

image_url

Binary Analysis

Audio

Images

PDFs

Presentation program files

Spreadsheets

Text files

Word processor files

The cul-12143-auk.tar.gz derivatives are the standard set of web archive derivatives produced by the Archives Unleashed Cloud.

Gephi file, which can be loaded into Gephi. It will have basic characteristics already computed and a basic layout.

Raw Network file, which can also be loaded into Gephi. You will have to use that network program to lay it out yourself.

Full text file. In it, each website within the web archive collection will have its full text presented on one line, along with information around when it was crawled, the name of the domain, and the full URL of the content.

Domains count file. A text file containing the frequency count of domains captured within your web archive.
Contemporary Composers Web Archive (CCWA) web archive collection derivatives...
zenodo.org
data.niaid.nih.gov
application/gzip
Updated Mar 1, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nick Ruest; Nick Ruest; Samantha Abrams; Samantha Abrams (2020). Contemporary Composers Web Archive (CCWA) web archive collection derivatives [Dataset]. http://doi.org/10.5281/zenodo.3692559
Explore at:
application/gzipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.3692559
Dataset updated
Mar 1, 2020
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Nick Ruest; Nick Ruest; Samantha Abrams; Samantha Abrams
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Web archive derivatives of the Contemporary Composers Web Archive (CCWA) collection from the Ivy Plus Libraries Confederation. The derivatives were created with the Archives Unleashed Toolkit and Archives Unleashed Cloud.

The ivy-4019-parquet.tar.gz derivatives are in the Apache Parquet format, which is a columnar storage format. These derivatives are generally small enough to work with on your local machine, and can be easily converted to Pandas DataFrames. See this notebook for examples.

Domains

.webpages().groupBy(ExtractDomainDF($"url").alias("url")).count().sort($"count".desc)

Produces a DataFrame with the following columns:

domain

count

Web Pages

.webpages().select($"crawl_date", $"url", $"mime_type_web_server", $"mime_type_tika", RemoveHTMLDF(RemoveHTTPHeaderDF(($"content"))).alias("content"))

Produces a DataFrame with the following columns:

crawl_date

url

mime_type_web_server

mime_type_tika

content

Web Graph

.webgraph()

Produces a DataFrame with the following columns:

crawl_date

src

dest

anchor

Image Links

.imageLinks()

Produces a DataFrame with the following columns:

src

image_url

Binary Analysis

Audio

Images

PDFs

Presentation program files

Spreadsheets

Text files

Word processor files

The ivy-4019-auk.tar.gz derivatives are the standard set of web archive derivatives produced by the Archives Unleashed Cloud.

Gephi file, which can be loaded into Gephi. It will have basic characteristics already computed and a basic layout.

Raw Network file, which can also be loaded into Gephi. You will have to use that network program to lay it out yourself.

Full text file. In it, each website within the web archive collection will have its full text presented on one line, along with information around when it was crawled, the name of the domain, and the full URL of the content.

Domains count file. A text file containing the frequency count of domains captured within your web archive.
Web Archive of Independent News Sites on Turkish Affairs derivatives
zenodo.org
data.niaid.nih.gov
application/gzip
Updated Jan 31, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nick Ruest; Nick Ruest (2020). Web Archive of Independent News Sites on Turkish Affairs derivatives [Dataset]. http://doi.org/10.5281/zenodo.3633234
Explore at:
application/gzipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.3633234
Dataset updated
Jan 31, 2020
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Nick Ruest; Nick Ruest
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Derivatives of the Web Archive of Independent News Sites on Turkish Affairs collection from the Ivy Plus Libraries Confederation. The derivatives were created with the Archives Unleashed Toolkit and Archives Unleashed Cloud.

The ivy-12911-parquet.tar.gz derivatives are in the Apache Parquet format, which is a columnar storage format. These derivatives are generally small enough to work with on your local machine, and can be easily converted to Pandas DataFrames. See this notebook for examples.

Domains

.webpages().groupBy(ExtractDomainDF($"url").alias("url")).count().sort($"count".desc)

Produces a DataFrame with the following columns:

domain

count

Web Pages

.webpages().select($"crawl_date", $"url", $"mime_type_web_server", $"mime_type_tika", RemoveHTMLDF(RemoveHTTPHeaderDF(($"content"))).alias("content"))

Produces a DataFrame with the following columns:

crawl_date

url

mime_type_web_server

mime_type_tika

content

Web Graph

.webgraph()

Produces a DataFrame with the following columns:

crawl_date

src

dest

anchor

Image Links

.imageLinks()

Produces a DataFrame with the following columns:

src

image_url

Binary Analysis

Audio

Images

PDFs

Presentation program files

Spreadsheets

Text files

Word processor files

The ivy-12911-auk.tar.gz derivatives are the standard set of web archive derivatives produced by the Archives Unleashed Cloud.

Gephi file, which can be loaded into Gephi. It will have basic characteristics already computed and a basic layout.

Raw Network file, which can also be loaded into Gephi. You will have to use that network program to lay it out yourself.

Full text file. In it, each website within the web archive collection will have its full text presented on one line, along with information around when it was crawled, the name of the domain, and the full URL of the content.

Domains count file. A text file containing the frequency count of domains captured within your web archive.
Popline and K4Health Web Archive collection derivatives
zenodo.org
data.niaid.nih.gov
application/gzip
Updated Jan 31, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nick Ruest; Nick Ruest; Lauris Olson; Samantha Abrams; Lauris Olson; Samantha Abrams (2020). Popline and K4Health Web Archive collection derivatives [Dataset]. http://doi.org/10.5281/zenodo.3633022
Explore at:
application/gzipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.3633022
Dataset updated
Jan 31, 2020
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Nick Ruest; Nick Ruest; Lauris Olson; Samantha Abrams; Lauris Olson; Samantha Abrams
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Web archive derivatives of the Popline and K4Health Web Archive collection from the Ivy Plus Libraries Confederation. The derivatives were created with the Archives Unleashed Toolkit and Archives Unleashed Cloud.

The ivy-12006-parquet.tar.gz derivatives are in the Apache Parquet format, which is a columnar storage format. These derivatives are generally small enough to work with on your local machine, and can be easily converted to Pandas DataFrames. See this notebook for examples.

Domains

.webpages().groupBy(ExtractDomainDF($"url").alias("url")).count().sort($"count".desc)

Produces a DataFrame with the following columns:

domain

count

Web Pages

.webpages().select($"crawl_date", $"url", $"mime_type_web_server", $"mime_type_tika", RemoveHTMLDF(RemoveHTTPHeaderDF(($"content"))).alias("content"))

Produces a DataFrame with the following columns:

crawl_date

url

mime_type_web_server

mime_type_tika

content

Web Graph

.webgraph()

Produces a DataFrame with the following columns:

crawl_date

src

dest

anchor

Image Links

.imageLinks()

Produces a DataFrame with the following columns:

src

image_url

Binary Analysis

Audio

Images

PDFs

Presentation program files

Spreadsheets

Text files

Word processor files

The ivy-12006-auk.tar.gz derivatives are the standard set of web archive derivatives produced by the Archives Unleashed Cloud.

Gephi file, which can be loaded into Gephi. It will have basic characteristics already computed and a basic layout.

Raw Network file, which can also be loaded into Gephi. You will have to use that network program to lay it out yourself.

Full text file. In it, each website within the web archive collection will have its full text presented on one line, along with information around when it was crawled, the name of the domain, and the full URL of the content.

Domains count file. A text file containing the frequency count of domains captured within your web archive.
d
Optogenetics_Dfd_population_imaging
search.dataone.org
dataverse.harvard.edu
Updated Nov 8, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Braun, Jonas (2023). Optogenetics_Dfd_population_imaging [Dataset]. http://doi.org/10.7910/DVN/INYAYV
Explore at:
Unique identifier
https://doi.org/10.7910/DVN/INYAYV
Dataset updated
Nov 8, 2023
Dataset provided by
Harvard Dataverse
Authors
Braun, Jonas
Description
This dataset contains Dfd population imaging data and behavioral including responses to optogenetic stimulation. Part of Braun et al. 2023. There are two different types of tar.gz files. The one ending with "processed.tar.gz" contains data related to one fly: 1. "background_image.tif" is a standard deviation projection of raw fluorescence data used as background in figure 2c. 2. "roi_center_annotation.pdf" is a plot indicating the results of semi-automated ROI detection. 3. "ROI_centers.txt" indicates the location of said ROI centers. 4. "ROI_mask.tif" is the mask used for ROI extraction for this fly. ------------------- The other tar.gz files each correspond to one 10min long recording of neuronal activity and behavior of one fly. If multiple recordings have been made in one fly, their sequence is indicated with a 3 digit number in the name. Each of those folders contains 3 subfolders: "2p" holds the synchronisation data and metadata and the two photon recording metadata. "behData" contains the behavioral camera metadata. "processed" contains a pickled pandas dataframe containing all processed behavioral variables (beh_df.pkl) and all neuronal time series (twop_df.pkl) required to reproduce the figures. Raw behavioral videos and raw fluorescence data are available upon request from the authors and are omitted here because of their size.
Quebec International Relation and Economy web archive collection derivatives...
zenodo.org
data.niaid.nih.gov
application/gzip
Updated Feb 26, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nick Ruest; Nick Ruest; Carole Gagné; Dave Mitchell; Carole Gagné; Dave Mitchell (2020). Quebec International Relation and Economy web archive collection derivatives [Dataset]. http://doi.org/10.5281/zenodo.3688334
Explore at:
application/gzipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.3688334
Dataset updated
Feb 26, 2020
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Nick Ruest; Nick Ruest; Carole Gagné; Dave Mitchell; Carole Gagné; Dave Mitchell
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Québec City
Description
Web archive derivatives of the Sites of the Quebec International Relation and Economy collection from the Bibliothèque et Archives nationales du Québec. The derivatives were created with the Archives Unleashed Toolkit. Merci beaucoup BAnQ!

These derivatives are in the Apache Parquet format, which is a columnar storage format. These derivatives are generally small enough to work with on your local machine, and can be easily converted to Pandas DataFrames. See this notebook for examples.

Domains

.webpages().groupBy(ExtractDomainDF($"url").alias("url")).count().sort($"count".desc)

Produces a DataFrame with the following columns:

domain

count

Web Pages

.webpages().select($"crawl_date", $"url", $"mime_type_web_server", $"mime_type_tika", RemoveHTMLDF(RemoveHTTPHeaderDF(($"content"))).alias("content"))

Produces a DataFrame with the following columns:

crawl_date

url

mime_type_web_server

mime_type_tika

content

Web Graph

.webgraph()

Produces a DataFrame with the following columns:

crawl_date

src

dest

anchor

Image Links

.imageLinks()

Produces a DataFrame with the following columns:

src

image_url

Binary Analysis

Audio

Images

PDFs

Presentation program files

Spreadsheets

Text files

Videos

Word processor files
m
Chapter 10 - Advanced Feature Engineering Techniques for Fraud Analytics
data.mendeley.com
Updated Oct 18, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
ABDELRAHIM AQQAD (2023). Chapter 10 - Advanced Feature Engineering Techniques for Fraud Analytics [Dataset]. http://doi.org/10.17632/v7r3dsgtmz.2
Explore at:
Unique identifier
https://doi.org/10.17632/v7r3dsgtmz.2
Dataset updated
Oct 18, 2023
Authors
ABDELRAHIM AQQAD
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Dataset_1: The dataset consists of transaction timestamps (in hours) for a sample of online banking transactions. The timestamps represent the time of day when the transactions occurred.

Dataset_2: The dataset, encapsulated as a pandas DataFrame "trans_David", chronicles the transactional activities of an individual named David. A salient column, "channel_cd", signifies the payment channel employed by David for each transaction. The dataset encompasses 40 entries across 14 columns, with 'channel_cd' being the focal point for the derivation of the 'freq_channel' feature.
State Elections Web Archive collection derivatives
zenodo.org
data.niaid.nih.gov
application/gzip
Updated Feb 5, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nick Ruest; Nick Ruest; Kristina Williams; JKeely Wilczek; Jeremy Darrington; Ryan Denniston; Samantha Abrams; Kristina Williams; JKeely Wilczek; Jeremy Darrington; Ryan Denniston; Samantha Abrams (2020). State Elections Web Archive collection derivatives [Dataset]. http://doi.org/10.5281/zenodo.3635634
Explore at:
application/gzipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.3635634
Dataset updated
Feb 5, 2020
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Nick Ruest; Nick Ruest; Kristina Williams; JKeely Wilczek; Jeremy Darrington; Ryan Denniston; Samantha Abrams; Kristina Williams; JKeely Wilczek; Jeremy Darrington; Ryan Denniston; Samantha Abrams
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Web archive derivatives of the State Elections Web Archive collection from the Ivy Plus Libraries Confederation. The derivatives were created with the Archives Unleashed Toolkit and Archives Unleashed Cloud.

The ivy-10793-parquet.tar.gz derivatives are in the Apache Parquet format, which is a columnar storage format. These derivatives are generally small enough to work with on your local machine, and can be easily converted to Pandas DataFrames. See this notebook for examples.

Domains

.webpages().groupBy(ExtractDomainDF($"url").alias("url")).count().sort($"count".desc)

Produces a DataFrame with the following columns:

domain

count

Web Pages

.webpages().select($"crawl_date", $"url", $"mime_type_web_server", $"mime_type_tika", RemoveHTMLDF(RemoveHTTPHeaderDF(($"content"))).alias("content"))

Produces a DataFrame with the following columns:

crawl_date

url

mime_type_web_server

mime_type_tika

content

Web Graph

.webgraph()

Produces a DataFrame with the following columns:

crawl_date

src

dest

anchor

Image Links

.imageLinks()

Produces a DataFrame with the following columns:

src

image_url

Binary Analysis

Audio

Images

PDFs

Presentation program files

Spreadsheets

Text files

Word processor files

The ivy-10793-auk.tar.gz derivatives are the standard set of web archive derivatives produced by the Archives Unleashed Cloud.

Gephi file, which can be loaded into Gephi. It will have basic characteristics already computed and a basic layout.

Raw Network file, which can also be loaded into Gephi. You will have to use that network program to lay it out yourself.

Full text file. In it, each website within the web archive collection will have its full text presented on one line, along with information around when it was crawled, the name of the domain, and the full URL of the content.

Domains count file. A text file containing the frequency count of domains captured within your web archive.
Rare Book and Manuscript Library web archive collection derivatives
zenodo.org
data.niaid.nih.gov
application/gzip
Updated Mar 9, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nick Ruest; Nick Ruest (2020). Rare Book and Manuscript Library web archive collection derivatives [Dataset]. http://doi.org/10.5281/zenodo.3701593
Explore at:
application/gzipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.3701593
Dataset updated
Mar 9, 2020
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Nick Ruest; Nick Ruest
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Web archive derivatives of the Rare Book and Manuscript Library collection from Columbia University Libraries. The derivatives were created with the Archives Unleashed Toolkit and Archives Unleashed Cloud.

The cul-2766-parquet.tar.gz derivatives are in the Apache Parquet format, which is a columnar storage format. These derivatives are generally small enough to work with on your local machine, and can be easily converted to Pandas DataFrames. See this notebook for examples.

Domains

.webpages().groupBy(ExtractDomainDF($"url").alias("url")).count().sort($"count".desc)

Produces a DataFrame with the following columns:

domain

count

Web Pages

.webpages().select($"crawl_date", $"url", $"mime_type_web_server", $"mime_type_tika", RemoveHTMLDF(RemoveHTTPHeaderDF(($"content"))).alias("content"))

Produces a DataFrame with the following columns:

crawl_date

url

mime_type_web_server

mime_type_tika

content

Web Graph

.webgraph()

Produces a DataFrame with the following columns:

crawl_date

src

dest

anchor

Image Links

.imageLinks()

Produces a DataFrame with the following columns:

src

image_url

Binary Analysis

Images

PDFs

Presentation program files

Spreadsheets

Word processor files

The cul-2766-auk.tar.gz derivatives are the standard set of web archive derivatives produced by the Archives Unleashed Cloud.

Gephi file, which can be loaded into Gephi. It will have basic characteristics already computed and a basic layout.

Raw Network file, which can also be loaded into Gephi. You will have to use that network program to lay it out yourself.

Full text file. In it, each website within the web archive collection will have its full text presented on one line, along with information around when it was crawled, the name of the domain, and the full URL of the content.

Domains count file. A text file containing the frequency count of domains captured within your web archive.
Z
Resistance web archive collection derivatives
data.niaid.nih.gov
zenodo.org
Updated Feb 9, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ruest, Nick (2020). Resistance web archive collection derivatives [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_3660456
Explore at:
Dataset updated
Feb 9, 2020
Dataset provided by
Thurman, Alex
Ruest, Nick
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Web archive derivatives of the Resistance collection from Columbia University Libraries. The derivatives were created with the Archives Unleashed Toolkit and Archives Unleashed Cloud.

The cul-8752-parquet.tar.gz derivatives are in the Apache Parquet format, which is a columnar storage format. These derivatives are generally small enough to work with on your local machine, and can be easily converted to Pandas DataFrames. See this notebook for examples.

Domains

.webpages().groupBy(ExtractDomainDF($"url").alias("url")).count().sort($"count".desc)

Produces a DataFrame with the following columns:

domain

count

Web Pages

.webpages().select($"crawl_date", $"url", $"mime_type_web_server", $"mime_type_tika", RemoveHTMLDF(RemoveHTTPHeaderDF(($"content"))).alias("content"))

Produces a DataFrame with the following columns:

crawl_date

url

mime_type_web_server

mime_type_tika

content

Web Graph

.webgraph()

Produces a DataFrame with the following columns:

crawl_date

src

dest

anchor

Image Links

.imageLinks()

Produces a DataFrame with the following columns:

src

image_url

Binary Analysis

PDFs

Spreadsheets

Text files

Word processor files

The cul-8752-auk.tar.gz derivatives are the standard set of web archive derivatives produced by the Archives Unleashed Cloud.

Gephi file, which can be loaded into Gephi. It will have basic characteristics already computed and a basic layout.

Raw Network file, which can also be loaded into Gephi. You will have to use that network program to lay it out yourself.

Full text file. In it, each website within the web archive collection will have its full text presented on one line, along with information around when it was crawled, the name of the domain, and the full URL of the content.

Domains count file. A text file containing the frequency count of domains captured within your web archive.
Ministry of Environment of Québec (2011-2014) web archive collection...
zenodo.org
application/gzip, bin
Updated Feb 25, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nick Ruest; Nick Ruest; Carole Gagné; Dave Mitchell; Carole Gagné; Dave Mitchell (2020). Ministry of Environment of Québec (2011-2014) web archive collection derivatives [Dataset]. http://doi.org/10.5281/zenodo.3605525
Explore at:
application/gzip, binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.3605525
Dataset updated
Feb 25, 2020
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Nick Ruest; Nick Ruest; Carole Gagné; Dave Mitchell; Carole Gagné; Dave Mitchell
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Quebec
Description
Web archive derivatives of the Ministry of Environment of Québec (2011-2014) collection from the Bibliothèque et Archives nationales du Québec. The derivatives were created with the Archives Unleashed Toolkit. Merci beaucoup BAnQ!

These derivatives are in the Apache Parquet format, which is a columnar storage format. These derivatives are generally small enough to work with on your local machine, and can be easily converted to Pandas DataFrames. See this notebook for examples.

Domains

.webpages().groupBy(ExtractDomainDF($"url").alias("url")).count().sort($"count".desc)

Produces a DataFrame with the following columns:

domain

count

Web Pages

.webpages().select($"crawl_date", $"url", $"mime_type_web_server", $"mime_type_tika", RemoveHTMLDF(RemoveHTTPHeaderDF(($"content"))).alias("content"))

Produces a DataFrame with the following columns:

crawl_date

url

mime_type_web_server

mime_type_tika

content

Web Graph

.webgraph()

Produces a DataFrame with the following columns:

crawl_date

src

dest

anchor

Image Links

.imageLinks()

Produces a DataFrame with the following columns:

src

image_url

Binary Analysis

Audio

Images

PDFs

Presentation program files

Spreadsheets

Text files

Videos

Word processor files

Facebook

Twitter

Click to copy link

Link copied

Cite

PandasPlotBench [Dataset]. https://huggingface.co/datasets/JetBrains-Research/PandasPlotBench

PandasPlotBench

JetBrains-Research/PandasPlotBench

Explore at:

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

Nov 25, 2024

Dataset provided by

JetBrainshttp://jetbrains.com/

Authors

JetBrains Research

License

Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically

Description

PandasPlotBench

PandasPlotBench is a benchmark to assess the capability of models in writing the code for visualizations given the description of the Pandas DataFrame. 🛠️ Task. Given the plotting task and the description of a Pandas DataFrame, write the code to build a plot. The dataset is based on the MatPlotLib gallery. The paper can be found in arXiv: https://arxiv.org/abs/2412.02764v1. To score your model on this dataset, you can use the our GitHub repository. 📩 If you have… See the full description on the dataset page: https://huggingface.co/datasets/JetBrains-Research/PandasPlotBench.

Clear search

Close search

Google apps

Main menu

PandasPlotBench

CO2 emissions DF - analysis

Python Time Normalized Superposed Epoch Analysis (SEAnorm) Example Data Set

Database of Uniaxial Cyclic and Tensile Coupon Tests for Structural Metallic...

Analysis of references in the IPCC AR6 WG2 Report of 2022

University Archives web archive collection derivatives

Burke Library New York City Religions web archive collection derivatives

Data from: Bayesian Analysis for Remote Biosignature Identification on...

polyOne Data Set - 100 million hypothetical polymers including 29 properties...

Freely Accessible eJournals web archive collection derivatives

Contemporary Composers Web Archive (CCWA) web archive collection derivatives...

Web Archive of Independent News Sites on Turkish Affairs derivatives

Popline and K4Health Web Archive collection derivatives

Optogenetics_Dfd_population_imaging

Quebec International Relation and Economy web archive collection derivatives...

Chapter 10 - Advanced Feature Engineering Techniques for Fraud Analytics

State Elections Web Archive collection derivatives

Rare Book and Manuscript Library web archive collection derivatives

Resistance web archive collection derivatives

Ministry of Environment of Québec (2011-2014) web archive collection...

PandasPlotBench

JetBrains-Research/PandasPlotBench