Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
PandasPlotBench
PandasPlotBench is a benchmark to assess the capability of models in writing the code for visualizations given the description of the Pandas DataFrame. 🛠️ Task. Given the plotting task and the description of a Pandas DataFrame, write the code to build a plot. The dataset is based on the MatPlotLib gallery. The paper can be found in arXiv: https://arxiv.org/abs/2412.02764v1. To score your model on this dataset, you can use the our GitHub repository. 📩 If you have… See the full description on the dataset page: https://huggingface.co/datasets/JetBrains-Research/PandasPlotBench.
https://market.oceanprotocol.com/termshttps://market.oceanprotocol.com/terms
this df is about the analysis of the predictions of CO2 fossil emissions and GHG
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Solar Wind Omni and SAMPEX ( Solar Anomalous and Magnetospheric Particle Explorer) datasets used in examples for SEAnorm, a time normalized superposed epoch analysis package in python.
Both data sets are stored as either a HDF5 or a compressed csv file (csv.bz2) which contain a Pandas DataFrame of either the Solar Wind Omni and SAMPEX data sets. The data sets where written with pandas.DataFrame.to_hdf() and pandas.DataFrame.to_csv() using a compression level of 9. The DataFrames can be read using pandas.DataFrame.read_hdf( ) or pandas.DataFrame.read_csv( ) depending on the file format.
The Solar Wind Omni data sets contains solar wind velocity (V) and dynamic pressure (P), the southward interplanetary magnetic field in Geocentric Solar Ecliptic System (GSE) coordinates (B_Z_GSE), the auroral electrojet index (AE), and the Sym-H index all at 1 minute cadence.
The SAMPEX data set contains electron flux from the Proton/Electron Telescope (PET) at two energy channels 1.5-6.0 MeV (ELO) and 2.5-14 MeV (EHI) at an approximate 6 second cadence.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Database of Uniaxial Cyclic and Tensile Coupon Tests for Structural Metallic Materials
Background
This dataset contains data from monotonic and cyclic loading experiments on structural metallic materials. The materials are primarily structural steels and one iron-based shape memory alloy is also included. Summary files are included that provide an overview of the database and data from the individual experiments is also included.
The files included in the database are outlined below and the format of the files is briefly described. Additional information regarding the formatting can be found through the post-processing library (https://github.com/ahartloper/rlmtp/tree/master/protocols).
Usage
The data is licensed through the Creative Commons Attribution 4.0 International.
If you have used our data and are publishing your work, we ask that you please reference both:
this database through its DOI, and
any publication that is associated with the experiments. See the Overall_Summary and Database_References files for the associated publication references.
Included Files
Overall_Summary_2022-08-25_v1-0-0.csv: summarises the specimen information for all experiments in the database.
Summarized_Mechanical_Props_Campaign_2022-08-25_v1-0-0.csv: summarises the average initial yield stress and average initial elastic modulus per campaign.
Unreduced_Data-#_v1-0-0.zip: contain the original (not downsampled) data
Where # is one of: 1, 2, 3, 4, 5, 6. The unreduced data is broken into separate archives because of upload limitations to Zenodo. Together they provide all the experimental data.
We recommend you un-zip all the folders and place them in one "Unreduced_Data" directory similar to the "Clean_Data"
The experimental data is provided through .csv files for each test that contain the processed data. The experiments are organised by experimental campaign and named by load protocol and specimen. A .pdf file accompanies each test showing the stress-strain graph.
There is a "db_tag_clean_data_map.csv" file that is used to map the database summary with the unreduced data.
The computed yield stresses and elastic moduli are stored in the "yield_stress" directory.
Clean_Data_v1-0-0.zip: contains all the downsampled data
The experimental data is provided through .csv files for each test that contain the processed data. The experiments are organised by experimental campaign and named by load protocol and specimen. A .pdf file accompanies each test showing the stress-strain graph.
There is a "db_tag_clean_data_map.csv" file that is used to map the database summary with the clean data.
The computed yield stresses and elastic moduli are stored in the "yield_stress" directory.
Database_References_v1-0-0.bib
Contains a bibtex reference for many of the experiments in the database. Corresponds to the "citekey" entry in the summary files.
File Format: Downsampled Data
These are the "LP_
The header of the first column is empty: the first column corresponds to the index of the sample point in the original (unreduced) data
Time[s]: time in seconds since the start of the test
e_true: true strain
Sigma_true: true stress in MPa
(optional) Temperature[C]: the surface temperature in degC
These data files can be easily loaded using the pandas library in Python through:
import pandas data = pandas.read_csv(data_file, index_col=0)
The data is formatted so it can be used directly in RESSPyLab (https://github.com/AlbanoCastroSousa/RESSPyLab). Note that the column names "e_true" and "Sigma_true" were kept for backwards compatibility reasons with RESSPyLab.
File Format: Unreduced Data
These are the "LP_
The first column is the index of each data point
S/No: sample number recorded by the DAQ
System Date: Date and time of sample
Time[s]: time in seconds since the start of the test
C_1_Force[kN]: load cell force
C_1_Déform1[mm]: extensometer displacement
C_1_Déplacement[mm]: cross-head displacement
Eng_Stress[MPa]: engineering stress
Eng_Strain[]: engineering strain
e_true: true strain
Sigma_true: true stress in MPa
(optional) Temperature[C]: specimen surface temperature in degC
The data can be loaded and used similarly to the downsampled data.
File Format: Overall_Summary
The overall summary file provides data on all the test specimens in the database. The columns include:
hidden_index: internal reference ID
grade: material grade
spec: specifications for the material
source: base material for the test specimen
id: internal name for the specimen
lp: load protocol
size: type of specimen (M8, M12, M20)
gage_length_mm_: unreduced section length in mm
avg_reduced_dia_mm_: average measured diameter for the reduced section in mm
avg_fractured_dia_top_mm_: average measured diameter of the top fracture surface in mm
avg_fractured_dia_bot_mm_: average measured diameter of the bottom fracture surface in mm
fy_n_mpa_: nominal yield stress
fu_n_mpa_: nominal ultimate stress
t_a_deg_c_: ambient temperature in degC
date: date of test
investigator: person(s) who conducted the test
location: laboratory where test was conducted
machine: setup used to conduct test
pid_force_k_p, pid_force_t_i, pid_force_t_d: PID parameters for force control
pid_disp_k_p, pid_disp_t_i, pid_disp_t_d: PID parameters for displacement control
pid_extenso_k_p, pid_extenso_t_i, pid_extenso_t_d: PID parameters for extensometer control
citekey: reference corresponding to the Database_References.bib file
yield_stress_mpa_: computed yield stress in MPa
elastic_modulus_mpa_: computed elastic modulus in MPa
fracture_strain: computed average true strain across the fracture surface
c,si,mn,p,s,n,cu,mo,ni,cr,v,nb,ti,al,b,zr,sn,ca,h,fe: chemical compositions in units of %mass
file: file name of corresponding clean (downsampled) stress-strain data
File Format: Summarized_Mechanical_Props_Campaign
Meant to be loaded in Python as a pandas DataFrame with multi-indexing, e.g.,
tab1 = pd.read_csv('Summarized_Mechanical_Props_Campaign_' + date + version + '.csv', index_col=[0, 1, 2, 3], skipinitialspace=True, header=[0, 1], keep_default_na=False, na_values='')
citekey: reference in "Campaign_References.bib".
Grade: material grade.
Spec.: specifications (e.g., J2+N).
Yield Stress [MPa]: initial yield stress in MPa
size, count, mean, coefvar: number of experiments in campaign, number of experiments in mean, mean value for campaign, coefficient of variation for campaign
Elastic Modulus [MPa]: initial elastic modulus in MPa
size, count, mean, coefvar: number of experiments in campaign, number of experiments in mean, mean value for campaign, coefficient of variation for campaign
Caveats
The files in the following directories were tested before the protocol was established. Therefore, only the true stress-strain is available for each:
A500
A992_Gr50
BCP325
BCR295
HYP400
S460NL
S690QL/25mm
S355J2_Plates/S355J2_N_25mm and S355J2_N_50mm
https://creativecommons.org/licenses/publicdomain/https://creativecommons.org/licenses/publicdomain/
This repository contains data on 17,419 DOIs cited in the IPCC Working Group 2 contribution to the Sixth Assessment Report, and the code to link them to the dataset built at the Curtin Open Knowledge Initiative (COKI).
References were extracted from the report's PDFs (downloaded 2022-03-01) via Scholarcy and exported as RIS and BibTeX files. DOI strings were identified from RIS files by pattern matching and saved as CSV file. The list of DOIs for each chapter and cross chapter paper was processed using a custom Python script to generate a pandas DataFrame which was saved as CSV file and uploaded to Google Big Query.
We used the main object table of the Academic Observatory, which combines information from Crossref, Unpaywall, Microsoft Academic, Open Citations, the Research Organization Registry and Geonames to enrich the DOIs with bibliographic information, affiliations, and open access status. A custom query was used to join and format the data and the resulting table was visualised in a Google DataStudio dashboard.
This version of the repository also includes the set of DOIs from references in the IPCC Working Group 1 contribution to the Sixth Assessment Report as extracted by Alexis-Michel Mugabushaka and shared on Zenodo: https://doi.org/10.5281/zenodo.5475442 (CC-BY)
A brief descriptive analysis was provided as a blogpost on the COKI website.
The repository contains the following content:
Data:
data/scholarcy/RIS/ - extracted references as RIS files
data/scholarcy/BibTeX/ - extracted references as BibTeX files
IPCC_AR6_WGII_dois.csv - list of DOIs
data/10.5281_zenodo.5475442/ - references from IPCC AR6 WG1 report
Processing:
preprocessing.R - preprocessing steps for identifying and cleaning DOIs
process.py - Python script for transforming data and linking to COKI data through Google Big Query
Outcomes:
Dataset on BigQuery - requires a google account for access and bigquery account for querying
Data Studio Dashboard - interactive analysis of the generated data
Zotero library of references extracted via Scholarcy
PDF version of blogpost
Note on licenses: Data are made available under CC0 (with the exception of WG1 reference data, which have been shared under CC-BY 4.0) Code is made available under Apache License 2.0
Web archive derivatives of the University Archives collection from Columbia University Libraries. The derivatives were created with the Archives Unleashed Toolkit and Archives Unleashed Cloud. The cul-1914-parquet.tar.gz derivatives are in the Apache Parquet format, which is a columnar storage format. These derivatives are generally small enough to work with on your local machine, and can be easily converted to Pandas DataFrames. See this notebook for examples. Domains .webpages().groupBy(ExtractDomainDF($"url").alias("url")).count().sort($"count".desc) Produces a DataFrame with the following columns: domain count Web Pages .webpages().select($"crawl_date", $"url", $"mime_type_web_server", $"mime_type_tika", RemoveHTMLDF(RemoveHTTPHeaderDF(($"content"))).alias("content")) Produces a DataFrame with the following columns: crawl_date url mime_type_web_server mime_type_tika content Web Graph .webgraph() Produces a DataFrame with the following columns: crawl_date src dest anchor Image Links .imageLinks() Produces a DataFrame with the following columns: src image_url Binary Analysis Images PDFs Presentation program files Spreadsheets Text files Word processor files The cul-1914-auk.tar.gz derivatives are the standard set of web archive derivatives produced by the Archives Unleashed Cloud. Gephi file, which can be loaded into Gephi. It will have basic characteristics already computed and a basic layout. Raw Network file, which can also be loaded into Gephi. You will have to use that network program to lay it out yourself. Full text file. In it, each website within the web archive collection will have its full text presented on one line, along with information around when it was crawled, the name of the domain, and the full URL of the content. Domains count file. A text file containing the frequency count of domains captured within your web archive. Due to file size restrictions in Scholars Portal Dataverse, each of the derivative files needed to be split into 1G parts. These parts can be joined back together with cat. For example: cat cul-1914-parquet.tar.gz.part* > cul-1914-parquet.tar.gz
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Web archive derivatives of the Burke Library New York City Religions collection from Columbia University Libraries. The derivatives were created with the Archives Unleashed Toolkit and Archives Unleashed Cloud.
The cul-1945-parquet.tar.gz derivatives are in the Apache Parquet format, which is a columnar storage format. These derivatives are generally small enough to work with on your local machine, and can be easily converted to Pandas DataFrames. See this notebook for examples.
Domains
.webpages().groupBy(ExtractDomainDF($"url").alias("url")).count().sort($"count".desc)
Produces a DataFrame with the following columns:
Web Pages
.webpages().select($"crawl_date", $"url", $"mime_type_web_server", $"mime_type_tika", RemoveHTMLDF(RemoveHTTPHeaderDF(($"content"))).alias("content"))
Produces a DataFrame with the following columns:
Web Graph
.webgraph()
Produces a DataFrame with the following columns:
Image Links
.imageLinks()
Produces a DataFrame with the following columns:
The cul-1945-auk.tar.gz derivatives are the standard set of web archive derivatives produced by the Archives Unleashed Cloud.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
We present all of the data across our SNR and abundance study for the molecule H2O for an exoEarth twin. The wavelength range is from 0.515-1 micron, with 25 evenly spaced 20% bandpasses in this range. The SNR ranges from 3-16, and the abundance values range from log10(VMR) = -3.5 to -1.5 in steps of 0.5 and 0.25 (all presented in VMR in the associated table). We present the lower and upper wavelength per bandpass, the input H2O value (abundance case), the retrieved H2O value (presented as the log10(VMR)), the lower and upper limits of the 68% credible region (presented as the log10(VMR)), and the log-Bayes factor for H2O. For more information about how these were calculated, please see Bayesian Analysis for Remote Biosignature Identification on exoEarths (BARBIE) I: Using Grid-Based Nested Sampling in Coronagraphy Observation Simulations for H2O, accepted and currently available on arXiv.
To open this csv as a Pandas dataframe, use the following command:
your_dataframe_name = pd.read_csv(f'zenodo_table.csv', dtype={'Input H2O': str})
polyOne Data Set
The data set contains 100 million hypothetical polymers each with 29 predicted properties using machine learning models. We use PSMILES strings to represent polymer structures, see here and here. The polymers are generated by decomposing previously synthesized polymers into unique chemical fragments. Random and enumerative compositions of these fragments yield 100 million hypothetical PSMILES strings. All PSMILES strings are chemically valid polymers but, mostly, have never been synthesized before. More information can be found in the paper. Please note the license agreement in the LICENSE file.
Full data set including the properties
The data files are in Apache Parquet format. The files start with polyOne_*.parquet
.
I recommend using dask (pip install dask
) to load and process the data set. Pandas also works but is slower.
Load sharded data set with dask
python
import dask.dataframe as dd
ddf = dd.read_parquet("*.parquet", engine="pyarrow")
For example, compute the description of data set ```python df_describe = ddf.describe().compute() df_describe
PSMILES strings only
generated_polymer_smiles_train.txt - 80 million PSMILES strings for training polyBERT. One string per line.
generated_polymer_smiles_dev.txt - 20 million PSMILES strings for testing polyBERT. One string per line.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Web archive derivatives of the Freely Accessible eJournals collection from Columbia University Libraries. The derivatives were created with the Archives Unleashed Toolkit and Archives Unleashed Cloud.
The cul-5921-parquet.tar.gz derivatives are in the Apache Parquet format, which is a columnar storage format. These derivatives are generally small enough to work with on your local machine, and can be easily converted to Pandas DataFrames. See this notebook for examples.
Domains
.webpages().groupBy(ExtractDomainDF($"url").alias("url")).count().sort($"count".desc)
Produces a DataFrame with the following columns:
Web Pages
.webpages().select($"crawl_date", $"url", $"mime_type_web_server", $"mime_type_tika", RemoveHTMLDF(RemoveHTTPHeaderDF(($"content"))).alias("content"))
Produces a DataFrame with the following columns:
Web Graph
.webgraph()
Produces a DataFrame with the following columns:
Image Links
.imageLinks()
Produces a DataFrame with the following columns:
The cul-12143-auk.tar.gz derivatives are the standard set of web archive derivatives produced by the Archives Unleashed Cloud.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Web archive derivatives of the Contemporary Composers Web Archive (CCWA) collection from the Ivy Plus Libraries Confederation. The derivatives were created with the Archives Unleashed Toolkit and Archives Unleashed Cloud.
The ivy-4019-parquet.tar.gz derivatives are in the Apache Parquet format, which is a columnar storage format. These derivatives are generally small enough to work with on your local machine, and can be easily converted to Pandas DataFrames. See this notebook for examples.
Domains
.webpages().groupBy(ExtractDomainDF($"url").alias("url")).count().sort($"count".desc)
Produces a DataFrame with the following columns:
Web Pages
.webpages().select($"crawl_date", $"url", $"mime_type_web_server", $"mime_type_tika", RemoveHTMLDF(RemoveHTTPHeaderDF(($"content"))).alias("content"))
Produces a DataFrame with the following columns:
Web Graph
.webgraph()
Produces a DataFrame with the following columns:
Image Links
.imageLinks()
Produces a DataFrame with the following columns:
The ivy-4019-auk.tar.gz derivatives are the standard set of web archive derivatives produced by the Archives Unleashed Cloud.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Derivatives of the Web Archive of Independent News Sites on Turkish Affairs collection from the Ivy Plus Libraries Confederation. The derivatives were created with the Archives Unleashed Toolkit and Archives Unleashed Cloud.
The ivy-12911-parquet.tar.gz derivatives are in the Apache Parquet format, which is a columnar storage format. These derivatives are generally small enough to work with on your local machine, and can be easily converted to Pandas DataFrames. See this notebook for examples.
Domains
.webpages().groupBy(ExtractDomainDF($"url").alias("url")).count().sort($"count".desc)
Produces a DataFrame with the following columns:
Web Pages
.webpages().select($"crawl_date", $"url", $"mime_type_web_server", $"mime_type_tika", RemoveHTMLDF(RemoveHTTPHeaderDF(($"content"))).alias("content"))
Produces a DataFrame with the following columns:
Web Graph
.webgraph()
Produces a DataFrame with the following columns:
Image Links
.imageLinks()
Produces a DataFrame with the following columns:
The ivy-12911-auk.tar.gz derivatives are the standard set of web archive derivatives produced by the Archives Unleashed Cloud.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Web archive derivatives of the Popline and K4Health Web Archive collection from the Ivy Plus Libraries Confederation. The derivatives were created with the Archives Unleashed Toolkit and Archives Unleashed Cloud.
The ivy-12006-parquet.tar.gz derivatives are in the Apache Parquet format, which is a columnar storage format. These derivatives are generally small enough to work with on your local machine, and can be easily converted to Pandas DataFrames. See this notebook for examples.
Domains
.webpages().groupBy(ExtractDomainDF($"url").alias("url")).count().sort($"count".desc)
Produces a DataFrame with the following columns:
Web Pages
.webpages().select($"crawl_date", $"url", $"mime_type_web_server", $"mime_type_tika", RemoveHTMLDF(RemoveHTTPHeaderDF(($"content"))).alias("content"))
Produces a DataFrame with the following columns:
Web Graph
.webgraph()
Produces a DataFrame with the following columns:
Image Links
.imageLinks()
Produces a DataFrame with the following columns:
The ivy-12006-auk.tar.gz derivatives are the standard set of web archive derivatives produced by the Archives Unleashed Cloud.
This dataset contains Dfd population imaging data and behavioral including responses to optogenetic stimulation. Part of Braun et al. 2023. There are two different types of tar.gz files. The one ending with "processed.tar.gz" contains data related to one fly: 1. "background_image.tif" is a standard deviation projection of raw fluorescence data used as background in figure 2c. 2. "roi_center_annotation.pdf" is a plot indicating the results of semi-automated ROI detection. 3. "ROI_centers.txt" indicates the location of said ROI centers. 4. "ROI_mask.tif" is the mask used for ROI extraction for this fly. ------------------- The other tar.gz files each correspond to one 10min long recording of neuronal activity and behavior of one fly. If multiple recordings have been made in one fly, their sequence is indicated with a 3 digit number in the name. Each of those folders contains 3 subfolders: "2p" holds the synchronisation data and metadata and the two photon recording metadata. "behData" contains the behavioral camera metadata. "processed" contains a pickled pandas dataframe containing all processed behavioral variables (beh_df.pkl) and all neuronal time series (twop_df.pkl) required to reproduce the figures. Raw behavioral videos and raw fluorescence data are available upon request from the authors and are omitted here because of their size.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Web archive derivatives of the Sites of the Quebec International Relation and Economy collection from the Bibliothèque et Archives nationales du Québec. The derivatives were created with the Archives Unleashed Toolkit. Merci beaucoup BAnQ!
These derivatives are in the Apache Parquet format, which is a columnar storage format. These derivatives are generally small enough to work with on your local machine, and can be easily converted to Pandas DataFrames. See this notebook for examples.
Domains
.webpages().groupBy(ExtractDomainDF($"url").alias("url")).count().sort($"count".desc)
Produces a DataFrame with the following columns:
Web Pages
.webpages().select($"crawl_date", $"url", $"mime_type_web_server", $"mime_type_tika", RemoveHTMLDF(RemoveHTTPHeaderDF(($"content"))).alias("content"))
Produces a DataFrame with the following columns:
Web Graph
.webgraph()
Produces a DataFrame with the following columns:
Image Links
.imageLinks()
Produces a DataFrame with the following columns:
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Dataset_1: The dataset consists of transaction timestamps (in hours) for a sample of online banking transactions. The timestamps represent the time of day when the transactions occurred.
Dataset_2: The dataset, encapsulated as a pandas DataFrame "trans_David", chronicles the transactional activities of an individual named David. A salient column, "channel_cd", signifies the payment channel employed by David for each transaction. The dataset encompasses 40 entries across 14 columns, with 'channel_cd' being the focal point for the derivation of the 'freq_channel' feature.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Web archive derivatives of the State Elections Web Archive collection from the Ivy Plus Libraries Confederation. The derivatives were created with the Archives Unleashed Toolkit and Archives Unleashed Cloud.
The ivy-10793-parquet.tar.gz derivatives are in the Apache Parquet format, which is a columnar storage format. These derivatives are generally small enough to work with on your local machine, and can be easily converted to Pandas DataFrames. See this notebook for examples.
Domains
.webpages().groupBy(ExtractDomainDF($"url").alias("url")).count().sort($"count".desc)
Produces a DataFrame with the following columns:
Web Pages
.webpages().select($"crawl_date", $"url", $"mime_type_web_server", $"mime_type_tika", RemoveHTMLDF(RemoveHTTPHeaderDF(($"content"))).alias("content"))
Produces a DataFrame with the following columns:
Web Graph
.webgraph()
Produces a DataFrame with the following columns:
Image Links
.imageLinks()
Produces a DataFrame with the following columns:
The ivy-10793-auk.tar.gz derivatives are the standard set of web archive derivatives produced by the Archives Unleashed Cloud.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Web archive derivatives of the Rare Book and Manuscript Library collection from Columbia University Libraries. The derivatives were created with the Archives Unleashed Toolkit and Archives Unleashed Cloud.
The cul-2766-parquet.tar.gz derivatives are in the Apache Parquet format, which is a columnar storage format. These derivatives are generally small enough to work with on your local machine, and can be easily converted to Pandas DataFrames. See this notebook for examples.
Domains
.webpages().groupBy(ExtractDomainDF($"url").alias("url")).count().sort($"count".desc)
Produces a DataFrame with the following columns:
Web Pages
.webpages().select($"crawl_date", $"url", $"mime_type_web_server", $"mime_type_tika", RemoveHTMLDF(RemoveHTTPHeaderDF(($"content"))).alias("content"))
Produces a DataFrame with the following columns:
Web Graph
.webgraph()
Produces a DataFrame with the following columns:
Image Links
.imageLinks()
Produces a DataFrame with the following columns:
The cul-2766-auk.tar.gz derivatives are the standard set of web archive derivatives produced by the Archives Unleashed Cloud.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Web archive derivatives of the Resistance collection from Columbia University Libraries. The derivatives were created with the Archives Unleashed Toolkit and Archives Unleashed Cloud.
The cul-8752-parquet.tar.gz derivatives are in the Apache Parquet format, which is a columnar storage format. These derivatives are generally small enough to work with on your local machine, and can be easily converted to Pandas DataFrames. See this notebook for examples.
Domains
.webpages().groupBy(ExtractDomainDF($"url").alias("url")).count().sort($"count".desc)
Produces a DataFrame with the following columns:
domain
count
Web Pages
.webpages().select($"crawl_date", $"url", $"mime_type_web_server", $"mime_type_tika", RemoveHTMLDF(RemoveHTTPHeaderDF(($"content"))).alias("content"))
Produces a DataFrame with the following columns:
crawl_date
url
mime_type_web_server
mime_type_tika
content
Web Graph
.webgraph()
Produces a DataFrame with the following columns:
crawl_date
src
dest
anchor
Image Links
.imageLinks()
Produces a DataFrame with the following columns:
src
image_url
Binary Analysis
PDFs
Spreadsheets
Text files
Word processor files
The cul-8752-auk.tar.gz derivatives are the standard set of web archive derivatives produced by the Archives Unleashed Cloud.
Gephi file, which can be loaded into Gephi. It will have basic characteristics already computed and a basic layout.
Raw Network file, which can also be loaded into Gephi. You will have to use that network program to lay it out yourself.
Full text file. In it, each website within the web archive collection will have its full text presented on one line, along with information around when it was crawled, the name of the domain, and the full URL of the content.
Domains count file. A text file containing the frequency count of domains captured within your web archive.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Web archive derivatives of the Ministry of Environment of Québec (2011-2014) collection from the Bibliothèque et Archives nationales du Québec. The derivatives were created with the Archives Unleashed Toolkit. Merci beaucoup BAnQ!
These derivatives are in the Apache Parquet format, which is a columnar storage format. These derivatives are generally small enough to work with on your local machine, and can be easily converted to Pandas DataFrames. See this notebook for examples.
Domains
.webpages().groupBy(ExtractDomainDF($"url").alias("url")).count().sort($"count".desc)
Produces a DataFrame with the following columns:
Web Pages
.webpages().select($"crawl_date", $"url", $"mime_type_web_server", $"mime_type_tika", RemoveHTMLDF(RemoveHTTPHeaderDF(($"content"))).alias("content"))
Produces a DataFrame with the following columns:
Web Graph
.webgraph()
Produces a DataFrame with the following columns:
Image Links
.imageLinks()
Produces a DataFrame with the following columns:
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
PandasPlotBench
PandasPlotBench is a benchmark to assess the capability of models in writing the code for visualizations given the description of the Pandas DataFrame. 🛠️ Task. Given the plotting task and the description of a Pandas DataFrame, write the code to build a plot. The dataset is based on the MatPlotLib gallery. The paper can be found in arXiv: https://arxiv.org/abs/2412.02764v1. To score your model on this dataset, you can use the our GitHub repository. 📩 If you have… See the full description on the dataset page: https://huggingface.co/datasets/JetBrains-Research/PandasPlotBench.