100+ datasets found

Z
Data pipeline Validation And Load Testing using Multiple CSV Files
data.niaid.nih.gov
data.europa.eu
Updated Mar 26, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mainak Adhikari; Afsana Khan; Pelle Jakovits (2021). Data pipeline Validation And Load Testing using Multiple CSV Files [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_4636797
Explore at:
Dataset updated
Mar 26, 2021
Dataset provided by
Research Fellow, University of Tartu
Lecturer, University of Tartu
Masters Student, University of Tartu
Authors
Mainak Adhikari; Afsana Khan; Pelle Jakovits
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The datasets were used to validate and test the data pipeline deployment following the RADON approach. The dataset has a CSV file that contains around 32000 Twitter tweets. 100 CSV files have been created from the single CSV file and each CSV file containing 320 tweets. Those 100 CSV files are used to validate and test (performance/load testing) the data pipeline components.
CSV-data-load_data_metadata
kaggle.com
zip
Updated Jun 7, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
DINESH JATAV (2023). CSV-data-load_data_metadata [Dataset]. https://www.kaggle.com/datasets/dineshjatav/csv-data-load-data-metadata
Explore at:
zip(23415375 bytes)Available download formats
Dataset updated
Jun 7, 2023
Authors
DINESH JATAV
Description
Dataset

This dataset was created by DINESH JATAV

Contents
w
Websites using Import Users From Csv With Meta
webtechsurvey.com
csv
Updated Nov 23, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
WebTechSurvey (2025). Websites using Import Users From Csv With Meta [Dataset]. https://webtechsurvey.com/technology/import-users-from-csv-with-meta
Explore at:
csvAvailable download formats
Dataset updated
Nov 23, 2025
Dataset authored and provided by
WebTechSurvey
License
https://webtechsurvey.com/termshttps://webtechsurvey.com/terms
Time period covered
2025
Area covered
Global
Description
A complete list of live websites using the Import Users From Csv With Meta technology, compiled through global website indexing conducted by WebTechSurvey.
H
CsvReader
dataverse.harvard.edu
search.dataone.org
Updated Jan 30, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Tao HU (2025). CsvReader [Dataset]. http://doi.org/10.7910/DVN/XT2MWH
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.7910/DVN/XT2MWH
Dataset updated
Jan 30, 2025
Dataset provided by
Harvard Dataverse
Authors
Tao HU
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
The CsvReader is a component designed to read and process CSV (Comma-Separated Values) files, which are widely used for storing tabular data. This component can be used to load CSV files, perform operations like filtering and aggregation, and then output the results. It is a valuable tool for data preprocessing in various workflows, including data analysis and machine learning pipelines.
Database of Uniaxial Cyclic and Tensile Coupon Tests for Structural Metallic...
zenodo.org
data.niaid.nih.gov
bin, csv, zip
Updated Dec 24, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Alexander R. Hartloper; Alexander R. Hartloper; Selimcan Ozden; Albano de Castro e Sousa; Dimitrios G. Lignos; Dimitrios G. Lignos; Selimcan Ozden; Albano de Castro e Sousa (2022). Database of Uniaxial Cyclic and Tensile Coupon Tests for Structural Metallic Materials [Dataset]. http://doi.org/10.5281/zenodo.6965147
Explore at:
bin, zip, csvAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.6965147
Dataset updated
Dec 24, 2022
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Alexander R. Hartloper; Alexander R. Hartloper; Selimcan Ozden; Albano de Castro e Sousa; Dimitrios G. Lignos; Dimitrios G. Lignos; Selimcan Ozden; Albano de Castro e Sousa
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Database of Uniaxial Cyclic and Tensile Coupon Tests for Structural Metallic Materials

Background

This dataset contains data from monotonic and cyclic loading experiments on structural metallic materials. The materials are primarily structural steels and one iron-based shape memory alloy is also included. Summary files are included that provide an overview of the database and data from the individual experiments is also included.

The files included in the database are outlined below and the format of the files is briefly described. Additional information regarding the formatting can be found through the post-processing library (https://github.com/ahartloper/rlmtp/tree/master/protocols).

Usage

The data is licensed through the Creative Commons Attribution 4.0 International.

If you have used our data and are publishing your work, we ask that you please reference both:

this database through its DOI, and

any publication that is associated with the experiments. See the Overall_Summary and Database_References files for the associated publication references.

Included Files

Overall_Summary_2022-08-25_v1-0-0.csv: summarises the specimen information for all experiments in the database.

Summarized_Mechanical_Props_Campaign_2022-08-25_v1-0-0.csv: summarises the average initial yield stress and average initial elastic modulus per campaign.

Unreduced_Data-#_v1-0-0.zip: contain the original (not downsampled) data

Where # is one of: 1, 2, 3, 4, 5, 6. The unreduced data is broken into separate archives because of upload limitations to Zenodo. Together they provide all the experimental data.

We recommend you un-zip all the folders and place them in one "Unreduced_Data" directory similar to the "Clean_Data"

The experimental data is provided through .csv files for each test that contain the processed data. The experiments are organised by experimental campaign and named by load protocol and specimen. A .pdf file accompanies each test showing the stress-strain graph.

There is a "db_tag_clean_data_map.csv" file that is used to map the database summary with the unreduced data.

The computed yield stresses and elastic moduli are stored in the "yield_stress" directory.

Clean_Data_v1-0-0.zip: contains all the downsampled data

The experimental data is provided through .csv files for each test that contain the processed data. The experiments are organised by experimental campaign and named by load protocol and specimen. A .pdf file accompanies each test showing the stress-strain graph.

There is a "db_tag_clean_data_map.csv" file that is used to map the database summary with the clean data.

The computed yield stresses and elastic moduli are stored in the "yield_stress" directory.

Database_References_v1-0-0.bib

Contains a bibtex reference for many of the experiments in the database. Corresponds to the "citekey" entry in the summary files.

File Format: Downsampled Data

These are the "LP_

The header of the first column is empty: the first column corresponds to the index of the sample point in the original (unreduced) data

Time[s]: time in seconds since the start of the test

e_true: true strain

Sigma_true: true stress in MPa

(optional) Temperature[C]: the surface temperature in degC

These data files can be easily loaded using the pandas library in Python through:

import pandas data = pandas.read_csv(data_file, index_col=0)

The data is formatted so it can be used directly in RESSPyLab (https://github.com/AlbanoCastroSousa/RESSPyLab). Note that the column names "e_true" and "Sigma_true" were kept for backwards compatibility reasons with RESSPyLab.

File Format: Unreduced Data

These are the "LP_

The first column is the index of each data point

S/No: sample number recorded by the DAQ

System Date: Date and time of sample

Time[s]: time in seconds since the start of the test

C_1_Force[kN]: load cell force

C_1_Déform1[mm]: extensometer displacement

C_1_Déplacement[mm]: cross-head displacement

Eng_Stress[MPa]: engineering stress

Eng_Strain[]: engineering strain

e_true: true strain

Sigma_true: true stress in MPa

(optional) Temperature[C]: specimen surface temperature in degC

The data can be loaded and used similarly to the downsampled data.

File Format: Overall_Summary

The overall summary file provides data on all the test specimens in the database. The columns include:

hidden_index: internal reference ID

grade: material grade

spec: specifications for the material

source: base material for the test specimen

id: internal name for the specimen

lp: load protocol

size: type of specimen (M8, M12, M20)

gage_length_mm_: unreduced section length in mm

avg_reduced_dia_mm_: average measured diameter for the reduced section in mm

avg_fractured_dia_top_mm_: average measured diameter of the top fracture surface in mm

avg_fractured_dia_bot_mm_: average measured diameter of the bottom fracture surface in mm

fy_n_mpa_: nominal yield stress

fu_n_mpa_: nominal ultimate stress

t_a_deg_c_: ambient temperature in degC

date: date of test

investigator: person(s) who conducted the test

location: laboratory where test was conducted

machine: setup used to conduct test

pid_force_k_p, pid_force_t_i, pid_force_t_d: PID parameters for force control

pid_disp_k_p, pid_disp_t_i, pid_disp_t_d: PID parameters for displacement control

pid_extenso_k_p, pid_extenso_t_i, pid_extenso_t_d: PID parameters for extensometer control

citekey: reference corresponding to the Database_References.bib file

yield_stress_mpa_: computed yield stress in MPa

elastic_modulus_mpa_: computed elastic modulus in MPa

fracture_strain: computed average true strain across the fracture surface

c,si,mn,p,s,n,cu,mo,ni,cr,v,nb,ti,al,b,zr,sn,ca,h,fe: chemical compositions in units of %mass

file: file name of corresponding clean (downsampled) stress-strain data

File Format: Summarized_Mechanical_Props_Campaign

Meant to be loaded in Python as a pandas DataFrame with multi-indexing, e.g.,

tab1 = pd.read_csv('Summarized_Mechanical_Props_Campaign_' + date + version + '.csv', index_col=[0, 1, 2, 3], skipinitialspace=True, header=[0, 1], keep_default_na=False, na_values='')

citekey: reference in "Campaign_References.bib".

Grade: material grade.

Spec.: specifications (e.g., J2+N).

Yield Stress [MPa]: initial yield stress in MPa

size, count, mean, coefvar: number of experiments in campaign, number of experiments in mean, mean value for campaign, coefficient of variation for campaign

Elastic Modulus [MPa]: initial elastic modulus in MPa

size, count, mean, coefvar: number of experiments in campaign, number of experiments in mean, mean value for campaign, coefficient of variation for campaign

Caveats

The files in the following directories were tested before the protocol was established. Therefore, only the true stress-strain is available for each:

A500

A992_Gr50

BCP325

BCR295

HYP400

S460NL

S690QL/25mm

S355J2_Plates/S355J2_N_25mm and S355J2_N_50mm
h
my-awesome-dataset
huggingface.co
Updated Jul 31, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Matthew Kehoe (2024). my-awesome-dataset [Dataset]. https://huggingface.co/datasets/Axion004/my-awesome-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jul 31, 2024
Authors
Matthew Kehoe
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Dataset Card for Demo

Dataset Summary

This is a demo dataset with two files train.csv and test.csv. Load it by: from datasets import load_dataset data_files = {"train": "train.csv", "test": "test.csv"} demo = load_dataset("stevhliu/demo", data_files=data_files)

Supported Tasks and Leaderboards

[More Information Needed]

Languages

[More Information Needed]

Dataset Structure Data Instances

[More Information… See the full description on the dataset page: https://huggingface.co/datasets/Axion004/my-awesome-dataset.
Data Mining Project - Boston
kaggle.com
zip
Updated Nov 25, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
SophieLiu (2019). Data Mining Project - Boston [Dataset]. https://www.kaggle.com/sliu65/data-mining-project-boston
Explore at:
zip(59313797 bytes)Available download formats
Dataset updated
Nov 25, 2019
Authors
SophieLiu
Area covered
Boston
Description
Context

To make this a seamless process, I cleaned the data and delete many variables that I thought were not important to our dataset. I then uploaded all of those files to Kaggle for each of you to download. The rideshare_data has both lyft and uber but it is still a cleaned version from the dataset we downloaded from Kaggle.

Use of Data Files

You can easily subset the data into the car types that you will be modeling by first loading the csv into R, here is the code for how you do this:

This loads the file into R

df<-read.csv('uber.csv')

The next codes is to subset the data into specific car types. The example below only has Uber 'Black' car types.

df_black<-subset(uber_df, uber_df$name == 'Black')

This next portion of code will be to load it into R. First, we must write this dataframe into a csv file on our computer in order to load it into R.

write.csv(df_black, "nameofthefileyouwanttosaveas.csv")

The file will appear in you working directory. If you are not familiar with your working directory. Run this code:

getwd()

The output will be the file path to your working directory. You will find the file you just created in that folder.

Inspiration

Your data will be in front of the world's largest data science community. What questions do you want to see answered?
h
doc-formats-csv-1
huggingface.co
Updated Nov 23, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Datasets examples (2023). doc-formats-csv-1 [Dataset]. https://huggingface.co/datasets/datasets-examples/doc-formats-csv-1
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Nov 23, 2023
Dataset authored and provided by
Datasets examples
Description
[doc] formats - csv - 1

This dataset contains one csv file at the root:

data.csv

kind,sound dog,woof cat,meow pokemon,pika human,hello

The YAML section of the README does not contain anything related to loading the data (only the size category metadata):

size_categories:

- n<1K
d
Data from: Root thread strength, landslide headscarp geometry, and observed...
catalog.data.gov
data.usgs.gov
+1more
Updated Nov 20, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. Geological Survey (2025). Root thread strength, landslide headscarp geometry, and observed root characteristics at the monitored CB1 landslide, Oregon, USA [Dataset]. https://catalog.data.gov/dataset/root-thread-strength-landslide-headscarp-geometry-and-observed-root-characteristics-at-the
Explore at:
Dataset updated
Nov 20, 2025
Dataset provided by
U.S. Geological Survey
Area covered
Oregon, United States
Description
This data release supports interpretations of field-observed root distributions within a shallow landslide headscarp (CB1) located below Mettman Ridge within the Oregon Coast Range, approximately 15 km northeast of Coos Bay, Oregon, USA. (Schmidt_2021_CB1_topo_far.png and Schmidt_2021_CB1_topo_close.png). Root species, diameter (greater than or equal to 1 mm), general orientation relative to the slide scarp, and depth below ground surface were characterized immediately following landsliding in response to large-magnitude precipitation in November 1996 which triggered thousands of landslides within the area (Montgomery and others, 2009). The enclosed data includes: (1) tests of root-thread failure as a function of root diameter and tensile load for different plant species applicable to the broader Oregon Coast Range and (2) tape and compass survey of the planform geometry of the CB1 landslide and the roots observed in the slide scarp. Root diameter and load measurements were principally collected in the general area of the CB1 slide for 12 species listed in: Schmidt_2021_OR_root_species_list.csv. Methodology of the failure tests included identifying roots of a given plant species, trimming root threads into 15-20 cm long segments, measuring diameters including bark (up to 6.5 mm) with a micrometer at multiple points along the segment to arrive at an average, clamping a segment end to a calibrated spring and loading roots until failure recording the maximum load. Files containing the tensile failure tests described in Schmidt and others (2001) include root diameter (mm), critical tensile load at failure (kg), root cross-sectional area (m^2), and tensile strength (MPa). Tensile strengths were calculated as: (critical tensile load at failure * gravitational acceleration)/root cross-sectional area. The files are labeled: Schmidt_2021_OR_root_AceCir.csv, Schmidt_2021_OR_root_AceMac.csv, Schmidt_2021_OR_root_AlnRub.csv, Schmidt_2021_OR_root_AnaMar.csv, Schmidt_2021_OR_root_DigPur.csv, Schmidt_2021_OR_root_MahNer.csv, Schmidt_2021_OR_root_PolMun.csv, Schmidt_2021_OR_root_PseMen_damaged.csv, Schmidt_2021_OR_root_PseMen_healthy.csv, Schmidt_2021_OR_root_RubDis.csv, Schmidt_2021_OR_root_RubPar.csv, Schmidt_2021_OR_root_SamCae.csv, and Schmidt_2021_OR_root_TsuHet.csv. File naming follows the convention of adopting the first three letters of the binomial system defining genus and species of their Latin names. Live and damaged roots were identified based on their color, texture, plasticity, adherence of bark to woody material, and compressibility. For example, healthy live Douglas-fir (Pseudotsuga menziesii) roots (Schmidt_2021_OR_root_PseMen_healthy.csv) have a crimson-colored inner bark, darkening to a brownish red in dead Douglas-fir roots. Both are distinctive colors. Live roots exhibited plastic responses to bending and strong adherence of bark, whereas dead roots displayed brittle behavior with bending and poor adherence of bark to the underlying woody material. Measured tensile strengths of damaged root threads with fungal infections following selective tree harvest using yarding operations that damaged bark of standing trees expressed significantly lower tensile strengths than their ultimate living tensile strengths (Schmidt_2021_OR_root_PseMen_damaged.csv). The CB1 site was clear cut logged in 1987 and replanted with Douglas fir saplings in 1989. Vegetation in the vicinity of the failure scarp is dominated by young Douglas fir saplings planted two years after the clear cut, blue elderberry (Sambucus caerulea), thimbleberry (Rubus parviflorus), foxglove (Digitalis purpurea), and Himalayan blackberry (Rubus discolor). The remaining seven species are provided for context of more regional studies. The CB1 site is a hillslope hollow that failed as a shallow landslide and mobilized as a debris flow during heavy rainfall in November 1996. Prior to debris flow mobilization, the ~5-m wide slide with a source area of roughly 860 m^2 and an average slope of 43° displaced and broke numerous roots. Following landsliding, field observations noted a preponderance of exposed, blunt broken root stubs within the scarp. Roots were not straight and smooth, but rather exhibited tortuous growth paths with firmly anchored, interlocking structures. The planform geometry represented by a tape and compass field survey is presented as starting and ending points of slide margin segments of roughly equal colluvial soil depths above saprolite or bedrock (Schmidt_2021_CB1_scarp_geometry.csv and Schmidt_2021_CB1_scarp_pts.shp). The graphic Schmidt_2021_CB1_scarp_pts_poly.png shows the horse-shoe shaped profile and its numbered scarp segments. Segment numbers enclosed within parentheses indicate segments where roots were not counted owing to occlusion by prior ground disturbance. The shapefile Schmidt_2021_CB1_scarp_poly.shp also represents the scarp line segments. The file Schmidt_2021_CB1_segment_info.csv presents the segment information as left and right cumulative lengths, averaged colluvium soils depths for each segment, and inclinations of the ground surface slope relative to horizontal along the perimeter (P) and the slide scarp face (F). Lastly, Schmidt_2021_CB1_rootdata_scarp.csv represents root diameter of individual threads measured by a micrometer, species, depth below ground surface, live vs. dead roots, general root orientation (parallel or perpendicular) relative to scarp perimeter, and cumulative perimeter distance within the scarp segments. At CB1 specifically and more generally across the Oregon Coast Range, root reinforcement occurs primarily by lateral reinforcement with typically much smaller basal reinforcements.
riiid_train_converted to Multiple Formats
kaggle.com
zip
Updated Jun 2, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Santh Raul (2021). riiid_train_converted to Multiple Formats [Dataset]. https://www.kaggle.com/santhraul/riiid-train-converted-to-multiple-formats
Explore at:
zip(4473461912 bytes)Available download formats
Dataset updated
Jun 2, 2021
Authors
Santh Raul
Description
Context

Train data of Riiid competition is a large dataset of over 100 million rows and 10 columns that does not fit into Kaggle Notebook's RAM using the default pandas read.csv resulting in a search for alternative approaches and formats.

Content

Train data of Riiid competition in different formats.

Acknowledgements

We wouldn't be here without the help of others. If you owe any attributions or thanks, include them here along with any citations of past research.

Inspiration

reading .CSV file for riiid completion took huge time and memory. This inspired me to convert .CSV in to different file format so that those can be loaded easily to Kaggle kernel.
w
Websites using AIT CSV Import / Export
webtechsurvey.com
csv
Updated Nov 22, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
WebTechSurvey (2025). Websites using AIT CSV Import / Export [Dataset]. https://webtechsurvey.com/technology/ait-csv-import-export-wordpress-plugin
Explore at:
csvAvailable download formats
Dataset updated
Nov 22, 2025
Dataset authored and provided by
WebTechSurvey
License
https://webtechsurvey.com/termshttps://webtechsurvey.com/terms
Time period covered
2025
Area covered
Global
Description
A complete list of live websites using the AIT CSV Import / Export technology, compiled through global website indexing conducted by WebTechSurvey.

Data from: Ecosystem-Level Determinants of Sustained Activity in Open-Source...

zenodo.org

application/gzip, bin +2

Updated Aug 2, 2024

+ more versions

Facebook

Twitter

Click to copy link

Link copied

Cite

Marat Valiev; Marat Valiev; Bogdan Vasilescu; James Herbsleb; Bogdan Vasilescu; James Herbsleb (2024). Ecosystem-Level Determinants of Sustained Activity in Open-Source Projects: A Case Study of the PyPI Ecosystem [Dataset]. http://doi.org/10.5281/zenodo.1419788

Explore at:

bin, application/gzip, zip, text/x-pythonAvailable download formats

Unique identifier

https://doi.org/10.5281/zenodo.1419788

Dataset updated

Aug 2, 2024

Dataset provided by

Zenodohttp://zenodo.org/

Authors

Marat Valiev; Marat Valiev; Bogdan Vasilescu; James Herbsleb; Bogdan Vasilescu; James Herbsleb

License

https://www.gnu.org/licenses/old-licenses/gpl-2.0-standalone.htmlhttps://www.gnu.org/licenses/old-licenses/gpl-2.0-standalone.html

Description

Replication pack, FSE2018 submission #164:
------------------------------------------

**Working title:** Ecosystem-Level Factors Affecting the Survival of Open-Source Projects: 
A Case Study of the PyPI Ecosystem

**Note:** link to data artifacts is already included in the paper. 
Link to the code will be included in the Camera Ready version as well.


Content description
===================

- **ghd-0.1.0.zip** - the code archive. This code produces the dataset files 
 described below
- **settings.py** - settings template for the code archive.
- **dataset_minimal_Jan_2018.zip** - the minimally sufficient version of the dataset.
 This dataset only includes stats aggregated by the ecosystem (PyPI)
- **dataset_full_Jan_2018.tgz** - full version of the dataset, including project-level
 statistics. It is ~34Gb unpacked. This dataset still doesn't include PyPI packages
 themselves, which take around 2TB.
- **build_model.r, helpers.r** - R files to process the survival data 
  (`survival_data.csv` in **dataset_minimal_Jan_2018.zip**, 
  `common.cache/survival_data.pypi_2008_2017-12_6.csv` in 
  **dataset_full_Jan_2018.tgz**)
- **Interview protocol.pdf** - approximate protocol used for semistructured interviews.
- LICENSE - text of GPL v3, under which this dataset is published
- INSTALL.md - replication guide (~2 pages)

Replication guide
=================

Step 0 - prerequisites
----------------------

- Unix-compatible OS (Linux or OS X)
- Python interpreter (2.7 was used; Python 3 compatibility is highly likely)
- R 3.4 or higher (3.4.4 was used, 3.2 is known to be incompatible)

Depending on detalization level (see Step 2 for more details):
- up to 2Tb of disk space (see Step 2 detalization levels)
- at least 16Gb of RAM (64 preferable)
- few hours to few month of processing time

Step 1 - software
----------------

- unpack **ghd-0.1.0.zip**, or clone from gitlab:

   git clone https://gitlab.com/user2589/ghd.git
   git checkout 0.1.0
 
 `cd` into the extracted folder. 
 All commands below assume it as a current directory.
  
- copy `settings.py` into the extracted folder. Edit the file:
  * set `DATASET_PATH` to some newly created folder path
  * add at least one GitHub API token to `SCRAPER_GITHUB_API_TOKENS` 
- install docker. For Ubuntu Linux, the command is 
  `sudo apt-get install docker-compose`
- install libarchive and headers: `sudo apt-get install libarchive-dev`
- (optional) to replicate on NPM, install yajl: `sudo apt-get install yajl-tools`
 Without this dependency, you might get an error on the next step, 
 but it's safe to ignore.
- install Python libraries: `pip install --user -r requirements.txt` . 
- disable all APIs except GitHub (Bitbucket and Gitlab support were
 not yet implemented when this study was in progress): edit
 `scraper/init.py`, comment out everything except GitHub support
 in `PROVIDERS`.

Step 2 - obtaining the dataset
-----------------------------

The ultimate goal of this step is to get output of the Python function 
`common.utils.survival_data()` and save it into a CSV file:

  # copy and paste into a Python console
  from common import utils
  survival_data = utils.survival_data('pypi', '2008', smoothing=6)
  survival_data.to_csv('survival_data.csv')

Since full replication will take several months, here are some ways to speedup
the process:

####Option 2.a, difficulty level: easiest

Just use the precomputed data. Step 1 is not necessary under this scenario.

- extract **dataset_minimal_Jan_2018.zip**
- get `survival_data.csv`, go to the next step

####Option 2.b, difficulty level: easy

Use precomputed longitudinal feature values to build the final table.
The whole process will take 15..30 minutes.

- create a folder `

h
bob2
huggingface.co
Updated Sep 19, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
A Q (2025). bob2 [Dataset]. https://huggingface.co/datasets/aq1048576/bob2
Explore at:
Dataset updated
Sep 19, 2025
Authors
A Q
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
bob2

This dataset was automatically uploaded from the red-team-agent repository.

Dataset Information

Original file: bob2.csv Source path: /home/ubuntu/red-team-agent/bob2.csv Validation: Valid CSV with 6 rows, 5 columns (0.0MB)

Usage

import pandas as pd from datasets import load_dataset

Load using datasets library

dataset = load_dataset("aq1048576/bob2")

Or load directly with pandas

df = pd.read_csv("hf://datasets/aq1048576/bob2/data.csv")… See the full description on the dataset page: https://huggingface.co/datasets/aq1048576/bob2.
Z
Linux Kernel binary size
data.niaid.nih.gov
data-staging.niaid.nih.gov
Updated Jun 14, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Hugo MARTIN; Mathieu ACHER (2021). Linux Kernel binary size [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_4943883
Explore at:
Dataset updated
Jun 14, 2021
Dataset provided by
IRISA
Authors
Hugo MARTIN; Mathieu ACHER
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Dataset containing measurements of Linux Kernel binary size after compilation. The reported size, in the column "perf", is the size in bytes of the vmlinux file. In contains also a column "active_options" reporting the number of activated options (set at "y"). All other columns, the list being reported in the file "Linux_options.json", are Linux kernel options. The sampling have been made using randconfig. The version of Linux used is 4.13.3.

Not all available options are present. First, it only contains options about the x86 and 64 bits version. Then, all non-tristate options have been ignored. Finally, options not having multiple value through the whole dataset, due to not enough variability in the sampling, are ignored. All options are encoded as 0 for "n" and "m" options value, and 1 for "y".

In python, importing the dataset using pandas will attribute all columns to int64, which will lead to a great consumption of memory (~50GB). We provide this way to import it using less than 1 GB of memory by setting options columns to int8.

import pandas as pd import json import numpy

with open("Linux_options.json","r") as f: linux_options = json.load(f)

Load csv by setting options as int8 to save a lot of memory

return pd.read_csv("Linux.csv", dtype={f:numpy.int8 for f in linux_options})
Plug Load Data - Dataset - NASA Open Data Portal
data.nasa.gov
Updated Mar 31, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
nasa.gov (2025). Plug Load Data - Dataset - NASA Open Data Portal [Dataset]. https://data.nasa.gov/dataset/plug-load-data
Explore at:
Dataset updated
Mar 31, 2025
Dataset provided by
NASAhttp://nasa.gov/
Description
We provide MATLAB binary files (.mat) and comma separated values files of data collected from a pilot study of a plug load management system that allows for the metering and control of individual electrical plug loads. The study included 15 power strips, each containing 4 channels (receptacles), which wirelessly transmitted power consumption data approximately once per second to 3 bridges. The bridges were connected to a building local area network which relayed data to a cloud-based service. Data were archived once per minute with the minimum, mean, and maximum power draw over each one minute interval recorded. The uncontrolled portion of the testing spanned approximately five weeks and established a baseline energy consumption. The controlled portion of the testing employed schedule-based rules for turning off selected loads during non-business hours; it also modified the energy saver policies for certain devices. Three folders are provided: “matFilesAllChOneDate” provides a MAT-file for each date, each file has all channels; “matFilesOneChAllDates” provides a MAT-file for each channel, each file has all dates; “csvFiles” provides comma separated values files for each date (note that because of data export size limitations, there are 10 csv files for each date). Each folder has the same data; there is no practical difference in content, only the way in which it is organized.
Full OTTO Dataset in CSV
kaggle.com
Updated Nov 5, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Adam (2022). Full OTTO Dataset in CSV [Dataset]. https://www.kaggle.com/datasets/adamnarozniak/full-otto-dataset-in-csv
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Nov 5, 2022
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Adam
Description
Adapted OTTO datasets from JSON to CSV. Code: https://www.kaggle.com/code/adamnarozniak/full-otto-dataset-in-csv-for-pandas To load the data, use (to decrease needed memory): df = pd.read_csv(path, dtype={"session": np.uint32, "aid": np.uint32, "type": np.uint8}, parse_dates=["ts"])

The type was changed compared to the original data using this dictionary: type_dict = { 'clicks': 0, 'carts': 1, 'orders': 2 }
SynD - CSV Version
springernature.figshare.com
bin
Updated May 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Christoph Klemenjak; Christoph Kovatsch; Manuel Herold; Wilfried Elmenreich (2023). SynD - CSV Version [Dataset]. http://doi.org/10.6084/m9.figshare.10071248.v1
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.10071248.v1
Dataset updated
May 30, 2023
Dataset provided by
Figsharehttp://figshare.com/
figshare
Authors
Christoph Klemenjak; Christoph Kovatsch; Manuel Herold; Wilfried Elmenreich
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
The CSV version of SynD. This contains 22 CSV files.
e
Csv Investments Private Limited Export Import Data | Eximpedia
eximpedia.app
Updated Oct 8, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2025). Csv Investments Private Limited Export Import Data | Eximpedia [Dataset]. https://www.eximpedia.app/companies/csv-investments-private-limited/54413494
Explore at:
Dataset updated
Oct 8, 2025
Description
Csv Investments Private Limited Export Import Data. Follow the Eximpedia platform for HS code, importer-exporter records, and customs shipment details.
d
Data from: Long-Term Agroecosystem Research in the Central Mississippi River...
catalog.data.gov
datasetcatalog.nlm.nih.gov
+2more
Updated Dec 2, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Agricultural Research Service (2025). Data from: Long-Term Agroecosystem Research in the Central Mississippi River Basin: Goodwater Creek Experimental Watershed and Regional Herbicide Water Quality Data [Dataset]. https://catalog.data.gov/dataset/data-from-long-term-agroecosystem-research-in-the-central-mississippi-river-basin-goodwate-a5df5
Explore at:
Dataset updated
Dec 2, 2025
Dataset provided by
Agricultural Research Service
Area covered
Mississippi River System, Mississippi River
Description
The GCEW herbicide data were collected from 1991-2010, and are documented at plot, field, and watershed scales. Atrazine concentrations in Goodwater Creek Experimental Watershed (GCEW) were shown to be among the highest of any watershed in the United States based on comparisons using the national Watershed Regressions for Pesticides (WARP) model and by direct comparison with the 112 watersheds used in the development of WARP. This 20-yr-long effort was augmented with a spatially broad effort within the Central Mississippi River Basin encompassing 12 related claypan watersheds in the Salt River Basin, two cave streams on the fringe of the Central Claypan Areas in the Bonne Femme watershed, and 95 streams in northern Missouri and southern Iowa. The research effort on herbicide transport has highlighted the importance of restrictive soil layers with smectitic mineralogy to the risk of transport vulnerability. Near-surface soil features, such as claypans and argillic horizons, result in greater herbicide transport than soils with high saturated hydraulic conductivities and low smectitic clay content. The data set contains concentration, load, and daily discharge data for Devils Icebox Cave and Hunters Cave from 1999 to 2002. The data are available in Microsoft Excel 2010 format. Sheet 1 (Cave Streams Metadata) contains supporting information regarding the length of record, site locations, parameters measured, parameter units, method detection limits, describes the meaning of zero and blank cells, and briefly describes unit area load computations. Sheet 2 (Devils Icebox Concentration Data) contains concentration data from all samples collected from 1999 to 2002 at the Devils Icebox site for 12 analytes and two computed nutrient parameters. Sheet 3 (Devils Icebox SS Conc Data) contains 15-minute suspended sediment (SS) concentrations estimated from turbidity sensor data for the Devils Icebox site. Sheet 4 (Devils Icebox Load & Discharge Data) contains daily data for discharge, load, and unit area loads for the Devils Icebox site. Sheet 5 (Hunters Cave Concentration Data) contains concentration data from all samples collected from 1999 to 2002 at the Hunters Cave site for 12 analytes and two computed nutrient parameters. Sheet 6 (Hunters Cave SS Conc Data) contains 15-minute SS concentrations estimated from turbidity sensor data for the Hunters Cave site. Sheet 7 (Hunters Cave Load & Discharge Data) contains daily data for discharge, load, and unit area loads for the Hunters Cave site. [Note: To support automated data access and processing, each worksheet has been extracted as a separate, machine-readable CSV file; see Data Dictionary for descriptions of variables and their concentration units.] Resources in this dataset:Resource Title: README - Metadata. File Name: LTAR_GCEW_herbicidewater_qual.xlsxResource Description: Defines Water Quality and Sediment Load/Discharge parameters, abbreviations, time-frames, and units as rendered in the Excel file. For additional information including site information, method detection limits, and methods citations, see Metadata tab. For Definitions used in machine-readable CSV files, see Data Dictionary.Resource Title: Excel data spreadsheet. File Name: c3.jeq2013.12.0516.ds1_.xlsxResource Description: Multi-page data spreadsheet containing data as well as metadata from this study. A direct download of the data spreadsheet can be found here: https://dl.sciencesocieties.org/publications/datasets/jeq/C3.JEQ2013.12.0516.ds1/downloadResource Title: Devils Icebox Concentration Data. File Name: DevilsIceboxConcData.csvResource Description: Concentrations of herbicides, metabolites, and nutrients (extracted from the Excel tab into machine-readable CSV data).Resource Title: Devils Icebox Load and Discharge Data. File Name: DevilsIceboxLoad&Discharge.csvResource Description: Discharge and Unit Area Loads for herbicides, metabolites, and suspended sediments (extracted from Excel tab as machine-readable CSV data)Resource Title: Devils Icebox Suspended Sediment Concentration Data. File Name: DevilsIceboxSSConcData.csvResource Description: Suspended Sediment Concentration Data (extracted from Excel tab as machine-readable CSV data)Resource Title: Hunters Cave Load and Discharge Data. File Name: HuntersCaveLoad&Discharge.csvResource Description: Discharge and Unit Area Loads for herbicides, metabolites, and suspended sediments (extracted from Excel tab as machine-readable CSV data)Resource Title: Hunters Cave Suspended Sediment Concentration Data. File Name: HuntersCaveSSConc.csvResource Description: Suspended Sediment Concentration Data (extracted from Excel tab as machine-readable CSV data)Resource Title: Data Dictionary for machine-readable CSV files. File Name: LTAR_GCEW_herbicidewater_qual.csvResource Description: Defines Water Quality and Sediment Load/Discharge parameters, abbreviations, time-frames, and units as implemented in the extracted machine-readable CSV files.Resource Title: Hunters Cave Concentration Data. File Name: HuntersCaveConcData.csvResource Description: Concentrations of herbicides, metabolites, and nutrients (extracted from the Excel tab into machine-readable CSV data)
d
Generation and Load time-series data for 10kV-400V networks
data.dtu.dk
figshare.com
txt
Updated May 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Aeishwarya Baviskar; Matti Juhani Koivisto; Kaushik Das; Anca Daniela Hansen (2023). Generation and Load time-series data for 10kV-400V networks [Dataset]. http://doi.org/10.11583/DTU.14604765
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.11583/DTU.14604765
Dataset updated
May 30, 2023
Dataset provided by
Technical University of Denmark
Authors
Aeishwarya Baviskar; Matti Juhani Koivisto; Kaushik Das; Anca Daniela Hansen
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset contains .csv files.The data contains load and generation time series for all the 10 kV or 400 V nodes in the network. Load and Generation time-series data:Load time-series> active and reactive power at 1 hour resolution> aggregated time-series at 60 kV-10 kV substation> individual load time-series at 10 kV or 400 V nodes> 27 different load profiles grouped in to household, commercial, agricultural and miscellaneous Generation time-series> active power at 1 hour resolution> Wind and solar generation time-series from meteorological dataThis item is a part of the collection, 'DTU 7k-Bus Active Distribution Network'https://doi.org/10.11583/DTU.c.5389910For more information, access the readme file: https://doi.org/10.11583/DTU.14971812

Facebook

Twitter

Click to copy link

Link copied

Cite

Mainak Adhikari; Afsana Khan; Pelle Jakovits (2021). Data pipeline Validation And Load Testing using Multiple CSV Files [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_4636797

Data pipeline Validation And Load Testing using Multiple CSV Files

Explore at:

Dataset updated

Mar 26, 2021

Dataset provided by

Research Fellow, University of Tartu
Lecturer, University of Tartu
Masters Student, University of Tartu

Authors

Mainak Adhikari; Afsana Khan; Pelle Jakovits

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

The datasets were used to validate and test the data pipeline deployment following the RADON approach. The dataset has a CSV file that contains around 32000 Twitter tweets. 100 CSV files have been created from the single CSV file and each CSV file containing 320 tweets. Those 100 CSV files are used to validate and test (performance/load testing) the data pipeline components.

Clear search

Close search

Google apps

Main menu

Data pipeline Validation And Load Testing using Multiple CSV Files

CSV-data-load_data_metadata

Dataset

Contents

Websites using Import Users From Csv With Meta

CsvReader

Database of Uniaxial Cyclic and Tensile Coupon Tests for Structural Metallic...

my-awesome-dataset

Data Mining Project - Boston

Context

Use of Data Files

This loads the file into R

The next codes is to subset the data into specific car types. The example below only has Uber 'Black' car types.

This next portion of code will be to load it into R. First, we must write this dataframe into a csv file on our computer in order to load it into R.

The file will appear in you working directory. If you are not familiar with your working directory. Run this code:

The output will be the file path to your working directory. You will find the file you just created in that folder.

Inspiration

doc-formats-csv-1

The YAML section of the README does not contain anything related to loading the data (only the size category metadata):

- n<1K

Data from: Root thread strength, landslide headscarp geometry, and observed...

riiid_train_converted to Multiple Formats

Context

Content

Acknowledgements

Inspiration

Websites using AIT CSV Import / Export

Data from: Ecosystem-Level Determinants of Sustained Activity in Open-Source...

bob2

Load using datasets library

Or load directly with pandas

Linux Kernel binary size

Load csv by setting options as int8 to save a lot of memory

Plug Load Data - Dataset - NASA Open Data Portal

Full OTTO Dataset in CSV

SynD - CSV Version

Csv Investments Private Limited Export Import Data | Eximpedia

Data from: Long-Term Agroecosystem Research in the Central Mississippi River...

Generation and Load time-series data for 10kV-400V networks

Data pipeline Validation And Load Testing using Multiple CSV Files