6 datasets found
  1. Data from: arXiv Dataset

    • kaggle.com
    Updated Jul 5, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Cornell University (2025). arXiv Dataset [Dataset]. http://doi.org/10.34740/kaggle/dsv/7548853
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 5, 2025
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Cornell University
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    About ArXiv

    For nearly 30 years, ArXiv has served the public and research communities by providing open access to scholarly articles, from the vast branches of physics to the many subdisciplines of computer science to everything in between, including math, statistics, electrical engineering, quantitative biology, and economics. This rich corpus of information offers significant, but sometimes overwhelming depth.

    In these times of unique global challenges, efficient extraction of insights from data is essential. To help make the arXiv more accessible, we present a free, open pipeline on Kaggle to the machine-readable arXiv dataset: a repository of 1.7 million articles, with relevant features such as article titles, authors, categories, abstracts, full text PDFs, and more.

    Our hope is to empower new use cases that can lead to the exploration of richer machine learning techniques that combine multi-modal features towards applications like trend analysis, paper recommender engines, category prediction, co-citation networks, knowledge graph construction and semantic search interfaces.

    The dataset is freely available via Google Cloud Storage buckets (more info here). Stay tuned for weekly updates to the dataset!

    ArXiv is a collaboratively funded, community-supported resource founded by Paul Ginsparg in 1991 and maintained and operated by Cornell University.

    The release of this dataset was featured further in a Kaggle blog post here.

    https://storage.googleapis.com/kaggle-public-downloads/arXiv.JPG" alt="">

    See here for more information.

    ArXiv On Kaggle

    Metadata

    This dataset is a mirror of the original ArXiv data. Because the full dataset is rather large (1.1TB and growing), this dataset provides only a metadata file in the json format. This file contains an entry for each paper, containing: - id: ArXiv ID (can be used to access the paper, see below) - submitter: Who submitted the paper - authors: Authors of the paper - title: Title of the paper - comments: Additional info, such as number of pages and figures - journal-ref: Information about the journal the paper was published in - doi: https://www.doi.org - abstract: The abstract of the paper - categories: Categories / tags in the ArXiv system - versions: A version history

    You can access each paper directly on ArXiv using these links: - https://arxiv.org/abs/{id}: Page for this paper including its abstract and further links - https://arxiv.org/pdf/{id}: Direct link to download the PDF

    Bulk access

    The full set of PDFs is available for free in the GCS bucket gs://arxiv-dataset or through Google API (json documentation and xml documentation).

    You can use for example gsutil to download the data to your local machine. ```

    List files:

    gsutil cp gs://arxiv-dataset/arxiv/

    Download pdfs from March 2020:

    gsutil cp gs://arxiv-dataset/arxiv/arxiv/pdf/2003/ ./a_local_directory/

    Download all the source files

    gsutil cp -r gs://arxiv-dataset/arxiv/ ./a_local_directory/ ```

    Update Frequency

    We're automatically updating the metadata as well as the GCS bucket on a weekly basis.

    License

    Creative Commons CC0 1.0 Universal Public Domain Dedication applies to the metadata in this dataset. See https://arxiv.org/help/license for further details and licensing on individual papers.

    Acknowledgements

    The original data is maintained by ArXiv, huge thanks to the team for building and maintaining this dataset.

    We're using https://github.com/mattbierbaum/arxiv-public-datasets to pull the original data, thanks to Matt Bierbaum for providing this tool.

  2. A

    ‘COVID-19's Impact on Educational Stress’ analyzed by Analyst-2

    • analyst-2.ai
    Updated Aug 4, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com) (2020). ‘COVID-19's Impact on Educational Stress’ analyzed by Analyst-2 [Dataset]. https://analyst-2.ai/analysis/kaggle-covid-19-s-impact-on-educational-stress-49b5/4f12e21a/?iid=019-206&v=presentation
    Explore at:
    Dataset updated
    Aug 4, 2020
    Dataset authored and provided by
    Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com)
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Analysis of ‘COVID-19's Impact on Educational Stress’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/bsoyka3/educational-stress-due-to-the-coronavirus-pandemic on 28 January 2022.

    --- Dataset description provided by original source is as follows ---

    Made by Statistry

    The survey collecting this information is still open for responses here.

    Context

    I just made this public survey because I want someone to be able to do something fun or insightful with the data that's been gathered. You can fill it out too!

    Content

    Each row represents a response to the survey. A few things have been done to sanitize the raw responses: - Column names and options have been renamed to make them easier to work with without much loss of meaning. - Responses from non-students have been removed. - Responses with ages greater than or equal to 22 have been removed.

    Take a look at the column description for each column to see what exactly it represents.

    Acknowledgements

    This dataset wouldn't exist without the help of others. I'd like to thank the following people for their contributions: - Every student who responded to the survey with valid responses - @radcliff on GitHub for providing the list of countries and abbreviations used in the survey and dataset - Giovanna de Vincenzo for providing the list of US states used in the survey and dataset - Simon Migaj for providing the image used for the survey and this dataset

    --- Original source retains full ownership of the source dataset ---

  3. N

    Forestry Planting Spaces

    • data.cityofnewyork.us
    • catalog.data.gov
    • +2more
    Updated Mar 5, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Department of Parks and Recreation (DPR) (2025). Forestry Planting Spaces [Dataset]. https://data.cityofnewyork.us/Environment/Forestry-Planting-Spaces/82zj-84is
    Explore at:
    csv, xml, application/rdfxml, application/rssxml, tsv, kml, application/geo+json, kmzAvailable download formats
    Dataset updated
    Mar 5, 2025
    Dataset authored and provided by
    Department of Parks and Recreation (DPR)
    Description

    Record of Forestry planting spaces for NYC Parks & Recreation.

    Tree Points and Planting Spaces form the basis of ForMS 2.0’s data inventory and are the core entities that all Service Requests, Inspections, and Work Orders are associated to. The system has built-in rules to ensure that every Tree Point has a Planting Space and each Planting Space can have no more than one active Tree Point at a given time. Locations that have had one tree removed and another tree replanted will appear in ForMS 2.0 as a single Planting Space associated with one retired Tree Point (that has a removal Work Order) and one active Tree Point.

    User guide: https://docs.google.com/document/d/1PVPWFi-WExkG3rvnagQDoBbqfsGzxCKNmR6n678nUeU/edit?usp=sharing

    Data dictionary: https://docs.google.com/spreadsheets/d/1yMfZgcsrvx9M0b3-ZdEQ3WCk2dFxgitCWytTrJSwEAs/edit?usp=sharing

  4. 18 excel spreadsheets by species and year giving reproduction and growth...

    • catalog.data.gov
    • data.wu.ac.at
    Updated Aug 17, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. EPA Office of Research and Development (ORD) (2024). 18 excel spreadsheets by species and year giving reproduction and growth data. One excel spreadsheet of herbicide treatment chemistry. [Dataset]. https://catalog.data.gov/dataset/18-excel-spreadsheets-by-species-and-year-giving-reproduction-and-growth-data-one-excel-sp
    Explore at:
    Dataset updated
    Aug 17, 2024
    Dataset provided by
    United States Environmental Protection Agencyhttp://www.epa.gov/
    Description

    Excel spreadsheets by species (4 letter code is abbreviation for genus and species used in study, year 2010 or 2011 is year data collected, SH indicates data for Science Hub, date is date of file preparation). The data in a file are described in a read me file which is the first worksheet in each file. Each row in a species spreadsheet is for one plot (plant). The data themselves are in the data worksheet. One file includes a read me description of the column in the date set for chemical analysis. In this file one row is an herbicide treatment and sample for chemical analysis (if taken). This dataset is associated with the following publication: Olszyk , D., T. Pfleeger, T. Shiroyama, M. Blakely-Smith, E. Lee , and M. Plocher. Plant reproduction is altered by simulated herbicide drift toconstructed plant communities. ENVIRONMENTAL TOXICOLOGY AND CHEMISTRY. Society of Environmental Toxicology and Chemistry, Pensacola, FL, USA, 36(10): 2799-2813, (2017).

  5. Future Asteroids

    • kaggle.com
    Updated Dec 7, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Devastator (2022). Future Asteroids [Dataset]. https://www.kaggle.com/datasets/thedevastator/investigate-near-earth-asteroids-track-close-app
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 7, 2022
    Dataset provided by
    Kaggle
    Authors
    The Devastator
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    Future Asteroids

    All known future asteroids poised to pass near Earth, some dangerous (PHO)

    By Mark Di Marco [source]

    About this dataset

    This fascinating dataset visualizes the ever-changing and dynamic world of Near Earth Asteroids (NEAs) that are either on their way to us or have recently came by! This real-time data offers an insight into our universe, helping you get a grasp of just how often asteroids fly by our planet and how close they can get. With this dataset containing information on those NEAs, you'll be able to get up close and personal with the cosmic travelers that grace the hood of our galaxy. We've included data like their known names, dates & times of their close approaches, distances in both astronomical units & Lunar Distances from Earth, velocities relative to us & the sun as well as other essential properties that will help paint a humanistic picture of these celestial objects. So join us on this exploration and take a journey through time into our cosmos with these asteroids!

    More Datasets

    For more datasets, click here.

    Featured Notebooks

    • 🚨 Your notebook can be here! 🚨!

    How to use the dataset

    Introduction This dataset provides information about Near Earth Asteroids that will make a close approach to Earth within the next 12 months, or have made a close approach within the last 12 months. The columns of data include characteristics such as distance from Earth and relative velocity, among others. To gain more insight on Near Earth Asteroids, follow these steps below:

    Download the Dataset Download this Investigate Near-Earth Asteroid – Track Close Approaches to Earth! dataset from Kaggle. With this download you’ll receive two CSV files: future.csv and all.csv. The first file (future) covers asteroids making a close approach in the next 12 months and ones that have made one in last 12 months; while all covers asteroids making a close approaches at later times (further than twelve months away). Analyze, Interpret & Vizualize Once you’ve downloaded your data files onto your machine, open them up with Microsoft Excel or Google Sheets to begin analyzing your collected asteroid dataset! Utilize organizational tools available in each spreadsheet program to sort through each column of data observing its classification as well as minimum distances etc… for any correlations/conclusions one can draw about these objects as they pertain our current space environment . After exploring patterns found among the contents it’s time for data visualization ! Using programs such as Tableau or looker assist in creating interactive charts and graphs visually depicting collected asteroid knowledge based upon attributes like distances traveled and composition classifications observed throughout researching available entries across both csv sheets! Begin to compile stories generated through gathered info presented using said aforementioned charting platforms leading readers/viewers deeper into their own analysis of various NEA boundaries; showcasing understanding found through digging passed tabular datasets utilizing more impressive display visuals suitable for broader consumption beyond personal analysis !

    Find Trends & Patterns The Future spreadsheet outlines all known asteroids categorized by their Distance Nominal(LD), Composition Classifications (GK), minimum (relative) speed VRelative(km/s)through space , size Vinfinity(km/s), standard deviation N Sigma of orbital path pertaining to earth; forming meaningful comparisons understandable almost anyone regardless their background knowledge when viewing provided visualizations created earlier during workflow joining together interpreted values researched throughout 3–4 emphasizing significance each metric holds when attempting assess risk posed our society at given moment presence current yearly trends collated applicable datasheets analyzed beforehand helping

    Research Ideas

    • Use the data to build an accurate 3D-printed model of a NEA at different scales, depending on the size and shape it describes
    • Build a computer simulation which simulates close approaches of NEAs and the risk they pose to Earth
    • Develop an interactive map which displays current positions of NEAs and radar detection for confirmed threats

    Acknowledgements

    If you use this dataset in your research, please credit the original authors. Data Source

    License

    License: Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) - You are free to: - Share - copy and redistribute the material in any mediu...

  6. Covid19arData COVID-19 Argentina data

    • kaggle.com
    Updated Jul 2, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Vladimiro Bellini (2025). Covid19arData COVID-19 Argentina data [Dataset]. https://www.kaggle.com/vladimirobellini/covid19ardata/discussion
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 2, 2025
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Vladimiro Bellini
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Argentina
    Description

    Spreadsheet exportado

    para acceder al spreadsheet dinamico: https://docs.google.com/spreadsheets/d/16-bnsDdmmgtSxdWbVMboIHo5FRuz76DBxsz_BbsEVWA/edit?usp=sharing

    Contexto

    Repositorio creado por Sistemas Mapache con el objetivo de poder contar con datos abiertos de la información oficial proveniente de los partes diarios sobre la situación de COVID-19 en Argentina.

    También se suman datos con mayor segmentación territorial de fuentes provinciales.

    Los datos historicos provienen de fuentes oficiales y no se mezclan con fuentes no oficiales.

    Data Dictionary

    Column NameDescription
    fechafecha a la que corresponde los datos
    dia_iniciocant dias desde el inicio del caso 1
    dia_cuarentena_dnu260cant dias desde la cuarentena por DNU 260
    osm_admin_level_2nombre administrativo en OpenStreetMap escala país
    osm_admin_level_4nombre administrativo en OpenStreetMap escala provincia
    osm_admin_level_8nombre administrativo en OpenStreetMap escala ciudad
    tot_casosconftotal de casos de infectados confirmados. Columna que sumariza fila a fila el total de casos confirmados
    nue_casosconf_diffnuevos casos infectados del dia
    tot_fallecidostotal de fallecidos. Columna que sumariza fila a fila el total de fallecidos
    nue_fallecidos_diffnuevos casos fallecidos del dia
    tot_recuperadostotal acumulado de casos recuperados.
    tot_test_negativostotal acumulado de tests negativos
    tot_testtotal acumulado de tests
    transmision_tipotipo de transmision al dia de la fecha
    informe_linkURL de acceso al informe de donde sale el dato
    transmision_tipoRegion shapefile in WKT
    observacionobservaciones relacionadas al dato o diferencias entre reportes
    covid19argentina_admin_level_4formato provincia requerido por necesario covid19argentina.com
  7. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Cornell University (2025). arXiv Dataset [Dataset]. http://doi.org/10.34740/kaggle/dsv/7548853
Organization logo

Data from: arXiv Dataset

arXiv dataset and metadata of 1.7M+ scholarly papers across STEM

Related Article
Explore at:
82 scholarly articles cite this dataset (View in Google Scholar)
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jul 5, 2025
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Cornell University
License

https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

Description

About ArXiv

For nearly 30 years, ArXiv has served the public and research communities by providing open access to scholarly articles, from the vast branches of physics to the many subdisciplines of computer science to everything in between, including math, statistics, electrical engineering, quantitative biology, and economics. This rich corpus of information offers significant, but sometimes overwhelming depth.

In these times of unique global challenges, efficient extraction of insights from data is essential. To help make the arXiv more accessible, we present a free, open pipeline on Kaggle to the machine-readable arXiv dataset: a repository of 1.7 million articles, with relevant features such as article titles, authors, categories, abstracts, full text PDFs, and more.

Our hope is to empower new use cases that can lead to the exploration of richer machine learning techniques that combine multi-modal features towards applications like trend analysis, paper recommender engines, category prediction, co-citation networks, knowledge graph construction and semantic search interfaces.

The dataset is freely available via Google Cloud Storage buckets (more info here). Stay tuned for weekly updates to the dataset!

ArXiv is a collaboratively funded, community-supported resource founded by Paul Ginsparg in 1991 and maintained and operated by Cornell University.

The release of this dataset was featured further in a Kaggle blog post here.

https://storage.googleapis.com/kaggle-public-downloads/arXiv.JPG" alt="">

See here for more information.

ArXiv On Kaggle

Metadata

This dataset is a mirror of the original ArXiv data. Because the full dataset is rather large (1.1TB and growing), this dataset provides only a metadata file in the json format. This file contains an entry for each paper, containing: - id: ArXiv ID (can be used to access the paper, see below) - submitter: Who submitted the paper - authors: Authors of the paper - title: Title of the paper - comments: Additional info, such as number of pages and figures - journal-ref: Information about the journal the paper was published in - doi: https://www.doi.org - abstract: The abstract of the paper - categories: Categories / tags in the ArXiv system - versions: A version history

You can access each paper directly on ArXiv using these links: - https://arxiv.org/abs/{id}: Page for this paper including its abstract and further links - https://arxiv.org/pdf/{id}: Direct link to download the PDF

Bulk access

The full set of PDFs is available for free in the GCS bucket gs://arxiv-dataset or through Google API (json documentation and xml documentation).

You can use for example gsutil to download the data to your local machine. ```

List files:

gsutil cp gs://arxiv-dataset/arxiv/

Download pdfs from March 2020:

gsutil cp gs://arxiv-dataset/arxiv/arxiv/pdf/2003/ ./a_local_directory/

Download all the source files

gsutil cp -r gs://arxiv-dataset/arxiv/ ./a_local_directory/ ```

Update Frequency

We're automatically updating the metadata as well as the GCS bucket on a weekly basis.

License

Creative Commons CC0 1.0 Universal Public Domain Dedication applies to the metadata in this dataset. See https://arxiv.org/help/license for further details and licensing on individual papers.

Acknowledgements

The original data is maintained by ArXiv, huge thanks to the team for building and maintaining this dataset.

We're using https://github.com/mattbierbaum/arxiv-public-datasets to pull the original data, thanks to Matt Bierbaum for providing this tool.

Search
Clear search
Close search
Google apps
Main menu