8 datasets found
  1. Tamil (Tamizh) Wikipedia Text Dataset for NLP

    • kaggle.com
    Updated Nov 12, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Younus_Mohamed (2024). Tamil (Tamizh) Wikipedia Text Dataset for NLP [Dataset]. http://doi.org/10.34740/kaggle/dsv/9884525
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 12, 2024
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Younus_Mohamed
    License

    Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
    License information was derived automatically

    Description

    This dataset is part of a larger mission to transform Tamil into a high-resource language in the field of Natural Language Processing (NLP). As one of the oldest and most culturally rich languages, Tamil has a unique linguistic structure, yet it remains underrepresented in the NLP landscape. This dataset, extracted from Tamil Wikipedia, serves as a foundational resource to support Tamil language processing, text mining, and machine learning applications.

    What’s Included

    - Text Data: This dataset contains over 569,000 articles in raw text form, extracted from Tamil Wikipedia. The collection is ideal for language model training, word frequency analysis, and text mining.

    - Scripts and Processing Tools: Code snippets are provided for processing .bz2 compressed files, generating word counts, and handling data for NLP applications.

    Why This Dataset?

    Despite having a documented lexicon of over 100,000 words, only a fraction of these are actively used in everyday communication. The largest available Tamil treebank currently holds only 10,000 words, limiting the scope for training accurate language models. This dataset aims to bridge that gap by providing a robust, open-source corpus for researchers, developers, and linguists who want to work on Tamil language technologies.

    ** How You Can Use This Dataset**

    - Language Modeling: Train or fine-tune models like BERT, GPT, or LSTM-based language models for Tamil. - Linguistic Research: Analyze Tamil morphology, syntax, and vocabulary usage. - Data Augmentation: Use the raw text to generate augmented data for multilingual NLP applications. - Word Embeddings and Semantic Analysis: Create embeddings for Tamil words, useful in multilingual setups or standalone applications.

    Let’s Collaborate!

    I believe that advancing Tamil in NLP cannot be a solo effort. Contributions in the form of additional data, annotations, or even new tools for Tamil language processing are welcome! By working together, we can make Tamil a truly high-resource language in NLP.

    License

    This dataset is based on content from Tamil Wikipedia and is shared under the Creative Commons Attribution-ShareAlike 3.0 Unported License (CC BY-SA 3.0). Proper attribution to Wikipedia is required when using this data.

  2. d

    Replication Data for: The Wikipedia Adventure: Field Evaluation of an...

    • search.dataone.org
    • dataverse.harvard.edu
    Updated Nov 21, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Narayan, Sneha; Orlowitz, Jake; Morgan, Jonathan T.; Shaw, Aaron D.; Hill, Benjamin Mako (2023). Replication Data for: The Wikipedia Adventure: Field Evaluation of an Interactive Tutorial for New Users [Dataset]. http://doi.org/10.7910/DVN/6HPRIG
    Explore at:
    Dataset updated
    Nov 21, 2023
    Dataset provided by
    Harvard Dataverse
    Authors
    Narayan, Sneha; Orlowitz, Jake; Morgan, Jonathan T.; Shaw, Aaron D.; Hill, Benjamin Mako
    Description

    This dataset contains the data and code necessary to replicate work in the following paper: Narayan, Sneha, Jake Orlowitz, Jonathan Morgan, Benjamin Mako Hill, and Aaron Shaw. 2017. “The Wikipedia Adventure: Field Evaluation of an Interactive Tutorial for New Users.” in Proceedings of the 20th ACM Conference on Computer-Supported Cooperative Work & Social Computing (CSCW '17). New York, New York: ACM Press. http://dx.doi.org/10.1145/2998181.2998307 The published paper contains two studies. Study 1 is a descriptive analysis of a survey of Wikipedia editors who played a gamified tutorial. Study 2 is a field experiment that evaluated the same the tutorial. These data are the data used in the field experiment described in Study 2. Description of Files This dataset contains the following files beyond this README: twa.RData — An RData file that includes all variables used in Study 2. twa_analysis.R — A GNU R script that includes all the code used to generate the tables and plots related to Study 2 in the paper. The RData file contains one variable (d) which is an R dataframe (i.e., table) that includes the following columns: userid (integer): The unique numerical ID representing each user on in our sample. These are 8-digit integers and describe public accounts on Wikipedia. sample.date (date string): The day the user was recruited to the study. Dates are formatted in “YYYY-MM-DD” format. In the case of invitees, it is the date their invitation was sent. For users in the control group, these is the date that they would have been invited to the study. edits.all (integer): The total number of edits made by the user on Wikipedia in the 180 days after they joined the study. Edits to user's user pages, user talk pages and subpages are ignored. edits.ns0 (integer): The total number of edits made by user to article pages on Wikipedia in the 180 days after they joined the study. edits.talk (integer): The total number of edits made by user to talk pages on Wikipedia in the 180 days after they joined the study. Edits to a user's user page, user talk page and subpages are ignored. treat (logical): TRUE if the user was invited, FALSE if the user was in control group. play (logical): TRUE if the user played the game. FALSE if the user did not. All users in control are listed as FALSE because any user who had not been invited to the game but played was removed. twa.level (integer): Takes a value 0 of if the user has not played the game. Ranges from 1 to 7 for those who did, indicating the highest level they reached in the game. quality.score (float). This is the average word persistence (over a 6 revision window) over all edits made by this userid. Our measure of word persistence (persistent word revision per word) is a measure of edit quality developed by Halfaker et al. that tracks how long words in an edit persist after subsequent revisions are made to the wiki-page. For more information on how word persistence is calculated, see the following paper: Halfaker, Aaron, Aniket Kittur, Robert Kraut, and John Riedl. 2009. “A Jury of Your Peers: Quality, Experience and Ownership in Wikipedia.” In Proceedings of the 5th International Symposium on Wikis and Open Collaboration (OpenSym '09), 1–10. New York, New York: ACM Press. doi:10.1145/1641309.1641332. Or this page: https://meta.wikimedia.org/wiki/Research:Content_persistence How we created twa.RData The files twa.RData combines datasets drawn from three places: A dataset created by Wikimedia Foundation staff that tracked the details of the experiment and how far people got in the game. The variables userid, sample.date, treat, play, and twa.level were all generated in a dataset created by WMF staff when The Wikipedia Adventure was deployed. All users in the sample created their accounts within 2 days before the date they were entered into the study. None of them had received a Teahouse invitation, a Level 4 user warning, or been blocked from editing at the time that they entered the study. Additionally, all users made at least one edit after the day they were invited. Users were sorted randomly into treatment and control groups, based on which they either received or did not receive an invite to play The Wikipedia Adventure. Edit and text persistence data drawn from public XML dumps created on May 21st, 2015. We used publicly available XML dumps to generate the outcome variables, namely edits.all, edits.ns0, edits.talk and quality.score. We first extracted all edits made by users in our sample during the six month period since they joined the study, excluding edits made to user pages or user talk pages using. We parsed the XML dumps using the Python based wikiq and MediaWikiUtilities software online at: http://projects.mako.cc/source/?p=mediawiki_dump_tools https://github.com/mediawiki-utilities/python-mediawiki-utilities We o... Visit https://dataone.org/datasets/sha256%3Ab1240bda398e8fa311ac15dbcc04880333d5f3fbe67a7a951786da2d44e33018 for complete metadata about this dataset.

  3. FiveThirtyEight Daily Show Guests Dataset

    • kaggle.com
    zip
    Updated Jan 13, 2019
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FiveThirtyEight (2019). FiveThirtyEight Daily Show Guests Dataset [Dataset]. https://www.kaggle.com/fivethirtyeight/fivethirtyeight-daily-show-guests-dataset
    Explore at:
    zip(37571 bytes)Available download formats
    Dataset updated
    Jan 13, 2019
    Dataset authored and provided by
    FiveThirtyEighthttps://abcnews.go.com/538
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Content

    Daily Show Guests

    This folder contains data behind the story Every Guest Jon Stewart Ever Had On ‘The Daily Show’.

    HeaderDefinition
    YEARThe year the episode aired
    GoogleKnowlege_OccupationTheir occupation or office, according to Google's Knowledge Graph or, if they're not in there, how Stewart introduced them on the program.
    ShowAir date of episode. Not unique, as some shows had more than one guest
    GroupA larger group designation for the occupation. For instance, us senators, us presidents, and former presidents are all under "politicians"
    Raw_Guest_ListThe person or list of people who appeared on the show, according to Wikipedia. The GoogleKnowlege_Occupation only refers to one of them in a given row.

    Source: Google Knowlege Graph, The Daily Show clip library, Wikipedia.

    Context

    This is a dataset from FiveThirtyEight hosted on their GitHub. Explore FiveThirtyEight data using Kaggle and all of the data sources available through the FiveThirtyEight organization page!

    • Update Frequency: This dataset is updated daily.

    Acknowledgements

    This dataset is maintained using GitHub's API and Kaggle's API.

    This dataset is distributed under the Attribution 4.0 International (CC BY 4.0) license.

    Cover photo by Oscar Nord on Unsplash
    Unsplash Images are distributed under a unique Unsplash License.

  4. A

    ‘List of Top Data Breaches (2004 - 2021)’ analyzed by Analyst-2

    • analyst-2.ai
    Updated Feb 14, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com) (2022). ‘List of Top Data Breaches (2004 - 2021)’ analyzed by Analyst-2 [Dataset]. https://analyst-2.ai/analysis/kaggle-list-of-top-data-breaches-2004-2021-e7ac/746cf4e2/?iid=002-608&v=presentation
    Explore at:
    Dataset updated
    Feb 14, 2022
    Dataset authored and provided by
    Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com)
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Analysis of ‘List of Top Data Breaches (2004 - 2021)’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/hishaamarmghan/list-of-top-data-breaches-2004-2021 on 14 February 2022.

    --- Dataset description provided by original source is as follows ---

    This is a dataset containing all the major data breaches in the world from 2004 to 2021

    As we know, there is a big issue related to the privacy of our data. Many major companies in the world still to this day face this issue every single day. Even with a great team of people working on their security, many still suffer. In order to tackle this situation, it is only right that we must study this issue in great depth and therefore I pulled this data from Wikipedia to conduct data analysis. I would encourage others to take a look at this as well and find as many insights as possible.

    This data contains 5 columns: 1. Entity: The name of the company, organization or institute 2. Year: In what year did the data breach took place 3. Records: How many records were compromised (can include information like email, passwords etc.) 4. Organization type: Which sector does the organization belong to 5. Method: Was it hacked? Were the files lost? Was it an inside job?

    Here is the source for the dataset: https://en.wikipedia.org/wiki/List_of_data_breaches

    Here is the GitHub link for a guide on how it was scraped: https://github.com/hishaamarmghan/Data-Breaches-Scraping-Cleaning

    --- Original source retains full ownership of the source dataset ---

  5. o

    The Global Soundscapes Project: overview of datasets and meta-data

    • explore.openaire.eu
    Updated May 11, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kevin F. A. Kevin F.A. Darras; Steven Van Wilgenburg; Rodney Rountree; Yuhang Song; Youfang Chen; Thomas Cherico Wanger (2022). The Global Soundscapes Project: overview of datasets and meta-data [Dataset]. http://doi.org/10.5281/zenodo.6537739
    Explore at:
    Dataset updated
    May 11, 2022
    Authors
    Kevin F. A. Kevin F.A. Darras; Steven Van Wilgenburg; Rodney Rountree; Yuhang Song; Youfang Chen; Thomas Cherico Wanger
    Description

    This is an overview of the soundscape recording datasets that have been contributed to the Global Soundscapes Project, as well as associated meta-data. The audio recording criteria justifying inclusion into the current meta-dataset are: Stationary (no towed sensors or microphones mounted on cars) Passive (no human disturbance by the recordist) Ambient (no focus on a particular species or direction) Recorded over multiple sites of a region and/or days The individual columns are described as follows. General: ID: primary key name: name of the dataset subset: incremental integer that can be used to distinguish sub-datasets collaborators: full names of people deemed responsible for the dataset, separated by commas date_added: when the dataset was added Space: realm_IUCN: realm from IUCN Global Ecosystem Typology (v2.0) (https://global-ecosystems.org/) medium: the physical medium the microphone is situated in GADM0: for terrestrial locations, Database of Global Administrative Areas level 0 unit as per https://gadm.org/ GADM1: for terrestrial locations, Database of Global Administrative Areas level 1 unit as per https://gadm.org/ GADM2: for terrestrial locations, Database of Global Administrative Areas level 2 unit as per https://gadm.org/ IHO: International Hydrographic Organisation sea area as per https://iho.int/ latitude_numeric_region: study region approximate centroid latitude in WGS84 decimal degrees longitude_numeric_region: study region approximate centroid longitude in WGS84 decimal degrees topography_min_m: minimum elevation of sites from sea level topography_max_m: maximum elevation of sites from sea level ground_distance_m: vertical distance of microphone from land ground or ocean floor freshwater_depth_m: vertical distance from water surface for freshwater datasets sites_number: number of sites sampled Time: days_number_per_site: typical number of days sampled per site (or minimum if too variable) day: whether the sites were sampled during daytime night: whether the sites were sampled during nighttime twilight: whether the sites were sampled during twilight warm_season: whether the warm season was sampled. Only outside tropics (https://en.wikipedia.org/wiki/K%C3%B6ppen_climate_classification) cold_season: whether the cold season was sampled. Only outside tropics (https://en.wikipedia.org/wiki/K%C3%B6ppen_climate_classification) dry_season: whether the dry season was sampled. Only for tropics (https://en.wikipedia.org/wiki/K%C3%B6ppen_climate_classification) wet_season: whether the wet season was sampled. Only for tropics (https://en.wikipedia.org/wiki/K%C3%B6ppen_climate_classification) year_start: starting year of the sampling year_end: ending year of the sampling schedule: description of the sampling schedule, free text recording_selection: criteria used to temporally select recordings (e.g., discarded rainy days) Audio: high_pass_filter_Hz: lower frequency of the high-pass filter sampling_frequency_kHz: frequency the microphone was sampled at audio_bit_depth: bit depth used for encoding audio recorder_model: recorder model used microphone: microphone used recordist_position: position of the recordist relative to the microphone during sampling Others: comments: free-text field URL_project: internet link for further information URL_publication: internet link of the corresponding publication More information on the project can be found here: https://ecosound-web.uni-goettingen.de/ecosound_web/project/gsp adding IHO data

  6. P

    Group SNAP Dataset

    • paperswithcode.com
    Updated Jul 21, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2018). Group SNAP Dataset [Dataset]. https://paperswithcode.com/dataset/group-snap-snap-suitesparse-matrix-collection
    Explore at:
    Dataset updated
    Jul 21, 2018
    Description

    Networks from SNAP (Stanford Network Analysis Platform) Network Data Sets, Jure Leskovec http://snap.stanford.edu/data/index.html email jure at cs.stanford.edu

    Citation for the SNAP collection:

    @misc{snapnets, author = {Jure Leskovec and Andrej Krevl}, title = {{SNAP Datasets}: {Stanford} Large Network Dataset Collection}, howpublished = {\url{http://snap.stanford.edu/data}}, month = jun, year = 2014 }

    The following matrices/graphs were added to the collection in June 2010 by Tim Davis (problem id and name):

    2284 SNAP/soc-Epinions1 who-trusts-whom network of Epinions.com 2285 SNAP/soc-LiveJournal1 LiveJournal social network 2286 SNAP/soc-Slashdot0811 Slashdot social network, Nov 2008 2287 SNAP/soc-Slashdot0902 Slashdot social network, Feb 2009 2288 SNAP/wiki-Vote Wikipedia who-votes-on-whom network 2289 SNAP/email-EuAll Email network from a EU research institution 2290 SNAP/email-Enron Email communication network from Enron 2291 SNAP/wiki-Talk Wikipedia talk (communication) network 2292 SNAP/cit-HepPh Arxiv High Energy Physics paper citation network 2293 SNAP/cit-HepTh Arxiv High Energy Physics paper citation network 2294 SNAP/cit-Patents Citation network among US Patents 2295 SNAP/ca-AstroPh Collaboration network of Arxiv Astro Physics 2296 SNAP/ca-CondMat Collaboration network of Arxiv Condensed Matter 2297 SNAP/ca-GrQc Collaboration network of Arxiv General Relativity 2298 SNAP/ca-HepPh Collaboration network of Arxiv High Energy Physics 2299 SNAP/ca-HepTh Collaboration network of Arxiv High Energy Physics Theory 2300 SNAP/web-BerkStan Web graph of Berkeley and Stanford 2301 SNAP/web-Google Web graph from Google 2302 SNAP/web-NotreDame Web graph of Notre Dame 2303 SNAP/web-Stanford Web graph of Stanford.edu 2304 SNAP/amazon0302 Amazon product co-purchasing network from March 2 2003 2305 SNAP/amazon0312 Amazon product co-purchasing network from March 12 2003 2306 SNAP/amazon0505 Amazon product co-purchasing network from May 5 2003 2307 SNAP/amazon0601 Amazon product co-purchasing network from June 1 2003 2308 SNAP/p2p-Gnutella04 Gnutella peer to peer network from August 4 2002 2309 SNAP/p2p-Gnutella05 Gnutella peer to peer network from August 5 2002 2310 SNAP/p2p-Gnutella06 Gnutella peer to peer network from August 6 2002 2311 SNAP/p2p-Gnutella08 Gnutella peer to peer network from August 8 2002 2312 SNAP/p2p-Gnutella09 Gnutella peer to peer network from August 9 2002 2313 SNAP/p2p-Gnutella24 Gnutella peer to peer network from August 24 2002 2314 SNAP/p2p-Gnutella25 Gnutella peer to peer network from August 25 2002 2315 SNAP/p2p-Gnutella30 Gnutella peer to peer network from August 30 2002 2316 SNAP/p2p-Gnutella31 Gnutella peer to peer network from August 31 2002 2317 SNAP/roadNet-CA Road network of California 2318 SNAP/roadNet-PA Road network of Pennsylvania 2319 SNAP/roadNet-TX Road network of Texas 2320 SNAP/as-735 733 daily instances(graphs) from November 8 1997 to January 2 2000 2321 SNAP/as-Skitter Internet topology graph, from traceroutes run daily in 2005 2322 SNAP/as-caida The CAIDA AS Relationships Datasets, from January 2004 to November 2007 2323 SNAP/Oregon-1 AS peering information inferred from Oregon route-views between March 31 and May 26 2001 2324 SNAP/Oregon-2 AS peering information inferred from Oregon route-views between March 31 and May 26 2001 2325 SNAP/soc-sign-epinions Epinions signed social network 2326 SNAP/soc-sign-Slashdot081106 Slashdot Zoo signed social network from November 6 2008 2327 SNAP/soc-sign-Slashdot090216 Slashdot Zoo signed social network from February 16 2009 2328 SNAP/soc-sign-Slashdot090221 Slashdot Zoo signed social network from February 21 2009

    Then the following problems were added in July 2018. All data and metadata from the SNAP data set was imported into the SuiteSparse Matrix Collection.

    2777 SNAP/CollegeMsg Messages on a Facebook-like platform at UC-Irvine 2778 SNAP/com-Amazon Amazon product network 2779 SNAP/com-DBLP DBLP collaboration network 2780 SNAP/com-Friendster Friendster online social network 2781 SNAP/com-LiveJournal LiveJournal online social network 2782 SNAP/com-Orkut Orkut online social network 2783 SNAP/com-Youtube Youtube online social network 2784 SNAP/email-Eu-core E-mail network 2785 SNAP/email-Eu-core-temporal E-mails between users at a research institution 2786 SNAP/higgs-twitter twitter messages re: Higgs boson on 4th July 2012. 2787 SNAP/loc-Brightkite Brightkite location based online social network 2788 SNAP/loc-Gowalla Gowalla location based online social network 2789 SNAP/soc-Pokec Pokec online social network 2790 SNAP/soc-sign-bitcoin-alpha Bitcoin Alpha web of trust network 2791 SNAP/soc-sign-bitcoin-otc Bitcoin OTC web of trust network 2792 SNAP/sx-askubuntu Comments, questions, and answers on Ask Ubuntu 2793 SNAP/sx-mathoverflow Comments, questions, and answers on Math Overflow 2794 SNAP/sx-stackoverflow Comments, questions, and answers on Stack Overflow 2795 SNAP/sx-superuser Comments, questions, and answers on Super User 2796 SNAP/twitter7 A collection of 476 million tweets collected between June-Dec 2009 2797 SNAP/wiki-RfA Wikipedia Requests for Adminship (with text) 2798 SNAP/wiki-talk-temporal Users editing talk pages on Wikipedia 2799 SNAP/wiki-topcats Wikipedia hyperlinks (with communities)

    The following 13 graphs/networks were in the SNAP data set in July 2018 but have not yet been imported into the SuiteSparse Matrix Collection. They may be added in the future:

    amazon-meta ego-Facebook ego-Gplus ego-Twitter gemsec-Deezer gemsec-Facebook ksc-time-series memetracker9 web-flickr web-Reddit web-RedditPizzaRequests wiki-Elec wiki-meta wikispeedia

    The 2010 description of the SNAP data set gave these categories:

    • Social networks: online social networks, edges represent interactions between people

    • Communication networks: email communication networks with edges representing communication

    • Citation networks: nodes represent papers, edges represent citations

    • Collaboration networks: nodes represent scientists, edges represent collaborations (co-authoring a paper)

    • Web graphs: nodes represent webpages and edges are hyperlinks

    • Blog and Memetracker graphs: nodes represent time stamped blog posts, edges are hyperlinks [revised below]

    • Amazon networks : nodes represent products and edges link commonly co-purchased products

    • Internet networks : nodes represent computers and edges communication

    • Road networks : nodes represent intersections and edges roads connecting the intersections

    • Autonomous systems : graphs of the internet

    • Signed networks : networks with positive and negative edges (friend/foe, trust/distrust)

    By July 2018, the following categories had been added:

    • Networks with ground-truth communities : ground-truth network communities in social and information networks

    • Location-based online social networks : Social networks with geographic check-ins

    • Wikipedia networks, articles, and metadata : Talk, editing, voting, and article data from Wikipedia

    • Temporal networks : networks where edges have timestamps

    • Twitter and Memetracker : Memetracker phrases, links and 467 million Tweets

    • Online communities : Data from online communities such as Reddit and Flickr

    • Online reviews : Data from online review systems such as BeerAdvocate and Amazon

    https://sparse.tamu.edu/SNAP

  7. EOD data for all Dow Jones stocks

    • kaggle.com
    zip
    Updated Jun 12, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Timo Bozsolik (2019). EOD data for all Dow Jones stocks [Dataset]. https://www.kaggle.com/datasets/timoboz/stock-data-dow-jones
    Explore at:
    zip(1697460 bytes)Available download formats
    Dataset updated
    Jun 12, 2019
    Authors
    Timo Bozsolik
    Description

    Update

    Unfortunately, the API this dataset used to pull the stock data isn't free anymore. Instead of having this auto-updating, I dropped the last version of the data files in here, so at least the historic data is still usable.

    Content

    This dataset provides free end of day data for all stocks currently in the Dow Jones Industrial Average. For each of the 30 components of the index, there is one CSV file named by the stock's symbol (e.g. AAPL for Apple). Each file provides historically adjusted market-wide data (daily, max. 5 years back). See here for description of the columns: https://iextrading.com/developer/docs/#chart

    Since this dataset uses remote URLs as files, it is automatically updated daily by the Kaggle platform and automatically represents the latest data.

    Acknowledgements

    List of stocks and symbols as per https://en.wikipedia.org/wiki/Dow_Jones_Industrial_Average

    Thanks to https://iextrading.com for providing this data for free!

    Terms of Use

    Data provided for free by IEX. View IEX’s Terms of Use.

  8. frames-benchmark

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    frames-benchmark [Dataset]. https://huggingface.co/datasets/google/frames-benchmark
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset authored and provided by
    Googlehttp://google.com/
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    FRAMES: Factuality, Retrieval, And reasoning MEasurement Set

    FRAMES is a comprehensive evaluation dataset designed to test the capabilities of Retrieval-Augmented Generation (RAG) systems across factuality, retrieval accuracy, and reasoning. Our paper with details and experiments is available on arXiv: https://arxiv.org/abs/2409.12941.

      Dataset Overview
    

    824 challenging multi-hop questions requiring information from 2-15 Wikipedia articles Questions span diverse topics… See the full description on the dataset page: https://huggingface.co/datasets/google/frames-benchmark.

  9. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Younus_Mohamed (2024). Tamil (Tamizh) Wikipedia Text Dataset for NLP [Dataset]. http://doi.org/10.34740/kaggle/dsv/9884525
Organization logo

Tamil (Tamizh) Wikipedia Text Dataset for NLP

Building a High-Resource Future for Tamil in NLP: Collaborative Efforts for Data

Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Nov 12, 2024
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Younus_Mohamed
License

Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
License information was derived automatically

Description

This dataset is part of a larger mission to transform Tamil into a high-resource language in the field of Natural Language Processing (NLP). As one of the oldest and most culturally rich languages, Tamil has a unique linguistic structure, yet it remains underrepresented in the NLP landscape. This dataset, extracted from Tamil Wikipedia, serves as a foundational resource to support Tamil language processing, text mining, and machine learning applications.

What’s Included

- Text Data: This dataset contains over 569,000 articles in raw text form, extracted from Tamil Wikipedia. The collection is ideal for language model training, word frequency analysis, and text mining.

- Scripts and Processing Tools: Code snippets are provided for processing .bz2 compressed files, generating word counts, and handling data for NLP applications.

Why This Dataset?

Despite having a documented lexicon of over 100,000 words, only a fraction of these are actively used in everyday communication. The largest available Tamil treebank currently holds only 10,000 words, limiting the scope for training accurate language models. This dataset aims to bridge that gap by providing a robust, open-source corpus for researchers, developers, and linguists who want to work on Tamil language technologies.

** How You Can Use This Dataset**

- Language Modeling: Train or fine-tune models like BERT, GPT, or LSTM-based language models for Tamil. - Linguistic Research: Analyze Tamil morphology, syntax, and vocabulary usage. - Data Augmentation: Use the raw text to generate augmented data for multilingual NLP applications. - Word Embeddings and Semantic Analysis: Create embeddings for Tamil words, useful in multilingual setups or standalone applications.

Let’s Collaborate!

I believe that advancing Tamil in NLP cannot be a solo effort. Contributions in the form of additional data, annotations, or even new tools for Tamil language processing are welcome! By working together, we can make Tamil a truly high-resource language in NLP.

License

This dataset is based on content from Tamil Wikipedia and is shared under the Creative Commons Attribution-ShareAlike 3.0 Unported License (CC BY-SA 3.0). Proper attribution to Wikipedia is required when using this data.

Search
Clear search
Close search
Google apps
Main menu