6 datasets found
  1. MCB_languages_county

    • kaggle.com
    Updated Oct 1, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Marisol Brewster (2019). MCB_languages_county [Dataset]. https://www.kaggle.com/mcbrewster/mcb-languages-county/code
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Oct 1, 2019
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Marisol Brewster
    Description

    Context

    This is a dataset I found online through the Google Dataset Search portal.

    Content

    The American Community Survey (ACS) 2009-2013 multi-year data are used to list all languages spoken in the United States that were reported during the sample period. These tables provide detailed counts of many more languages than the 39 languages and language groups that are published annually as a part of the routine ACS data release. This is the second tabulation beyond 39 languages since ACS began.

    The tables include all languages that were reported in each geography during the 2009 to 2013 sampling period. For the purpose of tabulation, reported languages are classified in one of 380 possible languages or language groups. Because the data are a sample of the total population, there may be languages spoken that are not reported, either because the ACS did not sample the households where those languages are spoken, or because the person filling out the survey did not report the language or reported another language instead.

    The tables also provide information about self-reported English-speaking ability. Respondents who reported speaking a language other than English were asked to indicate their ability to speak English in one of the following categories: "Very well," "Well," "Not well," or "Not at all." The data on ability to speak English represent the person’s own perception about his or her own ability or, because ACS questionnaires are usually completed by one household member, the responses may represent the perception of another household member.

    These tables are also available through the Census Bureau's application programming interface (API). Please see the developers page for additional details on how to use the API to access these data.

    Acknowledgements

    Sources:

    Google Dataset Search: https://toolbox.google.com/datasetsearch

    2009-2013 American Community Survey

    Original dataset: https://www.census.gov/data/tables/2013/demo/2009-2013-lang-tables.html

    Downloaded From: https://data.world/kvaughn/languages-county

    Banner and thumbnail photo by Farzad Mohsenvand on Unsplash

  2. Financial Statement Data Sets

    • kaggle.com
    Updated Jul 4, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Vadim Vanak (2025). Financial Statement Data Sets [Dataset]. https://www.kaggle.com/datasets/vadimvanak/company-facts-2
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 4, 2025
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Vadim Vanak
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    This dataset offers a detailed collection of US-GAAP financial data extracted from the financial statements of exchange-listed U.S. companies, as submitted to the U.S. Securities and Exchange Commission (SEC) via the EDGAR database. Covering filings from January 2009 onwards, this dataset provides key financial figures reported by companies in accordance with U.S. Generally Accepted Accounting Principles (GAAP).

    Dataset Features:

    • Data Scope: The dataset is restricted to figures reported under US-GAAP standards, with the exception of EntityCommonStockSharesOutstanding and EntityPublicFloat.
    • Currency and Units: The dataset exclusively includes figures reported in USD or shares, ensuring uniformity and comparability. It excludes ratios and non-financial metrics to maintain focus on financial data.
    • Company Selection: The dataset is limited to companies with U.S. exchange tickers, providing a concentrated analysis of publicly traded firms within the United States.
    • Submission Types: The dataset only incorporates data from 10-Q, 10-K, 10-Q/A, and 10-K/A filings, ensuring consistency in the type of financial reports analyzed.

    Data Sources and Extraction:

    This dataset primarily relies on the SEC's Financial Statement Data Sets and EDGAR APIs: - SEC Financial Statement Data Sets - EDGAR Application Programming Interfaces

    In instances where specific figures were missing from these sources, data was directly extracted from the companies' financial statements to ensure completeness.

    Please note that the dataset presents financial figures exactly as reported by the companies, which may occasionally include errors. A common issue involves incorrect reporting of scaling factors in the XBRL format. XBRL supports two tag attributes related to scaling: 'decimals' and 'scale.' The 'decimals' attribute indicates the number of significant decimal places but does not affect the actual value of the figure, while the 'scale' attribute adjusts the value by a specific factor.

    However, there are several instances, numbering in the thousands, where companies have incorrectly used the 'decimals' attribute (e.g., 'decimals="-6"') under the mistaken assumption that it controls scaling. This is not correct, and as a result, some figures may be inaccurately scaled. This dataset does not attempt to detect or correct such errors; it aims to reflect the data precisely as reported by the companies. A future version of the dataset may be introduced to address and correct these issues.

    The source code for data extraction is available here

  3. Spanish Stocks Historical Data from 2000 to 2019

    • kaggle.com
    Updated Jun 7, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    alvarobartt (2019). Spanish Stocks Historical Data from 2000 to 2019 [Dataset]. https://www.kaggle.com/alvarob96/spanish-stocks-historical-data/code
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 7, 2019
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    alvarobartt
    Description

    Introduction

    Since Investing.com does not have an API, I decided to develop this Python package in order to retrieve historical data from the companies that integrate the Continuous Spanish Stock Market. So on, I decided to generate, via investpy, the datasets for every company so that any Data Scientist or Data Enthusiastic can handle it and abstract their own conclusions and research.

    The main purpose of developing investpy, the package from which these datasets have been retrieved, was to use it as the Data Extraction tool for its namesake section, for my Final Degree Project at the University of Salamanca titled "*Machine Learning for stock investment recommendation systems*". The package end up being so consistent, reliable and usable that it is going to be used as the main Data Extraction tool by another students in their Final Degree Projects named "*Recommender system of banking products*" and "*Robo-Advisor Application*".

    License

    MIT License

    Additional Information

    investpy, the Python package from which datasets were generated is currently in a development beta version, so please, if needed open an issue to solve all the possible problems the package may be causing or any dataset error. Also, any new ideas or proposals are welcome, and will be gladly implemented in the package if the are positive and useful.

    For further information or any question feel free to contact me via email at alvarob96@usal.es

    You can also check my Medium Publication, where I upload weekly posts related to Data Science and mainly on Data Extraction techniques via Web Scraping. In this case, you can read "investpy — a Python package for historical data extraction from the Spanish stock market" where I explain the basics on investpy development and some insights on Web Scraping with Python.

    Disclaimer

    This Python Package has been made for research purposes in order to fit a needs that Investing.com does not cover, so this package works like an Application Programming Interface (API) of Investing.com developed in an altruistic way. Conclude that this package is not related in any way with Investing.com or any dependant company, the only requirement for developing this package was to mention the source where data is retrieved.

  4. NVidia - Stock Data - Latest and Updated

    • kaggle.com
    Updated Feb 2, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kalilur Rahman (2025). NVidia - Stock Data - Latest and Updated [Dataset]. https://www.kaggle.com/datasets/kalilurrahman/nvidia-stock-data-latest-and-updated/versions/167
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 2, 2025
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Kalilur Rahman
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    https://upload.wikimedia.org/wikipedia/en/thumb/a/a4/NVIDIA_logo.svg/731px-NVIDIA_logo.svg.png" alt="NVidia">

    • Nvidia Corporation is an American multinational technology company incorporated in Delaware and based in Santa Clara, California.

    • It designs graphics processing units (GPUs) for the gaming and professional markets, as well as system on a chip units (SoCs) for the mobile computing and automotive market.

    • Its primary GPU line, labeled "GeForce", is in direct competition with the GPUs of the "Radeon" brand by Advanced Micro Devices (AMD). Nvidia expanded its presence in the gaming industry with its handheld game consoles Shield Portable, Shield Tablet, and Shield Android TV and its cloud gaming service GeForce Now.

    • Its professional line of GPUs are used in workstations for applications in such fields as architecture, engineering and construction, media and entertainment, automotive, scientific research, and manufacturing design.

    • In addition to GPU manufacturing, Nvidia provides an application programming interface (API) called CUDA that allows the creation of massively parallel programs which utilize GPUs.They are deployed in supercomputing sites around the world. More recently, it has moved into the mobile computing market, where it produces Tegra mobile processors for smartphones and tablets as well as vehicle navigation and entertainment systems.It recently acquired ARM

    # Let us analyze the performance of this solid star!

  5. [CIC-AndMal-2020] Static-Dynamic Malware analysis

    • kaggle.com
    Updated Dec 27, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Alberto Zorzetto (2021). [CIC-AndMal-2020] Static-Dynamic Malware analysis [Dataset]. https://www.kaggle.com/datasets/albertozorzetto/cic-andmal-2020-dynamic-static-analysis
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 27, 2021
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Alberto Zorzetto
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Introduction

    This dataset contains 200K android malware apps which are labeled and characterized into corresponding family. Benign android apps (200K) are collected from Androzoo dataset to balance the huge dataset. We collected 14 malware categories including adware, backdoor, file infector, no category, Potentially Unwanted Apps (PUA), ransomware, riskware, scareware, trojan, trojan-banker, trojan-dropper, trojan-sms,**trojan-spy** and zero-day.

    A complete taxonomy of all the malware families of captured malware apps is created by dividing them into 8 categories such as sensitive data collection, media, hardware, actions/activities, internet connection, C&C, antivirus and storage & settings.

    Dataset details

    CategoryNumber of familiesNumber of samples
    Adware4847,210
    Backdoor111,538
    File Infector5669
    No Category-2,296
    PUA82,051
    Ransomware86,202
    Riskware2197,349
    Scareware31,556
    Trojan4513,559
    Trojan-Banker11887
    Trojan-Dropper92,302
    Trojan-SMS113,125
    Trojan-Spy113,540
    Zero-day-13,340

    Static analysis

    AndroidManifest.xml contains a lot of features that can be used for static analysis. The main extracted features include:

    • Activities: An android activity is one screen of the android app's user interface
    • Broadcast receivers and providers
    • Metadata: It is basically an additional option to store information that can be accessed through the entire project
    • The permissions requested by application: It protects the privacy of the user and is needed to access sensitive user data (such as contacts and SMS)
    • System features (such as camera and internet)

    Static Features

    FeatureValues
    Package Name"com.fb.iwidget"
    Activities"com.fb.iwidget.OverlayActivity"
    "org.acra.CrashReportDialog"
    "com.batch.android.BatchActionActivity"
    "com.fb.iwidget.MainActivity"
    "com.fb.iwidget.PreferencesActivity"
    "com.fb.iwidget.PickerActivity"
    "com.fb.iwidget.IntroActivity"
    Services"com.batch.android.BatchActionService"
    "com.fb.iwidget.MainService"
    "com.fb.iwidget.SnapAccessService"
    Receivers/Providers"com.fb.iwidget.ExpandWidgetProvider"
    "com.fb.iwidget.ActionReceiver"
    Intents Actions"android.accessibilityservice.AccessibilityService"
    "android.appwidget.action.APPWIDGET_UPDATE"
    "android.intent.action.BOOT_COMPLETED"
    "android.intent.action.CREATE_SHORTCUT"
    "android.intent.action.MAIN"
    "android.intent.action.MY_PACKAGE_REPLACED"
    "android.intent.action.USER_PRESENT"
    "android.intent.action.VIEW"
    "com.fb.iwidget.action.SHOULD_REVIVE"
    Intents Categories"android.intent.category.BROWSABLE"
    "android.intent.category.DEFAULT"
    "android.intent.category.LAUNCHER"
    Permissions"android.permission.ACCESS_NETWORK_STATE"
    "android.permission.CALL_PHONE"
    "android.permission.INTERNET"
    "android.permission.RECEIVE_BOOT_COMPLETED"
    "android.permission.SYSTEM_ALERT_WINDOW"
    "com.android.vending.BILLING"
    "android.permission.BIND_ACCESSIBILITY_SERVICE"
    Meta-Data"android.accessibilityservice"
    "android.appwidget.provider"
    #Icons331
    #Pictures0
    #Videos0
    Audio files0
    Videos0
    Size of the App4.2M

    Dynamic analysis

    For understanding the behavioral changes of these malware categories and families, six categories of features are extracted after executing the malware in an emulated environment. The main extracted features include:

    • Memory: Memory features define activities performed by malware by utilizing memory.
    • API: Application Programming Interface (API) features delineate the communication between two applications.
    • Network: Network features describe the data transmitted and received between other devices in the network. It indicates foreground and background * network usage.
    • Battery: Batt...
  6. llama-7b.ggmlv3.q4_1.bin

    • kaggle.com
    Updated Oct 5, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lorentz (2023). llama-7b.ggmlv3.q4_1.bin [Dataset]. https://www.kaggle.com/datasets/lorentzyeung/llama-7b-ggmlv3-q4-1-bin
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Oct 5, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Lorentz
    Description

    Git LFS Details

    Origin: https://huggingface.co/TheBloke/LLaMa-7B-GGML SHA256: bcb95f6755597f26046ab2d5ebea51bf1418f440a96e1563f0fecc379c2cbee3 Pointer size: 135 Bytes Size of remote file: 3.79 GB

    Raw pointer file

    Git Large File Storage (LFS) replaces large files with text pointers inside Git, while storing the file contents on a remote server. More info.

    Meta's LLaMA 7b GGML These files are GGML format model files for Meta's LLaMA 7b.

    GGML files are for CPU + GPU inference using llama.cpp and libraries and UIs which support this format, such as:

    KoboldCpp, a powerful GGML web UI with full GPU acceleration out of the box. Especially good for story telling. LoLLMS Web UI, a great web UI with GPU acceleration via the c_transformers backend. LM Studio, a fully featured local GUI. Supports full GPU accel on macOS. Also supports Windows, without GPU accel. text-generation-webui, the most popular web UI. Requires extra steps to enable GPU accel via llama.cpp backend. ctransformers, a Python library with LangChain support and OpenAI-compatible AI server. llama-cpp-python, a Python library with OpenAI-compatible API server. These files were quantised using hardware kindly provided by Latitude.sh.

    Repositories available GPTQ models for GPU inference, with multiple quantisation parameter options. 2, 3, 4, 5, 6 and 8-bit GGML models for CPU+GPU inference "https://huggingface.co/huggyllama/llama-7b">Unquantised fp16 model in pytorch format, for GPU inference and for further conversions Prompt template: None

    Compatibility Original llama.cpp quant methods: q4_0, q4_1, q5_0, q5_1, q8_0 These are guaranteed to be compatible with any UIs, tools and libraries released since late May. They may be phased out soon, as they are largely superseded by the new k-quant methods.

    New k-quant methods: q2_K, q3_K_S, q3_K_M, q3_K_L, q4_K_S, q4_K_M, q5_K_S, q6_K These new quantisation methods are compatible with llama.cpp as of June 6th, commit 2d43387.

    They are now also compatible with recent releases of text-generation-webui, KoboldCpp, llama-cpp-python, ctransformers, rustformers and most others. For compatibility with other tools and libraries, please check their documentation.

  7. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Marisol Brewster (2019). MCB_languages_county [Dataset]. https://www.kaggle.com/mcbrewster/mcb-languages-county/code
Organization logo

MCB_languages_county

This dataset was used to list all languages spoken in the United States: 2009-13

Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Oct 1, 2019
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Marisol Brewster
Description

Context

This is a dataset I found online through the Google Dataset Search portal.

Content

The American Community Survey (ACS) 2009-2013 multi-year data are used to list all languages spoken in the United States that were reported during the sample period. These tables provide detailed counts of many more languages than the 39 languages and language groups that are published annually as a part of the routine ACS data release. This is the second tabulation beyond 39 languages since ACS began.

The tables include all languages that were reported in each geography during the 2009 to 2013 sampling period. For the purpose of tabulation, reported languages are classified in one of 380 possible languages or language groups. Because the data are a sample of the total population, there may be languages spoken that are not reported, either because the ACS did not sample the households where those languages are spoken, or because the person filling out the survey did not report the language or reported another language instead.

The tables also provide information about self-reported English-speaking ability. Respondents who reported speaking a language other than English were asked to indicate their ability to speak English in one of the following categories: "Very well," "Well," "Not well," or "Not at all." The data on ability to speak English represent the person’s own perception about his or her own ability or, because ACS questionnaires are usually completed by one household member, the responses may represent the perception of another household member.

These tables are also available through the Census Bureau's application programming interface (API). Please see the developers page for additional details on how to use the API to access these data.

Acknowledgements

Sources:

Google Dataset Search: https://toolbox.google.com/datasetsearch

2009-2013 American Community Survey

Original dataset: https://www.census.gov/data/tables/2013/demo/2009-2013-lang-tables.html

Downloaded From: https://data.world/kvaughn/languages-county

Banner and thumbnail photo by Farzad Mohsenvand on Unsplash

Search
Clear search
Close search
Google apps
Main menu