# Unstructured.io File Loader

```bash
pip install llama-index-readers-file
```

This loader extracts the text from a variety of unstructured text files using [Unstructured.io](https://github.com/Unstructured-IO/unstructured). Currently, the file extensions that are supported are `.csv`, `.tsv`, `.doc`, `.docx`, `.odt`, `.epub`, `.org`, `.rst`, `.rtf`, `.md`, `.msg`, `.pdf`, `.heic`, `.png`, `.jpg`, `.jpeg`, `.tiff`, `.bmp`, `.ppt`, `.pptx`, `.xlsx`, `.eml`, `.html`, `.xml`, `.txt` and `.json` documents. A single local file is passed in each time you call `load_data`.

Check out their documentation to see more details, but notably, this enables you to parse the unstructured data of many use-cases. For example, you can download the 10-K SEC filings of public companies (e.g. [Coinbase](https://www.sec.gov/ix?doc=/Archives/edgar/data/0001679788/000167978822000031/coin-20211231.htm)), and feed it directly into this loader without worrying about cleaning up the formatting or HTML tags.

## Usage

To use this loader, you need to pass in any desired keyword arguments to unstructured via the `unstructured_kwargs` parameter. For example a `Path` to a local file or even a stream. Optionally, you may specify `split_documents` if you want each `element` generated by Unstructured.io to be placed in a separate document. This will guarantee that those elements will be split when an index is created in LlamaIndex, which, depending on your use-case, could be a smarter form of text-splitting. By default this is `False`.

```python
from pathlib import Path
from llama_index.readers.file import UnstructuredReader

loader = UnstructuredReader()
documents = loader.load_data(
    unstructured_kwargs={"filename": "./10k_filing.html"}
)
```

You can also easily use this loader in conjunction with `SimpleDirectoryReader` if you want to parse certain files throughout a directory with Unstructured.io.

```python
from pathlib import Path
from llama_index.core import SimpleDirectoryReader

from llama_index.readers.file import UnstructuredReader

dir_reader = SimpleDirectoryReader(
    "./data",
    file_extractor={
        ".pdf": UnstructuredReader(),
        ".html": UnstructuredReader(),
        ".eml": UnstructuredReader(),
    },
)
documents = dir_reader.load_data()
```

```python
# Example using a filestream input, taking advantage of HI_RES partitioning and
# native unstructured chunking by_title.
documents = UnstructuredReader().load_data(
    unstructured_kwargs={
        "file": filestream,
        "content_type": file.content_type,
        "url": None,
        "strategy": PartitionStrategy.HI_RES,
        "chunking_strategy": "by_title",
    },
    split_documents=True,
    # We can generate deterministic ids for each document, or for the whole
    # document when not splitting, to support document lifecycle operations
    # (upserts, etc.).
    deterministic_ids=True,
)
```

This loader is designed to be used as a way to load data into [LlamaIndex](https://github.com/run-llama/llama_index/).

## Troubleshooting

**"failed to find libmagic" error**: Try `pip install python-magic-bin==0.4.14`. Solution documented [here](https://github.com/Yelp/elastalert/issues/1927#issuecomment-425040424). On MacOS, you may also try `brew install libmagic`.
