JSON 파일 불러오기 :: pd.read_json( )
Reference
Pandas In Action
- JSON(Jave Script Object Notation)
- 텍스트 데이터를 저장하고 전송하기 위한 형식
- 키 - 값 쌍으로 구성
- Python의 딕셔너리 객체와 유사
- 린터(Linter)
- 각 키 - 값 쌍을 별도의 줄에 배치하여 JSON 응답을 가독성 있는 형식으로 나타냄
< JSON 파일 불러오기 >
pd.read_csv(
path_or_buf = None,
)
- Option
- path_or_buf : 파일 경로 및 파일 이름
import pandas as pd
nobel = pd.read_json("./Data/nobel.json")
nobel
|
prizes |
0 |
{‘year’: ‘2019’, ‘category’: ‘chemistry’, ‘lau… |
1 |
{‘year’: ‘2019’, ‘category’: ‘economics’, ‘lau… |
2 |
{‘year’: ‘2019’, ‘category’: ‘literature’, ‘la… |
3 |
{‘year’: ‘2019’, ‘category’: ‘peace’, ‘laureat… |
4 |
{‘year’: ‘2019’, ‘category’: ‘physics’, ‘overa… |
… |
… |
641 |
{‘year’: ‘1901’, ‘category’: ‘chemistry’, ‘lau… |
642 |
{‘year’: ‘1901’, ‘category’: ‘literature’, ‘la… |
643 |
{‘year’: ‘1901’, ‘category’: ‘peace’, ‘laureat… |
644 |
{‘year’: ‘1901’, ‘category’: ‘physics’, ‘laure… |
645 |
{‘year’: ‘1901’, ‘category’: ‘medicine’, ‘laur… |
646 rows × 1 columns
< Result >
→ prizes
에 중첩된 딕셔너리가 존재
< 평탄화(Flattening)** or **정규화(Normalizing) >
pd.json_normalize(
data,
record_path = None,
meta = None,
)
- 중첩된 데이터 레코드를 단일 1차원 리스트로 변형하는 과정
- Option
- data : 직렬화되지 않은 JSON 객체
- record_path : 레코드 목록에 대한 각 개체의 경로
- meta : 결과 테이블의 각 레코드에 대한 메타데이터
# `prizes`데이터 중 첫번째 최상위 딕셔너리 키(`year`, `category`, `laureates`)를 추출
pd.json_normalize(data = nobel['prizes'][0])
|
year |
category |
laureates |
0 |
2019 |
chemistry |
[{‘id’: ‘976’, ‘firstname’: ‘John’, ‘surname’:… |
< Result >
→ laureates
에 여전히 중첩된 딕셔너리가 존재
# 중첩된 `laureates`레코드를 정규화
pd.json_normalize(
data = nobel['prizes'][0],
record_path = 'laureates'
)
|
id |
firstname |
surname |
motivation |
share |
0 |
976 |
John |
Goodenough |
“for the development of lithium-ion batteries” |
3 |
1 |
977 |
M. Stanley |
Whittingham |
“for the development of lithium-ion batteries” |
3 |
2 |
978 |
Akira |
Yoshino |
“for the development of lithium-ion batteries” |
3 |
< Result >
→ 새로운 열로 확장했지만 기존의 year
와 category
열이 사라짐
# 최상위 키 - 값 쌍을 유지 (`year`, `category`)
pd.json_normalize(
data = nobel['prizes'][0],
record_path = 'laureates',
meta = ['year', 'category']
)
|
id |
firstname |
surname |
motivation |
share |
year |
category |
0 |
976 |
John |
Goodenough |
“for the development of lithium-ion batteries” |
3 |
2019 |
chemistry |
1 |
977 |
M. Stanley |
Whittingham |
“for the development of lithium-ion batteries” |
3 |
2019 |
chemistry |
2 |
978 |
Akira |
Yoshino |
“for the development of lithium-ion batteries” |
3 |
2019 |
chemistry |
# Error
pd.json_normalize(
data = nobel['prizes'],
record_path = 'laureates',
meta = ['year', 'category']
)
Result
```
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
File ~/anaconda3/envs/torch/lib/python3.11/site-packages/pandas/io/json/_normalize.py:399, in _json_normalize.._pull_field(js, spec, extract_record)
398 else:
--> 399 result = result[spec]
400 except KeyError as e:
KeyError: 'laureates'
The above exception was the direct cause of the following exception:
KeyError Traceback (most recent call last)
Cell In[6], line 2
1 # Error
----> 2 pd.json_normalize(
3 data = nobel['prizes'],
4 record_path = 'laureates',
5 meta = ['year', 'category']
6 )
File ~/anaconda3/envs/torch/lib/python3.11/site-packages/pandas/io/json/_normalize.py:518, in _json_normalize(data, record_path, meta, meta_prefix, record_prefix, errors, sep, max_level)
515 meta_vals[key].append(meta_val)
516 records.extend(recs)
--> 518 _recursive_extract(data, record_path, {}, level=0)
520 result = DataFrame(records)
522 if record_prefix is not None:
File ~/anaconda3/envs/torch/lib/python3.11/site-packages/pandas/io/json/_normalize.py:500, in _json_normalize.._recursive_extract(data, path, seen_meta, level)
498 else:
499 for obj in data:
--> 500 recs = _pull_records(obj, path[0])
501 recs = [
502 nested_to_record(r, sep=sep, max_level=max_level)
503 if isinstance(r, dict)
504 else r
505 for r in recs
506 ]
508 # For repeating the metadata later
File ~/anaconda3/envs/torch/lib/python3.11/site-packages/pandas/io/json/_normalize.py:422, in _json_normalize.._pull_records(js, spec)
416 def _pull_records(js: dict[str, Any], spec: list | str) -> list:
417 """
418 Internal function to pull field for records, and similar to
419 _pull_field, but require to return list. And will raise error
420 if has non iterable value.
421 """
--> 422 result = _pull_field(js, spec, extract_record=True)
424 # GH 31507 GH 30145, GH 26284 if result is not list, raise TypeError if not
425 # null, otherwise return an empty list
426 if not isinstance(result, list):
File ~/anaconda3/envs/torch/lib/python3.11/site-packages/pandas/io/json/_normalize.py:402, in _json_normalize.._pull_field(js, spec, extract_record)
400 except KeyError as e:
401 if extract_record:
--> 402 raise KeyError(
403 f"Key {e} not found. If specifying a record_path, all elements of "
404 f"data should have the path."
405 ) from e
406 elif errors == "ignore":
407 return np.nan
KeyError: "Key 'laureates' not found. If specifying a record_path, all elements of data should have the path."
```
</details>
**< Result >**
→ Error : prizes Series에 있는 딕셔너리 중 일부는 `laureates`라는 키가 없기 때문..
```python
dictionary.setdefault(
key,
value
)
```
- 딕셔너리 키에 대한 기본 값을 할당
- 딕셔너리에 키가 없는 경우에는 키 - 값 쌍을 할당
- 딕셔너리에 키가 있는 경우 기존 값을 반환
```python
def add_laureates_key(entry):
entry.setdefault('laureates', [])
# prizes에 있는 딕셔너리 자체를 변경하므로 기존의 Series를 덮어쓸 필요가 없음
nobel['prizes'].apply(add_laureates_key)
```
```
0 None
1 None
2 None
3 None
4 None
...
641 None
642 None
643 None
644 None
645 None
Name: prizes, Length: 646, dtype: object
```
```python
# 완성된 JSON 파일 불러오기
winners = pd.json_normalize(
data = nobel['prizes'],
record_path = 'laureates',
meta = ['year', 'category']
)
winners
```
| | id | firstname | surname | motivation | share | year | category |
| ------: | ---: | -------------: | ----------: | ------------------------------------------------: | ----: | ---: | ---------: |
| **0** | 976 | John | Goodenough | "for the development of lithium-ion batteries" | 3 | 2019 | chemistry |
| **1** | 977 | M. Stanley | Whittingham | "for the development of lithium-ion batteries" | 3 | 2019 | chemistry |
| **2** | 978 | Akira | Yoshino | "for the development of lithium-ion batteries" | 3 | 2019 | chemistry |
| **3** | 982 | Abhijit | Banerjee | "for their experimental approach to alleviatin... | 3 | 2019 | economics |
| **4** | 983 | Esther | Duflo | "for their experimental approach to alleviatin... | 3 | 2019 | economics |
| **...** | ... | ... | ... | ... | ... | ... | ... |
| **945** | 569 | Sully | Prudhomme | "in special recognition of his poetic composit... | 1 | 1901 | literature |
| **946** | 462 | Henry | Dunant | "for his humanitarian efforts to help wounded ... | 2 | 1901 | peace |
| **947** | 463 | Frédéric | Passy | "for his lifelong work for international peace... | 2 | 1901 | peace |
| **948** | 1 | Wilhelm Conrad | Röntgen | "in recognition of the extraordinary services ... | 1 | 1901 | physics |
| **949** | 293 | Emil | von Behring | "for his work on serum therapy, especially its... | 1 | 1901 | medicine |
950 rows × 7 columns